We're thrilled to announce that our fact extraction model is now generally available to all duohub users, both through the app and the API.
This model does something simple but powerful - it takes a page of content and extracts the most important information from it leaving you with atomic facts.
Consider the following example. We want to ingest information from the LTA OneMotoring website and provide an agent which is able to answer questions from vehicle owners about the complex COE rules.
Let's analyse just one page - Cars and Motorcycles Registered in Malaysia. You can find the first 10 out of 53 facts extracted from this page below.
1. Malaysia-registered vehicles must have a valid Autopass card, insurance, road tax and LTA's VEP approval email before entering Singapore. 2. Drivers of Malaysia-registered vehicles are required to present the vehicle's certificate of insurance, road tax, vehicle registration certificate, Autopass card, and LTA's VEP approval email upon request by authorised officers in Singapore. 3. Vehicle Entry Permits (VEPs) for Social Visit Pass holders have a validity period of up to 14 days from the date of the vehicle's entry into Singapore. 4. VEPs for Social Visit Pass holders can be extended up to 5 days before they expire. 5. Drivers of Malaysia-registered vehicles need to pay Electronic Road Pricing (ERP) charges if they use ERP-priced roads during ERP operating hours in Singapore. 6. Drivers of Malaysia-registered vehicles need to settle all outstanding fines for their vehicle before entering Singapore. 7. Singapore Citizens, Permanent Residents, residents of Singapore, Long-Term Visit Pass holders, Dependant's Pass holders, Student's Pass holders, Training Employment Pass holders, Work Holiday Pass holders, and Work Pass holders who are residents of Singapore are not allowed to keep or use foreign-registered vehicles in Singapore. 8. Thailand-registered vehicles must have Singapore insurance coverage from a Singapore-based insurance company for the duration of their stay in Singapore. 9. Work Pass holders who violate conditions for keeping or using foreign-registered vehicles in Singapore can face fines up to $1,000 or imprisonment for up to 3 months on first conviction. 10. Drivers can update their vehicle's road tax and insurance validity with LTA through the website https://go.gov.sg/vepds-insureapp.
You can see each one is an atomic unit of information that discards most of the marketing fluff and focuses on the facts.
Fact extraction can be used with graphs, but with vector it allows a similar level of precision without the storage cost or size constraints that come with using graphs.
As an example, say you have a website with 1000 products on it. There is a lot of information on the website that is not relevant to the product, which is required to help sell the product.
Using fact extraction, any "fluff" gets discarded and only the atomic units of information relvant to each product are extracted, including coreference resolution.
When a user asks questions about products, they are sure to get the n most relevant results.
Government websites are notorious for being difficult to parse into a structure that can be easily retrieved with high precision. Fact extraction resolves the tables and examples into atomic facts that can be easily retrieved in a way that suits both embedding algorithms and graph parsing models.
The beauty of fact extraction is that it offers global coreference resolution in a triple-like format that is both graph and vector compatible.
In a CV example, you might get the following facts.
1. Tauhid Akhtar worked for VISHAL ENGINEERING from June 2016 to December 2016. 2. Tauhid Akhtar has been working at Reliance Industries Limited in Jamnagar, Gujarat since December 2016. 3. Tauhid Akhtar completed a PG Diploma in Process Piping Engineering as per ASME 31.3 from IPEBS, Hyderabad in 2015. 4. Tauhid Akhtar completed a B-tech in Mechanical Engineering from Maharishi Dayanand University Rohtak, Haryana in 2015. 5. Tauhid Akhtar completed a Bachelor of Science in Physics from Magadh University, Bodh Gaya in 2011.
This makes it very easy to ask, 'What is Tauhid Akhtar's highest degree?' with top_k=10
and get the answer, or to pass the facts as inputs to graph parsing models to get the most relevant information.
As we all know the last 5% is the hardest to achieve, and when it comes to large-scale precision, fact extraction can help you get there.
To use Fact Extraction in the duohub app:
1.
2.
3.
4.
5.
6.
7.
8.
9.
When creating a new memory, select vector
under memoryType
.
Then under the factExtraction
parameter, select True
.
After adding files to memory, call the start ingestion endpoint to process the memory.
Yes, it's that easy.
If you would like to create your own fact extraction models, you can download and use our Fact Extration Dataset on Hugging Face. We have also trained a Llama 3.1 8B Instruct model on Fact Extraction which you can download from Hugging Face here.