Structured AI Data Extraction for .NET Applications
🎯 Purpose of the Sample
The Structured Data Extraction Demo showcases how the LM-Kit.NET SDK can extract structured information from various data sources such as invoices, job offers, medical records, contracts, reports, and more. The number of use cases is unlimited, as the extraction engine can be customized to different types of data sources. Extracted values can be obtained in a structured JSON format or accessed via a straightforward, high-level API. This demo demonstrates the usage of the high-level API provided by the TextExtraction
class, which simplifies the process of extracting and structuring data. The demo also highlights how the extraction engine leverages LM-Kit's Dynamic Sampling technology, enabling fast responses and high accuracy, even when using smaller models.
👥 Industry Target Audience
This sample is particularly useful for developers and organizations working in the following industries:
- 💼 Business Automation: Automating data extraction from business sources like invoices or contracts.
- 🏥 Healthcare: Extracting structured data from medical records for patient management and analysis.
- 👨💼 Human Resources: Parsing job offers, resumes, and employment contracts to simplify HR workflows.
- 📊 Financial Services: Processing and extracting key details from financial statements and reports.
- 🏛️ Legal: Extracting clauses and terms from legal documents, contracts, or agreements.
- 📊 Data Entry and Processing: Reducing manual data entry tasks by automating the extraction of relevant information from various data sources.
🚀 Problem Solved
Unstructured data sources often contain valuable information that is difficult to extract and organize programmatically. This demo addresses that challenge by using AI-powered text extraction to automatically identify and structure data from a wide variety of sources. LM-Kit's Dynamic Sampling technology ensures quick responses and high accuracy, even with smaller models, significantly reducing manual effort and ensuring consistent, accurate data extraction for analysis or system integration.
💻 Sample Application Description
The Structured Data Extraction Demo is a console application that allows users to input different types of data sources, including invoices, job offers, medical records, and more, to extract specific data fields into a structured JSON format. The extracted values can also be accessed via a high-level API, provided by the TextExtraction
class, which simplifies the extraction process. The application uses pre-trained language models and the LMKit.Extraction
class to process these sources efficiently.
Additionally, a wide range of data types are supported, including:
- Char: A Unicode character.
- String: A sequence of characters.
- Integer: A 32-bit signed integer.
- Float: A single-precision floating-point number.
- Double: A double-precision floating-point number.
- Bool: A Boolean value (true or false).
- Date: A date value.
Arrays of these data types, such as StringArray
, IntegerArray
, and DateArray
, are also supported. You can view the complete list of supported types in the LM-Kit API documentation here. The SDK also allows defining complex objects using arrays or nested objects of any type, providing flexibility to handle diverse data extraction requirements.
✨ Key Features
- 🧠 Model Flexibility: Users can select from multiple pre-trained models or use custom models for extraction.
- 📄 Data Source Selection: Supports various types of data sources such as invoices, job offers, medical records, contracts, and more.
- 📊 Structured Output: Extracts relevant information from the data source and organizes it into predefined JSON structures or makes it available via a high-level API.
- 🔍 Customizable Extraction Templates: Users can modify and define their own data extraction templates based on the data source, including handling complex structures like arrays or nested objects.
- ⚡ Dynamic Sampling: Utilizes LM-Kit's Dynamic Sampling technology, allowing the application to deliver quick and accurate results even when using smaller models.
🧠 Supported Models
The sample supports the following pre-trained models for data extraction:
- Mistral Nemo 2407 12.2B
- Meta Llama 3.1 8B
- Google Gemma2 9B Medium
- Microsoft Phi-3 3.82B Mini
- Alibaba Qwen-2 7.6B
🛠️ Getting Started
📋 Prerequisites
- .NET Framework 4.6.2 or .NET 6.0
📥 Download the Project
▶️ Running the Application
📂 Clone the repository:
git clone https://github.com/LM-Kit/lm-kit-net-samples.git
📁 Navigate to the project directory:
cd lm-kit-net-samples/console_framework_4.62/structured_data_extraction
or
cd lm-kit-net-samples/console_net6/structured_data_extraction
🔨 Build and run the application:
dotnet build dotnet run
💡 Example Usage
Set the License Key (if available):
LMKit.Licensing.LicenseManager.SetLicenseKey(""); // Set an optional license key here if available.
Select the Model: The application will prompt you to select one of the pre-defined models or enter a custom model URI.
Choose the Data Source Type: The application will allow you to select a data source type to extract structured data from. Example options include:
Please select the content from which you want to extract structured data: 0 - Invoice 1 - Job Offer 2 - Medical Record
View the Extracted Information: Once the data source is processed, the extracted information will be displayed in JSON format or can be accessed via the high-level API.
By following these steps, developers can explore how LM-Kit.NET handles structured data extraction from various data sources, offering an efficient solution for automating workflows that involve processing unstructured text.