Class TextExtraction
- Namespace
- LMKit.Extraction
- Assembly
- LM-Kit.NET.dll
Provides functionality to extract structured data from unstructured content using a language model.
public sealed class TextExtraction
- Inheritance
-
TextExtraction
- Inherited Members
Examples
Example: Extract customer information from text
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
using System.Collections.Generic;
LM model = LM.LoadFromModelID("llama-3.2-1b");
// Create the extractor
TextExtraction extractor = new TextExtraction(model);
// Define elements to extract
extractor.Elements = new List<TextExtractionElement>
{
new TextExtractionElement("CustomerName", ElementType.String, "Full name of the customer"),
new TextExtractionElement("Email", ElementType.String, "Customer email address"),
new TextExtractionElement("OrderTotal", ElementType.Double, "Total order amount"),
new TextExtractionElement("IsPremiumMember", ElementType.Bool, "Whether customer is a premium member")
};
// Set the content to extract from
extractor.SetContent(@"
Order Confirmation
Customer: John Smith (john.smith@email.com)
Premium Member: Yes
Total: $149.99
");
// Extract the data
TextExtractionResult result = extractor.Parse();
// Access results
Console.WriteLine($"Customer: {result["CustomerName"].Value}");
Console.WriteLine($"Email: {result["Email"].Value}");
Console.WriteLine($"Order Total: {result["OrderTotal"].Value}");
Console.WriteLine($"Premium: {result["IsPremiumMember"].Value}");
Console.WriteLine($"JSON: {result.Json}");
Example: Extract data from a PDF invoice
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
using System.Collections.Generic;
LM model = LM.LoadFromModelID("llama-3.2-1b");
TextExtraction extractor = new TextExtraction(model);
// Define invoice fields
extractor.Elements = new List<TextExtractionElement>
{
new TextExtractionElement("InvoiceNumber", ElementType.String),
new TextExtractionElement("InvoiceDate", ElementType.Date),
new TextExtractionElement("VendorName", ElementType.String),
new TextExtractionElement("TotalAmount", ElementType.Double),
new TextExtractionElement("LineItems", ElementType.StringArray, "List of items on the invoice")
};
// Load PDF and extract
var pdfAttachment = new Attachment("invoice.pdf");
extractor.SetContent(pdfAttachment);
TextExtractionResult result = extractor.Parse();
Console.WriteLine($"Invoice #{result["InvoiceNumber"].Value} from {result["VendorName"].Value}");
Example: Extract with enum values
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
using System.Collections.Generic;
LM model = LM.LoadFromModelID("llama-3.2-1b");
TextExtraction extractor = new TextExtraction(model);
// Define element with allowed values
var sentimentElement = new TextExtractionElement("Sentiment", ElementType.String, "Customer sentiment");
sentimentElement.Format.AllowedValues = new List<string> { "Positive", "Neutral", "Negative" };
extractor.Elements = new List<TextExtractionElement>
{
new TextExtractionElement("CustomerFeedback", ElementType.String),
sentimentElement
};
extractor.SetContent("The product exceeded my expectations! Great quality and fast shipping.");
var result = extractor.Parse();
Console.WriteLine($"Sentiment: {result["Sentiment"].Value}"); // Output: Positive
Remarks
The TextExtraction class enables you to define a set of elements to extract from various content sources, including plain text, images, PDF documents, HTML files, and Microsoft Office formats (DOCX, XLSX, PPTX). It leverages a language model to parse the content and extract the specified elements into structured data.
For a complete list of supported file formats, see the Attachment class documentation.
Key Features
- Define extraction elements with names, types, and descriptions
- Extract from text, images, PDFs, and Office documents
- Support for various data types (string, integer, decimal, boolean, date, enum, arrays)
- Optional OCR engine integration for scanned documents
- Guidance text to improve extraction accuracy
- JSON output with structured results
Typical Workflow
- Create a TextExtraction instance with a language model
- Define elements to extract using Elements
- Set the content source using SetContent(string) or similar methods
- Call Parse(CancellationToken) to extract the data
- Access results via TextExtractionResult
Constructors
- TextExtraction(LM)
Initializes a new instance of the TextExtraction class with the specified language model.
Fields
- Description
Gets or sets the description for the current extraction schema. This value is automatically populated from the schema's "description" field when calling SetElementsFromJsonSchema(string).
- Title
Gets or sets the title for the current extraction schema. This value is automatically populated from the schema's "title" field when calling SetElementsFromJsonSchema(string).
Properties
- Elements
Gets or sets the list of TextExtractionElement instances that define which elements to extract from the content.
- Guidance
Gets or sets semantic guidance for the extraction process.
- HumanVerificationThreshold
Gets or sets the confidence threshold below which HumanVerificationRequired is set to
true.
- JsonSchema
Gets the JSON schema representation of the extraction elements.
- MaximumContextLength
Gets or sets the maximum context length (in tokens) allowed for the language model during text extraction.
- Model
Gets the language model instance used to drive the extraction process.
- NullOnDoubt
Gets or sets a value indicating whether the language model should return
nullfor uncertain content rather than risk an aggressive extraction that could lead to false positives.
- PreferredInferenceModality
Gets or sets the preferred modality for inference. This determines whether text, image, or both modalities are used when processing input. Defaults to Multimodal.
Methods
- ClearContent()
Removes all previously set input (both text and attachments) so that no content remains for extraction.
- Parse(CancellationToken)
Parses the content synchronously to extract the defined elements.
- ParseAsync(CancellationToken)
Parses the content asynchronously to extract the defined elements.
- SchemaDiscovery(CancellationToken)
Discovers an optimal JSON Schema for the current input content.
- SchemaDiscoveryAsync(CancellationToken)
Asynchronously discovers an optimal JSON Schema for the current input content.
- SetContent(Attachment)
Adds all pages of an attachment to be processed for data extraction.
- SetContent(Attachment, int)
Adds a specific page of an attachment to be processed for data extraction.
- SetContent(Attachment, string)
Adds specified pages of an attachment to be processed for data extraction.
- SetContent(ImageBuffer)
Sets the content for extraction from the specified image buffer.
- SetContent(IEnumerable<Attachment>)
Adds multiple attachments to be processed for data extraction.
- SetContent(string)
Sets the text content from which elements will be extracted.
- SetElementsFromJsonSchema(string)
Configures the text extraction elements by parsing a JSON schema.
Events
- Progress
Occurs when the extraction operation makes progress, including phase transitions and pass completions.