Table of Contents

Class TextExtraction

Namespace
LMKit.Extraction
Assembly
LM-Kit.NET.dll

Provides functionality to extract structured data from unstructured content using a language model.

public sealed class TextExtraction
Inheritance
TextExtraction
Inherited Members

Examples

Example: Extract customer information from text

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
using System.Collections.Generic;

LM model = LM.LoadFromModelID("llama-3.2-1b");

// Create the extractor
TextExtraction extractor = new TextExtraction(model);

// Define elements to extract
extractor.Elements = new List<TextExtractionElement>
{
    new TextExtractionElement("CustomerName", ElementType.String, "Full name of the customer"),
    new TextExtractionElement("Email", ElementType.String, "Customer email address"),
    new TextExtractionElement("OrderTotal", ElementType.Double, "Total order amount"),
    new TextExtractionElement("IsPremiumMember", ElementType.Bool, "Whether customer is a premium member")
};

// Set the content to extract from
extractor.SetContent(@"
    Order Confirmation
    Customer: John Smith (john.smith@email.com)
    Premium Member: Yes
    Total: $149.99
");

// Extract the data
TextExtractionResult result = extractor.Parse();

// Access results
Console.WriteLine($"Customer: {result["CustomerName"].Value}");
Console.WriteLine($"Email: {result["Email"].Value}");
Console.WriteLine($"Order Total: {result["OrderTotal"].Value}");
Console.WriteLine($"Premium: {result["IsPremiumMember"].Value}");
Console.WriteLine($"JSON: {result.Json}");

Example: Extract data from a PDF invoice

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
using System.Collections.Generic;

LM model = LM.LoadFromModelID("llama-3.2-1b");
TextExtraction extractor = new TextExtraction(model);

// Define invoice fields
extractor.Elements = new List<TextExtractionElement>
{
    new TextExtractionElement("InvoiceNumber", ElementType.String),
    new TextExtractionElement("InvoiceDate", ElementType.Date),
    new TextExtractionElement("VendorName", ElementType.String),
    new TextExtractionElement("TotalAmount", ElementType.Double),
    new TextExtractionElement("LineItems", ElementType.StringArray, "List of items on the invoice")
};

// Load PDF and extract
var pdfAttachment = new Attachment("invoice.pdf");
extractor.SetContent(pdfAttachment);

TextExtractionResult result = extractor.Parse();
Console.WriteLine($"Invoice #{result["InvoiceNumber"].Value} from {result["VendorName"].Value}");

Example: Extract with enum values

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
using System.Collections.Generic;

LM model = LM.LoadFromModelID("llama-3.2-1b");
TextExtraction extractor = new TextExtraction(model);

// Define element with allowed values
var sentimentElement = new TextExtractionElement("Sentiment", ElementType.String, "Customer sentiment");
sentimentElement.Format.AllowedValues = new List<string> { "Positive", "Neutral", "Negative" };

extractor.Elements = new List<TextExtractionElement>
{
    new TextExtractionElement("CustomerFeedback", ElementType.String),
    sentimentElement
};

extractor.SetContent("The product exceeded my expectations! Great quality and fast shipping.");

var result = extractor.Parse();
Console.WriteLine($"Sentiment: {result["Sentiment"].Value}"); // Output: Positive

Remarks

The TextExtraction class enables you to define a set of elements to extract from various content sources, including plain text, images, PDF documents, HTML files, and Microsoft Office formats (DOCX, XLSX, PPTX). It leverages a language model to parse the content and extract the specified elements into structured data.

For a complete list of supported file formats, see the Attachment class documentation.

Key Features

  • Define extraction elements with names, types, and descriptions
  • Extract from text, images, PDFs, and Office documents
  • Support for various data types (string, integer, decimal, boolean, date, enum, arrays)
  • Optional OCR engine integration for scanned documents
  • Guidance text to improve extraction accuracy
  • JSON output with structured results

Typical Workflow

  1. Create a TextExtraction instance with a language model
  2. Define elements to extract using Elements
  3. Set the content source using SetContent(string) or similar methods
  4. Call Parse(CancellationToken) to extract the data
  5. Access results via TextExtractionResult

Constructors

TextExtraction(LM)

Initializes a new instance of the TextExtraction class with the specified language model.

Fields

Description

Gets or sets the description for the current extraction schema. This value is automatically populated from the schema's "description" field when calling SetElementsFromJsonSchema(string).

Title

Gets or sets the title for the current extraction schema. This value is automatically populated from the schema's "title" field when calling SetElementsFromJsonSchema(string).

Properties

Elements

Gets or sets the list of TextExtractionElement instances that define which elements to extract from the content.

Guidance

Gets or sets semantic guidance for the extraction process.

HumanVerificationThreshold

Gets or sets the confidence threshold below which HumanVerificationRequired is set to true.

JsonSchema

Gets the JSON schema representation of the extraction elements.

MaximumContextLength

Gets or sets the maximum context length (in tokens) allowed for the language model during text extraction.

Model

Gets the language model instance used to drive the extraction process.

NullOnDoubt

Gets or sets a value indicating whether the language model should return null for uncertain content rather than risk an aggressive extraction that could lead to false positives.

OcrEngine

Gets or sets an optional OcrEngine used to perform traditional OCR on raster content.

PreferredInferenceModality

Gets or sets the preferred modality for inference. This determines whether text, image, or both modalities are used when processing input. Defaults to Multimodal.

Methods

ClearContent()

Removes all previously set input (both text and attachments) so that no content remains for extraction.

Parse(CancellationToken)

Parses the content synchronously to extract the defined elements.

ParseAsync(CancellationToken)

Parses the content asynchronously to extract the defined elements.

SchemaDiscovery(CancellationToken)

Discovers an optimal JSON Schema for the current input content.

SchemaDiscoveryAsync(CancellationToken)

Asynchronously discovers an optimal JSON Schema for the current input content.

SetContent(Attachment)

Adds all pages of an attachment to be processed for data extraction.

SetContent(Attachment, int)

Adds a specific page of an attachment to be processed for data extraction.

SetContent(Attachment, string)

Adds specified pages of an attachment to be processed for data extraction.

SetContent(ImageBuffer)

Sets the content for extraction from the specified image buffer.

SetContent(IEnumerable<Attachment>)

Adds multiple attachments to be processed for data extraction.

SetContent(string)

Sets the text content from which elements will be extracted.

SetElementsFromJsonSchema(string)

Configures the text extraction elements by parsing a JSON schema.

Events

Progress

Occurs when the extraction operation makes progress, including phase transitions and pass completions.