Table of Contents

Class TextExtraction

Namespace
LMKit.Extraction
Assembly
LM-Kit.NET.dll

Provides functionality to extract structured data from unstructured content using a language model.

public sealed class TextExtraction
Inheritance
TextExtraction
Inherited Members

Remarks

The TextExtraction class enables you to define a set of elements to extract from various content sources, including plain text, images, PDF documents, HTML files, and Microsoft Office formats (DOCX, XLSX, PPTX). It leverages a language model to parse the content and extract the specified elements into structured data.

For a complete list of supported file formats, see the Attachment class documentation.

Constructors

TextExtraction(LM)

Initializes a new instance of the TextExtraction class with the specified language model.

Fields

Description

Gets or sets the description for the current extraction schema. This value is automatically populated from the schema's "description" field when calling SetElementsFromJsonSchema(string).

Title

Gets or sets the title for the current extraction schema. This value is automatically populated from the schema's "title" field when calling SetElementsFromJsonSchema(string).

Properties

Elements

Gets or sets the list of TextExtractionElement instances that define which elements to extract from the content.

Guidance

Gets or sets semantic guidance for the extraction process.

JsonSchema

Gets the JSON schema representation of the extraction elements.

MaximumContextLength

Gets or sets the maximum context length (in tokens) allowed for the language model during text extraction.

Model

Gets the language model instance used to drive the extraction process.

NullOnDoubt

Gets or sets a value indicating whether the language model should return null for uncertain content rather than risk an aggressive extraction that could lead to false positives.

OcrEngine

Gets or sets an optional OcrEngine used to perform traditional OCR on raster content.

PreferredInferenceModality

Gets or sets the preferred modality for inference. This determines whether text, image, or both modalities are used when processing input. Defaults to Multimodal.

Methods

ClearContent()

Removes all previously set input (both text and attachments) so that no content remains for extraction.

Parse(CancellationToken)

Parses the content synchronously to extract the defined elements.

ParseAsync(CancellationToken)

Parses the content asynchronously to extract the defined elements.

SchemaDiscovery(CancellationToken)

Discovers an optimal JSON Schema for the current input content.

SchemaDiscoveryAsync(CancellationToken)

Asynchronously discovers an optimal JSON Schema for the current input content.

SetContent(Attachment)

Adds all pages of an attachment to be processed for data extraction.

SetContent(Attachment, int)

Adds a specific page of an attachment to be processed for data extraction.

SetContent(Attachment, string)

Adds specified pages of an attachment to be processed for data extraction.

SetContent(ImageBuffer)

Sets the content for extraction from the specified image buffer.

SetContent(IEnumerable<Attachment>)

Adds multiple attachments to be processed for data extraction.

SetContent(string)

Sets the text content from which elements will be extracted.

SetElementsFromJsonSchema(string)

Configures the text extraction elements by parsing a JSON schema.