Class TextExtraction
- Namespace
- LMKit.Extraction
- Assembly
- LM-Kit.NET.dll
Provides functionality to extract structured data from unstructured content using a language model.
public sealed class TextExtraction
- Inheritance
-
TextExtraction
- Inherited Members
Remarks
The TextExtraction class enables you to define a set of elements to extract from various content sources, including plain text, images, PDF documents, HTML files, and Microsoft Office formats (DOCX, XLSX, PPTX). It leverages a language model to parse the content and extract the specified elements into structured data.
For a complete list of supported file formats, see the Attachment class documentation.
Constructors
- TextExtraction(LM)
Initializes a new instance of the TextExtraction class with the specified language model.
Fields
- Description
Gets or sets the description for the current extraction schema. This value is automatically populated from the schema's "description" field when calling SetElementsFromJsonSchema(string).
- Title
Gets or sets the title for the current extraction schema. This value is automatically populated from the schema's "title" field when calling SetElementsFromJsonSchema(string).
Properties
- Elements
Gets or sets the list of TextExtractionElement instances that define which elements to extract from the content.
- Guidance
Gets or sets semantic guidance for the extraction process.
- JsonSchema
Gets the JSON schema representation of the extraction elements.
- MaximumContextLength
Gets or sets the maximum context length (in tokens) allowed for the language model during text extraction.
- Model
Gets the language model instance used to drive the extraction process.
- NullOnDoubt
Gets or sets a value indicating whether the language model should return
nullfor uncertain content rather than risk an aggressive extraction that could lead to false positives.
- PreferredInferenceModality
Gets or sets the preferred modality for inference. This determines whether text, image, or both modalities are used when processing input. Defaults to Multimodal.
Methods
- ClearContent()
Removes all previously set input (both text and attachments) so that no content remains for extraction.
- Parse(CancellationToken)
Parses the content synchronously to extract the defined elements.
- ParseAsync(CancellationToken)
Parses the content asynchronously to extract the defined elements.
- SchemaDiscovery(CancellationToken)
Discovers an optimal JSON Schema for the current input content.
- SchemaDiscoveryAsync(CancellationToken)
Asynchronously discovers an optimal JSON Schema for the current input content.
- SetContent(Attachment)
Adds all pages of an attachment to be processed for data extraction.
- SetContent(Attachment, int)
Adds a specific page of an attachment to be processed for data extraction.
- SetContent(Attachment, string)
Adds specified pages of an attachment to be processed for data extraction.
- SetContent(ImageBuffer)
Sets the content for extraction from the specified image buffer.
- SetContent(IEnumerable<Attachment>)
Adds multiple attachments to be processed for data extraction.
- SetContent(string)
Sets the text content from which elements will be extracted.
- SetElementsFromJsonSchema(string)
Configures the text extraction elements by parsing a JSON schema.