Class DocumentRag

Namespace: LMKit.Retrieval

Assembly: LM-Kit.NET.dll

Provides document-centric Retrieval-Augmented Generation (RAG) capabilities with built-in support for multi-page document processing, OCR, and vision-based document understanding.

public sealed class DocumentRag : RagEngine

Inheritance: object

RagEngine

DocumentRag

Inherited Members: RagEngine.Reranker

RagEngine.DefaultImagePayloadPix

RagEngine.DefaultIChunking

RagEngine.DataSources

RagEngine.Filter

RagEngine.MmrLambda

RagEngine.RetrievalStrategy

RagEngine.ContextWindow

RagEngine.MaxContextWindowCharacters

RagEngine.ImportTextFromFile(string, Encoding, string, string, CancellationToken)

RagEngine.ImportTextFromFile(string, Encoding, IChunking, string, string, CancellationToken)

RagEngine.ImportTextFromFile(string, Encoding, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportTextFromFile(string, Encoding, IChunking, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportTextFromFileAsync(string, Encoding, string, string, CancellationToken)

RagEngine.ImportTextFromFileAsync(string, Encoding, IChunking, string, string, CancellationToken)

RagEngine.ImportTextFromFileAsync(string, Encoding, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportTextFromFileAsync(string, Encoding, IChunking, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportText(string, string, string, CancellationToken)

RagEngine.ImportText(string, IChunking, string, string, CancellationToken)

RagEngine.ImportText(string, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportText(string, IChunking, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportTextAsync(string, string, string, CancellationToken)

RagEngine.ImportTextAsync(string, IChunking, string, string, CancellationToken)

RagEngine.ImportTextAsync(string, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportTextAsync(string, IChunking, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportText(IList<string>, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.ImportText(IList<string>, IChunking, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.ImportTextAsync(IList<string>, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.ImportTextAsync(IList<string>, IChunking, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.Import(Attachment, IChunking, string, string, CancellationToken)

RagEngine.Import(Attachment, string, string, CancellationToken)

RagEngine.Import(Attachment, IChunking, string, string, MetadataCollection, CancellationToken)

RagEngine.Import(Attachment, string, string, MetadataCollection, CancellationToken)

RagEngine.Import(IList<Attachment>, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.Import(IList<Attachment>, IChunking, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.ImportAsync(Attachment, string, string, CancellationToken)

RagEngine.ImportAsync(Attachment, IChunking, string, string, CancellationToken)

RagEngine.ImportAsync(Attachment, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportAsync(Attachment, IChunking, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportAsync(IList<Attachment>, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.ImportAsync(IList<Attachment>, IChunking, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.Import(ImageBuffer, string, string, CancellationToken)

RagEngine.Import(ImageBuffer, string, string, MetadataCollection, CancellationToken)

RagEngine.Import(IList<ImageBuffer>, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.ImportAsync(ImageBuffer, string, string, CancellationToken)

RagEngine.ImportAsync(ImageBuffer, string, string, MetadataCollection, CancellationToken)

RagEngine.ImportAsync(IList<ImageBuffer>, string, IList<string>, IList<MetadataCollection>, CancellationToken)

RagEngine.QueryPartitions(string, string, IEnumerable<PartitionSimilarity>, IConversation, CancellationToken)

RagEngine.QueryPartitionsAsync(string, string, IEnumerable<PartitionSimilarity>, IConversation, CancellationToken)

RagEngine.FindMatchingPartitions(string, int, float, bool, bool, CancellationToken)

RagEngine.FindMatchingPartitionsAsync(string, int, float, bool, bool, CancellationToken)

RagEngine.FindMatchingPartitions(Attachment, int, float, bool, bool, CancellationToken)

RagEngine.FindMatchingPartitionsAsync(Attachment, int, float, bool, bool, CancellationToken)

RagEngine.AddDataSource(DataSource)

RagEngine.AddDataSource(string, CancellationToken)

RagEngine.AddDataSourceAsync(string, CancellationToken)

RagEngine.AddDataSources(IEnumerable<DataSource>)

RagEngine.GetDataSource(string)

RagEngine.TryGetDataSource(string, out DataSource)

RagEngine.RemoveDataSource(DataSource)

RagEngine.RemoveDataSource(string)

RagEngine.ClearDataSources()

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Examples

// Basic document RAG setup
LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
DocumentRag docRag = new DocumentRag(embeddingModel);

// Optional: Configure OCR for scanned documents
docRag.OcrEngine = new OcrEngine();

// Import a PDF document with explicit ID for lifecycle management
var attachment = Attachment.FromFile("report.pdf");
var metadata = new DocumentMetadata(attachment, id: "quarterly-report-2024-q4");
var dataSource = await docRag.ImportDocumentAsync(attachment, metadata, "reports");

// Search for relevant content
var matches = await docRag.FindMatchingPartitionsAsync("quarterly revenue", topK: 5);

// Generate a response with source references
LM chatModel = LM.LoadFromModelID("qwen3-vl:4b");
var conversation = new SingleTurnConversation(chatModel);
var result = await docRag.QueryPartitionsAsync("What was the quarterly revenue?", matches, conversation, default);

Console.WriteLine(result.Response.Text);
foreach (var reference in result.SourceReferences)
{
    Console.WriteLine($"Source: {reference.DocumentName}, Page {reference.PageNumber}");
}

// Later, delete the document using the same ID
await docRag.DeleteDocumentAsync("quarterly-report-2024-q4", "reports");

Remarks

DocumentRag extends RagEngine to simplify working with document attachments such as PDFs, images, and other multi-page formats. It automatically handles page-by-page extraction, text chunking, and embedding generation.

The class supports three processing modes:

Auto (default): Automatically selects the best processing strategy per page based on content type and available engines.
TextExtraction: Uses traditional text extraction with optional OCR for image-based pages.
DocumentUnderstanding: Uses vision language models (VLM) for advanced document parsing, preserving layout and structure as markdown.

For OCR-based text extraction, configure the OcrEngine property. For vision-based document understanding, configure the VisionParser property.

Constructors

DocumentRag(LM, IVectorStore): Initializes a new instance of the DocumentRag class with the specified embedding model.

Properties

MaxChunkSize: Gets or sets the maximum size in characters for text chunks during document import.

OcrEngine: Gets or sets the OCR engine used for extracting text from image-based document pages.

ProcessingMode: Gets or sets the page processing mode that determines how document pages are analyzed and text is extracted.

PromptTemplate: Gets or sets the prompt template used when querying partitions.

VisionParser: Gets or sets the vision language model (VLM) parser used for advanced document understanding.

Methods

DeleteDocument(string, string, CancellationToken): Deletes all sections associated with a specific document from a data source.

DeleteDocumentAsync(string, string, CancellationToken): Asynchronously deletes all sections associated with a specific document from a data source.

ImportDocument(Attachment, DocumentMetadata, string, string, CancellationToken): Imports a document into a DataSource, extracting text from each page and generating embeddings for retrieval.

ImportDocumentAsync(Attachment, DocumentMetadata, string, string, CancellationToken): Asynchronously imports a document into a DataSource, extracting text from each page and generating embeddings for retrieval.

QueryPartitions(string, IEnumerable<PartitionSimilarity>, IConversation, bool, CancellationToken): Generates a response by querying the specified partitions, optionally including page renderings for visual context, and returns the result with source document references.

QueryPartitions(string, IEnumerable<PartitionSimilarity>, IConversation, CancellationToken): Generates a response by querying the specified partitions and returns the result with source document references.

QueryPartitionsAsync(string, IEnumerable<PartitionSimilarity>, IConversation, bool, CancellationToken): Asynchronously generates a response by querying the specified partitions, optionally including page renderings for visual context, and returns the result with source document references.

QueryPartitionsAsync(string, IEnumerable<PartitionSimilarity>, IConversation, CancellationToken): Asynchronously generates a response by querying the specified partitions and returns the result with source document references.

Events

Progress: Occurs when document import progress changes, providing status updates for each processing phase.

Table of Contents