Method ImportDocumentAsync

Namespace: LMKit.Retrieval

Assembly: LM-Kit.NET.dll

ImportDocumentAsync(Attachment, DocumentMetadata, string, string, CancellationToken)

Asynchronously imports a document into a DataSource, extracting text from each page and generating embeddings for retrieval.

public Task<DataSource> ImportDocumentAsync(Attachment attachment, DocumentRag.DocumentMetadata documentMetadata, string dataSourceIdentifier, string pageRange = null, CancellationToken cancellationToken = default)

Parameters

attachment Attachment: The document attachment to import. Must not be null.
documentMetadata DocumentRag.DocumentMetadata: Metadata to associate with the document. Use this to specify a custom document name, reference URL, or additional metadata fields for source attribution.
dataSourceIdentifier string: The unique identifier for the target DataSource. If a matching data source exists, pages are added as new sections; otherwise, a new data source is created.
pageRange string: An optional page range specification (e.g., "1-5", "1,3,5-10") to import only specific pages. If null or empty, all pages are imported.
cancellationToken CancellationToken: A token to cancel the operation.

Returns

Task<DataSource>: A task that resolves to the DataSource containing the imported document, or null if no text could be extracted from any page.

Examples

LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
DocumentRag docRag = new DocumentRag(embeddingModel);

// Import with explicit ID for document lifecycle management
var attachment = Attachment.FromFile("document.pdf");
var metadata = new DocumentMetadata(attachment, id: "annual-report-2024");
var dataSource = await docRag.ImportDocumentAsync(attachment, metadata, "documents");

// Import with custom metadata including reference URL
var customMetadata = new DocumentMetadata(
    attachment,
    id: "q4-financial-report",
    sourceUri: "https://intranet.example.com/docs/q4-report.pdf",
    customMetadata: new MetadataCollection
    {
        { "author", "Finance Team" },
        { "department", "Finance" }
    });
var dataSourceWithMeta = await docRag.ImportDocumentAsync(attachment, customMetadata, "financial-reports");

// Import specific pages
var partialMetadata = new DocumentMetadata(attachment, id: "summary-pages");
var selectedSource = await docRag.ImportDocumentAsync(attachment, partialMetadata, "documents", pageRange: "1,5,10-15");

// Later, delete a document using its ID
await docRag.DeleteDocumentAsync("annual-report-2024", "documents");

Remarks

This method processes the document page by page according to the configured ProcessingMode. Each page becomes a separate section in the data source, with metadata recording the page number and document name for source attribution.

The Progress event is raised throughout the import process to report status updates.

Pages that produce no extractable text (e.g., blank pages or images without OCR) are skipped.

Exceptions

ArgumentNullException: Thrown if attachment or documentMetadata is null.
ArgumentException: Thrown if dataSourceIdentifier is null, empty, or whitespace.
OperationCanceledException: Thrown if the operation is canceled.

Table of Contents