Table of Contents

Build a Multi-Format Document Ingestion Pipeline

Enterprise applications rarely deal with a single document format. PDFs, Word documents, images, and HTML files all need to be ingested, chunked, embedded, and made searchable. LM-Kit.NET's DocumentRag class handles multi-format ingestion with a unified API. It supports text extraction, OCR, and vision-based document understanding, automatically choosing the best strategy per page. This tutorial builds a production document ingestion pipeline that processes mixed-format document collections.


Why a Unified Ingestion Pipeline Matters

Two real-world problems that a unified document ingestion pipeline solves:

  1. Mixed-format knowledge bases. A company's knowledge base contains scanned PDFs, typed Word documents, email screenshots, and HTML exports. Without a unified pipeline, each format requires separate parsing logic. DocumentRag abstracts format handling behind a single ImportDocumentAsync call.
  2. Scanned vs. digital document routing. Some PDF pages contain selectable text while others are scanned images. The Auto processing mode detects this per page and routes text pages through fast extraction while sending image pages through vision-based understanding.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
Embedding model Any embedding model (e.g., qwen3-embedding:0.6b)
VRAM 2 GB+ for embedding model
Input formats PDF, DOCX, PPTX, EML, MBOX, PNG, JPEG, HTML, TXT, Markdown

For vision-based document understanding, you also need a Vision Language Model.


Step 1: Create the Project

dotnet new console -n DocumentIngestion
cd DocumentIngestion
dotnet add package LM-Kit.NET

Step 2: Understand the Processing Modes

┌────────────────────────────┐
│     Incoming Document      │
│  (PDF, DOCX, EML, MBOX, PNG, HTML) │
└─────────────┬──────────────┘
              │
              ▼
┌────────────────────────────┐
│  PageProcessingMode        │
├────────────────────────────┤
│  Auto (default)            │───► Checks each page:
│                            │     text available? → TextExtraction
│                            │     image-only?     → DocumentUnderstanding
├────────────────────────────┤
│  TextExtraction            │───► Fast text parsing + optional OCR
├────────────────────────────┤
│  DocumentUnderstanding     │───► VLM-based layout analysis
└────────────────────────────┘
              │
              ▼
┌────────────────────────────┐
│  Chunk → Embed → Store     │
│  (vector store)            │
└────────────────────────────┘
Mode Speed Quality When to use
Auto Adaptive Best per page Default for mixed documents
TextExtraction Fast Good for digital PDFs Known text-based documents
DocumentUnderstanding Slower Excellent for layouts Scanned docs, complex tables, forms

Step 3: Write the Ingestion Pipeline

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Embedding model loaded: {embeddingModel.Name}\n");

// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512
};

// ──────────────────────────────────────
// 3. Subscribe to progress events
// ──────────────────────────────────────
rag.Progress += (sender, e) =>
{
    Console.WriteLine($"  [{e.DocumentName}] Page {e.PageIndex + 1}/{e.TotalPages}: {e.Phase}");
};

// ──────────────────────────────────────
// 4. Define the documents to ingest
// ──────────────────────────────────────
string documentsFolder = "documents";

if (!Directory.Exists(documentsFolder))
{
    Console.WriteLine($"Create a '{documentsFolder}' folder with documents, then run again.");
    return;
}

string[] supportedExtensions = { ".pdf", ".docx", ".pptx", ".eml", ".mbox", ".png", ".jpg", ".jpeg", ".html", ".txt", ".md" };

string[] documentFiles = Directory.GetFiles(documentsFolder)
    .Where(f => supportedExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

Console.WriteLine($"Found {documentFiles.Length} document(s) to ingest.\n");

// ──────────────────────────────────────
// 5. Ingest each document
// ──────────────────────────────────────
string dataSourceId = "knowledge-base";
int successCount = 0;
int failCount = 0;

foreach (string filePath in documentFiles)
{
    string fileName = Path.GetFileName(filePath);
    Console.WriteLine($"Ingesting: {fileName}");

    try
    {
        // Create the attachment from the file
        using var attachment = new Attachment(filePath);

        // Create document metadata
        var metadata = new DocumentRag.DocumentMetadata(
            attachment: attachment,
            id: Path.GetFileNameWithoutExtension(fileName),
            sourceUri: Path.GetFullPath(filePath));

        // Import the document
        DataSource dataSource = await rag.ImportDocumentAsync(
            attachment,
            metadata,
            dataSourceId);

        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine($"  Ingested successfully.\n");
        Console.ResetColor();
        successCount++;
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Failed: {ex.Message}\n");
        Console.ResetColor();
        failCount++;
    }
}

// ──────────────────────────────────────
// 6. Summary
// ──────────────────────────────────────
Console.WriteLine("=== Ingestion Summary ===");
Console.WriteLine($"  Succeeded: {successCount}");
Console.WriteLine($"  Failed:    {failCount}");
Console.WriteLine($"  Total:     {documentFiles.Length}");

Step 4: Ingest Specific Page Ranges

For large PDFs, you can ingest only specific pages:

using LMKit.Data;
using LMKit.Retrieval;

using var attachment = new Attachment("documents/large-report.pdf");

var metadata = new DocumentRag.DocumentMetadata(
    attachment: attachment,
    id: "report-chapter-3");

// Ingest only pages 15 through 30
DataSource dataSource = await rag.ImportDocumentAsync(
    attachment,
    metadata,
    dataSourceId: "knowledge-base",
    pageRange: "15-30");

Console.WriteLine("Ingested pages 15-30 of the report.");

Step 5: Use Vision-Based Document Understanding

For scanned documents or complex layouts, set the processing mode to DocumentUnderstanding and provide a Vision Language Model:

using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;

// Load a vision-capable model for document understanding
using LM vlm = LM.LoadFromModelID("gemma3:4b");

var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.DocumentUnderstanding,
    VisionParser = new VlmOcr(vlm)
};

// Now ingested documents will use VLM-based layout analysis
using var attachment = new Attachment("documents/scanned-invoice.pdf");

var metadata = new DocumentRag.DocumentMetadata(
    attachment: attachment,
    id: "invoice-2024-001");

DataSource dataSource = await rag.ImportDocumentAsync(
    attachment,
    metadata,
    "invoices");

Step 6: Ingest from Different Sources

Attachment supports multiple input sources beyond file paths:

// From a byte array (e.g., downloaded from an API)
byte[] pdfBytes = File.ReadAllBytes("document.pdf");
using var fromBytes = new Attachment(pdfBytes, "api-response.pdf");

// From a stream
using var stream = File.OpenRead("document.docx");
using var fromStream = new Attachment(stream, "streamed.docx");

// From a URI (downloads automatically)
var uri = new Uri("https://example.com/report.pdf");
using var fromUri = new Attachment(uri,
    downloadingProgress: (contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    });

// From plain text
using var fromText = Attachment.CreateFromText(
    "This is a plain text document with important information.",
    "notes.txt");

Step 7: Add Custom Metadata

Attach custom metadata to documents for filtering during retrieval:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Embedding model loaded: {embeddingModel.Name}\n");

// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512
};

var customMetadata = new MetadataCollection();
customMetadata["department"] = "legal";
customMetadata["confidentiality"] = "internal";
customMetadata["author"] = "Jane Smith";

var metadata = new DocumentRag.DocumentMetadata(
    name: "Contract Agreement Q1 2025",
    id: "contract-q1-2025",
    sourceUri: "https://intranet.example.com/contracts/q1-2025",
    customMetadata: customMetadata);

using var attachment = new Attachment("documents/contract.pdf");

DataSource dataSource = await rag.ImportDocumentAsync(
    attachment,
    metadata,
    "legal-documents");

Step 8: Delete Documents

Remove documents from the vector store when they become outdated:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Embedding model loaded: {embeddingModel.Name}\n");

// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512
};

bool deleted = await rag.DeleteDocumentAsync(
    documentId: "contract-q1-2025",
    dataSourceIdentifier: "legal-documents");

if (deleted)
    Console.WriteLine("Document removed from the knowledge base.");
else
    Console.WriteLine("Document not found.");

PageProcessingMode Reference

Mode Enum Value Behavior
Auto 0 Checks each page. Uses text extraction when text is available, falls back to vision understanding for image-only pages
TextExtraction 1 Extracts embedded text. OCR may be used for image-based content when an OCR engine is available
DocumentUnderstanding 2 Uses a Vision Language Model to analyze page layout and structure. Best for scanned documents, forms, and complex tables

Common Issues

Problem Cause Fix
Slow ingestion on large PDFs DocumentUnderstanding processes every page with VLM Use Auto mode or limit page ranges
Empty text from scanned PDFs TextExtraction mode with no OCR engine Switch to Auto or DocumentUnderstanding with a VLM
Duplicate document error Same id used for different documents Use unique IDs per document (e.g., hash of file content)
Poor chunk quality MaxChunkSize too large or too small Start with 512 and adjust based on retrieval quality

Next Steps