Build a Multi-Format Document Ingestion Pipeline

Enterprise applications rarely deal with a single document format. PDFs, Word documents, images, and HTML files all need to be ingested, chunked, embedded, and made searchable. LM-Kit.NET's DocumentRag class handles multi-format ingestion with a unified API. It supports text extraction, OCR, and vision-based document understanding, automatically choosing the best strategy per page. This tutorial builds a production document ingestion pipeline that processes mixed-format document collections.

Why a Unified Ingestion Pipeline Matters

Two real-world problems that a unified document ingestion pipeline solves:

Mixed-format knowledge bases. A company's knowledge base contains scanned PDFs, typed Word documents, email screenshots, and HTML exports. Without a unified pipeline, each format requires separate parsing logic. DocumentRag abstracts format handling behind a single ImportDocumentAsync call.
Scanned vs. digital document routing. Some PDF pages contain selectable text while others are scanned images. The Auto processing mode detects this per page and routes text pages through fast extraction while sending image pages through vision-based understanding.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
Embedding model	Any embedding model (e.g., `qwen3-embedding:0.6b`)
VRAM	2 GB+ for embedding model
Input formats	PDF, DOCX, PPTX, EML, MBOX, PNG, JPEG, HTML, TXT, Markdown

For vision-based document understanding, you also need a Vision Language Model.

Step 1: Create the Project

dotnet new console -n DocumentIngestion
cd DocumentIngestion
dotnet add package LM-Kit.NET

Step 2: Understand the Processing Modes

┌────────────────────────────┐
│     Incoming Document      │
│  (PDF, DOCX, EML, MBOX, PNG, HTML) │
└─────────────┬──────────────┘
              │
              ▼
┌────────────────────────────┐
│  PageProcessingMode        │
├────────────────────────────┤
│  Auto (default)            │───► Checks each page:
│                            │     text available? → TextExtraction
│                            │     image-only?     → DocumentUnderstanding
├────────────────────────────┤
│  TextExtraction            │───► Fast text parsing + optional OCR
├────────────────────────────┤
│  DocumentUnderstanding     │───► VLM-based layout analysis
└────────────────────────────┘
              │
              ▼
┌────────────────────────────┐
│  Chunk → Embed → Store     │
│  (vector store)            │
└────────────────────────────┘

Mode	Speed	Quality	When to use
`Auto`	Adaptive	Best per page	Default for mixed documents
`TextExtraction`	Fast	Good for digital PDFs	Known text-based documents
`DocumentUnderstanding`	Slower	Excellent for layouts	Scanned docs, complex tables, forms

Step 3: Write the Ingestion Pipeline

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Embedding model loaded: {embeddingModel.Name}\n");

// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512
};

// ──────────────────────────────────────
// 3. Subscribe to progress events
// ──────────────────────────────────────
rag.Progress += (sender, e) =>
{
    Console.WriteLine($"  [{e.DocumentName}] Page {e.PageIndex + 1}/{e.TotalPages}: {e.Phase}");
};

// ──────────────────────────────────────
// 4. Define the documents to ingest
// ──────────────────────────────────────
string documentsFolder = "documents";

if (!Directory.Exists(documentsFolder))
{
    Console.WriteLine($"Create a '{documentsFolder}' folder with documents, then run again.");
    return;
}

string[] supportedExtensions = { ".pdf", ".docx", ".pptx", ".eml", ".mbox", ".png", ".jpg", ".jpeg", ".html", ".txt", ".md" };

string[] documentFiles = Directory.GetFiles(documentsFolder)
    .Where(f => supportedExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

Console.WriteLine($"Found {documentFiles.Length} document(s) to ingest.\n");

// ──────────────────────────────────────
// 5. Ingest each document
// ──────────────────────────────────────
string dataSourceId = "knowledge-base";
int successCount = 0;
int failCount = 0;

foreach (string filePath in documentFiles)
{
    string fileName = Path.GetFileName(filePath);
    Console.WriteLine($"Ingesting: {fileName}");

    try
    {
        // Create the attachment from the file
        using var attachment = new Attachment(filePath);

        // Create document metadata
        var metadata = new DocumentRag.DocumentMetadata(
            attachment: attachment,
            id: Path.GetFileNameWithoutExtension(fileName),
            sourceUri: Path.GetFullPath(filePath));

        // Import the document
        DataSource dataSource = await rag.ImportDocumentAsync(
            attachment,
            metadata,
            dataSourceId);

        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine($"  Ingested successfully.\n");
        Console.ResetColor();
        successCount++;
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Failed: {ex.Message}\n");
        Console.ResetColor();
        failCount++;
    }
}

// ──────────────────────────────────────
// 6. Summary
// ──────────────────────────────────────
Console.WriteLine("=== Ingestion Summary ===");
Console.WriteLine($"  Succeeded: {successCount}");
Console.WriteLine($"  Failed:    {failCount}");
Console.WriteLine($"  Total:     {documentFiles.Length}");

Step 4: Ingest Specific Page Ranges

For large PDFs, you can ingest only specific pages:

using LMKit.Data;
using LMKit.Retrieval;

using var attachment = new Attachment("documents/large-report.pdf");

var metadata = new DocumentRag.DocumentMetadata(
    attachment: attachment,
    id: "report-chapter-3");

// Ingest only pages 15 through 30
DataSource dataSource = await rag.ImportDocumentAsync(
    attachment,
    metadata,
    dataSourceId: "knowledge-base",
    pageRange: "15-30");

Console.WriteLine("Ingested pages 15-30 of the report.");

Step 5: Use Vision-Based Document Understanding

For scanned documents or complex layouts, set the processing mode to DocumentUnderstanding and provide a Vision Language Model:

using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;

// Load a vision-capable model for document understanding
using LM vlm = LM.LoadFromModelID("gemma3:4b");

var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.DocumentUnderstanding,
    VisionParser = new VlmOcr(vlm)
};

// Now ingested documents will use VLM-based layout analysis
using var attachment = new Attachment("documents/scanned-invoice.pdf");

var metadata = new DocumentRag.DocumentMetadata(
    attachment: attachment,
    id: "invoice-2024-001");

DataSource dataSource = await rag.ImportDocumentAsync(
    attachment,
    metadata,
    "invoices");

Step 6: Ingest from Different Sources

Attachment supports multiple input sources beyond file paths:

// From a byte array (e.g., downloaded from an API)
byte[] pdfBytes = File.ReadAllBytes("document.pdf");
using var fromBytes = new Attachment(pdfBytes, "api-response.pdf");

// From a stream
using var stream = File.OpenRead("document.docx");
using var fromStream = new Attachment(stream, "streamed.docx");

// From a URI (downloads automatically)
var uri = new Uri("https://example.com/report.pdf");
using var fromUri = new Attachment(uri,
    downloadingProgress: (contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    });

// From plain text
using var fromText = Attachment.CreateFromText(
    "This is a plain text document with important information.",
    "notes.txt");

Step 7: Add Custom Metadata

Attach custom metadata to documents for filtering during retrieval:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Embedding model loaded: {embeddingModel.Name}\n");

// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512
};

var customMetadata = new MetadataCollection();
customMetadata["department"] = "legal";
customMetadata["confidentiality"] = "internal";
customMetadata["author"] = "Jane Smith";

var metadata = new DocumentRag.DocumentMetadata(
    name: "Contract Agreement Q1 2025",
    id: "contract-q1-2025",
    sourceUri: "https://intranet.example.com/contracts/q1-2025",
    customMetadata: customMetadata);

using var attachment = new Attachment("documents/contract.pdf");

DataSource dataSource = await rag.ImportDocumentAsync(
    attachment,
    metadata,
    "legal-documents");

Step 8: Delete Documents

Remove documents from the vector store when they become outdated:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Embedding model loaded: {embeddingModel.Name}\n");

// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512
};

bool deleted = await rag.DeleteDocumentAsync(
    documentId: "contract-q1-2025",
    dataSourceIdentifier: "legal-documents");

if (deleted)
    Console.WriteLine("Document removed from the knowledge base.");
else
    Console.WriteLine("Document not found.");

PageProcessingMode Reference

Mode	Enum Value	Behavior
`Auto`	0	Checks each page. Uses text extraction when text is available, falls back to vision understanding for image-only pages
`TextExtraction`	1	Extracts embedded text. OCR may be used for image-based content when an OCR engine is available
`DocumentUnderstanding`	2	Uses a Vision Language Model to analyze page layout and structure. Best for scanned documents, forms, and complex tables

Common Issues

Problem	Cause	Fix
Slow ingestion on large PDFs	`DocumentUnderstanding` processes every page with VLM	Use `Auto` mode or limit page ranges
Empty text from scanned PDFs	`TextExtraction` mode with no OCR engine	Switch to `Auto` or `DocumentUnderstanding` with a VLM
Duplicate document error	Same `id` used for different documents	Use unique IDs per document (e.g., hash of file content)
Poor chunk quality	`MaxChunkSize` too large or too small	Start with 512 and adjust based on retrieval quality

Next Steps

Build a RAG Pipeline Over Your Own Documents: query the ingested documents.
Chat with PDF Documents: interactive document Q&A.
Automatically Split Multi-Document PDFs with AI Vision: detect logical document boundaries before ingestion.
Preprocess Images for Vision Pipelines: clean images before ingestion.
Process PDFs and Images with Built-In Document Tools: use PdfInfo, DocumentText, and PdfMerge tools in agent-driven ingestion workflows.
Process Email Archives for Compliance and Legal Discovery: add EML and MBOX email processing to your ingestion pipeline.
Build a Real-Time Document Monitoring and Indexing Agent: automate ingestion with folder monitoring, classification, and indexing.

Table of Contents