Build a Multi-Format Document Ingestion Pipeline
Enterprise applications rarely deal with a single document format. PDFs, Word documents, images, and HTML files all need to be ingested, chunked, embedded, and made searchable. LM-Kit.NET's DocumentRag class handles multi-format ingestion with a unified API. It supports text extraction, OCR, and vision-based document understanding, automatically choosing the best strategy per page. This tutorial builds a production document ingestion pipeline that processes mixed-format document collections.
Why a Unified Ingestion Pipeline Matters
Two real-world problems that a unified document ingestion pipeline solves:
- Mixed-format knowledge bases. A company's knowledge base contains scanned PDFs, typed Word documents, email screenshots, and HTML exports. Without a unified pipeline, each format requires separate parsing logic.
DocumentRagabstracts format handling behind a singleImportDocumentAsynccall. - Scanned vs. digital document routing. Some PDF pages contain selectable text while others are scanned images. The
Autoprocessing mode detects this per page and routes text pages through fast extraction while sending image pages through vision-based understanding.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| Embedding model | Any embedding model (e.g., qwen3-embedding:0.6b) |
| VRAM | 2 GB+ for embedding model |
| Input formats | PDF, DOCX, PPTX, EML, MBOX, PNG, JPEG, HTML, TXT, Markdown |
For vision-based document understanding, you also need a Vision Language Model.
Step 1: Create the Project
dotnet new console -n DocumentIngestion
cd DocumentIngestion
dotnet add package LM-Kit.NET
Step 2: Understand the Processing Modes
┌────────────────────────────┐
│ Incoming Document │
│ (PDF, DOCX, EML, MBOX, PNG, HTML) │
└─────────────┬──────────────┘
│
▼
┌────────────────────────────┐
│ PageProcessingMode │
├────────────────────────────┤
│ Auto (default) │───► Checks each page:
│ │ text available? → TextExtraction
│ │ image-only? → DocumentUnderstanding
├────────────────────────────┤
│ TextExtraction │───► Fast text parsing + optional OCR
├────────────────────────────┤
│ DocumentUnderstanding │───► VLM-based layout analysis
└────────────────────────────┘
│
▼
┌────────────────────────────┐
│ Chunk → Embed → Store │
│ (vector store) │
└────────────────────────────┘
| Mode | Speed | Quality | When to use |
|---|---|---|---|
Auto |
Adaptive | Best per page | Default for mixed documents |
TextExtraction |
Fast | Good for digital PDFs | Known text-based documents |
DocumentUnderstanding |
Slower | Excellent for layouts | Scanned docs, complex tables, forms |
Step 3: Write the Ingestion Pipeline
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
},
loadingProgress: p =>
{
Console.Write($"\r Loading: {p * 100:F0}% ");
return true;
});
Console.WriteLine($"\n Embedding model loaded: {embeddingModel.Name}\n");
// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
ProcessingMode = PageProcessingMode.Auto,
MaxChunkSize = 512
};
// ──────────────────────────────────────
// 3. Subscribe to progress events
// ──────────────────────────────────────
rag.Progress += (sender, e) =>
{
Console.WriteLine($" [{e.DocumentName}] Page {e.PageIndex + 1}/{e.TotalPages}: {e.Phase}");
};
// ──────────────────────────────────────
// 4. Define the documents to ingest
// ──────────────────────────────────────
string documentsFolder = "documents";
if (!Directory.Exists(documentsFolder))
{
Console.WriteLine($"Create a '{documentsFolder}' folder with documents, then run again.");
return;
}
string[] supportedExtensions = { ".pdf", ".docx", ".pptx", ".eml", ".mbox", ".png", ".jpg", ".jpeg", ".html", ".txt", ".md" };
string[] documentFiles = Directory.GetFiles(documentsFolder)
.Where(f => supportedExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
Console.WriteLine($"Found {documentFiles.Length} document(s) to ingest.\n");
// ──────────────────────────────────────
// 5. Ingest each document
// ──────────────────────────────────────
string dataSourceId = "knowledge-base";
int successCount = 0;
int failCount = 0;
foreach (string filePath in documentFiles)
{
string fileName = Path.GetFileName(filePath);
Console.WriteLine($"Ingesting: {fileName}");
try
{
// Create the attachment from the file
using var attachment = new Attachment(filePath);
// Create document metadata
var metadata = new DocumentRag.DocumentMetadata(
attachment: attachment,
id: Path.GetFileNameWithoutExtension(fileName),
sourceUri: Path.GetFullPath(filePath));
// Import the document
DataSource dataSource = await rag.ImportDocumentAsync(
attachment,
metadata,
dataSourceId);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($" Ingested successfully.\n");
Console.ResetColor();
successCount++;
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($" Failed: {ex.Message}\n");
Console.ResetColor();
failCount++;
}
}
// ──────────────────────────────────────
// 6. Summary
// ──────────────────────────────────────
Console.WriteLine("=== Ingestion Summary ===");
Console.WriteLine($" Succeeded: {successCount}");
Console.WriteLine($" Failed: {failCount}");
Console.WriteLine($" Total: {documentFiles.Length}");
Step 4: Ingest Specific Page Ranges
For large PDFs, you can ingest only specific pages:
using LMKit.Data;
using LMKit.Retrieval;
using var attachment = new Attachment("documents/large-report.pdf");
var metadata = new DocumentRag.DocumentMetadata(
attachment: attachment,
id: "report-chapter-3");
// Ingest only pages 15 through 30
DataSource dataSource = await rag.ImportDocumentAsync(
attachment,
metadata,
dataSourceId: "knowledge-base",
pageRange: "15-30");
Console.WriteLine("Ingested pages 15-30 of the report.");
Step 5: Use Vision-Based Document Understanding
For scanned documents or complex layouts, set the processing mode to DocumentUnderstanding and provide a Vision Language Model:
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;
// Load a vision-capable model for document understanding
using LM vlm = LM.LoadFromModelID("gemma3:4b");
var rag = new DocumentRag(embeddingModel)
{
ProcessingMode = PageProcessingMode.DocumentUnderstanding,
VisionParser = new VlmOcr(vlm)
};
// Now ingested documents will use VLM-based layout analysis
using var attachment = new Attachment("documents/scanned-invoice.pdf");
var metadata = new DocumentRag.DocumentMetadata(
attachment: attachment,
id: "invoice-2024-001");
DataSource dataSource = await rag.ImportDocumentAsync(
attachment,
metadata,
"invoices");
Step 6: Ingest from Different Sources
Attachment supports multiple input sources beyond file paths:
// From a byte array (e.g., downloaded from an API)
byte[] pdfBytes = File.ReadAllBytes("document.pdf");
using var fromBytes = new Attachment(pdfBytes, "api-response.pdf");
// From a stream
using var stream = File.OpenRead("document.docx");
using var fromStream = new Attachment(stream, "streamed.docx");
// From a URI (downloads automatically)
var uri = new Uri("https://example.com/report.pdf");
using var fromUri = new Attachment(uri,
downloadingProgress: (contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
});
// From plain text
using var fromText = Attachment.CreateFromText(
"This is a plain text document with important information.",
"notes.txt");
Step 7: Add Custom Metadata
Attach custom metadata to documents for filtering during retrieval:
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
},
loadingProgress: p =>
{
Console.Write($"\r Loading: {p * 100:F0}% ");
return true;
});
Console.WriteLine($"\n Embedding model loaded: {embeddingModel.Name}\n");
// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
ProcessingMode = PageProcessingMode.Auto,
MaxChunkSize = 512
};
var customMetadata = new MetadataCollection();
customMetadata["department"] = "legal";
customMetadata["confidentiality"] = "internal";
customMetadata["author"] = "Jane Smith";
var metadata = new DocumentRag.DocumentMetadata(
name: "Contract Agreement Q1 2025",
id: "contract-q1-2025",
sourceUri: "https://intranet.example.com/contracts/q1-2025",
customMetadata: customMetadata);
using var attachment = new Attachment("documents/contract.pdf");
DataSource dataSource = await rag.ImportDocumentAsync(
attachment,
metadata,
"legal-documents");
Step 8: Delete Documents
Remove documents from the vector store when they become outdated:
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the embedding model
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
},
loadingProgress: p =>
{
Console.Write($"\r Loading: {p * 100:F0}% ");
return true;
});
Console.WriteLine($"\n Embedding model loaded: {embeddingModel.Name}\n");
// ──────────────────────────────────────
// 2. Create the DocumentRag instance
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
ProcessingMode = PageProcessingMode.Auto,
MaxChunkSize = 512
};
bool deleted = await rag.DeleteDocumentAsync(
documentId: "contract-q1-2025",
dataSourceIdentifier: "legal-documents");
if (deleted)
Console.WriteLine("Document removed from the knowledge base.");
else
Console.WriteLine("Document not found.");
PageProcessingMode Reference
| Mode | Enum Value | Behavior |
|---|---|---|
Auto |
0 | Checks each page. Uses text extraction when text is available, falls back to vision understanding for image-only pages |
TextExtraction |
1 | Extracts embedded text. OCR may be used for image-based content when an OCR engine is available |
DocumentUnderstanding |
2 | Uses a Vision Language Model to analyze page layout and structure. Best for scanned documents, forms, and complex tables |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Slow ingestion on large PDFs | DocumentUnderstanding processes every page with VLM |
Use Auto mode or limit page ranges |
| Empty text from scanned PDFs | TextExtraction mode with no OCR engine |
Switch to Auto or DocumentUnderstanding with a VLM |
| Duplicate document error | Same id used for different documents |
Use unique IDs per document (e.g., hash of file content) |
| Poor chunk quality | MaxChunkSize too large or too small |
Start with 512 and adjust based on retrieval quality |
Next Steps
- Build a RAG Pipeline Over Your Own Documents: query the ingested documents.
- Chat with PDF Documents: interactive document Q&A.
- Automatically Split Multi-Document PDFs with AI Vision: detect logical document boundaries before ingestion.
- Preprocess Images for Vision Pipelines: clean images before ingestion.
- Process PDFs and Images with Built-In Document Tools: use PdfInfo, DocumentText, and PdfMerge tools in agent-driven ingestion workflows.