Import and Query Documents with Vision Understanding

Standard text extraction misses tables, charts, headers, and complex layouts in PDFs and scanned documents. DocumentRag extends the base RAG engine with vision-based document understanding: it renders each page as an image, processes it through a Vision Language Model (VLM), and generates layout-aware Markdown that preserves tables, headings, and formatting. This produces dramatically better retrieval results for visually complex documents like invoices, research papers, and technical manuals.

Why This Matters

Two enterprise problems that vision-based document understanding solves:

Accurate table and chart extraction. Financial reports, lab results, and compliance documents contain critical data in tables and charts that plain text extraction destroys. Vision understanding preserves table structure as Markdown, making it searchable and answerable.
Scanned document processing. Many enterprises still work with scanned PDFs (contracts, historical records, government filings). Vision understanding reads these documents without requiring a separate OCR pipeline.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	16 GB recommended
VRAM	8+ GB (for VLM + embedding model simultaneously)
Disk	~6 GB free for model downloads
PDF files	At least one `.pdf` file to test with

Step 1: Create the Project

dotnet new console -n VisionDocQuickstart
cd VisionDocQuickstart
dotnet add package LM-Kit.NET

Step 2: Basic Document Import with Vision Understanding

This program loads three models (embedding, vision, and chat), imports a PDF using vision-based page processing, and queries the indexed content.

using System.Text;
using LMKit.Extraction.Ocr;
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading vision model...");
using LM visionModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create DocumentRag with vision understanding
// ──────────────────────────────────────
var docRag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.DocumentUnderstanding,
    VisionParser = new VlmOcr(visionModel)
};

// ──────────────────────────────────────
// 3. Import a PDF document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "documents/financial-report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

Console.WriteLine($"Importing: {Path.GetFileName(pdfPath)}...");

var attachment = new Attachment(pdfPath);
var metadata = new DocumentRag.DocumentMetadata(attachment, id: "fin-report-2024");

await docRag.ImportDocumentAsync(attachment, metadata, "Reports");
Console.WriteLine("Document imported and indexed.\n");

// ──────────────────────────────────────
// 4. Query the document
// ──────────────────────────────────────
string query = "What was the total revenue in Q3?";
Console.WriteLine($"Query: \"{query}\"\n");

var matches = docRag.FindMatchingPartitions(query, topK: 3, minScore: 0.3f);

foreach (var match in matches)
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  [{match.SectionIdentifier}] score={match.Similarity:F3}");
    Console.ResetColor();
}

var chat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

var result = docRag.QueryPartitions(query, matches, chat);
Console.WriteLine($"\nAnswer: {result.Response.Completion}");

// ──────────────────────────────────────
// Helper callbacks
// ──────────────────────────────────────
static bool DownloadProgress(string path, long? contentLength, long bytesRead)
{
    if (contentLength.HasValue)
        Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
    return true;
}

static bool LoadProgress(float progress)
{
    Console.Write($"\r  Loading: {progress * 100:F0}%   ");
    return true;
}

Run it:

dotnet run -- "path/to/your/document.pdf"

Step 3: Import Specific Page Ranges

For large documents, you can import only the pages you need. This saves time and memory by skipping irrelevant sections.

// Import only pages 1 through 5
await docRag.ImportDocumentAsync(attachment, metadata, "Reports", pageRange: "1-5");

// Import specific pages
await docRag.ImportDocumentAsync(attachment, metadata, "Reports", pageRange: "1,3,7-10");

Step 4: Choosing the Right Processing Mode

DocumentRag supports three processing modes. Choose the one that matches your document type.

using System.Text;
using LMKit.Extraction.Ocr;
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading vision model...");
using LM visionModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create DocumentRag with vision understanding
// ──────────────────────────────────────
var docRag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.DocumentUnderstanding,
    VisionParser = new VlmOcr(visionModel)
};

// Auto mode: DocumentRag decides based on document content
docRag.ProcessingMode = PageProcessingMode.Auto;

// Text extraction: fast, works well for text-heavy PDFs
docRag.ProcessingMode = PageProcessingMode.TextExtraction;

// Document understanding: uses VLM for complex layouts, tables, charts
docRag.ProcessingMode = PageProcessingMode.DocumentUnderstanding;

Mode	Speed	Accuracy on Tables	Scanned Docs	VLM Required
`TextExtraction`	Fast	Low	No (needs OCR engine)	No
`DocumentUnderstanding`	Slower	High	Yes	Yes
`Auto`	Varies	Adaptive	Adaptive	Recommended

Use TextExtraction for text-heavy PDFs with simple layouts (contracts, articles, reports with no tables). Use DocumentUnderstanding for visually complex documents (invoices, financial statements, research papers with figures). Auto inspects each page and selects the best approach automatically.

Step 5: Managing the Document Lifecycle

Documents can be added, checked, and removed from the index at any time.

using System.Text;
using LMKit.Extraction.Ocr;
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading vision model...");
using LM visionModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create DocumentRag with vision understanding
// ──────────────────────────────────────
var docRag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.DocumentUnderstanding,
    VisionParser = new VlmOcr(visionModel)
};

// Check if a document section exists
bool exists = docRag.DataSources.Any(ds => ds.HasSection("fin-report-2024"));

// Delete a document by ID
await docRag.DeleteDocumentAsync("fin-report-2024", "Reports");

// Import an updated version
var updatedAttachment = new Attachment("documents/financial-report-v2.pdf");
var updatedMeta = new DocumentRag.DocumentMetadata(updatedAttachment, id: "fin-report-2024-v2");
await docRag.ImportDocumentAsync(updatedAttachment, updatedMeta, "Reports");

Step 6: Adding Reranking for Better Retrieval

For the highest retrieval quality, combine vision-based document understanding with reranking. The reranker re-scores retrieved passages using a cross-encoder, improving ranking accuracy for domain-specific queries.

// Enable reranking on the DocumentRag engine
docRag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);

// Now queries will be reranked automatically
var matches = docRag.FindMatchingPartitions("quarterly revenue breakdown by region", topK: 5, minScore: 0.2f);

Step 7: Interactive Q&A Loop

For a full interactive experience, wrap the query logic in a loop with token streaming.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Retrieval;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading vision model...");
using LM visionModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create DocumentRag with vision understanding
// ──────────────────────────────────────
var docRag = new DocumentRag(embeddingModel)
{
    ProcessingMode = PageProcessingMode.DocumentUnderstanding,
    VisionParser = new VlmOcr(visionModel)
};

var multiChat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. " +
                   "If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

multiChat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

Console.WriteLine("Ask questions about the document (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("Question: ");
    Console.ResetColor();

    string? question = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    var queryMatches = docRag.FindMatchingPartitions(question, topK: 3, minScore: 0.3f);

    if (queryMatches.Count == 0)
    {
        Console.WriteLine("No relevant passages found in the document.\n");
        continue;
    }

    Console.ForegroundColor = ConsoleColor.DarkGray;
    foreach (var m in queryMatches)
        Console.WriteLine($"  [{m.SectionIdentifier}] score={m.Similarity:F3}");
    Console.ResetColor();

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("\nAnswer: ");
    Console.ResetColor();

    var answer = docRag.QueryPartitions(question, queryMatches, multiChat);
    Console.WriteLine($"\n  [{answer.Response.GeneratedTokenCount} tokens, {answer.Response.TokenGenerationRate:F1} tok/s]\n");
}

Model Selection

Embedding Models

Model ID	Size	Best For
`embeddinggemma-300m`	~300 MB	General-purpose, fast, low memory (default)
`nomic-embed-text`	~260 MB	High-quality text embeddings

Vision Models

Model ID	VRAM	Speed	Best For
`qwen3-vl:4b`	~4 GB	Fast	Latest purpose-built Qwen VL with Vision + OCR + tool calling
`gemma4:e4b`	~6 GB	Fast	General document understanding, vision (recommended start)
`qwen3.5:4b`	~3.5 GB	Fast	Multilingual documents
`qwen3.5:9b`	~7 GB	Moderate	High accuracy on complex layouts
`qwen3.6:27b`	~17 GB	Moderate	Latest Qwen flagship, Vision + OCR + reasoning

Chat Models

Model ID	VRAM	Best For
`gemma4:e4b`	~6 GB	Good quality, vision and chat
`qwen3.5:4b`	~3.5 GB	Strong reasoning, multilingual
`qwen3.5:9b`	~7 GB	Best balance for document Q&A
`qwen3.6:27b`	~17 GB	Top-tier reasoning, vision, OCR

For document Q&A with vision, the recommended combination is embeddinggemma-300m + gemma4:e4b. Gemma 4 E4B handles both vision and chat, providing good quality while fitting within 8 GB of VRAM. If you have more VRAM available, upgrade to qwen3.5:9b, qwen3-vl:8b, or qwen3.6:27b for better table and chart extraction.

Common Issues

Problem	Cause	Fix
`VisionParser` is null exception	DocumentUnderstanding mode without VLM	Set `docRag.VisionParser = new VlmOcr(visionModel)`
Tables not extracted correctly	Using TextExtraction mode	Switch to DocumentUnderstanding mode
Very slow import	VLM processes every page individually	Use `pageRange` to import only relevant pages
Large memory usage	Multiple large models loaded simultaneously	Use a smaller VLM or process documents in batches
Poor results on scanned PDFs	Text extraction fails on images	Use DocumentUnderstanding mode with a VLM
Low similarity scores after import	Vision output not well-chunked for embeddings	Increase `MaxChunkSize` on the chunking configuration

Next Steps

Build a RAG Pipeline Over Your Own Documents: foundational RAG with RagEngine for text files and custom data sources.
Chat with PDF Documents: high-level PDF chat API with conversation history.
Build Semantic Search with Embeddings: embedding fundamentals and similarity computation.
Convert Documents to Markdown with VLM OCR: standalone VLM OCR for document conversion.
Analyze Images with Vision Language Models: image Q&A and visual analysis.

Table of Contents