Table of Contents

Import and Query Documents with Vision Understanding

Standard text extraction misses tables, charts, headers, and complex layouts in PDFs and scanned documents. DocumentRag extends the base RAG engine with vision-based document understanding: it renders each page as an image, processes it through a Vision Language Model (VLM), and generates layout-aware Markdown that preserves tables, headings, and formatting. This produces dramatically better retrieval results for visually complex documents like invoices, research papers, and technical manuals.


Why This Matters

Two enterprise problems that vision-based document understanding solves:

  1. Accurate table and chart extraction. Financial reports, lab results, and compliance documents contain critical data in tables and charts that plain text extraction destroys. Vision understanding preserves table structure as Markdown, making it searchable and answerable.
  2. Scanned document processing. Many enterprises still work with scanned PDFs (contracts, historical records, government filings). Vision understanding reads these documents without requiring a separate OCR pipeline.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 16 GB recommended
VRAM 8+ GB (for VLM + embedding model simultaneously)
Disk ~6 GB free for model downloads
PDF files At least one .pdf file to test with

Step 1: Create the Project

dotnet new console -n VisionDocQuickstart
cd VisionDocQuickstart
dotnet add package LM-Kit.NET

Step 2: Basic Document Import with Vision Understanding

This program loads three models (embedding, vision, and chat), imports a PDF using vision-based page processing, and queries the indexed content.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: DownloadProgress,
    loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading vision model...");
using LM visionModel = LM.LoadFromModelID("gemma3-vl:4b",
    downloadingProgress: DownloadProgress,
    loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: DownloadProgress,
    loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create DocumentRag with vision understanding
// ──────────────────────────────────────
var docRag = new DocumentRag(embeddingModel)
{
    ProcessingMode = DocumentRag.PageProcessingMode.DocumentUnderstanding,
    VisionParser = new LMKit.Graphics.VlmOcr(visionModel)
};

// ──────────────────────────────────────
// 3. Import a PDF document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "documents/financial-report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

Console.WriteLine($"Importing: {Path.GetFileName(pdfPath)}...");

var attachment = new Attachment(pdfPath);
var metadata = new DocumentMetadata(attachment, id: "fin-report-2024");

await docRag.ImportDocumentAsync(attachment, metadata, "Reports");
Console.WriteLine("Document imported and indexed.\n");

// ──────────────────────────────────────
// 4. Query the document
// ──────────────────────────────────────
string query = "What was the total revenue in Q3?";
Console.WriteLine($"Query: \"{query}\"\n");

var matches = docRag.FindMatchingPartitions(query, topK: 3, minScore: 0.3f);

foreach (var match in matches)
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  [{match.SectionIdentifier}] score={match.Similarity:F3}");
    Console.ResetColor();
}

var chat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

var result = docRag.QueryPartitions(query, matches, chat);
Console.WriteLine($"\nAnswer: {result.Completion}");

// ──────────────────────────────────────
// Helper callbacks
// ──────────────────────────────────────
static bool DownloadProgress(string path, long? contentLength, long bytesRead)
{
    if (contentLength.HasValue)
        Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
    return true;
}

static bool LoadProgress(float progress)
{
    Console.Write($"\r  Loading: {progress * 100:F0}%   ");
    return true;
}

Run it:

dotnet run -- "path/to/your/document.pdf"

Step 3: Import Specific Page Ranges

For large documents, you can import only the pages you need. This saves time and memory by skipping irrelevant sections.

// Import only pages 1 through 5
await docRag.ImportDocumentAsync(attachment, metadata, "Reports", pageRange: "1-5");

// Import specific pages
await docRag.ImportDocumentAsync(attachment, metadata, "Reports", pageRange: "1,3,7-10");

Step 4: Choosing the Right Processing Mode

DocumentRag supports three processing modes. Choose the one that matches your document type.

// Auto mode: DocumentRag decides based on document content
docRag.ProcessingMode = DocumentRag.PageProcessingMode.Auto;

// Text extraction: fast, works well for text-heavy PDFs
docRag.ProcessingMode = DocumentRag.PageProcessingMode.TextExtraction;

// Document understanding: uses VLM for complex layouts, tables, charts
docRag.ProcessingMode = DocumentRag.PageProcessingMode.DocumentUnderstanding;
Mode Speed Accuracy on Tables Scanned Docs VLM Required
TextExtraction Fast Low No (needs OCR engine) No
DocumentUnderstanding Slower High Yes Yes
Auto Varies Adaptive Adaptive Recommended

Use TextExtraction for text-heavy PDFs with simple layouts (contracts, articles, reports with no tables). Use DocumentUnderstanding for visually complex documents (invoices, financial statements, research papers with figures). Auto inspects each page and selects the best approach automatically.


Step 5: Managing the Document Lifecycle

Documents can be added, checked, and removed from the index at any time.

// Check if a document section exists
bool exists = docRag.HasSection("fin-report-2024");

// Delete a document by ID
await docRag.DeleteDocumentAsync("fin-report-2024", "Reports");

// Import an updated version
var updatedAttachment = new Attachment("documents/financial-report-v2.pdf");
var updatedMeta = new DocumentMetadata(updatedAttachment, id: "fin-report-2024-v2");
await docRag.ImportDocumentAsync(updatedAttachment, updatedMeta, "Reports");

Step 6: Adding Reranking for Better Retrieval

For the highest retrieval quality, combine vision-based document understanding with reranking. The reranker re-scores retrieved passages using a cross-encoder, improving ranking accuracy for domain-specific queries.

// Enable reranking on the DocumentRag engine
docRag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);

// Now queries will be reranked automatically
var matches = docRag.FindMatchingPartitions("quarterly revenue breakdown by region", topK: 5, minScore: 0.2f);

Step 7: Interactive Q&A Loop

For a full interactive experience, wrap the query logic in a loop with token streaming.

var multiChat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. " +
                   "If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

multiChat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

Console.WriteLine("Ask questions about the document (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("Question: ");
    Console.ResetColor();

    string? question = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    var queryMatches = docRag.FindMatchingPartitions(question, topK: 3, minScore: 0.3f);

    if (queryMatches.Count == 0)
    {
        Console.WriteLine("No relevant passages found in the document.\n");
        continue;
    }

    Console.ForegroundColor = ConsoleColor.DarkGray;
    foreach (var m in queryMatches)
        Console.WriteLine($"  [{m.SectionIdentifier}] score={m.Similarity:F3}");
    Console.ResetColor();

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("\nAnswer: ");
    Console.ResetColor();

    var answer = docRag.QueryPartitions(question, queryMatches, multiChat);
    Console.WriteLine($"\n  [{answer.GeneratedTokenCount} tokens, {answer.TokenGenerationRate:F1} tok/s]\n");
}

Model Selection

Embedding Models

Model ID Size Best For
embeddinggemma-300m ~300 MB General-purpose, fast, low memory (default)
nomic-embed-text ~260 MB High-quality text embeddings

Vision Models

Model ID VRAM Speed Best For
gemma3-vl:4b ~4 GB Fast General document understanding (recommended start)
qwen3-vl:4b ~4 GB Fast Multilingual documents
qwen3-vl:8b ~6.5 GB Moderate High accuracy on complex layouts
gemma3:12b ~11 GB Slower Maximum accuracy, small text, dense tables

Chat Models

Model ID VRAM Best For
gemma3:4b ~3.5 GB Good quality, fast responses
qwen3:4b ~3.5 GB Strong reasoning, multilingual
qwen3:8b ~6 GB Best balance for document Q&A

For document Q&A with vision, the recommended combination is embeddinggemma-300m + gemma3-vl:4b + gemma3:4b. This provides good quality while fitting within 8 GB of VRAM. If you have more VRAM available, upgrade the vision model to qwen3-vl:8b for better table and chart extraction.


Common Issues

Problem Cause Fix
VisionParser is null exception DocumentUnderstanding mode without VLM Set docRag.VisionParser = new VlmOcr(visionModel)
Tables not extracted correctly Using TextExtraction mode Switch to DocumentUnderstanding mode
Very slow import VLM processes every page individually Use pageRange to import only relevant pages
Large memory usage Multiple large models loaded simultaneously Use a smaller VLM or process documents in batches
Poor results on scanned PDFs Text extraction fails on images Use DocumentUnderstanding mode with a VLM
Low similarity scores after import Vision output not well-chunked for embeddings Increase MaxChunkSize on the chunking configuration

Next Steps