Import and Query Documents with Vision Understanding
Standard text extraction misses tables, charts, headers, and complex layouts in PDFs and scanned documents. DocumentRag extends the base RAG engine with vision-based document understanding: it renders each page as an image, processes it through a Vision Language Model (VLM), and generates layout-aware Markdown that preserves tables, headings, and formatting. This produces dramatically better retrieval results for visually complex documents like invoices, research papers, and technical manuals.
Why This Matters
Two enterprise problems that vision-based document understanding solves:
- Accurate table and chart extraction. Financial reports, lab results, and compliance documents contain critical data in tables and charts that plain text extraction destroys. Vision understanding preserves table structure as Markdown, making it searchable and answerable.
- Scanned document processing. Many enterprises still work with scanned PDFs (contracts, historical records, government filings). Vision understanding reads these documents without requiring a separate OCR pipeline.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB recommended |
| VRAM | 8+ GB (for VLM + embedding model simultaneously) |
| Disk | ~6 GB free for model downloads |
| PDF files | At least one .pdf file to test with |
Step 1: Create the Project
dotnet new console -n VisionDocQuickstart
cd VisionDocQuickstart
dotnet add package LM-Kit.NET
Step 2: Basic Document Import with Vision Understanding
This program loads three models (embedding, vision, and chat), imports a PDF using vision-based page processing, and queries the indexed content.
using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading vision model...");
using LM visionModel = LM.LoadFromModelID("gemma3-vl:4b",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
// ──────────────────────────────────────
// 2. Create DocumentRag with vision understanding
// ──────────────────────────────────────
var docRag = new DocumentRag(embeddingModel)
{
ProcessingMode = DocumentRag.PageProcessingMode.DocumentUnderstanding,
VisionParser = new LMKit.Graphics.VlmOcr(visionModel)
};
// ──────────────────────────────────────
// 3. Import a PDF document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "documents/financial-report.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
Console.WriteLine($"Importing: {Path.GetFileName(pdfPath)}...");
var attachment = new Attachment(pdfPath);
var metadata = new DocumentMetadata(attachment, id: "fin-report-2024");
await docRag.ImportDocumentAsync(attachment, metadata, "Reports");
Console.WriteLine("Document imported and indexed.\n");
// ──────────────────────────────────────
// 4. Query the document
// ──────────────────────────────────────
string query = "What was the total revenue in Q3?";
Console.WriteLine($"Query: \"{query}\"\n");
var matches = docRag.FindMatchingPartitions(query, topK: 3, minScore: 0.3f);
foreach (var match in matches)
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" [{match.SectionIdentifier}] score={match.Similarity:F3}");
Console.ResetColor();
}
var chat = new SingleTurnConversation(chatModel)
{
SystemPrompt = "Answer the question using only the provided context. If the context does not contain the answer, say so.",
MaximumCompletionTokens = 512
};
var result = docRag.QueryPartitions(query, matches, chat);
Console.WriteLine($"\nAnswer: {result.Completion}");
// ──────────────────────────────────────
// Helper callbacks
// ──────────────────────────────────────
static bool DownloadProgress(string path, long? contentLength, long bytesRead)
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
}
static bool LoadProgress(float progress)
{
Console.Write($"\r Loading: {progress * 100:F0}% ");
return true;
}
Run it:
dotnet run -- "path/to/your/document.pdf"
Step 3: Import Specific Page Ranges
For large documents, you can import only the pages you need. This saves time and memory by skipping irrelevant sections.
// Import only pages 1 through 5
await docRag.ImportDocumentAsync(attachment, metadata, "Reports", pageRange: "1-5");
// Import specific pages
await docRag.ImportDocumentAsync(attachment, metadata, "Reports", pageRange: "1,3,7-10");
Step 4: Choosing the Right Processing Mode
DocumentRag supports three processing modes. Choose the one that matches your document type.
// Auto mode: DocumentRag decides based on document content
docRag.ProcessingMode = DocumentRag.PageProcessingMode.Auto;
// Text extraction: fast, works well for text-heavy PDFs
docRag.ProcessingMode = DocumentRag.PageProcessingMode.TextExtraction;
// Document understanding: uses VLM for complex layouts, tables, charts
docRag.ProcessingMode = DocumentRag.PageProcessingMode.DocumentUnderstanding;
| Mode | Speed | Accuracy on Tables | Scanned Docs | VLM Required |
|---|---|---|---|---|
TextExtraction |
Fast | Low | No (needs OCR engine) | No |
DocumentUnderstanding |
Slower | High | Yes | Yes |
Auto |
Varies | Adaptive | Adaptive | Recommended |
Use TextExtraction for text-heavy PDFs with simple layouts (contracts, articles, reports with no tables). Use DocumentUnderstanding for visually complex documents (invoices, financial statements, research papers with figures). Auto inspects each page and selects the best approach automatically.
Step 5: Managing the Document Lifecycle
Documents can be added, checked, and removed from the index at any time.
// Check if a document section exists
bool exists = docRag.HasSection("fin-report-2024");
// Delete a document by ID
await docRag.DeleteDocumentAsync("fin-report-2024", "Reports");
// Import an updated version
var updatedAttachment = new Attachment("documents/financial-report-v2.pdf");
var updatedMeta = new DocumentMetadata(updatedAttachment, id: "fin-report-2024-v2");
await docRag.ImportDocumentAsync(updatedAttachment, updatedMeta, "Reports");
Step 6: Adding Reranking for Better Retrieval
For the highest retrieval quality, combine vision-based document understanding with reranking. The reranker re-scores retrieved passages using a cross-encoder, improving ranking accuracy for domain-specific queries.
// Enable reranking on the DocumentRag engine
docRag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);
// Now queries will be reranked automatically
var matches = docRag.FindMatchingPartitions("quarterly revenue breakdown by region", topK: 5, minScore: 0.2f);
Step 7: Interactive Q&A Loop
For a full interactive experience, wrap the query logic in a loop with token streaming.
var multiChat = new SingleTurnConversation(chatModel)
{
SystemPrompt = "Answer the question using only the provided context. " +
"If the context does not contain the answer, say so.",
MaximumCompletionTokens = 512
};
multiChat.AfterTextCompletion += (_, e) =>
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
};
Console.WriteLine("Ask questions about the document (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("Question: ");
Console.ResetColor();
string? question = Console.ReadLine();
if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
var queryMatches = docRag.FindMatchingPartitions(question, topK: 3, minScore: 0.3f);
if (queryMatches.Count == 0)
{
Console.WriteLine("No relevant passages found in the document.\n");
continue;
}
Console.ForegroundColor = ConsoleColor.DarkGray;
foreach (var m in queryMatches)
Console.WriteLine($" [{m.SectionIdentifier}] score={m.Similarity:F3}");
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("\nAnswer: ");
Console.ResetColor();
var answer = docRag.QueryPartitions(question, queryMatches, multiChat);
Console.WriteLine($"\n [{answer.GeneratedTokenCount} tokens, {answer.TokenGenerationRate:F1} tok/s]\n");
}
Model Selection
Embedding Models
| Model ID | Size | Best For |
|---|---|---|
embeddinggemma-300m |
~300 MB | General-purpose, fast, low memory (default) |
nomic-embed-text |
~260 MB | High-quality text embeddings |
Vision Models
| Model ID | VRAM | Speed | Best For |
|---|---|---|---|
gemma3-vl:4b |
~4 GB | Fast | General document understanding (recommended start) |
qwen3-vl:4b |
~4 GB | Fast | Multilingual documents |
qwen3-vl:8b |
~6.5 GB | Moderate | High accuracy on complex layouts |
gemma3:12b |
~11 GB | Slower | Maximum accuracy, small text, dense tables |
Chat Models
| Model ID | VRAM | Best For |
|---|---|---|
gemma3:4b |
~3.5 GB | Good quality, fast responses |
qwen3:4b |
~3.5 GB | Strong reasoning, multilingual |
qwen3:8b |
~6 GB | Best balance for document Q&A |
For document Q&A with vision, the recommended combination is embeddinggemma-300m + gemma3-vl:4b + gemma3:4b. This provides good quality while fitting within 8 GB of VRAM. If you have more VRAM available, upgrade the vision model to qwen3-vl:8b for better table and chart extraction.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
VisionParser is null exception |
DocumentUnderstanding mode without VLM | Set docRag.VisionParser = new VlmOcr(visionModel) |
| Tables not extracted correctly | Using TextExtraction mode | Switch to DocumentUnderstanding mode |
| Very slow import | VLM processes every page individually | Use pageRange to import only relevant pages |
| Large memory usage | Multiple large models loaded simultaneously | Use a smaller VLM or process documents in batches |
| Poor results on scanned PDFs | Text extraction fails on images | Use DocumentUnderstanding mode with a VLM |
| Low similarity scores after import | Vision output not well-chunked for embeddings | Increase MaxChunkSize on the chunking configuration |
Next Steps
- Build a RAG Pipeline Over Your Own Documents: foundational RAG with
RagEnginefor text files and custom data sources. - Chat with PDF Documents: high-level PDF chat API with conversation history.
- Build Semantic Search with Embeddings: embedding fundamentals and similarity computation.
- Convert Documents to Markdown with VLM OCR: standalone VLM OCR for document conversion.
- Analyze Images with Vision Language Models: image Q&A and visual analysis.