Build a Private Document Q&A System
This tutorial builds an on-device document Q&A system that loads PDFs, answers questions with source references, maintains multi-turn conversation history, and streams responses token by token. Everything runs locally with no cloud API calls.
Why Private Document Q&A
Two enterprise problems that local document Q&A solves:
- Regulated document access. Healthcare, legal, and financial organizations handle sensitive documents (patient records, contracts, financial statements) that cannot leave the organization's infrastructure. A local Q&A system keeps all data on-premises while delivering AI-powered answers.
- Offline field operations. Engineers, auditors, and inspectors working in disconnected environments need to query technical manuals, compliance checklists, and SOPs. A local system runs on a laptop with no internet required.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB recommended |
| VRAM | 6 GB (for both embedding and chat models) |
| Disk | ~4 GB free for model downloads |
| PDF files | At least one .pdf file to test with |
Step 1: Create the Project
dotnet new console -n PrivateDocQA
cd PrivateDocQA
dotnet add package LM-Kit.NET
Step 2: Understand the Architecture
PdfChat is a high-level class that wraps document parsing, chunking, embedding, retrieval, and chat generation into a single API. Under the hood, it uses RagEngine for vector search and MultiTurnConversation for contextual responses.
┌─────────────────────────────────────────────┐
│ PdfChat │
│ │
PDF files ───► │ LoadDocument() │
│ │ │
│ ▼ │
│ Parse ► Chunk ► Embed ► Store │
│ │ │
User query ──► │ Submit() │ │
│ │ │ │
│ ▼ ▼ │
│ Embed query ► Similarity Search │
│ │ │
│ ▼ │
│ Top-K passages │
│ │ │
│ ▼ │
│ Inject into prompt + Chat history │
│ │ │
│ ▼ │
│ Generate answer │
│ + source refs │
└─────────────────────────────────────────────┘
Key advantage: PdfChat maintains conversation history automatically, so follow-up questions reference prior answers without any extra code.
Step 3: Basic PDF Q&A
This is the minimal working program. It loads one PDF and starts a chat loop.
using System.Text;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
// ──────────────────────────────────────
// 2. Create PdfChat instance
// ──────────────────────────────────────
using var pdfChat = new PdfChat(chatModel, embeddingModel)
{
MaximumCompletionTokens = 1024,
MaxRetrievedPassages = 5,
MinRelevanceScore = 0.25f
};
// Stream tokens as they are generated
pdfChat.AfterTextCompletion += (_, e) =>
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
};
// ──────────────────────────────────────
// 3. Load PDF documents
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "document.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
Console.WriteLine($"Indexing {Path.GetFileName(pdfPath)}...");
pdfChat.DocumentImportProgress += (_, e) =>
{
int percent = (int)((e.PageIndex + 1) / (float)e.TotalPages * 100);
Console.Write($"\r Processing: page {e.PageIndex + 1}/{e.TotalPages} ({percent}%) ");
};
var indexResult = await pdfChat.LoadDocumentAsync(pdfPath);
Console.WriteLine($"\n Indexed {indexResult.PageCount} pages ({indexResult.TokenCount} tokens).\n");
// ──────────────────────────────────────
// 4. Chat loop
// ──────────────────────────────────────
Console.WriteLine("Ask a question about the document (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("You: ");
Console.ResetColor();
string? question = Console.ReadLine();
if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();
var result = await pdfChat.SubmitAsync(question);
// Show source references
if (result.SourceReferences?.Count > 0)
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine("\n\n Sources:");
foreach (var source in result.SourceReferences)
Console.WriteLine($" p.{source.PageNumber}: {Truncate(source.Excerpt, 80)}");
Console.ResetColor();
}
Console.WriteLine();
}
// ──────────────────────────────────────
// Helper methods
// ──────────────────────────────────────
static bool DownloadProgress(string path, long? contentLength, long bytesRead)
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
}
static bool LoadProgress(float progress)
{
Console.Write($"\r Loading: {progress * 100:F0}% ");
return true;
}
static string Truncate(string text, int maxLength)
{
if (string.IsNullOrEmpty(text)) return "";
string cleaned = text.Replace("\n", " ").Replace("\r", "");
return cleaned.Length <= maxLength ? cleaned : cleaned.Substring(0, maxLength) + "...";
}
Run it:
dotnet run -- "path/to/your/document.pdf"
Step 4: Loading Multiple Documents
PdfChat supports loading multiple PDFs. Each document is indexed independently and searched together during queries.
string[] pdfPaths = {
"reports/annual-report-2024.pdf",
"reports/quarterly-earnings-q4.pdf",
"policies/employee-handbook.pdf"
};
foreach (string path in pdfPaths)
{
if (!File.Exists(path))
{
Console.WriteLine($" Skipping {path} (not found)");
continue;
}
Console.Write($" Indexing {Path.GetFileName(path)}...");
var result = await pdfChat.LoadDocumentAsync(path);
Console.WriteLine($" {result.PageCount} pages indexed.");
}
Console.WriteLine($"\nTotal documents loaded: {pdfChat.DocumentCount}");
Step 5: Configuring Retrieval Quality
Passage Count and Relevance Threshold
// More passages = more context, but slower and uses more tokens
pdfChat.MaxRetrievedPassages = 10;
// Lower threshold = more results (higher recall, lower precision)
// Higher threshold = fewer, more relevant results
pdfChat.MinRelevanceScore = 0.3f;
Reranking
A reranker re-scores retrieved passages using a cross-encoder for better ranking accuracy:
pdfChat.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);
// 0.0 = only original similarity score
// 1.0 = only reranker score
// 0.7 = blend favoring reranker (recommended)
Full Document Context vs. Passage Retrieval
For small documents (under ~50 pages), you can inject the entire document into the prompt context instead of doing passage retrieval:
// Force full document context for small docs
pdfChat.PreferFullDocumentContext = true;
pdfChat.FullDocumentTokenBudget = 8000; // max tokens to allocate for document content
This gives the model complete document visibility at the cost of higher token usage. For large documents, passage retrieval is more efficient and more accurate.
Step 6: Custom System Prompt
Override the default system prompt to control answer style and behavior:
pdfChat.SystemPrompt =
"You are a document analyst. Answer questions using only the information " +
"found in the loaded documents. If a question cannot be answered from the " +
"documents, say: 'This information is not in the loaded documents.' " +
"Always cite the page number when referencing specific information.";
Step 7: Processing Scanned PDFs
For scanned PDFs (image-based, no text layer), configure vision or OCR processing:
using LMKit.Inference;
// Option A: Use a Vision Language Model for direct image understanding
pdfChat.DocumentProcessingModality = InferenceModality.Vision;
// Option B: Use OCR to extract text first, then process normally
pdfChat.OcrEngine = new TesseractOcr();
Vision mode works best when you load a VLM as the chat model (e.g., gemma3-vl:4b). OCR mode works with any text model.
Model Selection
Embedding Models
| Model ID | Size | Best For |
|---|---|---|
embeddinggemma-300m |
~300 MB | General-purpose, fast, low memory (default) |
nomic-embed-text |
~260 MB | High-quality text embeddings |
Chat Models
| Model ID | VRAM | Best For |
|---|---|---|
gemma3:4b |
~3.5 GB | Good quality, fast responses |
qwen3:4b |
~3.5 GB | Strong reasoning, multilingual |
gemma3:12b |
~8 GB | High accuracy on complex questions |
qwen3:8b |
~6 GB | Best balance for document analysis |
For document Q&A specifically, qwen3:8b or gemma3:12b deliver noticeably better accuracy on complex multi-hop questions (questions that require synthesizing information from multiple sections). Use gemma3:4b if VRAM is limited.
Example Session
Loading embedding model...
Loading: 100% Done.
Loading chat model...
Loading: 100% Done.
Indexing annual-report-2024.pdf...
Processing: page 48/48
Indexed 48 pages (32,541 tokens).
Ask a question about the document (or 'quit' to exit):
You: What was the company's total revenue last year?
Answer: According to the financial statements, total revenue for fiscal year 2024
was $2.47 billion, representing a 12% increase year-over-year. The growth was
primarily driven by the cloud services division, which contributed $1.1 billion.
Sources:
p.12: Total revenue for the fiscal year ended December 31, 2024 was $2,470...
p.15: Cloud services revenue grew 23% to $1.1 billion, accounting for 44.5%...
You: How does that compare to the previous year?
Answer: In fiscal year 2023, total revenue was $2.21 billion. The year-over-year
increase of $260 million (12%) was above the company's guidance of 8-10% growth.
The largest contributor was cloud services, which grew from $894 million to
$1.1 billion.
Sources:
p.12: ...compared to $2,205 million in the prior year, representing growth...
p.8: Management guidance for FY2024 projected revenue growth of 8-10%...
Notice that the second question ("How does that compare") works correctly because PdfChat maintains conversation history. The model understands "that" refers to the revenue discussed in the previous turn.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| "No relevant passages found" | Relevance threshold too high, or document not text-searchable | Lower MinRelevanceScore to 0.15; check if PDF is scanned (use OCR) |
| Answer ignores document content | System prompt not directive enough | Use a system prompt that explicitly says "answer ONLY from the documents" |
| Slow indexing on large PDFs | Many pages being embedded sequentially | Normal for 100+ page documents. Index once; subsequent queries are fast |
| Out of memory loading two models | Combined model size exceeds VRAM | Use embeddinggemma-300m (small) + gemma3:4b (medium), or reduce GpuLayerCount |
| Garbled text from scanned PDF | PDF has no text layer | Set DocumentProcessingModality = InferenceModality.Vision or enable OCR |
| Follow-up questions lose context | Using SingleTurnConversation instead of PdfChat |
PdfChat handles multi-turn automatically. Do not replace it with SingleTurnConversation |
Next Steps
- Build a RAG Pipeline Over Your Own Documents: lower-level RAG control with
RagEnginefor text files and custom data sources. - Load a Model and Generate Your First Response: model loading fundamentals if you haven't set up yet.
- Extract Structured Data from Unstructured Text: pull typed fields from your documents.
- Samples: Chat with PDF: full PDF chat demo application.
- Samples: Building a Custom Chatbot with RAG: custom RAG chatbot demo.