Build a RAG Pipeline Over Your Own Documents
Retrieval-Augmented Generation (RAG) grounds LLM responses in your own data. Instead of relying on the model's training data alone, RAG retrieves relevant passages from your documents and injects them into the prompt context. This eliminates hallucinations on domain-specific questions and keeps answers current without retraining.
This tutorial builds a working RAG system that indexes text files, persists the index to disk, and answers questions using retrieved context.
Why Local RAG Matters
Two real-world problems that on-device RAG solves:
- Data sovereignty in regulated industries. Healthcare, finance, and legal organizations cannot send proprietary documents to cloud APIs. Local RAG keeps all data on-premises while still delivering AI-powered Q&A.
- Offline knowledge bases for field workers. Technicians, inspectors, and engineers need access to manuals and procedures in environments with no internet connectivity. Local RAG runs entirely on a laptop or edge device.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB recommended |
| VRAM | 6 GB (for both models simultaneously) |
| Disk | ~4 GB free for model downloads |
You will load two models: an embedding model (for indexing and search) and a chat model (for generating answers).
Step 1: Create the Project
dotnet new console -n RagQuickstart
cd RagQuickstart
dotnet add package LM-Kit.NET
Step 2: Understand the RAG Architecture
┌──────────────┐ chunk + embed ┌────────────────┐
│ Your Docs │ ─────────────────── │ DataSource │
│ (.txt, .md) │ │ (vector index)│
└──────────────┘ └───────┬────────┘
│ similarity search
┌──────────────┐ embed query │
│ User Query │ ───────────────────────────►│
└──────────────┘ │
▼
┌───────────────┐
│ Top-K chunks │
└───────┬───────┘
│ inject into prompt
▼
┌───────────────┐
│ Chat Model │ ──► Answer
└───────────────┘
Key classes:
| Class | Role |
|---|---|
RagEngine |
Orchestrates indexing, search, and LLM querying |
DataSource |
Stores chunk embeddings (in-memory or file-backed) |
TextChunking |
Splits text into overlapping chunks |
Embedder |
Generates vector embeddings |
SingleTurnConversation |
Generates the final answer from retrieved context |
Step 3: Write the Program
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: DownloadProgress,
loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");
// ──────────────────────────────────────
// 2. Create the RAG engine with a file-backed index
// ──────────────────────────────────────
const string IndexPath = "knowledge_base.dat";
DataSource dataSource;
if (File.Exists(IndexPath))
{
Console.WriteLine("Loading existing index from disk...");
dataSource = DataSource.LoadFromFile(IndexPath, readOnly: false);
}
else
{
dataSource = DataSource.CreateFileDataSource(IndexPath, "KnowledgeBase", embeddingModel);
}
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
// Configure chunking
rag.DefaultIChunking = new TextChunking
{
MaxChunkSize = 500, // tokens per chunk
MaxOverlapSize = 50 // overlap for context continuity
};
// ──────────────────────────────────────
// 3. Index documents (skip sections already indexed)
// ──────────────────────────────────────
string[] docs = {
"docs/product-manual.txt",
"docs/faq.txt",
"docs/troubleshooting.txt"
};
foreach (string docPath in docs)
{
string sectionName = Path.GetFileNameWithoutExtension(docPath);
if (dataSource.HasSection(sectionName))
{
Console.WriteLine($" Skipping {sectionName} (already indexed)");
continue;
}
if (!File.Exists(docPath))
{
Console.WriteLine($" Skipping {docPath} (file not found)");
continue;
}
Console.WriteLine($" Indexing {sectionName}...");
string content = File.ReadAllText(docPath);
rag.ImportText(content, "KnowledgeBase", sectionName);
}
Console.WriteLine($"\nIndex contains {dataSource.Sections.Count()} section(s).\n");
// ──────────────────────────────────────
// 4. Query loop
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
SystemPrompt = "Answer the question using only the provided context. " +
"If the context does not contain the answer, say so.",
MaximumCompletionTokens = 512
};
chat.AfterTextCompletion += (_, e) =>
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
};
Console.WriteLine("Ask a question about your documents (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("Question: ");
Console.ResetColor();
string? query = Console.ReadLine();
if (string.IsNullOrWhiteSpace(query) || query.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
// Retrieve top-3 most relevant chunks
var matches = rag.FindMatchingPartitions(query, topK: 3, minScore: 0.3f);
if (matches.Count == 0)
{
Console.WriteLine("No relevant passages found in the index.\n");
continue;
}
// Show which sections were matched
Console.ForegroundColor = ConsoleColor.DarkGray;
foreach (var m in matches)
Console.WriteLine($" [{m.SectionIdentifier}] score={m.Similarity:F3}");
Console.ResetColor();
// Generate answer grounded in the retrieved context
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("\nAnswer: ");
Console.ResetColor();
var result = rag.QueryPartitions(query, matches, chat);
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
}
// ──────────────────────────────────────
// Helper callbacks
// ──────────────────────────────────────
static bool DownloadProgress(string path, long? contentLength, long bytesRead)
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
}
static bool LoadProgress(float progress)
{
Console.Write($"\r Loading: {progress * 100:F0}% ");
return true;
}
Step 4: Create Sample Documents and Run
Create a docs/ folder with a few .txt files containing your content, then:
dotnet run
Example session:
Loading embedding model...
Loading: 100% Done.
Loading chat model...
Loading: 100% Done.
Indexing product-manual...
Indexing faq...
Index contains 2 section(s).
Ask a question about your documents (or 'quit' to exit):
Question: How do I reset the device to factory settings?
[product-manual] score=0.847
[faq] score=0.612
Answer: To reset the device to factory settings, press and hold the power button
and volume-down button simultaneously for 10 seconds until the LED flashes red.
The device will restart and all user data will be erased.
[52 tokens, 38.7 tok/s]
Choosing an Embedding Model
| Model ID | Dimensions | Size | Best For |
|---|---|---|---|
embeddinggemma-300m |
256 | ~300 MB | General-purpose, fast, low memory |
nomic-embed-text |
768 | ~260 MB | High-quality text embeddings |
Both are downloaded automatically with LoadFromModelID. Use embeddinggemma-300m as a default starting point.
Tuning Retrieval Quality
Chunk Size
| Chunk Size | Effect |
|---|---|
| Small (200-300) | More precise matches, but may split important context |
| Medium (400-500) | Good default balance |
| Large (800-1000) | Better for long-form content, less precise matching |
Search Parameters
var matches = rag.FindMatchingPartitions(
query,
topK: 5, // return up to 5 chunks
minScore: 0.3f, // minimum cosine similarity threshold
forceUniqueSection: true // at most one result per section
);
Lowering minScore returns more results (higher recall, lower precision). Raising it returns fewer, more relevant results.
Adding a Reranker
A reranker re-scores retrieved chunks using a cross-encoder, improving ranking quality at a small latency cost:
rag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);
// rerankedAlpha: 0.0 = only original score, 1.0 = only reranker score
Persistence and Incremental Updates
The DataSource.CreateFileDataSource approach persists embeddings to disk. On subsequent runs, DataSource.LoadFromFile loads the index instantly without re-embedding.
To add new documents later:
if (!dataSource.HasSection("new-document"))
{
string content = File.ReadAllText("docs/new-document.txt");
rag.ImportText(content, "KnowledgeBase", "new-document");
}
Scaling Up: PDF and Markdown Documents
For PDF files, use DocumentRag instead of RagEngine for built-in document parsing:
var docRag = new DocumentRag(embeddingModel);
var attachment = new Attachment("report.pdf");
var metadata = new DocumentMetadata(attachment, id: "q4-report");
await docRag.ImportDocumentAsync(attachment, metadata, "Reports");
var matches = docRag.FindMatchingPartitions("quarterly revenue", topK: 5);
For the highest-level PDF Q&A experience (with chat history and source references), use PdfChat:
var pdfChat = new PdfChat(chatModel, embeddingModel);
await pdfChat.LoadDocumentAsync("report.pdf");
var response = await pdfChat.SubmitAsync("What were the key findings?");
Console.WriteLine(response.Response.Completion);
Custom Prompt Templates
Override how retrieved context is injected into the prompt:
string customTemplate = @"Use the following reference material to answer the user's question.
If the material does not contain the answer, state that clearly.
## Reference Material:
@context
## User Question:
@question";
var result = rag.QueryPartitions(query, customTemplate, matches, chat);
The placeholders @context and @question are replaced automatically.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Low similarity scores | Embedding model not suited to your domain | Try nomic-embed-text or increase chunk overlap |
| Answers ignore retrieved context | System prompt too weak | Strengthen the instruction: "Answer ONLY from the provided context" |
| Index file grows large | Many large documents | Use MarkdownChunking for structured docs, or reduce MaxChunkSize |
| Slow indexing | Large corpus on CPU | Use GPU-accelerated embedding, or batch-index offline |
Next Steps
- Create an AI Agent with Tools: build agents that can search the web and call functions.
- Extract Structured Data from Unstructured Text: pull typed fields from documents.
- Samples: Building a Custom Chatbot with RAG: full demo application.
- Samples: Chat with PDF: PDF-specific RAG demo.