Table of Contents

Build a RAG Pipeline Over Your Own Documents

Retrieval-Augmented Generation (RAG) grounds LLM responses in your own data. Instead of relying on the model's training data alone, RAG retrieves relevant passages from your documents and injects them into the prompt context. This eliminates hallucinations on domain-specific questions and keeps answers current without retraining.

This tutorial builds a working RAG system that indexes text files, persists the index to disk, and answers questions using retrieved context.


Why Local RAG Matters

Two real-world problems that on-device RAG solves:

  1. Data sovereignty in regulated industries. Healthcare, finance, and legal organizations cannot send proprietary documents to cloud APIs. Local RAG keeps all data on-premises while still delivering AI-powered Q&A.
  2. Offline knowledge bases for field workers. Technicians, inspectors, and engineers need access to manuals and procedures in environments with no internet connectivity. Local RAG runs entirely on a laptop or edge device.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 16 GB recommended
VRAM 6 GB (for both models simultaneously)
Disk ~4 GB free for model downloads

You will load two models: an embedding model (for indexing and search) and a chat model (for generating answers).


Step 1: Create the Project

dotnet new console -n RagQuickstart
cd RagQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand the RAG Architecture

┌──────────────┐    chunk + embed    ┌────────────────┐
│  Your Docs   │ ─────────────────── │  DataSource    │
│  (.txt, .md) │                     │  (vector index)│
└──────────────┘                     └───────┬────────┘
                                             │ similarity search
┌──────────────┐    embed query              │
│  User Query  │ ───────────────────────────►│
└──────────────┘                             │
                                             ▼
                                     ┌───────────────┐
                                     │  Top-K chunks │
                                     └───────┬───────┘
                                             │ inject into prompt
                                             ▼
                                     ┌───────────────┐
                                     │  Chat Model   │ ──► Answer
                                     └───────────────┘

Key classes:

Class Role
RagEngine Orchestrates indexing, search, and LLM querying
DataSource Stores chunk embeddings (in-memory or file-backed)
TextChunking Splits text into overlapping chunks
Embedder Generates vector embeddings
SingleTurnConversation Generates the final answer from retrieved context

Step 3: Write the Program

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: DownloadProgress,
    loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: DownloadProgress,
    loadingProgress: LoadProgress);
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create the RAG engine with a file-backed index
// ──────────────────────────────────────
const string IndexPath = "knowledge_base.dat";

DataSource dataSource;
if (File.Exists(IndexPath))
{
    Console.WriteLine("Loading existing index from disk...");
    dataSource = DataSource.LoadFromFile(IndexPath, readOnly: false);
}
else
{
    dataSource = DataSource.CreateFileDataSource(IndexPath, "KnowledgeBase", embeddingModel);
}

var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);

// Configure chunking
rag.DefaultIChunking = new TextChunking
{
    MaxChunkSize = 500,    // tokens per chunk
    MaxOverlapSize = 50    // overlap for context continuity
};

// ──────────────────────────────────────
// 3. Index documents (skip sections already indexed)
// ──────────────────────────────────────
string[] docs = {
    "docs/product-manual.txt",
    "docs/faq.txt",
    "docs/troubleshooting.txt"
};

foreach (string docPath in docs)
{
    string sectionName = Path.GetFileNameWithoutExtension(docPath);

    if (dataSource.HasSection(sectionName))
    {
        Console.WriteLine($"  Skipping {sectionName} (already indexed)");
        continue;
    }

    if (!File.Exists(docPath))
    {
        Console.WriteLine($"  Skipping {docPath} (file not found)");
        continue;
    }

    Console.WriteLine($"  Indexing {sectionName}...");
    string content = File.ReadAllText(docPath);
    rag.ImportText(content, "KnowledgeBase", sectionName);
}

Console.WriteLine($"\nIndex contains {dataSource.Sections.Count()} section(s).\n");

// ──────────────────────────────────────
// 4. Query loop
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. " +
                   "If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

chat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

Console.WriteLine("Ask a question about your documents (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("Question: ");
    Console.ResetColor();

    string? query = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(query) || query.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    // Retrieve top-3 most relevant chunks
    var matches = rag.FindMatchingPartitions(query, topK: 3, minScore: 0.3f);

    if (matches.Count == 0)
    {
        Console.WriteLine("No relevant passages found in the index.\n");
        continue;
    }

    // Show which sections were matched
    Console.ForegroundColor = ConsoleColor.DarkGray;
    foreach (var m in matches)
        Console.WriteLine($"  [{m.SectionIdentifier}] score={m.Similarity:F3}");
    Console.ResetColor();

    // Generate answer grounded in the retrieved context
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("\nAnswer: ");
    Console.ResetColor();

    var result = rag.QueryPartitions(query, matches, chat);
    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
}

// ──────────────────────────────────────
// Helper callbacks
// ──────────────────────────────────────
static bool DownloadProgress(string path, long? contentLength, long bytesRead)
{
    if (contentLength.HasValue)
        Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
    return true;
}

static bool LoadProgress(float progress)
{
    Console.Write($"\r  Loading: {progress * 100:F0}%   ");
    return true;
}

Step 4: Create Sample Documents and Run

Create a docs/ folder with a few .txt files containing your content, then:

dotnet run

Example session:

Loading embedding model...
  Loading: 100%    Done.

Loading chat model...
  Loading: 100%    Done.

  Indexing product-manual...
  Indexing faq...

Index contains 2 section(s).

Ask a question about your documents (or 'quit' to exit):

Question: How do I reset the device to factory settings?
  [product-manual] score=0.847
  [faq] score=0.612

Answer: To reset the device to factory settings, press and hold the power button
and volume-down button simultaneously for 10 seconds until the LED flashes red.
The device will restart and all user data will be erased.
  [52 tokens, 38.7 tok/s]

Choosing an Embedding Model

Model ID Dimensions Size Best For
embeddinggemma-300m 256 ~300 MB General-purpose, fast, low memory
nomic-embed-text 768 ~260 MB High-quality text embeddings

Both are downloaded automatically with LoadFromModelID. Use embeddinggemma-300m as a default starting point.


Tuning Retrieval Quality

Chunk Size

Chunk Size Effect
Small (200-300) More precise matches, but may split important context
Medium (400-500) Good default balance
Large (800-1000) Better for long-form content, less precise matching

Search Parameters

var matches = rag.FindMatchingPartitions(
    query,
    topK: 5,                    // return up to 5 chunks
    minScore: 0.3f,             // minimum cosine similarity threshold
    forceUniqueSection: true    // at most one result per section
);

Lowering minScore returns more results (higher recall, lower precision). Raising it returns fewer, more relevant results.

Adding a Reranker

A reranker re-scores retrieved chunks using a cross-encoder, improving ranking quality at a small latency cost:

rag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);
// rerankedAlpha: 0.0 = only original score, 1.0 = only reranker score

Persistence and Incremental Updates

The DataSource.CreateFileDataSource approach persists embeddings to disk. On subsequent runs, DataSource.LoadFromFile loads the index instantly without re-embedding.

To add new documents later:

if (!dataSource.HasSection("new-document"))
{
    string content = File.ReadAllText("docs/new-document.txt");
    rag.ImportText(content, "KnowledgeBase", "new-document");
}

Scaling Up: PDF and Markdown Documents

For PDF files, use DocumentRag instead of RagEngine for built-in document parsing:

var docRag = new DocumentRag(embeddingModel);

var attachment = new Attachment("report.pdf");
var metadata = new DocumentMetadata(attachment, id: "q4-report");
await docRag.ImportDocumentAsync(attachment, metadata, "Reports");

var matches = docRag.FindMatchingPartitions("quarterly revenue", topK: 5);

For the highest-level PDF Q&A experience (with chat history and source references), use PdfChat:

var pdfChat = new PdfChat(chatModel, embeddingModel);
await pdfChat.LoadDocumentAsync("report.pdf");
var response = await pdfChat.SubmitAsync("What were the key findings?");
Console.WriteLine(response.Response.Completion);

Custom Prompt Templates

Override how retrieved context is injected into the prompt:

string customTemplate = @"Use the following reference material to answer the user's question.
If the material does not contain the answer, state that clearly.

## Reference Material:
@context

## User Question:
@question";

var result = rag.QueryPartitions(query, customTemplate, matches, chat);

The placeholders @context and @question are replaced automatically.


Common Issues

Problem Cause Fix
Low similarity scores Embedding model not suited to your domain Try nomic-embed-text or increase chunk overlap
Answers ignore retrieved context System prompt too weak Strengthen the instruction: "Answer ONLY from the provided context"
Index file grows large Many large documents Use MarkdownChunking for structured docs, or reduce MaxChunkSize
Slow indexing Large corpus on CPU Use GPU-accelerated embedding, or batch-index offline

Next Steps