Table of Contents

What is Retrieval-Augmented Generation (RAG)?


TL;DR

Retrieval-Augmented Generation (RAG) is a technique that enhances text generation by combining a Large Language Model (LLM) with a retrieval system that fetches relevant information from external data sources. In LM-Kit.NET, the RagEngine class implements the core retrieval pipeline, while the RagChat class provides a turnkey multi-turn conversational RAG experience. The framework supports hybrid search (BM25 + vector fusion), advanced query generation strategies (Contextual, Multi-Query, HyDE), Maximal Marginal Relevance diversity filtering, context window expansion, and reranking. This makes LM-Kit.NET one of the most complete local RAG frameworks available for .NET.


Retrieval-Augmented Generation (RAG)

Definition: Retrieval-Augmented Generation (RAG) is a method in which a language model augments its response generation by retrieving relevant information from external sources. Unlike traditional LLMs, which rely solely on their pre-trained knowledge, RAG enables the model to consult and incorporate up-to-date information from documents, databases, or other data sources during the generation process.

In LM-Kit.NET, the RAG subsystem is built around three complementary classes:

  • RagEngine is the core retrieval engine. It manages DataSource repositories, performs similarity search across partitions, and supports pluggable retrieval strategies (vector, BM25, or hybrid).
  • RagChat wraps a RagEngine with an internal MultiTurnConversation, orchestrating query contextualization, retrieval, prompt construction, and grounded response generation in a single call. It supports all four QueryGenerationMode strategies, tools, skills, and agent memory.
  • DocumentRag extends RagEngine with multi-page document import (PDF, DOCX, images) and configurable processing modes (text extraction, OCR, or VLM-based document understanding).

This layered architecture lets developers choose the level of abstraction they need: low-level control with RagEngine, turnkey conversational RAG with RagChat, or document-centric workflows with DocumentRag.


The Role of RAG in LLMs

  1. Combining Retrieval with Generation: RAG enhances language models by allowing them to retrieve external information before generating text. This makes the model capable of providing up-to-date, factually correct, and contextually relevant responses, especially in cases where its pre-trained knowledge may be insufficient.

  2. Improving Accuracy and Contextual Relevance: By retrieving related content from a data source, RAG ensures that the generated responses are more grounded in real-world data. This is particularly useful for tasks that require up-to-date knowledge, such as question answering, document summarization, and chatbots that need to refer to external data.

  3. Handling Large Text Datasets: RAG is highly effective for processing large datasets by breaking them down into manageable chunks of text or image, known as Partitions. The retrieval process finds the most relevant chunks from the data source, which are then used to generate accurate and context-aware responses.

  4. Leveraging Multiple Retrieval Strategies: RAG can combine different retrieval methods. Semantic search uses vector embeddings and cosine similarity to match meaning, while keyword search (BM25) matches exact terms. Hybrid search fuses both with Reciprocal Rank Fusion for comprehensive coverage. MMR then removes near-duplicate passages to maximize the diversity of context sent to the LLM.


Practical Application in LM-Kit.NET SDK

LM-Kit.NET provides a layered RAG framework that scales from simple single-turn Q&A to production-grade conversational RAG with hybrid search and advanced query processing.

  1. Core RAG Engine (RagEngine): Manages data sources, text chunking, embedding, retrieval, and context-augmented generation.

    • AddDataSource: Adds data sources backed by the built-in vector database, file-based persistence, or external stores like Qdrant.
    • ImportText / ImportTextAsync: Chunks and embeds text into a named section with configurable chunking strategies (TextChunking, MarkdownChunking, HtmlChunking).
    • FindMatchingPartitions: Searches across all data sources using the active retrieval strategy.
    • QueryPartitions: Injects matched partitions into a prompt template and generates a grounded response.
  2. Retrieval Strategies: The RetrievalStrategy property on RagEngine controls how partitions are matched:

    • VectorRetrievalStrategy (default): Semantic similarity via cosine distance on embeddings.
    • Bm25RetrievalStrategy: BM25+ lexical ranking with configurable term saturation, length normalization, proximity boosting, and language-aware stopword filtering.
    • HybridRetrievalStrategy: Combines both with weighted Reciprocal Rank Fusion, configurable via VectorWeight, KeywordWeight, and RrfK.
  3. Conversational RAG (RagChat): A turnkey multi-turn class that wraps RagEngine with an internal conversation. Supports four QueryGenerationMode options:

    • Original: Uses the user's question as-is.
    • Contextual: Rewrites follow-up questions into self-contained queries using conversation history.
    • Multi-Query: Generates multiple query variants and merges results with Reciprocal Rank Fusion.
    • HyDE (Hypothetical Document Embeddings): Generates a hypothetical answer and uses it as the retrieval query, bridging the gap between question and document phrasing.
  4. Quality Refinement:

    • Reranking: The Reranker property on RagEngine re-scores retrieved partitions with a cross-encoder for higher precision.
    • Maximal Marginal Relevance: The MmrLambda property reduces near-duplicate passages by balancing relevance against diversity.
    • Context Window Expansion: The ContextWindow property automatically includes neighboring partitions around each match, giving the LLM surrounding context for more accurate answers.
  5. Document-Centric RAG (DocumentRag): Extends RagEngine for multi-page document processing (PDF, DOCX, images) with three processing modes:

    • Auto: Automatically selects the best strategy per page.
    • TextExtraction: Traditional text extraction with optional OCR.
    • DocumentUnderstanding: Uses a VLM to parse complex layouts as Markdown.
  6. Chunking Strategies: LM-Kit.NET ships three chunking strategies, configurable per import or as a default on RagEngine:

    • TextChunking: Paragraph and sentence-aware splitting with configurable overlap.
    • MarkdownChunking: Heading-aware splitting that preserves code fences and document structure.
    • HtmlChunking: DOM-aware splitting with boilerplate removal and heading breadcrumbs.

Code Examples

Single-Turn RAG (RagEngine)

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;

using LM chatModel = LM.LoadFromModelID("gemma3:4b");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");

// Create a RAG engine with a file-backed data source
var dataSource = DataSource.CreateFileDataSource("index.dat", "KB", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);

// Import and chunk a document
rag.ImportText(File.ReadAllText("docs/manual.txt"), "KB", "manual");

// Retrieve and generate
var matches = rag.FindMatchingPartitions("How do I reset the device?", topK: 3, minScore: 0.3f);
var chat = new SingleTurnConversation(chatModel);
var result = rag.QueryPartitions("How do I reset the device?", matches, chat);

Conversational RAG (RagChat)

using LMKit.Model;
using LMKit.Retrieval;

using LM chatModel = LM.LoadFromModelID("qwen3:8b");
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create a RagChat instance (multi-turn, with query contextualization)
var rag = new RagEngine(embeddingModel);
var ragChat = new RagChat(chatModel, rag)
{
    QueryGenerationMode = QueryGenerationMode.Contextual,
    MaxRetrievedPartitions = 5
};

// Submit questions with automatic context tracking across turns
var result = await ragChat.SubmitAsync("What products does NovaPulse offer?");
Console.WriteLine(result.TextGenerationResult.Completion);

// Follow-up: "Contextual" mode rewrites this into a self-contained query
var followUp = await ragChat.SubmitAsync("What about pricing?");
// Enable BM25 + vector fusion with weighted Reciprocal Rank Fusion
rag.RetrievalStrategy = new HybridRetrievalStrategy
{
    VectorWeight = 0.6f,
    KeywordWeight = 0.4f
};

Key Classes and Concepts in LM-Kit.NET RAG

  • RagEngine: The core retrieval-augmented generation engine. Manages data sources, chunking, embedding, retrieval (vector, BM25, or hybrid), reranking, MMR diversity filtering, and context-augmented LLM generation.

  • RagChat: A turnkey multi-turn conversational RAG class. Wraps RagEngine with an internal MultiTurnConversation and supports four query generation modes (Original, Contextual, Multi-Query, HyDE). Returns RagQueryResult with both the generated answer and the retrieved partitions.

  • DocumentRag: Extends RagEngine for document-centric workflows. Imports multi-page PDFs, DOCX, and images with configurable processing modes (text extraction, OCR, VLM understanding).

  • DataSource: Stores chunk embeddings. Supports three storage modes: in-memory, file-backed (built-in vector database), and external vector stores (e.g., Qdrant).

  • TextChunking / MarkdownChunking / HtmlChunking: Three chunking strategies implementing the IChunking interface. Each optimizes splitting for its content type.

  • PartitionSimilarity: Represents a retrieval result with the matched partition, similarity score, and optional reranked score.

  • IRetrievalStrategy: Interface for pluggable retrieval strategies (VectorRetrievalStrategy, Bm25RetrievalStrategy, HybridRetrievalStrategy).

  • RagReranker: Cross-encoder reranker that plugs into RagEngine via the Reranker property for improved retrieval precision.


Key Terms

  • Retrieval-Augmented Generation (RAG): A technique that combines retrieval of external information with text generation, improving the accuracy and relevance of the generated output by using real-world data.

  • Text Chunking: The process of breaking large texts into smaller segments (chunks or partitions) to make them easier to retrieve and process during RAG. See Optimize RAG with Custom Chunking.

  • Hybrid Search: Combining semantic vector search with BM25 keyword search and fusing results with Reciprocal Rank Fusion for comprehensive retrieval.

  • Query Contextualization: Rewriting follow-up questions into self-contained queries using conversation history, so retrieval stays accurate across turns.

  • Multi-Query Retrieval: Generating multiple query variants from a single question and merging results with Reciprocal Rank Fusion for improved recall.

  • HyDE: Hypothetical Document Embeddings. Generating a hypothetical answer and using it as the retrieval query to bridge the gap between question and document phrasing.

  • Maximal Marginal Relevance (MMR): A diversity filtering technique that reduces near-duplicate passages in retrieval results.

  • Embedding: A vector representation of text in a high-dimensional space. Embeddings are used during RAG to measure the similarity between text partitions and the query.

  • Reranking: Re-scoring retrieved passages with a cross-encoder model for higher precision ranking.





External Resources

  • RAG Original Paper (Lewis et al., 2020): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  • Self-RAG (Asai et al., 2023): Learning to retrieve, generate, and critique
  • RAPTOR (Sarthi et al., 2024): Recursive abstractive processing for tree-organized retrieval

Summary

Retrieval-Augmented Generation (RAG) is a technique that improves the output of Large Language Models (LLMs) by incorporating external information retrieved from data sources. LM-Kit.NET provides a comprehensive RAG framework: RagEngine for core retrieval and generation, RagChat for turnkey multi-turn conversational RAG with four query generation strategies, and DocumentRag for document-centric workflows. The framework supports hybrid search (BM25 + vector fusion), MMR diversity filtering, context window expansion, reranking, and three chunking strategies (TextChunking, MarkdownChunking, HtmlChunking). This makes LM-Kit.NET a production-ready platform for building RAG systems that run entirely on-device, keeping data private and eliminating cloud API dependencies.

Share