What is Retrieval-Augmented Generation (RAG)?
TL;DR
Retrieval-Augmented Generation (RAG) is a technique that enhances text generation by combining a Large Language Model (LLM) with a retrieval system that fetches relevant information from external data sources. In LM-Kit.NET, the RagEngine class implements the core retrieval pipeline, while the RagChat class provides a turnkey multi-turn conversational RAG experience. The framework supports hybrid search (BM25 + vector fusion), advanced query generation strategies (Contextual, Multi-Query, HyDE), Maximal Marginal Relevance diversity filtering, context window expansion, and reranking. This makes LM-Kit.NET one of the most complete local RAG frameworks available for .NET.
Retrieval-Augmented Generation (RAG)
Definition: Retrieval-Augmented Generation (RAG) is a method in which a language model augments its response generation by retrieving relevant information from external sources. Unlike traditional LLMs, which rely solely on their pre-trained knowledge, RAG enables the model to consult and incorporate up-to-date information from documents, databases, or other data sources during the generation process.
In LM-Kit.NET, the RAG subsystem is built around three complementary classes:
RagEngineis the core retrieval engine. It manages DataSource repositories, performs similarity search across partitions, and supports pluggable retrieval strategies (vector, BM25, or hybrid).RagChatwraps aRagEnginewith an internalMultiTurnConversation, orchestrating query contextualization, retrieval, prompt construction, and grounded response generation in a single call. It supports all fourQueryGenerationModestrategies, tools, skills, and agent memory.DocumentRagextendsRagEnginewith multi-page document import (PDF, DOCX, images) and configurable processing modes (text extraction, OCR, or VLM-based document understanding).
This layered architecture lets developers choose the level of abstraction they need: low-level control with RagEngine, turnkey conversational RAG with RagChat, or document-centric workflows with DocumentRag.
The Role of RAG in LLMs
Combining Retrieval with Generation: RAG enhances language models by allowing them to retrieve external information before generating text. This makes the model capable of providing up-to-date, factually correct, and contextually relevant responses, especially in cases where its pre-trained knowledge may be insufficient.
Improving Accuracy and Contextual Relevance: By retrieving related content from a data source, RAG ensures that the generated responses are more grounded in real-world data. This is particularly useful for tasks that require up-to-date knowledge, such as question answering, document summarization, and chatbots that need to refer to external data.
Handling Large Text Datasets: RAG is highly effective for processing large datasets by breaking them down into manageable chunks of text or image, known as Partitions. The retrieval process finds the most relevant chunks from the data source, which are then used to generate accurate and context-aware responses.
Leveraging Multiple Retrieval Strategies: RAG can combine different retrieval methods. Semantic search uses vector embeddings and cosine similarity to match meaning, while keyword search (BM25) matches exact terms. Hybrid search fuses both with Reciprocal Rank Fusion for comprehensive coverage. MMR then removes near-duplicate passages to maximize the diversity of context sent to the LLM.
Practical Application in LM-Kit.NET SDK
LM-Kit.NET provides a layered RAG framework that scales from simple single-turn Q&A to production-grade conversational RAG with hybrid search and advanced query processing.
Core RAG Engine (
RagEngine): Manages data sources, text chunking, embedding, retrieval, and context-augmented generation.AddDataSource: Adds data sources backed by the built-in vector database, file-based persistence, or external stores like Qdrant.ImportText/ImportTextAsync: Chunks and embeds text into a named section with configurable chunking strategies (TextChunking,MarkdownChunking,HtmlChunking).FindMatchingPartitions: Searches across all data sources using the active retrieval strategy.QueryPartitions: Injects matched partitions into a prompt template and generates a grounded response.
Retrieval Strategies: The
RetrievalStrategyproperty onRagEnginecontrols how partitions are matched:VectorRetrievalStrategy(default): Semantic similarity via cosine distance on embeddings.Bm25RetrievalStrategy: BM25+ lexical ranking with configurable term saturation, length normalization, proximity boosting, and language-aware stopword filtering.HybridRetrievalStrategy: Combines both with weighted Reciprocal Rank Fusion, configurable viaVectorWeight,KeywordWeight, andRrfK.
Conversational RAG (
RagChat): A turnkey multi-turn class that wrapsRagEnginewith an internal conversation. Supports fourQueryGenerationModeoptions:- Original: Uses the user's question as-is.
- Contextual: Rewrites follow-up questions into self-contained queries using conversation history.
- Multi-Query: Generates multiple query variants and merges results with Reciprocal Rank Fusion.
- HyDE (Hypothetical Document Embeddings): Generates a hypothetical answer and uses it as the retrieval query, bridging the gap between question and document phrasing.
Quality Refinement:
- Reranking: The
Rerankerproperty onRagEnginere-scores retrieved partitions with a cross-encoder for higher precision. - Maximal Marginal Relevance: The
MmrLambdaproperty reduces near-duplicate passages by balancing relevance against diversity. - Context Window Expansion: The
ContextWindowproperty automatically includes neighboring partitions around each match, giving the LLM surrounding context for more accurate answers.
- Reranking: The
Document-Centric RAG (
DocumentRag): ExtendsRagEnginefor multi-page document processing (PDF, DOCX, images) with three processing modes:- Auto: Automatically selects the best strategy per page.
- TextExtraction: Traditional text extraction with optional OCR.
- DocumentUnderstanding: Uses a VLM to parse complex layouts as Markdown.
Chunking Strategies: LM-Kit.NET ships three chunking strategies, configurable per import or as a default on
RagEngine:TextChunking: Paragraph and sentence-aware splitting with configurable overlap.MarkdownChunking: Heading-aware splitting that preserves code fences and document structure.HtmlChunking: DOM-aware splitting with boilerplate removal and heading breadcrumbs.
Code Examples
Single-Turn RAG (RagEngine)
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;
using LM chatModel = LM.LoadFromModelID("gemma3:4b");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
// Create a RAG engine with a file-backed data source
var dataSource = DataSource.CreateFileDataSource("index.dat", "KB", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
// Import and chunk a document
rag.ImportText(File.ReadAllText("docs/manual.txt"), "KB", "manual");
// Retrieve and generate
var matches = rag.FindMatchingPartitions("How do I reset the device?", topK: 3, minScore: 0.3f);
var chat = new SingleTurnConversation(chatModel);
var result = rag.QueryPartitions("How do I reset the device?", matches, chat);
Conversational RAG (RagChat)
using LMKit.Model;
using LMKit.Retrieval;
using LM chatModel = LM.LoadFromModelID("qwen3:8b");
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
// Create a RagChat instance (multi-turn, with query contextualization)
var rag = new RagEngine(embeddingModel);
var ragChat = new RagChat(chatModel, rag)
{
QueryGenerationMode = QueryGenerationMode.Contextual,
MaxRetrievedPartitions = 5
};
// Submit questions with automatic context tracking across turns
var result = await ragChat.SubmitAsync("What products does NovaPulse offer?");
Console.WriteLine(result.TextGenerationResult.Completion);
// Follow-up: "Contextual" mode rewrites this into a self-contained query
var followUp = await ragChat.SubmitAsync("What about pricing?");
Hybrid Search
// Enable BM25 + vector fusion with weighted Reciprocal Rank Fusion
rag.RetrievalStrategy = new HybridRetrievalStrategy
{
VectorWeight = 0.6f,
KeywordWeight = 0.4f
};
Key Classes and Concepts in LM-Kit.NET RAG
RagEngine: The core retrieval-augmented generation engine. Manages data sources, chunking, embedding, retrieval (vector, BM25, or hybrid), reranking, MMR diversity filtering, and context-augmented LLM generation.RagChat: A turnkey multi-turn conversational RAG class. WrapsRagEnginewith an internalMultiTurnConversationand supports four query generation modes (Original, Contextual, Multi-Query, HyDE). ReturnsRagQueryResultwith both the generated answer and the retrieved partitions.DocumentRag: ExtendsRagEnginefor document-centric workflows. Imports multi-page PDFs, DOCX, and images with configurable processing modes (text extraction, OCR, VLM understanding).DataSource: Stores chunk embeddings. Supports three storage modes: in-memory, file-backed (built-in vector database), and external vector stores (e.g., Qdrant).TextChunking/MarkdownChunking/HtmlChunking: Three chunking strategies implementing theIChunkinginterface. Each optimizes splitting for its content type.PartitionSimilarity: Represents a retrieval result with the matched partition, similarity score, and optional reranked score.IRetrievalStrategy: Interface for pluggable retrieval strategies (VectorRetrievalStrategy,Bm25RetrievalStrategy,HybridRetrievalStrategy).RagReranker: Cross-encoder reranker that plugs intoRagEnginevia theRerankerproperty for improved retrieval precision.
Key Terms
Retrieval-Augmented Generation (RAG): A technique that combines retrieval of external information with text generation, improving the accuracy and relevance of the generated output by using real-world data.
Text Chunking: The process of breaking large texts into smaller segments (chunks or partitions) to make them easier to retrieve and process during RAG. See Optimize RAG with Custom Chunking.
Hybrid Search: Combining semantic vector search with BM25 keyword search and fusing results with Reciprocal Rank Fusion for comprehensive retrieval.
Query Contextualization: Rewriting follow-up questions into self-contained queries using conversation history, so retrieval stays accurate across turns.
Multi-Query Retrieval: Generating multiple query variants from a single question and merging results with Reciprocal Rank Fusion for improved recall.
HyDE: Hypothetical Document Embeddings. Generating a hypothetical answer and using it as the retrieval query to bridge the gap between question and document phrasing.
Maximal Marginal Relevance (MMR): A diversity filtering technique that reduces near-duplicate passages in retrieval results.
Embedding: A vector representation of text in a high-dimensional space. Embeddings are used during RAG to measure the similarity between text partitions and the query.
Reranking: Re-scoring retrieved passages with a cross-encoder model for higher precision ranking.
Related API Documentation
RagEngine: Core retrieval-augmented generation engineRagChat: Turnkey multi-turn conversational RAGDocumentRag: Document-centric RAG with multi-page processingDataSource: Repository for content partitionsTextChunking: Recursive text partitioningMarkdownChunking: Heading-aware Markdown splittingHtmlChunking: DOM-aware HTML splittingPartition: Individual text or image chunksEmbedder: Generate embeddings for similarity search
Related Glossary Topics
- Agentic RAG: Agent-driven retrieval with iterative refinement and tool use
- Hybrid Search: BM25 + vector fusion for comprehensive retrieval
- Reciprocal Rank Fusion: Merging ranked lists from different retrieval methods
- Maximal Marginal Relevance (MMR): Diversity filtering for retrieval results
- Query Contextualization: Rewriting follow-up questions for multi-turn RAG
- Multi-Query Retrieval: Generating query variants for improved recall
- HyDE: Hypothetical Document Embeddings for better retrieval alignment
- Embeddings: Vector representations for similarity search
- Reranking: Cross-encoder re-scoring for retrieval precision
- Chunking: Splitting documents into retrievable partitions
- Vector Database: Persistent storage for embeddings
- AI Agent Memory: RAG-like patterns for agent context
- Semantic Similarity: Measuring how close two pieces of text are in meaning
- LLM: Large Language Models that RAG augments with external knowledge
Related Guides and Demos
- Build a RAG Pipeline Over Your Own Documents: Step-by-step tutorial with indexing, search, and generation
- Improve RAG Results with Reranking: Cross-encoder reranking for precision
- Optimize RAG with Custom Chunking Strategies: TextChunking, MarkdownChunking, HtmlChunking
- Build a Unified Multimodal RAG System: Audio, images, and text in one knowledge base
- Build Semantic Search with Embeddings: Embedding fundamentals and similarity search
- Chat with PDF Documents: PdfChat with conversational Q&A
- Conversational RAG Demo: Multi-turn RAG with RagChat and query generation modes
- Single-Turn RAG Demo: Basic Q&A over documents
External Resources
- RAG Original Paper (Lewis et al., 2020): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Self-RAG (Asai et al., 2023): Learning to retrieve, generate, and critique
- RAPTOR (Sarthi et al., 2024): Recursive abstractive processing for tree-organized retrieval
Summary
Retrieval-Augmented Generation (RAG) is a technique that improves the output of Large Language Models (LLMs) by incorporating external information retrieved from data sources. LM-Kit.NET provides a comprehensive RAG framework: RagEngine for core retrieval and generation, RagChat for turnkey multi-turn conversational RAG with four query generation strategies, and DocumentRag for document-centric workflows. The framework supports hybrid search (BM25 + vector fusion), MMR diversity filtering, context window expansion, reranking, and three chunking strategies (TextChunking, MarkdownChunking, HtmlChunking). This makes LM-Kit.NET a production-ready platform for building RAG systems that run entirely on-device, keeping data private and eliminating cloud API dependencies.