Build Conversational RAG with RagChat
A basic RAG pipeline answers isolated questions, but real users ask follow-up questions that depend on conversation history. "What were Q4 revenues?" followed by "How does that compare to Q3?" requires the system to understand that "that" refers to Q4 revenues. RagChat wraps RagEngine with built-in conversation management, query contextualization, and configurable query generation modes so that each follow-up retrieves the right passages automatically.
This tutorial builds a multi-turn document Q&A system where follow-up questions are handled transparently.
Why This Matters
Two enterprise problems that conversational RAG solves:
- Support agents drilling into complex cases. A support engineer asks about an error, then follows up with "Which versions are affected?" and "What is the workaround?" Each question depends on the previous context. Without conversation-aware retrieval, the system retrieves unrelated passages because the follow-up questions lack sufficient standalone context.
- Analysts exploring large document collections. A financial analyst asks "What is the company's debt-to-equity ratio?" then "How has it changed year over year?" The second query is meaningless without the first.
RagChatrewrites it into a self-contained query before retrieval.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB recommended |
| VRAM | 6 GB (for both models simultaneously) |
| Disk | ~4 GB free for model downloads |
You should be familiar with the foundational RAG pipeline before starting this tutorial.
Step 1: Create the Project
dotnet new console -n ConversationalRag
cd ConversationalRag
dotnet add package LM-Kit.NET
Step 2: Understand the Architecture
┌──────────────────────────────────────────────────────────────────┐
│ RagChat │
│ │
│ ChatHistory ──► QueryGenerationMode ──► RagEngine ──► LLM │
│ │
│ Modes: │
│ Original ............. use query as-is │
│ Contextual ........... rewrite with history context │
│ MultiQuery ........... generate N query variants + RRF │
│ HypotheticalAnswer ... generate hypothetical answer (HyDE) │
└──────────────────────────────────────────────────────────────────┘
Key classes:
| Class | Role |
|---|---|
RagChat |
Multi-turn conversational interface over RagEngine |
RagEngine |
Core retrieval and indexing engine |
QueryGenerationMode |
Controls how queries are transformed before retrieval |
Step 3: Build a Conversational RAG System
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine(" Done.\n");
// ──────────────────────────────────────
// 2. Build the RAG engine and index documents
// ──────────────────────────────────────
var dataSource = DataSource.CreateInMemoryDataSource("KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
rag.DefaultIChunking = new TextChunking { MaxChunkSize = 500, MaxOverlapSize = 50 };
// Index some sample content
string[] sections =
{
"Q4 2024 revenue was $12.3M, a 15% increase over Q3 2024 revenue of $10.7M. " +
"The growth was driven by enterprise license sales in the EMEA region.",
"Q3 2024 revenue was $10.7M, primarily from North American cloud subscriptions. " +
"Operating expenses were $8.1M, resulting in an operating margin of 24.3%.",
"The company's debt-to-equity ratio improved from 1.8 in Q3 to 1.5 in Q4, " +
"driven by accelerated debt repayment and retained earnings growth."
};
for (int i = 0; i < sections.Length; i++)
rag.ImportText(sections[i], "KnowledgeBase", $"financial-report-section-{i}");
Console.WriteLine($"Indexed {sections.Length} sections.\n");
// ──────────────────────────────────────
// 3. Create RagChat with contextual query rewriting
// ──────────────────────────────────────
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.Contextual,
MaxRetrievedPartitions = 3,
MinRelevanceScore = 0.3f,
SystemPrompt = "Answer the question using only the provided context. " +
"If the context does not contain the answer, say so.",
MaximumCompletionTokens = 512
};
// Stream tokens as they are generated
ragChat.AfterTextCompletion += (_, e) =>
{
Console.Write(e.Text);
};
// Monitor retrieval
ragChat.RetrievalCompleted += (_, e) =>
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" [Retrieved {e.Partitions.Count} passages]");
Console.ResetColor();
};
// ──────────────────────────────────────
// 4. Multi-turn conversation loop
// ──────────────────────────────────────
Console.WriteLine("Ask questions about the financial reports (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("You: ");
Console.ResetColor();
string? input = Console.ReadLine();
if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();
RagQueryResult result = ragChat.Submit(input);
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
}
Step 4: Run and Test Multi-Turn Retrieval
dotnet run
Example session showing how contextual mode rewrites follow-up queries:
Indexed 3 sections.
Ask questions about the financial reports (or 'quit' to exit):
You: What was Q4 revenue?
[Retrieved 2 passages]
Assistant: Q4 2024 revenue was $12.3M, a 15% increase over Q3.
[38 tokens, 41.2 tok/s]
You: How does that compare to Q3?
[Retrieved 2 passages]
Assistant: Q4 revenue of $12.3M was 15% higher than Q3 revenue of $10.7M.
The growth was driven by enterprise license sales in EMEA.
[52 tokens, 39.8 tok/s]
You: What about the debt-to-equity ratio?
[Retrieved 1 passages]
Assistant: The debt-to-equity ratio improved from 1.8 in Q3 to 1.5 in Q4,
driven by accelerated debt repayment and retained earnings growth.
[47 tokens, 40.1 tok/s]
Without contextual mode, the query "How does that compare to Q3?" would be sent to the retrieval engine as-is, and "that" would match poorly against any document. With QueryGenerationMode.Contextual, RagChat rewrites it into something like "How does Q4 2024 revenue compare to Q3 2024 revenue?" before retrieval.
Step 5: Configure Query Contextualization
Fine-tune how the contextualizer uses conversation history:
ragChat.QueryGenerationMode = QueryGenerationMode.Contextual;
ragChat.ContextualizationOptions.MaxHistoryTurns = 5; // Look back 5 turns (default: 10)
ragChat.ContextualizationOptions.MaxCompletionTokens = 128; // Token budget for rewrite (default: 256)
| Property | Default | Effect |
|---|---|---|
MaxHistoryTurns |
10 | Number of previous turns used for context. Lower values reduce latency. |
MaxCompletionTokens |
256 | Token budget for the rewritten query. Keep low for simple rewrites. |
Step 6: Switch to Advanced Query Modes
RagChat supports four query generation modes. You can switch modes at any point during a conversation:
Multi-Query Mode
Generates multiple query variants and merges results with RRF. Improves recall for ambiguous or broad queries.
ragChat.QueryGenerationMode = QueryGenerationMode.MultiQuery;
ragChat.MultiQueryOptions.QueryVariantCount = 4; // Generate 4 variants (default: 3)
ragChat.MultiQueryOptions.MaxCompletionTokens = 256; // Token budget per generation
HyDE Mode
Generates a hypothetical answer first, embeds it, and uses that embedding for retrieval. Bridges the vocabulary gap between short questions and long document passages.
ragChat.QueryGenerationMode = QueryGenerationMode.HypotheticalAnswer;
ragChat.HydeOptions.MaxCompletionTokens = 512; // Token budget for hypothesis
See Improve Recall with Multi-Query and HyDE Retrieval for a detailed comparison of these modes.
Step 7: Combine with Hybrid Search
RagChat inherits the retrieval strategy from its RagEngine. To use hybrid search with conversational RAG:
rag.RetrievalStrategy = new HybridRetrievalStrategy();
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.Contextual,
MaxRetrievedPartitions = 5,
MinRelevanceScore = 0.2f
};
This gives you multi-turn conversation with contextual rewriting and hybrid (vector + BM25) retrieval in a single setup.
Step 8: Manage Conversation State
Clear History
Reset the conversation to start fresh:
ragChat.ClearHistory();
Access Chat History
Inspect the conversation state programmatically:
foreach (var message in ragChat.ChatHistory)
{
Console.WriteLine($"[{message.AuthorRole}]: {message.Content}");
}
Custom Prompt Templates
Override how retrieved context is injected into the prompt:
ragChat.PromptTemplate = @"Use the following reference material to answer the user's question.
If the material does not contain the answer, state that clearly.
## Reference Material:
@context
## User Question:
@question";
The placeholders @context and @question are replaced automatically.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Follow-ups retrieve irrelevant passages | QueryGenerationMode set to Original |
Switch to Contextual for conversation-aware retrieval |
| Rewritten query too verbose | MaxCompletionTokens too high |
Reduce ContextualizationOptions.MaxCompletionTokens to 128 |
| History context too stale | Too many turns in memory | Lower MaxHistoryTurns to 3 or 5 |
| Slow responses | Multiple LLM calls per query in MultiQuery/HyDE modes | Use Contextual mode for lower latency, or reduce QueryVariantCount |
Next Steps
- Build a RAG Pipeline Over Your Own Documents: foundational RAG tutorial with indexing and search.
- Improve Recall with Multi-Query and HyDE Retrieval: detailed guide on query transformation strategies.
- Boost Retrieval with Hybrid Search: combine vector and BM25 search for broader recall.
- Diversify and Filter RAG Results: reduce redundancy with MMR and scope retrieval with metadata filtering.
- Improve RAG Results with Reranking: add cross-encoder reranking for higher precision.
- Glossary: Query Contextualization: how follow-up queries are rewritten into standalone queries.
- Glossary: Multi-Query Retrieval: how query variants improve recall.
- Samples: Conversational RAG: interactive demo showcasing all
RagChatfeatures.