Improve Recall with Multi-Query and HyDE Retrieval
A single query may not capture all the ways your documents express the answer. "What causes high latency?" might miss a passage that says "response times degrade when the thread pool is saturated." LM-Kit.NET provides two query transformation strategies that expand retrieval coverage: Multi-Query generates multiple reformulations of the original query, while HyDE (Hypothetical Document Embeddings) generates a hypothetical answer and uses its embedding for retrieval.
This tutorial shows how to configure both strategies, when to use each, and how they combine with other retrieval techniques.
Why This Matters
Two enterprise problems that query transformation solves:
- Short, ambiguous user queries. Users often type minimal queries like "timeout issue" or "performance." These contain too little information for vector search to find the best passages. Multi-Query generates expanded variants like "What causes request timeout errors?" and "How to diagnose timeout issues in production?" to cast a wider net.
- Vocabulary gap between questions and documents. Technical documentation uses different language than the questions people ask about it. A user asks "Why is my app slow?" while the document says "Latency increases due to garbage collection pressure." HyDE bridges this gap by generating a hypothetical answer in document-style language, then searching for passages similar to that answer.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB recommended |
| VRAM | 6 GB (for both models simultaneously) |
| Disk | ~4 GB free for model downloads |
You should be familiar with the foundational RAG pipeline before starting this tutorial.
Step 1: Create the Project
dotnet new console -n QueryTransformQuickstart
cd QueryTransformQuickstart
dotnet add package LM-Kit.NET
Step 2: Understand the Four Query Modes
User Query: "Why is it slow?"
│
├─ Original ─────────────► "Why is it slow?"
│ (used as-is)
│
├─ Contextual ───────────► "Why is the API response time slow?"
│ (rewritten with conversation history)
│
├─ MultiQuery ───────────► "What causes slow response times?"
│ "How to diagnose latency issues?"
│ "Why is application performance degraded?"
│ (3 variants, results merged with RRF)
│
└─ HypotheticalAnswer ──► "Slow response times are typically caused by
thread pool exhaustion, excessive GC pressure,
or database connection bottlenecks..."
(hypothetical answer embedded for retrieval)
| Mode | LLM Calls | Retrieval Calls | Latency | Best For |
|---|---|---|---|---|
Original |
0 | 1 | Lowest | Well-formed, specific queries |
Contextual |
1 | 1 | Low | Multi-turn conversations (follow-up questions) |
MultiQuery |
1 | N | Medium | Ambiguous or broad queries |
HypotheticalAnswer |
1 | 1 | Medium | Short queries with vocabulary gaps |
Step 3: Multi-Query Retrieval with RagChat
Multi-Query generates N variant phrasings of the user's question, retrieves results for each, and merges them using Reciprocal Rank Fusion (RRF).
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine(" Done.\n");
// ──────────────────────────────────────
// 2. Build RAG engine and index documents
// ──────────────────────────────────────
var dataSource = DataSource.CreateInMemoryDataSource("KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
rag.DefaultIChunking = new TextChunking { MaxChunkSize = 500, MaxOverlapSize = 50 };
string[] docs =
{
"Thread pool exhaustion causes request queuing and elevated response times. " +
"Monitor ThreadPool.PendingWorkItemCount to detect saturation early.",
"Garbage collection pressure from large object heap allocations can freeze " +
"application threads for hundreds of milliseconds during Gen2 collections.",
"Database connection pool limits default to 100. When all connections are in use, " +
"new requests block until a connection is returned, causing cascading timeouts.",
"Network latency between microservices increases proportionally with payload size. " +
"Use compression and pagination to reduce round-trip times.",
"CPU throttling in containerized environments occurs when the container exceeds its " +
"CPU quota. The CFS scheduler introduces artificial delays of up to 100ms."
};
foreach (string doc in docs)
rag.ImportText(doc, "KnowledgeBase", "performance-docs");
// ──────────────────────────────────────
// 3. Enable Multi-Query mode
// ──────────────────────────────────────
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.MultiQuery,
MaxRetrievedPartitions = 5,
MinRelevanceScore = 0.2f,
SystemPrompt = "Answer using only the provided context.",
MaximumCompletionTokens = 512
};
// Configure variant generation
ragChat.MultiQueryOptions.QueryVariantCount = 3; // Generate 3 variants (default)
ragChat.MultiQueryOptions.MaxCompletionTokens = 256; // Token budget per generation
ragChat.AfterTextCompletion += (_, e) => Console.Write(e.Text);
// ──────────────────────────────────────
// 4. Query: "Why is it slow?" generates 3 variants
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Query: Why is it slow?\n");
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();
var result = ragChat.Submit("Why is it slow?");
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens]\n");
The short query "Why is it slow?" is expanded into multiple variants that capture different aspects of the question. Each variant retrieves independently, and RRF fusion ensures that passages relevant to any variant surface in the final results.
Step 4: Tune Multi-Query Options
ragChat.MultiQueryOptions.QueryVariantCount = 4; // More variants = broader recall
ragChat.MultiQueryOptions.MaxCompletionTokens = 128; // Shorter budget = faster generation
| Setting | Default | Guidance |
|---|---|---|
QueryVariantCount |
3 | 3 to 4 variants is optimal. More than 5 adds latency with diminishing returns. |
MaxCompletionTokens |
256 | 128 is sufficient for query reformulation. Increase only for very complex queries. |
Step 5: HyDE Retrieval
HyDE takes a different approach: instead of generating query variants, it generates a hypothetical answer to the question. This hypothetical answer uses vocabulary and sentence structure similar to the actual documents, producing an embedding that is closer to the real answer in vector space.
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.HypotheticalAnswer,
MaxRetrievedPartitions = 5,
MinRelevanceScore = 0.2f,
SystemPrompt = "Answer using only the provided context.",
MaximumCompletionTokens = 512
};
// Configure hypothesis generation
ragChat.HydeOptions.MaxCompletionTokens = 512; // Token budget for hypothetical answer
ragChat.AfterTextCompletion += (_, e) => Console.Write(e.Text);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Query: timeout issue\n");
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();
var result = ragChat.Submit("timeout issue");
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens]\n");
For the terse query "timeout issue," HyDE generates a paragraph-length hypothetical answer about timeout causes. That paragraph's embedding matches the actual document about database connection pool timeouts far better than the two-word query would.
Step 6: Choosing Between Multi-Query and HyDE
| Factor | Multi-Query | HyDE |
|---|---|---|
| Query type | Ambiguous, multi-faceted questions | Short, terse queries with vocabulary gaps |
| Mechanism | Generates N query variants, merges results | Generates one hypothetical answer, uses its embedding |
| Recall boost | Broad: captures different phrasings | Deep: bridges question vs. document vocabulary |
| Latency | 1 LLM call + N retrieval calls | 1 LLM call + 1 retrieval call |
| Risk | Low (variants are short) | Medium (hypothetical answer may hallucinate, leading to wrong embedding) |
| Best domains | General Q&A, conversational search | Technical documentation, knowledge bases with specialized vocabulary |
Practical heuristic: Start with Contextual mode for multi-turn conversations. If recall is insufficient, try MultiQuery. If queries are very short or use different vocabulary than your documents, try HypotheticalAnswer.
Step 7: Combine with Other Retrieval Techniques
Query transformation is the first stage of the retrieval pipeline. It combines naturally with other techniques:
With Hybrid Search
Use hybrid search to ensure both semantic and keyword matches are captured, on top of query expansion:
rag.RetrievalStrategy = new HybridRetrievalStrategy();
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.MultiQuery
};
With Reranking
Add reranking to re-score the expanded result set for higher precision:
rag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.MultiQuery
};
With MMR Diversity
Use MMR to ensure the expanded result set does not contain near-duplicate passages:
rag.MmrLambda = 0.7f;
using var ragChat = new RagChat(rag, chatModel)
{
QueryGenerationMode = QueryGenerationMode.MultiQuery
};
The full pipeline: Multi-Query expansion (broad recall) followed by Hybrid search (keyword + semantic) followed by Reranking (precision) with MMR (diversity).
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Generated variants are too similar | Model repeating itself | Use a more capable chat model (e.g., gemma3:12b) |
| HyDE retrieves wrong passages | Hypothetical answer hallucinated off-topic | Lower HydeOptions.MaxCompletionTokens to constrain the hypothesis |
| High latency in MultiQuery mode | Too many variants | Reduce QueryVariantCount to 2 or 3 |
| No improvement over Original mode | Queries are already specific and well-formed | Query transformation helps most with short or ambiguous queries |
Next Steps
- Build Conversational RAG with RagChat: full multi-turn conversational interface with query modes.
- Boost Retrieval with Hybrid Search: combine vector and BM25 for broader recall.
- Improve RAG Results with Reranking: add cross-encoder reranking for precision.
- Diversify and Filter RAG Results: reduce redundancy with MMR and scope results with metadata filtering.
- Glossary: Multi-Query Retrieval: how query variants are generated and fused.
- Glossary: HyDE: how hypothetical document embeddings work.
- Glossary: Reciprocal Rank Fusion: the algorithm that merges multiple ranked lists.
- Samples: Conversational RAG: interactive demo with all four query modes.