Improve Recall with Multi-Query and HyDE Retrieval

A single query may not capture all the ways your documents express the answer. "What causes high latency?" might miss a passage that says "response times degrade when the thread pool is saturated." LM-Kit.NET provides two query transformation strategies that expand retrieval coverage: Multi-Query generates multiple reformulations of the original query, while HyDE (Hypothetical Document Embeddings) generates a hypothetical answer and uses its embedding for retrieval.

This tutorial shows how to configure both strategies, when to use each, and how they combine with other retrieval techniques.

Why This Matters

Two enterprise problems that query transformation solves:

Short, ambiguous user queries. Users often type minimal queries like "timeout issue" or "performance." These contain too little information for vector search to find the best passages. Multi-Query generates expanded variants like "What causes request timeout errors?" and "How to diagnose timeout issues in production?" to cast a wider net.
Vocabulary gap between questions and documents. Technical documentation uses different language than the questions people ask about it. A user asks "Why is my app slow?" while the document says "Latency increases due to garbage collection pressure." HyDE bridges this gap by generating a hypothetical answer in document-style language, then searching for passages similar to that answer.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	16 GB recommended
VRAM	6 GB (for both models simultaneously)
Disk	~4 GB free for model downloads

You should be familiar with the foundational RAG pipeline before starting this tutorial.

Step 1: Create the Project

dotnet new console -n QueryTransformQuickstart
cd QueryTransformQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand the Four Query Modes

User Query: "Why is it slow?"
       │
       ├─ Original ─────────────► "Why is it slow?"
       │                                (used as-is)
       │
       ├─ Contextual ───────────► "Why is the API response time slow?"
       │                                (rewritten with conversation history)
       │
       ├─ MultiQuery ───────────► "What causes slow response times?"
       │                          "How to diagnose latency issues?"
       │                          "Why is application performance degraded?"
       │                                (3 variants, results merged with RRF)
       │
       └─ HypotheticalAnswer ──► "Slow response times are typically caused by
                                   thread pool exhaustion, excessive GC pressure,
                                   or database connection bottlenecks..."
                                        (hypothetical answer embedded for retrieval)

Mode	LLM Calls	Retrieval Calls	Latency	Best For
`Original`	0	1	Lowest	Well-formed, specific queries
`Contextual`	1	1	Low	Multi-turn conversations (follow-up questions)
`MultiQuery`	1	N	Medium	Ambiguous or broad queries
`HypotheticalAnswer`	1	1	Medium	Short queries with vocabulary gaps

Step 3: Multi-Query Retrieval with RagChat

Multi-Query generates N variant phrasings of the user's question, retrieves results for each, and merges them using Reciprocal Rank Fusion (RRF).

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m", // or "harrier-oss:0.6b"
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Build RAG engine and index documents
// ──────────────────────────────────────
var dataSource = DataSource.CreateInMemoryDataSource("KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
rag.DefaultIChunking = new TextChunking { MaxChunkSize = 500, MaxOverlapSize = 50 };

string[] docs =
{
    "Thread pool exhaustion causes request queuing and elevated response times. " +
    "Monitor ThreadPool.PendingWorkItemCount to detect saturation early.",

    "Garbage collection pressure from large object heap allocations can freeze " +
    "application threads for hundreds of milliseconds during Gen2 collections.",

    "Database connection pool limits default to 100. When all connections are in use, " +
    "new requests block until a connection is returned, causing cascading timeouts.",

    "Network latency between microservices increases proportionally with payload size. " +
    "Use compression and pagination to reduce round-trip times.",

    "CPU throttling in containerized environments occurs when the container exceeds its " +
    "CPU quota. The CFS scheduler introduces artificial delays of up to 100ms."
};

foreach (string doc in docs)
    rag.ImportText(doc, "KnowledgeBase", "performance-docs");

// ──────────────────────────────────────
// 3. Enable Multi-Query mode
// ──────────────────────────────────────
using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery,
    MaxRetrievedPartitions = 5,
    MinRelevanceScore = 0.2f,
    SystemPrompt = "Answer using only the provided context.",
    MaximumCompletionTokens = 512
};

// Configure variant generation
ragChat.MultiQueryOptions.QueryVariantCount = 3;        // Generate 3 variants (default)
ragChat.MultiQueryOptions.MaxCompletionTokens = 256;    // Token budget per generation

ragChat.AfterTextCompletion += (_, e) => Console.Write(e.Text);

// ──────────────────────────────────────
// 4. Query: "Why is it slow?" generates 3 variants
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Query: Why is it slow?\n");
Console.ResetColor();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();

var result = ragChat.Submit("Why is it slow?");
Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens]\n");

The short query "Why is it slow?" is expanded into multiple variants that capture different aspects of the question. Each variant retrieves independently, and RRF fusion ensures that passages relevant to any variant surface in the final results.

Step 4: Tune Multi-Query Options

ragChat.MultiQueryOptions.QueryVariantCount = 4;       // More variants = broader recall
ragChat.MultiQueryOptions.MaxCompletionTokens = 128;   // Shorter budget = faster generation

Setting	Default	Guidance
`QueryVariantCount`	3	3 to 4 variants is optimal. More than 5 adds latency with diminishing returns.
`MaxCompletionTokens`	256	128 is sufficient for query reformulation. Increase only for very complex queries.

Step 5: HyDE Retrieval

HyDE takes a different approach: instead of generating query variants, it generates a hypothetical answer to the question. This hypothetical answer uses vocabulary and sentence structure similar to the actual documents, producing an embedding that is closer to the real answer in vector space.

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.HypotheticalAnswer,
    MaxRetrievedPartitions = 5,
    MinRelevanceScore = 0.2f,
    SystemPrompt = "Answer using only the provided context.",
    MaximumCompletionTokens = 512
};

// Configure hypothesis generation
ragChat.HydeOptions.MaxCompletionTokens = 512;  // Token budget for hypothetical answer

ragChat.AfterTextCompletion += (_, e) => Console.Write(e.Text);

Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Query: timeout issue\n");
Console.ResetColor();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();

var result = ragChat.Submit("timeout issue");
Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens]\n");

For the terse query "timeout issue," HyDE generates a paragraph-length hypothetical answer about timeout causes. That paragraph's embedding matches the actual document about database connection pool timeouts far better than the two-word query would.

Step 6: Choosing Between Multi-Query and HyDE

Factor	Multi-Query	HyDE
Query type	Ambiguous, multi-faceted questions	Short, terse queries with vocabulary gaps
Mechanism	Generates N query variants, merges results	Generates one hypothetical answer, uses its embedding
Recall boost	Broad: captures different phrasings	Deep: bridges question vs. document vocabulary
Latency	1 LLM call + N retrieval calls	1 LLM call + 1 retrieval call
Risk	Low (variants are short)	Medium (hypothetical answer may hallucinate, leading to wrong embedding)
Best domains	General Q&A, conversational search	Technical documentation, knowledge bases with specialized vocabulary

Practical heuristic: Start with Contextual mode for multi-turn conversations. If recall is insufficient, try MultiQuery. If queries are very short or use different vocabulary than your documents, try HypotheticalAnswer.

Step 7: Combine with Other Retrieval Techniques

Query transformation is the first stage of the retrieval pipeline. It combines naturally with other techniques:

With Hybrid Search

Use hybrid search to ensure both semantic and keyword matches are captured, on top of query expansion:

rag.RetrievalStrategy = new HybridRetrievalStrategy();

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery
};

With Reranking

Add reranking to re-score the expanded result set for higher precision:

rag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery
};

With MMR Diversity

Use MMR to ensure the expanded result set does not contain near-duplicate passages:

rag.MmrLambda = 0.7f;

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery
};

The full pipeline: Multi-Query expansion (broad recall) followed by Hybrid search (keyword + semantic) followed by Reranking (precision) with MMR (diversity).

Common Issues

Problem	Cause	Fix
Generated variants are too similar	Model repeating itself	Use a more capable chat model (e.g., `gemma4:e4b`)
HyDE retrieves wrong passages	Hypothetical answer hallucinated off-topic	Lower `HydeOptions.MaxCompletionTokens` to constrain the hypothesis
High latency in MultiQuery mode	Too many variants	Reduce `QueryVariantCount` to 2 or 3
No improvement over Original mode	Queries are already specific and well-formed	Query transformation helps most with short or ambiguous queries

Next Steps

Build Conversational RAG with RagChat: full multi-turn conversational interface with query modes.
Boost Retrieval with Hybrid Search: combine vector and BM25 for broader recall.
Improve RAG Results with Reranking: add cross-encoder reranking for precision.
Diversify and Filter RAG Results: reduce redundancy with MMR and scope results with metadata filtering.
Glossary: Multi-Query Retrieval: how query variants are generated and fused.
Glossary: HyDE: how hypothetical document embeddings work.
Glossary: Reciprocal Rank Fusion: the algorithm that merges multiple ranked lists.
Samples: Conversational RAG: interactive demo with all four query modes.

Table of Contents