Table of Contents

Improve Recall with Multi-Query and HyDE Retrieval

A single query may not capture all the ways your documents express the answer. "What causes high latency?" might miss a passage that says "response times degrade when the thread pool is saturated." LM-Kit.NET provides two query transformation strategies that expand retrieval coverage: Multi-Query generates multiple reformulations of the original query, while HyDE (Hypothetical Document Embeddings) generates a hypothetical answer and uses its embedding for retrieval.

This tutorial shows how to configure both strategies, when to use each, and how they combine with other retrieval techniques.


Why This Matters

Two enterprise problems that query transformation solves:

  1. Short, ambiguous user queries. Users often type minimal queries like "timeout issue" or "performance." These contain too little information for vector search to find the best passages. Multi-Query generates expanded variants like "What causes request timeout errors?" and "How to diagnose timeout issues in production?" to cast a wider net.
  2. Vocabulary gap between questions and documents. Technical documentation uses different language than the questions people ask about it. A user asks "Why is my app slow?" while the document says "Latency increases due to garbage collection pressure." HyDE bridges this gap by generating a hypothetical answer in document-style language, then searching for passages similar to that answer.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 16 GB recommended
VRAM 6 GB (for both models simultaneously)
Disk ~4 GB free for model downloads

You should be familiar with the foundational RAG pipeline before starting this tutorial.


Step 1: Create the Project

dotnet new console -n QueryTransformQuickstart
cd QueryTransformQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand the Four Query Modes

User Query: "Why is it slow?"
       │
       ├─ Original ─────────────► "Why is it slow?"
       │                                (used as-is)
       │
       ├─ Contextual ───────────► "Why is the API response time slow?"
       │                                (rewritten with conversation history)
       │
       ├─ MultiQuery ───────────► "What causes slow response times?"
       │                          "How to diagnose latency issues?"
       │                          "Why is application performance degraded?"
       │                                (3 variants, results merged with RRF)
       │
       └─ HypotheticalAnswer ──► "Slow response times are typically caused by
                                   thread pool exhaustion, excessive GC pressure,
                                   or database connection bottlenecks..."
                                        (hypothetical answer embedded for retrieval)
Mode LLM Calls Retrieval Calls Latency Best For
Original 0 1 Lowest Well-formed, specific queries
Contextual 1 1 Low Multi-turn conversations (follow-up questions)
MultiQuery 1 N Medium Ambiguous or broad queries
HypotheticalAnswer 1 1 Medium Short queries with vocabulary gaps

Step 3: Multi-Query Retrieval with RagChat

Multi-Query generates N variant phrasings of the user's question, retrieves results for each, and merges them using Reciprocal Rank Fusion (RRF).

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Build RAG engine and index documents
// ──────────────────────────────────────
var dataSource = DataSource.CreateInMemoryDataSource("KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
rag.DefaultIChunking = new TextChunking { MaxChunkSize = 500, MaxOverlapSize = 50 };

string[] docs =
{
    "Thread pool exhaustion causes request queuing and elevated response times. " +
    "Monitor ThreadPool.PendingWorkItemCount to detect saturation early.",

    "Garbage collection pressure from large object heap allocations can freeze " +
    "application threads for hundreds of milliseconds during Gen2 collections.",

    "Database connection pool limits default to 100. When all connections are in use, " +
    "new requests block until a connection is returned, causing cascading timeouts.",

    "Network latency between microservices increases proportionally with payload size. " +
    "Use compression and pagination to reduce round-trip times.",

    "CPU throttling in containerized environments occurs when the container exceeds its " +
    "CPU quota. The CFS scheduler introduces artificial delays of up to 100ms."
};

foreach (string doc in docs)
    rag.ImportText(doc, "KnowledgeBase", "performance-docs");

// ──────────────────────────────────────
// 3. Enable Multi-Query mode
// ──────────────────────────────────────
using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery,
    MaxRetrievedPartitions = 5,
    MinRelevanceScore = 0.2f,
    SystemPrompt = "Answer using only the provided context.",
    MaximumCompletionTokens = 512
};

// Configure variant generation
ragChat.MultiQueryOptions.QueryVariantCount = 3;        // Generate 3 variants (default)
ragChat.MultiQueryOptions.MaxCompletionTokens = 256;    // Token budget per generation

ragChat.AfterTextCompletion += (_, e) => Console.Write(e.Text);

// ──────────────────────────────────────
// 4. Query: "Why is it slow?" generates 3 variants
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Query: Why is it slow?\n");
Console.ResetColor();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();

var result = ragChat.Submit("Why is it slow?");
Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens]\n");

The short query "Why is it slow?" is expanded into multiple variants that capture different aspects of the question. Each variant retrieves independently, and RRF fusion ensures that passages relevant to any variant surface in the final results.


Step 4: Tune Multi-Query Options

ragChat.MultiQueryOptions.QueryVariantCount = 4;       // More variants = broader recall
ragChat.MultiQueryOptions.MaxCompletionTokens = 128;   // Shorter budget = faster generation
Setting Default Guidance
QueryVariantCount 3 3 to 4 variants is optimal. More than 5 adds latency with diminishing returns.
MaxCompletionTokens 256 128 is sufficient for query reformulation. Increase only for very complex queries.

Step 5: HyDE Retrieval

HyDE takes a different approach: instead of generating query variants, it generates a hypothetical answer to the question. This hypothetical answer uses vocabulary and sentence structure similar to the actual documents, producing an embedding that is closer to the real answer in vector space.

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.HypotheticalAnswer,
    MaxRetrievedPartitions = 5,
    MinRelevanceScore = 0.2f,
    SystemPrompt = "Answer using only the provided context.",
    MaximumCompletionTokens = 512
};

// Configure hypothesis generation
ragChat.HydeOptions.MaxCompletionTokens = 512;  // Token budget for hypothetical answer

ragChat.AfterTextCompletion += (_, e) => Console.Write(e.Text);

Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Query: timeout issue\n");
Console.ResetColor();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Answer: ");
Console.ResetColor();

var result = ragChat.Submit("timeout issue");
Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens]\n");

For the terse query "timeout issue," HyDE generates a paragraph-length hypothetical answer about timeout causes. That paragraph's embedding matches the actual document about database connection pool timeouts far better than the two-word query would.


Step 6: Choosing Between Multi-Query and HyDE

Factor Multi-Query HyDE
Query type Ambiguous, multi-faceted questions Short, terse queries with vocabulary gaps
Mechanism Generates N query variants, merges results Generates one hypothetical answer, uses its embedding
Recall boost Broad: captures different phrasings Deep: bridges question vs. document vocabulary
Latency 1 LLM call + N retrieval calls 1 LLM call + 1 retrieval call
Risk Low (variants are short) Medium (hypothetical answer may hallucinate, leading to wrong embedding)
Best domains General Q&A, conversational search Technical documentation, knowledge bases with specialized vocabulary

Practical heuristic: Start with Contextual mode for multi-turn conversations. If recall is insufficient, try MultiQuery. If queries are very short or use different vocabulary than your documents, try HypotheticalAnswer.


Step 7: Combine with Other Retrieval Techniques

Query transformation is the first stage of the retrieval pipeline. It combines naturally with other techniques:

Use hybrid search to ensure both semantic and keyword matches are captured, on top of query expansion:

rag.RetrievalStrategy = new HybridRetrievalStrategy();

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery
};

With Reranking

Add reranking to re-score the expanded result set for higher precision:

rag.Reranker = new RagEngine.RagReranker(embeddingModel, rerankedAlpha: 0.7f);

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery
};

With MMR Diversity

Use MMR to ensure the expanded result set does not contain near-duplicate passages:

rag.MmrLambda = 0.7f;

using var ragChat = new RagChat(rag, chatModel)
{
    QueryGenerationMode = QueryGenerationMode.MultiQuery
};

The full pipeline: Multi-Query expansion (broad recall) followed by Hybrid search (keyword + semantic) followed by Reranking (precision) with MMR (diversity).


Common Issues

Problem Cause Fix
Generated variants are too similar Model repeating itself Use a more capable chat model (e.g., gemma3:12b)
HyDE retrieves wrong passages Hypothetical answer hallucinated off-topic Lower HydeOptions.MaxCompletionTokens to constrain the hypothesis
High latency in MultiQuery mode Too many variants Reduce QueryVariantCount to 2 or 3
No improvement over Original mode Queries are already specific and well-formed Query transformation helps most with short or ambiguous queries

Next Steps

Share