Table of Contents

What is Multi-Query Retrieval?


TL;DR

Multi-query retrieval is a RAG technique that improves recall by generating multiple variants of the user's query and running each variant as a separate search. Because a single query only captures one way of asking the question, it may miss relevant documents that use different terminology or approach the topic from a different angle. Multi-query generates several reformulations (e.g., paraphrases, decomposed sub-questions, or different perspectives), retrieves results for each, and merges them using Reciprocal Rank Fusion (RRF) to produce a single, comprehensive result set. This dramatically improves recall without sacrificing precision. LM-Kit.NET implements multi-query retrieval via QueryGenerationMode.MultiQuery on PdfChat and RagEngine, configurable through MultiQueryOptions.


What Exactly is Multi-Query Retrieval?

A single search query, no matter how well-formulated, captures only one perspective on the information need. Consider:

User query: "How do I reduce memory usage when running LLMs?"

This query will match documents about:
  ✓ "memory optimization for LLMs"
  ✓ "reducing LLM memory footprint"

But may miss documents about:
  ✗ "quantization reduces model size" (different vocabulary)
  ✗ "KV-cache management strategies" (specific technique)
  ✗ "context window sizing for memory constraints" (indirect relation)
  ✗ "choosing smaller models for limited hardware" (alternative approach)

Multi-query retrieval addresses this by generating multiple query variants that approach the question from different angles:

Original query: "How do I reduce memory usage when running LLMs?"

Generated variants:
  Q1: "What techniques reduce the memory footprint of large language models?"
  Q2: "How does quantization affect LLM memory requirements?"
  Q3: "What is KV-cache and how does it impact memory during inference?"
  Q4: "How to choose model size based on available RAM and VRAM?"

Each variant is searched independently:
  Q1 results: [doc_a, doc_b, doc_c, doc_d, doc_e]
  Q2 results: [doc_f, doc_b, doc_g, doc_h, doc_i]
  Q3 results: [doc_j, doc_k, doc_l, doc_b, doc_m]
  Q4 results: [doc_n, doc_o, doc_b, doc_p, doc_q]

Merged with RRF: [doc_b, doc_a, doc_f, doc_j, doc_n, ...]
  → doc_b appears in all four result sets → ranked highest
  → Results cover quantization, KV-cache, model sizing, and general optimization

The merged result set is far more comprehensive than any single query could produce, covering the topic from multiple angles.

Types of Query Variants

Multi-query systems can generate different types of variants:

  1. Paraphrases: Same question, different wording ("reduce memory" → "minimize RAM consumption")
  2. Sub-questions: Decompose complex questions into simpler parts ("How to optimize?" → "What uses memory?" + "How to reduce each?")
  3. Perspective shifts: Approach from different angles ("reduce memory" → "what are the memory requirements?" + "what compression options exist?")
  4. Specificity variations: Both broader and narrower versions of the query

Why Multi-Query Retrieval Matters

  1. Dramatically Improved Recall: A single query misses documents that use different terminology. Multiple variants cast a wider net, catching relevant documents that any individual query would miss.

  2. Covers Multiple Aspects: Complex questions have multiple facets. "How do I deploy an AI agent in production?" involves infrastructure, monitoring, scaling, error handling, and more. Multi-query naturally decomposes this into searchable sub-topics.

  3. Vocabulary Bridging: Different documents describe the same concept with different terms. Multi-query generates variants that match different terminological conventions, reducing the embedding vocabulary gap.

  4. Robust to Query Formulation: Users do not always formulate the best possible query. Multi-query compensates by exploring alternative formulations, making the system less sensitive to how the user phrases their question.

  5. Composable with Other Techniques: Multi-query works with query contextualization (contextualize first, then expand), HyDE (generate hypothetical answers for each variant), MMR (diversify the merged results), and reranking (rerank the final set).


Technical Insights

The Multi-Query Pipeline

Step 1: Generate query variants
  Input:  Original user query (+ conversation history if contextualized)
  LLM:    "Generate 3-5 different search queries that would help
           answer this question from different angles"
  Output: List of query variants

Step 2: Retrieve independently
  For each variant:
    Embed the variant → Search vector store → Get top-K results

Step 3: Merge results with RRF
  Combine all result sets using Reciprocal Rank Fusion
  Documents appearing in multiple result sets are boosted
  See: Reciprocal Rank Fusion (RRF)

Step 4: (Optional) Apply MMR diversity filtering
  Filter merged results to reduce redundancy
  See: Maximal Marginal Relevance (MMR)

Step 5: (Optional) Rerank
  Use a reranker to refine the final ordering
  See: Reranking

Step 6: Generate answer
  Feed the top results to the LLM along with the original query

How Many Variants?

Variant Count Tradeoff
2-3 Good balance of recall improvement and latency
4-5 Better recall, noticeable latency increase
6+ Diminishing returns; variants become repetitive

The typical recommendation is 3-4 variants for most applications. Beyond 5, the additional variants rarely cover new ground and the latency cost dominates.

Merging Strategy: Why RRF?

The key challenge in multi-query is combining results from multiple searches into a single ranked list. Reciprocal Rank Fusion (RRF) is the standard approach because:

  • Rank-based: It uses rank positions rather than raw similarity scores, which makes it robust to score scale differences between queries
  • Boosts consensus: Documents that appear in multiple result sets receive higher scores, surfacing broadly relevant results
  • Simple and effective: No training or tuning required beyond the standard RRF constant

Latency Considerations

Multi-query adds latency in two places:

  1. Variant generation: One LLM call to generate 3-4 variants (~200-500ms with a fast model)
  2. Multiple retrievals: 3-4 vector store searches instead of 1

The retrieval step can be parallelized: all variant queries can search the vector store simultaneously, so the wall-clock time for retrieval is approximately the same as a single query. The main latency cost is the variant generation step.

Sequential timing:  Generate (300ms) + Retrieve×4 (200ms each) = 1100ms
Parallel timing:    Generate (300ms) + Retrieve (200ms parallel) = 500ms

Multi-Query vs. HyDE

Aspect Multi-Query HyDE
Approach Multiple question variants Generate hypothetical answer
Fixes Vocabulary and perspective gaps Question-document style mismatch
Best for Complex, multi-faceted questions Technical domains with jargon
Merging RRF across result sets Single retrieval from hypothesis
Combinable? Yes, can apply HyDE to each variant Yes, can generate multiple hypotheses

These techniques are complementary and can be combined: generate multiple query variants, then apply HyDE to each, retrieve for all, and merge with RRF.


Practical Use Cases

  • Complex Research Questions: "What factors should I consider when building a production AI system?" has many facets (cost, performance, reliability, security, monitoring). Multi-query decomposes this into specific searches for each facet.

  • Technical Troubleshooting: "Why is my model generating bad outputs?" could be about temperature settings, context length, model quality, prompt engineering, or hallucination. Multi-query explores all these possibilities. See Build RAG Pipeline.

  • Comprehensive Document Analysis: When analyzing a large knowledge base, multi-query ensures that retrieval covers different sections and topics rather than focusing on the single most similar passage. See Chat with PDF Documents.

  • Enterprise Search: Employees use different terminology depending on their department. Multi-query bridges these vocabulary differences, ensuring documents are found regardless of the terms used. See Build Private Document Q&A.

  • Agentic RAG Workflows: AI agents that autonomously search knowledge bases benefit from multi-query because agents formulate queries programmatically and may not produce the optimal phrasing. Multi-query compensates for this. See Agentic RAG.


Key Terms

  • Multi-Query Retrieval: A RAG technique that generates multiple search query variants from a single user question and merges the results for improved recall.

  • Query Variant: An alternative formulation of the original query that approaches the same information need from a different angle or with different vocabulary.

  • Query Expansion: The broader category of techniques that augment or modify queries before retrieval, including multi-query, HyDE, and keyword expansion.

  • Recall: The proportion of all relevant documents that are successfully retrieved. Multi-query primarily improves recall.

  • Precision: The proportion of retrieved documents that are actually relevant. Multi-query maintains precision through RRF's consensus boosting.

  • Reciprocal Rank Fusion (RRF): The standard algorithm for merging ranked result sets from multiple queries. See Reciprocal Rank Fusion.


  • PdfChat: PDF-based RAG with QueryGenerationMode.MultiQuery
  • RagEngine: Core RAG engine supporting multi-query retrieval
  • MultiQueryOptions: Configuration for query variant generation



External Resources


Summary

Multi-query retrieval is a powerful technique for improving RAG recall by generating multiple search query variants and merging results with Reciprocal Rank Fusion (RRF). A single query only captures one way of asking the question, inevitably missing documents that use different terminology or cover related sub-topics. By generating 3-4 variants that approach the question from different angles, multi-query casts a wider retrieval net while maintaining precision through RRF's consensus-based merging. LM-Kit.NET supports this via QueryGenerationMode.MultiQuery on PdfChat and RagEngine, with MultiQueryOptions for controlling variant generation. In a production pipeline, multi-query combines with query contextualization (for multi-turn conversations), HyDE (for vocabulary bridging), MMR (for diversity), and reranking (for precision) to build a robust retrieval system that handles the full range of user queries effectively.

Share