What is Multi-Query Retrieval?
TL;DR
Multi-query retrieval is a RAG technique that improves recall by generating multiple variants of the user's query and running each variant as a separate search. Because a single query only captures one way of asking the question, it may miss relevant documents that use different terminology or approach the topic from a different angle. Multi-query generates several reformulations (e.g., paraphrases, decomposed sub-questions, or different perspectives), retrieves results for each, and merges them using Reciprocal Rank Fusion (RRF) to produce a single, comprehensive result set. This dramatically improves recall without sacrificing precision. LM-Kit.NET implements multi-query retrieval via QueryGenerationMode.MultiQuery on PdfChat and RagEngine, configurable through MultiQueryOptions.
What Exactly is Multi-Query Retrieval?
A single search query, no matter how well-formulated, captures only one perspective on the information need. Consider:
User query: "How do I reduce memory usage when running LLMs?"
This query will match documents about:
✓ "memory optimization for LLMs"
✓ "reducing LLM memory footprint"
But may miss documents about:
✗ "quantization reduces model size" (different vocabulary)
✗ "KV-cache management strategies" (specific technique)
✗ "context window sizing for memory constraints" (indirect relation)
✗ "choosing smaller models for limited hardware" (alternative approach)
Multi-query retrieval addresses this by generating multiple query variants that approach the question from different angles:
Original query: "How do I reduce memory usage when running LLMs?"
Generated variants:
Q1: "What techniques reduce the memory footprint of large language models?"
Q2: "How does quantization affect LLM memory requirements?"
Q3: "What is KV-cache and how does it impact memory during inference?"
Q4: "How to choose model size based on available RAM and VRAM?"
Each variant is searched independently:
Q1 results: [doc_a, doc_b, doc_c, doc_d, doc_e]
Q2 results: [doc_f, doc_b, doc_g, doc_h, doc_i]
Q3 results: [doc_j, doc_k, doc_l, doc_b, doc_m]
Q4 results: [doc_n, doc_o, doc_b, doc_p, doc_q]
Merged with RRF: [doc_b, doc_a, doc_f, doc_j, doc_n, ...]
→ doc_b appears in all four result sets → ranked highest
→ Results cover quantization, KV-cache, model sizing, and general optimization
The merged result set is far more comprehensive than any single query could produce, covering the topic from multiple angles.
Types of Query Variants
Multi-query systems can generate different types of variants:
- Paraphrases: Same question, different wording ("reduce memory" → "minimize RAM consumption")
- Sub-questions: Decompose complex questions into simpler parts ("How to optimize?" → "What uses memory?" + "How to reduce each?")
- Perspective shifts: Approach from different angles ("reduce memory" → "what are the memory requirements?" + "what compression options exist?")
- Specificity variations: Both broader and narrower versions of the query
Why Multi-Query Retrieval Matters
Dramatically Improved Recall: A single query misses documents that use different terminology. Multiple variants cast a wider net, catching relevant documents that any individual query would miss.
Covers Multiple Aspects: Complex questions have multiple facets. "How do I deploy an AI agent in production?" involves infrastructure, monitoring, scaling, error handling, and more. Multi-query naturally decomposes this into searchable sub-topics.
Vocabulary Bridging: Different documents describe the same concept with different terms. Multi-query generates variants that match different terminological conventions, reducing the embedding vocabulary gap.
Robust to Query Formulation: Users do not always formulate the best possible query. Multi-query compensates by exploring alternative formulations, making the system less sensitive to how the user phrases their question.
Composable with Other Techniques: Multi-query works with query contextualization (contextualize first, then expand), HyDE (generate hypothetical answers for each variant), MMR (diversify the merged results), and reranking (rerank the final set).
Technical Insights
The Multi-Query Pipeline
Step 1: Generate query variants
Input: Original user query (+ conversation history if contextualized)
LLM: "Generate 3-5 different search queries that would help
answer this question from different angles"
Output: List of query variants
Step 2: Retrieve independently
For each variant:
Embed the variant → Search vector store → Get top-K results
Step 3: Merge results with RRF
Combine all result sets using Reciprocal Rank Fusion
Documents appearing in multiple result sets are boosted
See: Reciprocal Rank Fusion (RRF)
Step 4: (Optional) Apply MMR diversity filtering
Filter merged results to reduce redundancy
See: Maximal Marginal Relevance (MMR)
Step 5: (Optional) Rerank
Use a reranker to refine the final ordering
See: Reranking
Step 6: Generate answer
Feed the top results to the LLM along with the original query
How Many Variants?
| Variant Count | Tradeoff |
|---|---|
| 2-3 | Good balance of recall improvement and latency |
| 4-5 | Better recall, noticeable latency increase |
| 6+ | Diminishing returns; variants become repetitive |
The typical recommendation is 3-4 variants for most applications. Beyond 5, the additional variants rarely cover new ground and the latency cost dominates.
Merging Strategy: Why RRF?
The key challenge in multi-query is combining results from multiple searches into a single ranked list. Reciprocal Rank Fusion (RRF) is the standard approach because:
- Rank-based: It uses rank positions rather than raw similarity scores, which makes it robust to score scale differences between queries
- Boosts consensus: Documents that appear in multiple result sets receive higher scores, surfacing broadly relevant results
- Simple and effective: No training or tuning required beyond the standard RRF constant
Latency Considerations
Multi-query adds latency in two places:
- Variant generation: One LLM call to generate 3-4 variants (~200-500ms with a fast model)
- Multiple retrievals: 3-4 vector store searches instead of 1
The retrieval step can be parallelized: all variant queries can search the vector store simultaneously, so the wall-clock time for retrieval is approximately the same as a single query. The main latency cost is the variant generation step.
Sequential timing: Generate (300ms) + Retrieve×4 (200ms each) = 1100ms
Parallel timing: Generate (300ms) + Retrieve (200ms parallel) = 500ms
Multi-Query vs. HyDE
| Aspect | Multi-Query | HyDE |
|---|---|---|
| Approach | Multiple question variants | Generate hypothetical answer |
| Fixes | Vocabulary and perspective gaps | Question-document style mismatch |
| Best for | Complex, multi-faceted questions | Technical domains with jargon |
| Merging | RRF across result sets | Single retrieval from hypothesis |
| Combinable? | Yes, can apply HyDE to each variant | Yes, can generate multiple hypotheses |
These techniques are complementary and can be combined: generate multiple query variants, then apply HyDE to each, retrieve for all, and merge with RRF.
Practical Use Cases
Complex Research Questions: "What factors should I consider when building a production AI system?" has many facets (cost, performance, reliability, security, monitoring). Multi-query decomposes this into specific searches for each facet.
Technical Troubleshooting: "Why is my model generating bad outputs?" could be about temperature settings, context length, model quality, prompt engineering, or hallucination. Multi-query explores all these possibilities. See Build RAG Pipeline.
Comprehensive Document Analysis: When analyzing a large knowledge base, multi-query ensures that retrieval covers different sections and topics rather than focusing on the single most similar passage. See Chat with PDF Documents.
Enterprise Search: Employees use different terminology depending on their department. Multi-query bridges these vocabulary differences, ensuring documents are found regardless of the terms used. See Build Private Document Q&A.
Agentic RAG Workflows: AI agents that autonomously search knowledge bases benefit from multi-query because agents formulate queries programmatically and may not produce the optimal phrasing. Multi-query compensates for this. See Agentic RAG.
Key Terms
Multi-Query Retrieval: A RAG technique that generates multiple search query variants from a single user question and merges the results for improved recall.
Query Variant: An alternative formulation of the original query that approaches the same information need from a different angle or with different vocabulary.
Query Expansion: The broader category of techniques that augment or modify queries before retrieval, including multi-query, HyDE, and keyword expansion.
Recall: The proportion of all relevant documents that are successfully retrieved. Multi-query primarily improves recall.
Precision: The proportion of retrieved documents that are actually relevant. Multi-query maintains precision through RRF's consensus boosting.
Reciprocal Rank Fusion (RRF): The standard algorithm for merging ranked result sets from multiple queries. See Reciprocal Rank Fusion.
Related API Documentation
PdfChat: PDF-based RAG withQueryGenerationMode.MultiQueryRagEngine: Core RAG engine supporting multi-query retrievalMultiQueryOptions: Configuration for query variant generation
Related Glossary Topics
- RAG (Retrieval-Augmented Generation): The retrieval framework that multi-query enhances
- Reciprocal Rank Fusion (RRF): The algorithm used to merge multi-query results
- Maximal Marginal Relevance (MMR): Diversity filtering applied after multi-query merging
- HyDE (Hypothetical Document Embeddings): Complementary query expansion technique
- Query Contextualization: Fixes conversational references before multi-query expansion
- Embeddings: The vector representations searched by each query variant
- Reranking: Refines the merged result set for precision
- Chunking: The document segments that multi-query searches against
- Agentic RAG: Agent-driven retrieval that benefits from multi-query expansion
- Context Engineering: Multi-query improves the quality of retrieved context
- Semantic Similarity: The matching mechanism used for each query variant
Related Guides and Demos
- Build RAG Pipeline: End-to-end RAG setup with multi-query support
- Chat with PDF Documents: PDF Q&A with advanced query strategies
- Build Private Document Q&A: Private document search with improved recall
- Improve RAG Results with Reranking: Combine multi-query with reranking
- Optimize RAG with Custom Chunking: Prepare documents for multi-query retrieval
- Single-Turn RAG (CLI): Single-turn RAG demo
External Resources
- RAG-Fusion: a New Take on Retrieval-Augmented Generation (Raudaschl, 2024): Multi-query with RRF for RAG
- Query Rewriting for Retrieval-Augmented Large Language Models (Ma et al., 2023): Query transformation techniques
- Decomposed Prompting: A Modular Approach for Solving Complex Tasks (Khot et al., 2022): Decomposing complex queries into sub-questions
Summary
Multi-query retrieval is a powerful technique for improving RAG recall by generating multiple search query variants and merging results with Reciprocal Rank Fusion (RRF). A single query only captures one way of asking the question, inevitably missing documents that use different terminology or cover related sub-topics. By generating 3-4 variants that approach the question from different angles, multi-query casts a wider retrieval net while maintaining precision through RRF's consensus-based merging. LM-Kit.NET supports this via QueryGenerationMode.MultiQuery on PdfChat and RagEngine, with MultiQueryOptions for controlling variant generation. In a production pipeline, multi-query combines with query contextualization (for multi-turn conversations), HyDE (for vocabulary bridging), MMR (for diversity), and reranking (for precision) to build a robust retrieval system that handles the full range of user queries effectively.