What is HyDE (Hypothetical Document Embeddings)?
TL;DR
HyDE (Hypothetical Document Embeddings) is a retrieval technique that bridges the gap between how users phrase questions and how answers are written in documents. Instead of embedding the user's question directly and searching for similar text, HyDE first asks the LLM to generate a hypothetical answer to the question, then embeds that hypothetical answer and uses it as the search query. Because the hypothetical answer is written in the same style and vocabulary as actual documents (declarative statements, domain terminology, factual descriptions), it matches far better against the document embeddings in the vector store. HyDE can dramatically improve retrieval recall for questions that are phrased very differently from the source documents. LM-Kit.NET implements HyDE via QueryGenerationMode.HypotheticalAnswer on PdfChat and RagEngine, configurable through HydeOptions.
What Exactly is HyDE?
The fundamental challenge in RAG retrieval is the query-document mismatch problem. Users ask questions; documents contain answers. Questions and answers are linguistically different:
Question style: "What causes transformers to struggle with long sequences?"
Document style: "The self-attention mechanism in transformer architectures
has quadratic complexity O(n²) with respect to sequence
length, which limits practical context window sizes..."
When you embed the question and search for similar document chunks, the embedding model must bridge this stylistic gap. While modern embedding models handle this reasonably well, there are many cases where the gap is too wide: highly technical documents, domain-specific jargon, or questions that approach a topic from an unexpected angle.
HyDE inverts the problem: instead of trying to match a question against documents, it first generates a document-like answer, then matches that answer against real documents:
Standard RAG retrieval:
User question → [Embed question] → Search vector store
"What causes memory issues in LLMs?"
↓
Embedding of a question → searches for similar text
↓
May miss documents about "KV-cache growth" or "attention overhead"
HyDE retrieval:
User question → [LLM generates hypothetical answer] → [Embed answer] → Search
"What causes memory issues in LLMs?"
↓
LLM generates: "Memory issues in LLMs primarily stem from
KV-cache growth during inference, which scales linearly with
sequence length and batch size. The attention mechanism requires
storing key-value pairs for all previous tokens..."
↓
Embedding of a document-like passage → searches for similar text
↓
Finds actual documents about KV-cache, attention memory, etc.
The hypothetical answer does not need to be factually correct. Its purpose is to be linguistically similar to the real documents, using the same vocabulary, structure, and domain language. This produces an embedding vector that is much closer in vector space to the actual relevant documents.
Why HyDE Works
The effectiveness of HyDE rests on two observations:
LLMs know the vocabulary: Even if the LLM does not have access to your specific documents, it knows the domain terminology and writing style. A hypothetical answer about "transformer memory issues" will naturally use terms like "KV-cache", "attention heads", "VRAM", and "context length", which are exactly the terms in the real documents.
Document-document similarity is easier than question-document similarity: Embedding models are generally better at measuring similarity between two passages of the same type (both declarative text) than between a question and an answer (different rhetorical structures).
Why HyDE Matters
Bridges the Query-Document Gap: The most common failure mode in RAG is retrieving the wrong documents. HyDE addresses the root cause: questions and documents are written differently, and embedding similarity does not always bridge this gap.
Improves Recall for Technical Domains: In domains with specialized vocabulary (medical, legal, financial, scientific), users often describe concepts in everyday language while documents use technical terminology. HyDE translates the user's language into domain language before retrieval.
No Training Required: Unlike fine-tuning an embedding model for better query-document matching, HyDE works with any off-the-shelf embedding model. The LLM does the linguistic bridging at inference time.
Complements Other Retrieval Strategies: HyDE can be combined with query contextualization (for multi-turn conversations), multi-query retrieval (for broader coverage), and reranking (for precision). Each technique addresses a different retrieval failure mode.
Works with Small Models: The hypothetical answer generation does not require a frontier model. A capable small language model can produce adequate hypothetical answers because the task requires domain vocabulary, not deep reasoning.
Technical Insights
The HyDE Pipeline
Step 1: Generate hypothetical answer
Input: User query + optional system prompt
Output: A passage that might appear in a document answering this query
Note: Factual accuracy is NOT required; linguistic similarity is
Step 2: Embed the hypothetical answer
Input: The generated passage
Output: A dense vector in the same embedding space as the document store
Step 3: Retrieve using the hypothetical embedding
Input: The embedding vector
Output: Top-K most similar document chunks from the vector store
Step 4: Generate final answer from real documents
Input: User query + retrieved real documents
Output: Final answer grounded in actual sources
When HyDE Helps Most
| Scenario | Impact |
|---|---|
| Technical questions in everyday language | High: bridges vocabulary gap |
| Cross-lingual retrieval (question in one language, docs in another) | High: LLM translates before embedding |
| Short, vague queries ("tell me about performance") | High: LLM adds specificity |
| Well-formed queries matching document language | Low: standard retrieval already works |
| Factual lookups ("what is the capital of France?") | Low: exact match retrieval is sufficient |
Limitations and Considerations
Latency: HyDE adds one LLM generation step before retrieval. For latency-sensitive applications, this tradeoff must be evaluated. The generation can be kept short (one paragraph) to minimize delay.
Hallucination risk in the hypothesis: The hypothetical answer may contain wrong information that biases retrieval toward incorrect documents. This is mitigated by using the hypothetical answer only for embedding similarity, not for answering. The final answer is always generated from the real retrieved documents.
Cost: An additional LLM call per query. For high-volume systems, consider whether the retrieval improvement justifies the extra computation. Batch processing can amortize this cost.
Not always necessary: If your embedding model already achieves high recall with standard queries, HyDE adds complexity without proportional benefit. Measure retrieval quality before and after to verify improvement.
HyDE vs. Other Query Expansion Techniques
| Technique | Approach | Strength |
|---|---|---|
| HyDE | Generate hypothetical answer, embed it | Bridges question-document style gap |
| Multi-Query | Generate multiple query variants | Covers different phrasings and aspects |
| Contextualization | Rewrite follow-up to standalone query | Handles multi-turn conversations |
| Query expansion | Add synonyms and related terms | Simple, no LLM needed, limited effectiveness |
These techniques are complementary. A robust pipeline might contextualize first (resolve references), then apply HyDE or multi-query for improved retrieval.
Practical Use Cases
Enterprise Knowledge Bases: Employees search internal documentation using natural questions, but documents are written in formal, technical style. HyDE translates the casual question into document-like text for better matching. See Build Private Document Q&A.
Legal and Compliance Search: Legal questions often use plain language ("Can we fire someone for being late?") while legal documents use precise legal terminology. HyDE generates a passage using legal language, improving retrieval of relevant statutes and precedents.
Medical Information Retrieval: Patients describe symptoms in everyday terms; medical documents use clinical terminology. HyDE bridges this vocabulary gap.
Code Documentation Search: Developers ask "how do I parse JSON?" while documentation contains method signatures and technical descriptions. HyDE generates a technical passage that better matches the documentation style.
Multi-Format Document Q&A: When documents contain technical reports, research papers, and manuals, HyDE generates text in the appropriate style for the domain, improving cross-format retrieval. See Chat with PDF Documents.
Key Terms
HyDE (Hypothetical Document Embeddings): A retrieval technique that generates a hypothetical answer to the user's question and uses its embedding as the search query, rather than embedding the question directly.
Query-Document Mismatch: The fundamental problem in information retrieval where users' questions are linguistically different from the documents containing the answers.
Hypothetical Answer: An LLM-generated passage that approximates what a real document answering the query might look like. Factual accuracy is secondary to linguistic similarity with real documents.
Dense Retrieval: Retrieval based on comparing dense vector embeddings, as opposed to sparse methods like keyword matching.
Query Expansion: The general category of techniques that modify or augment a query before retrieval, of which HyDE is one approach.
Related API Documentation
PdfChat: PDF-based RAG withQueryGenerationMode.HypotheticalAnswerRagEngine: Core RAG engine supporting HyDE retrievalHydeOptions: Configuration for hypothetical answer generationEmbedder: Generates embeddings for both hypothetical and real documents
Related Glossary Topics
- RAG (Retrieval-Augmented Generation): The core framework that HyDE enhances
- Embeddings: The vector representations at the heart of HyDE's matching
- Query Contextualization: Complementary technique for multi-turn conversations
- Multi-Query Retrieval: Alternative query expansion strategy often combined with HyDE
- Reciprocal Rank Fusion (RRF): Merges results from HyDE and standard retrieval
- Maximal Marginal Relevance (MMR): Diversity filtering applied after HyDE retrieval
- Reranking: Refines HyDE retrieval results for precision
- Chunking: The document segments that HyDE queries search against
- Agentic RAG: Agent-driven retrieval that can selectively apply HyDE
- Hallucination: HyDE intentionally generates potentially inaccurate text, but only for retrieval, not for answering
- Semantic Similarity: The matching mechanism HyDE optimizes
Related Guides and Demos
- Build RAG Pipeline: End-to-end RAG setup where HyDE can be integrated
- Chat with PDF Documents: Document Q&A with advanced query strategies
- Build Private Document Q&A: Private document search enhanced by HyDE
- Improve RAG Results with Reranking: Combine HyDE with reranking for best results
- Optimize RAG with Custom Chunking: Document preparation for HyDE-enhanced retrieval
- Single-Turn RAG (CLI): Single-turn RAG demo
External Resources
- Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) (Gao et al., 2022): The original HyDE paper
- Query2doc: Query Expansion with Large Language Models (Wang et al., 2023): Related query expansion technique
- Large Language Models are Built-in Autoregressive Search Engines (Ziems et al., 2023): Analysis of LLMs as retrieval tools
Summary
HyDE (Hypothetical Document Embeddings) solves one of RAG's most persistent problems: the mismatch between how users phrase questions and how documents contain answers. By generating a hypothetical answer before retrieval, HyDE produces an embedding vector that is linguistically similar to real documents, dramatically improving retrieval recall for queries that use different vocabulary, style, or structure than the source material. The technique requires no embedding model changes, works with any vector store, and is particularly effective for technical, legal, medical, and other domains with specialized terminology. LM-Kit.NET supports HyDE via QueryGenerationMode.HypotheticalAnswer on PdfChat and RagEngine, with HydeOptions for controlling hypothesis generation. Combined with query contextualization for multi-turn conversations, multi-query retrieval for breadth, MMR for diversity, and reranking for precision, HyDE is a powerful component in a production-grade retrieval pipeline.