Table of Contents

What is HyDE (Hypothetical Document Embeddings)?


TL;DR

HyDE (Hypothetical Document Embeddings) is a retrieval technique that bridges the gap between how users phrase questions and how answers are written in documents. Instead of embedding the user's question directly and searching for similar text, HyDE first asks the LLM to generate a hypothetical answer to the question, then embeds that hypothetical answer and uses it as the search query. Because the hypothetical answer is written in the same style and vocabulary as actual documents (declarative statements, domain terminology, factual descriptions), it matches far better against the document embeddings in the vector store. HyDE can dramatically improve retrieval recall for questions that are phrased very differently from the source documents. LM-Kit.NET implements HyDE via QueryGenerationMode.HypotheticalAnswer on PdfChat and RagEngine, configurable through HydeOptions.


What Exactly is HyDE?

The fundamental challenge in RAG retrieval is the query-document mismatch problem. Users ask questions; documents contain answers. Questions and answers are linguistically different:

Question style:  "What causes transformers to struggle with long sequences?"
Document style:  "The self-attention mechanism in transformer architectures
                  has quadratic complexity O(n²) with respect to sequence
                  length, which limits practical context window sizes..."

When you embed the question and search for similar document chunks, the embedding model must bridge this stylistic gap. While modern embedding models handle this reasonably well, there are many cases where the gap is too wide: highly technical documents, domain-specific jargon, or questions that approach a topic from an unexpected angle.

HyDE inverts the problem: instead of trying to match a question against documents, it first generates a document-like answer, then matches that answer against real documents:

Standard RAG retrieval:
  User question → [Embed question] → Search vector store
  "What causes memory issues in LLMs?"
       ↓
  Embedding of a question → searches for similar text
       ↓
  May miss documents about "KV-cache growth" or "attention overhead"

HyDE retrieval:
  User question → [LLM generates hypothetical answer] → [Embed answer] → Search
  "What causes memory issues in LLMs?"
       ↓
  LLM generates: "Memory issues in LLMs primarily stem from
  KV-cache growth during inference, which scales linearly with
  sequence length and batch size. The attention mechanism requires
  storing key-value pairs for all previous tokens..."
       ↓
  Embedding of a document-like passage → searches for similar text
       ↓
  Finds actual documents about KV-cache, attention memory, etc.

The hypothetical answer does not need to be factually correct. Its purpose is to be linguistically similar to the real documents, using the same vocabulary, structure, and domain language. This produces an embedding vector that is much closer in vector space to the actual relevant documents.

Why HyDE Works

The effectiveness of HyDE rests on two observations:

  1. LLMs know the vocabulary: Even if the LLM does not have access to your specific documents, it knows the domain terminology and writing style. A hypothetical answer about "transformer memory issues" will naturally use terms like "KV-cache", "attention heads", "VRAM", and "context length", which are exactly the terms in the real documents.

  2. Document-document similarity is easier than question-document similarity: Embedding models are generally better at measuring similarity between two passages of the same type (both declarative text) than between a question and an answer (different rhetorical structures).


Why HyDE Matters

  1. Bridges the Query-Document Gap: The most common failure mode in RAG is retrieving the wrong documents. HyDE addresses the root cause: questions and documents are written differently, and embedding similarity does not always bridge this gap.

  2. Improves Recall for Technical Domains: In domains with specialized vocabulary (medical, legal, financial, scientific), users often describe concepts in everyday language while documents use technical terminology. HyDE translates the user's language into domain language before retrieval.

  3. No Training Required: Unlike fine-tuning an embedding model for better query-document matching, HyDE works with any off-the-shelf embedding model. The LLM does the linguistic bridging at inference time.

  4. Complements Other Retrieval Strategies: HyDE can be combined with query contextualization (for multi-turn conversations), multi-query retrieval (for broader coverage), and reranking (for precision). Each technique addresses a different retrieval failure mode.

  5. Works with Small Models: The hypothetical answer generation does not require a frontier model. A capable small language model can produce adequate hypothetical answers because the task requires domain vocabulary, not deep reasoning.


Technical Insights

The HyDE Pipeline

Step 1: Generate hypothetical answer
  Input:  User query + optional system prompt
  Output: A passage that might appear in a document answering this query
  Note:   Factual accuracy is NOT required; linguistic similarity is

Step 2: Embed the hypothetical answer
  Input:  The generated passage
  Output: A dense vector in the same embedding space as the document store

Step 3: Retrieve using the hypothetical embedding
  Input:  The embedding vector
  Output: Top-K most similar document chunks from the vector store

Step 4: Generate final answer from real documents
  Input:  User query + retrieved real documents
  Output: Final answer grounded in actual sources

When HyDE Helps Most

Scenario Impact
Technical questions in everyday language High: bridges vocabulary gap
Cross-lingual retrieval (question in one language, docs in another) High: LLM translates before embedding
Short, vague queries ("tell me about performance") High: LLM adds specificity
Well-formed queries matching document language Low: standard retrieval already works
Factual lookups ("what is the capital of France?") Low: exact match retrieval is sufficient

Limitations and Considerations

  • Latency: HyDE adds one LLM generation step before retrieval. For latency-sensitive applications, this tradeoff must be evaluated. The generation can be kept short (one paragraph) to minimize delay.

  • Hallucination risk in the hypothesis: The hypothetical answer may contain wrong information that biases retrieval toward incorrect documents. This is mitigated by using the hypothetical answer only for embedding similarity, not for answering. The final answer is always generated from the real retrieved documents.

  • Cost: An additional LLM call per query. For high-volume systems, consider whether the retrieval improvement justifies the extra computation. Batch processing can amortize this cost.

  • Not always necessary: If your embedding model already achieves high recall with standard queries, HyDE adds complexity without proportional benefit. Measure retrieval quality before and after to verify improvement.

HyDE vs. Other Query Expansion Techniques

Technique Approach Strength
HyDE Generate hypothetical answer, embed it Bridges question-document style gap
Multi-Query Generate multiple query variants Covers different phrasings and aspects
Contextualization Rewrite follow-up to standalone query Handles multi-turn conversations
Query expansion Add synonyms and related terms Simple, no LLM needed, limited effectiveness

These techniques are complementary. A robust pipeline might contextualize first (resolve references), then apply HyDE or multi-query for improved retrieval.


Practical Use Cases

  • Enterprise Knowledge Bases: Employees search internal documentation using natural questions, but documents are written in formal, technical style. HyDE translates the casual question into document-like text for better matching. See Build Private Document Q&A.

  • Legal and Compliance Search: Legal questions often use plain language ("Can we fire someone for being late?") while legal documents use precise legal terminology. HyDE generates a passage using legal language, improving retrieval of relevant statutes and precedents.

  • Medical Information Retrieval: Patients describe symptoms in everyday terms; medical documents use clinical terminology. HyDE bridges this vocabulary gap.

  • Code Documentation Search: Developers ask "how do I parse JSON?" while documentation contains method signatures and technical descriptions. HyDE generates a technical passage that better matches the documentation style.

  • Multi-Format Document Q&A: When documents contain technical reports, research papers, and manuals, HyDE generates text in the appropriate style for the domain, improving cross-format retrieval. See Chat with PDF Documents.


Key Terms

  • HyDE (Hypothetical Document Embeddings): A retrieval technique that generates a hypothetical answer to the user's question and uses its embedding as the search query, rather than embedding the question directly.

  • Query-Document Mismatch: The fundamental problem in information retrieval where users' questions are linguistically different from the documents containing the answers.

  • Hypothetical Answer: An LLM-generated passage that approximates what a real document answering the query might look like. Factual accuracy is secondary to linguistic similarity with real documents.

  • Dense Retrieval: Retrieval based on comparing dense vector embeddings, as opposed to sparse methods like keyword matching.

  • Query Expansion: The general category of techniques that modify or augment a query before retrieval, of which HyDE is one approach.


  • PdfChat: PDF-based RAG with QueryGenerationMode.HypotheticalAnswer
  • RagEngine: Core RAG engine supporting HyDE retrieval
  • HydeOptions: Configuration for hypothetical answer generation
  • Embedder: Generates embeddings for both hypothetical and real documents



External Resources


Summary

HyDE (Hypothetical Document Embeddings) solves one of RAG's most persistent problems: the mismatch between how users phrase questions and how documents contain answers. By generating a hypothetical answer before retrieval, HyDE produces an embedding vector that is linguistically similar to real documents, dramatically improving retrieval recall for queries that use different vocabulary, style, or structure than the source material. The technique requires no embedding model changes, works with any vector store, and is particularly effective for technical, legal, medical, and other domains with specialized terminology. LM-Kit.NET supports HyDE via QueryGenerationMode.HypotheticalAnswer on PdfChat and RagEngine, with HydeOptions for controlling hypothesis generation. Combined with query contextualization for multi-turn conversations, multi-query retrieval for breadth, MMR for diversity, and reranking for precision, HyDE is a powerful component in a production-grade retrieval pipeline.

Share