What is Maximal Marginal Relevance (MMR)?

TL;DR

Maximal Marginal Relevance (MMR) is a retrieval technique that balances relevance (how well a document matches the query) against diversity (how different retrieved documents are from each other). Standard similarity search often returns near-duplicate passages that all say the same thing, wasting precious context window space and providing the LLM with redundant information. MMR solves this by iteratively selecting documents that are both relevant to the query and dissimilar from documents already selected. The result is a set of retrieved passages that covers more ground, provides the LLM with broader context, and leads to more complete answers. LM-Kit.NET implements MMR via the MmrLambda parameter on RagEngine and PdfChat, where the lambda value controls the relevance-diversity tradeoff.

What Exactly is Maximal Marginal Relevance?

When you search a vector store for the top-K most similar documents, you often get results like this:

Query: "What are the benefits of edge AI?"

Standard top-5 similarity search:
  1. "Edge AI provides privacy by keeping data local..."        (score: 0.95)
  2. "Running AI on edge devices ensures data privacy..."       (score: 0.93)
  3. "Data privacy is a key advantage of edge computing..."     (score: 0.91)
  4. "Edge AI eliminates network latency for inference..."      (score: 0.87)
  5. "Local inference on edge devices removes latency..."       (score: 0.85)

Problem: Results 1-3 all say the same thing (privacy).
         Results 4-5 both say the same thing (latency).
         Other benefits (cost, offline capability, control) are missed.

All five documents are individually relevant, but collectively they are redundant. The LLM receives the same information three times about privacy and twice about latency, while missing other important aspects entirely.

MMR reranks the results to maximize both relevance and diversity:

MMR-filtered top-5:
  1. "Edge AI provides privacy by keeping data local..."        (most relevant)
  2. "Edge AI eliminates network latency for inference..."      (relevant + different from #1)
  3. "Zero marginal cost at scale makes edge AI economical..."  (relevant + different from #1,#2)
  4. "Edge AI works without internet connectivity..."           (relevant + different from #1-#3)
  5. "Running AI on edge devices ensures data privacy..."       (relevant, somewhat similar to #1)

Now the LLM receives diverse information covering privacy, latency, cost, and offline capability, producing a much more complete answer.

The MMR Formula

MMR selects documents iteratively using a scoring formula that balances two factors:

MMR Score = λ × Relevance(doc, query) - (1-λ) × max(Similarity(doc, selected_docs))

Where:
  λ (lambda)              = tradeoff parameter between 0 and 1
  Relevance(doc, query)   = similarity between document and query
  Similarity(doc, selected) = maximum similarity to any already-selected document

When λ = 1.0: Pure relevance ranking (no diversity consideration, same as standard search)
When λ = 0.0: Pure diversity (selects the most different documents, regardless of relevance)
When λ = 0.5: Equal weight to relevance and diversity
Typical range: λ = 0.5 to 0.8 balances both goals effectively

The algorithm works iteratively: it first selects the most relevant document, then for each subsequent selection, it picks the document that best combines relevance to the query with dissimilarity from the documents already chosen.

Why MMR Matters

Eliminates Redundant Context: Context windows are finite. Every redundant passage wastes tokens that could contain new, useful information. MMR ensures each retrieved passage adds unique value.
More Complete Answers: By covering more aspects of the query, MMR helps the LLM generate answers that address the question from multiple angles rather than repeating the same point.
Better Use of the K Budget: If you retrieve K documents, standard search might give you K variations of the same passage. MMR gives you K genuinely different passages, effectively multiplying the information content of your retrieval.
Reduces Hallucination Risk: When the LLM receives diverse, comprehensive context, it is less likely to hallucinate information to fill gaps. Redundant context leaves gaps in coverage that the LLM might fill with fabricated details.
Simple to Implement and Tune: MMR requires only a single parameter (λ) to control the relevance-diversity tradeoff. This makes it easy to integrate into any RAG pipeline and straightforward to tune for specific use cases.

Technical Insights

How MMR Selection Works Step by Step

Given: Query Q, Candidate documents D = {d1, d2, ..., dn}, K results needed

Step 1: Select d_best = argmax(Similarity(d, Q)) from D
        Add d_best to Selected set S
        Remove d_best from D

Step 2: For each remaining step (until |S| = K):
        For each candidate d in D:
          mmr_score(d) = λ × Sim(d, Q) - (1-λ) × max(Sim(d, s) for s in S)
        Select d_best = argmax(mmr_score(d)) from D
        Add d_best to S, remove from D

Result: S contains K documents that are relevant AND diverse

Choosing the Lambda Value

Lambda Value	Behavior	Best For
0.9 - 1.0	Almost pure relevance	When redundancy is acceptable and precision is critical
0.7 - 0.8	Relevance-biased balance	Most RAG applications (recommended starting point)
0.5 - 0.6	Equal balance	Broad exploration, research queries
0.3 - 0.4	Diversity-biased	When covering all aspects matters more than top relevance
0.0 - 0.2	Almost pure diversity	Rarely useful; may include irrelevant results

MMR in the Retrieval Pipeline

MMR is typically applied as a post-retrieval step. You first retrieve a larger candidate set, then apply MMR to select the final subset:

[Query] → [Retrieve top 20 candidates] → [MMR filter to top 5] → [LLM generates answer]

This two-stage approach works because MMR needs a pool of candidates to select from. Retrieving more candidates than needed gives MMR room to find diverse, relevant passages.

MMR vs. Other Diversity Techniques

Technique	Approach	Tradeoff
MMR	Iterative selection balancing relevance and diversity	Tunable λ parameter, well-established
Clustering	Cluster results, take one from each cluster	Fixed diversity, ignores relevance ordering
Deduplication	Remove near-exact duplicates	Simple but only handles exact redundancy
Multi-Query + RRF	Multiple queries, merge with rank fusion	Diversity through query variety, not result filtering

MMR is complementary to multi-query and RRF. You can first broaden retrieval with multi-query, merge with RRF, and then apply MMR to ensure the final set is diverse.

Impact on RAG Quality

In retrieval pipelines where documents contain overlapping content (which is common after chunking, since adjacent chunks share context), MMR provides significant quality improvements:

Without MMR: Top-5 chunks often come from the same document section, covering one subtopic deeply
With MMR: Top-5 chunks span multiple sections and subtopics, giving the LLM a comprehensive view

This is especially important for chunking strategies that use overlapping windows, where adjacent chunks share substantial text and will naturally have high similarity scores.

Practical Use Cases

Document Q&A: When users ask broad questions about a document, MMR ensures the retrieved passages cover different sections and aspects rather than returning variations of the same paragraph. See Chat with PDF Documents.
Research and Analysis: Researchers asking "What are the key findings?" need diverse passages covering multiple findings, not five variations of the most prominent one.
Multi-Document RAG: When the knowledge base contains multiple documents on the same topic, standard retrieval may return passages from only one document. MMR encourages selection across documents, providing multiple perspectives. See Build RAG Pipeline.
Summarization: Generating summaries from retrieved context benefits from diverse passages that cover the full scope of the topic. See Build Document Summarization Pipeline.
Enterprise Knowledge Search: Internal documentation often has redundant content across wikis, manuals, and guides. MMR filters out the redundancy, presenting the user with distinct, useful results. See Build Private Document Q&A.

Key Terms

Maximal Marginal Relevance (MMR): A retrieval reranking algorithm that iteratively selects documents maximizing a combined score of query relevance and dissimilarity from already-selected documents.
Lambda (λ): The tradeoff parameter controlling the balance between relevance (λ = 1.0) and diversity (λ = 0.0).
Marginal Relevance: The additional, non-redundant information a document provides given what has already been selected.
Diversity Filtering: The general practice of ensuring retrieved results are not redundant, of which MMR is the most widely used algorithm.
Candidate Pool: The initial set of retrieved documents (typically larger than the final K) from which MMR selects the diverse subset.
Redundancy: When multiple retrieved passages convey the same information, wasting context window capacity without adding new knowledge.

RagEngine: Core RAG engine with MmrLambda parameter
PdfChat: PDF-based RAG with MmrLambda for diversity filtering
Embedder: Generates the embeddings used for similarity and diversity calculation

RAG (Retrieval-Augmented Generation): The retrieval framework that MMR optimizes
Embeddings: The vector representations used for similarity and diversity measurement
Reranking: Complementary reranking technique focused on relevance ordering
Chunking: Document segmentation that often produces overlapping chunks where MMR is most valuable
Multi-Query Retrieval: Broadens retrieval; MMR then ensures diversity in the merged results
Reciprocal Rank Fusion (RRF): Merges results from multiple queries before MMR filtering
Query Contextualization: Fixes the query; MMR fixes the result set
HyDE (Hypothetical Document Embeddings): Alternative query strategy; MMR applies after retrieval regardless of query strategy
Context Windows: MMR maximizes information density within the context budget
Context Engineering: MMR is a key tool for optimizing retrieved context quality
Hallucination: Diverse context reduces gaps that lead to hallucinated content
Agentic RAG: Agent-driven retrieval that benefits from diverse result sets
Semantic Similarity: The similarity measure that MMR uses for both relevance and diversity

Build RAG Pipeline: End-to-end RAG setup with MMR integration
Chat with PDF Documents: PDF Q&A with diversity filtering
Build Private Document Q&A: Private document search with MMR
Improve RAG Results with Reranking: Combine MMR with reranking
Optimize RAG with Custom Chunking: Chunking strategies where MMR is essential
Build Document Summarization Pipeline: Summarization benefits from diverse retrieval
Single-Turn RAG (CLI): Single-turn RAG demo

External Resources

The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries (Carbonell & Goldstein, 1998): The original MMR paper
Avoiding Redundancy in Multi-Document Summarization (Goldstein et al., 2000): MMR applied to summarization
Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020): Dense retrieval foundations where MMR is commonly applied

Summary

Maximal Marginal Relevance (MMR) is a simple but powerful technique that prevents RAG retrieval from returning redundant passages. By iteratively selecting documents that are both relevant to the query and dissimilar from already-selected documents, MMR ensures that the context window is filled with diverse, non-overlapping information. The single λ parameter provides intuitive control: higher values favor relevance, lower values favor diversity, with the typical range of 0.5 to 0.8 providing the best balance for most applications. LM-Kit.NET implements MMR via the MmrLambda parameter on RagEngine and PdfChat, making it straightforward to enable diversity filtering in any retrieval pipeline. Combined with multi-query retrieval for broader recall, RRF for result merging, and reranking for precision, MMR is an essential component of production-grade RAG systems that deliver complete, non-redundant answers.

Table of Contents