Table of Contents

What is Maximal Marginal Relevance (MMR)?


TL;DR

Maximal Marginal Relevance (MMR) is a retrieval technique that balances relevance (how well a document matches the query) against diversity (how different retrieved documents are from each other). Standard similarity search often returns near-duplicate passages that all say the same thing, wasting precious context window space and providing the LLM with redundant information. MMR solves this by iteratively selecting documents that are both relevant to the query and dissimilar from documents already selected. The result is a set of retrieved passages that covers more ground, provides the LLM with broader context, and leads to more complete answers. LM-Kit.NET implements MMR via the MmrLambda parameter on RagEngine and PdfChat, where the lambda value controls the relevance-diversity tradeoff.


What Exactly is Maximal Marginal Relevance?

When you search a vector store for the top-K most similar documents, you often get results like this:

Query: "What are the benefits of edge AI?"

Standard top-5 similarity search:
  1. "Edge AI provides privacy by keeping data local..."        (score: 0.95)
  2. "Running AI on edge devices ensures data privacy..."       (score: 0.93)
  3. "Data privacy is a key advantage of edge computing..."     (score: 0.91)
  4. "Edge AI eliminates network latency for inference..."      (score: 0.87)
  5. "Local inference on edge devices removes latency..."       (score: 0.85)

Problem: Results 1-3 all say the same thing (privacy).
         Results 4-5 both say the same thing (latency).
         Other benefits (cost, offline capability, control) are missed.

All five documents are individually relevant, but collectively they are redundant. The LLM receives the same information three times about privacy and twice about latency, while missing other important aspects entirely.

MMR reranks the results to maximize both relevance and diversity:

MMR-filtered top-5:
  1. "Edge AI provides privacy by keeping data local..."        (most relevant)
  2. "Edge AI eliminates network latency for inference..."      (relevant + different from #1)
  3. "Zero marginal cost at scale makes edge AI economical..."  (relevant + different from #1,#2)
  4. "Edge AI works without internet connectivity..."           (relevant + different from #1-#3)
  5. "Running AI on edge devices ensures data privacy..."       (relevant, somewhat similar to #1)

Now the LLM receives diverse information covering privacy, latency, cost, and offline capability, producing a much more complete answer.

The MMR Formula

MMR selects documents iteratively using a scoring formula that balances two factors:

MMR Score = λ × Relevance(doc, query) - (1-λ) × max(Similarity(doc, selected_docs))

Where:
  λ (lambda)              = tradeoff parameter between 0 and 1
  Relevance(doc, query)   = similarity between document and query
  Similarity(doc, selected) = maximum similarity to any already-selected document
  • When λ = 1.0: Pure relevance ranking (no diversity consideration, same as standard search)
  • When λ = 0.0: Pure diversity (selects the most different documents, regardless of relevance)
  • When λ = 0.5: Equal weight to relevance and diversity
  • Typical range: λ = 0.5 to 0.8 balances both goals effectively

The algorithm works iteratively: it first selects the most relevant document, then for each subsequent selection, it picks the document that best combines relevance to the query with dissimilarity from the documents already chosen.


Why MMR Matters

  1. Eliminates Redundant Context: Context windows are finite. Every redundant passage wastes tokens that could contain new, useful information. MMR ensures each retrieved passage adds unique value.

  2. More Complete Answers: By covering more aspects of the query, MMR helps the LLM generate answers that address the question from multiple angles rather than repeating the same point.

  3. Better Use of the K Budget: If you retrieve K documents, standard search might give you K variations of the same passage. MMR gives you K genuinely different passages, effectively multiplying the information content of your retrieval.

  4. Reduces Hallucination Risk: When the LLM receives diverse, comprehensive context, it is less likely to hallucinate information to fill gaps. Redundant context leaves gaps in coverage that the LLM might fill with fabricated details.

  5. Simple to Implement and Tune: MMR requires only a single parameter (λ) to control the relevance-diversity tradeoff. This makes it easy to integrate into any RAG pipeline and straightforward to tune for specific use cases.


Technical Insights

How MMR Selection Works Step by Step

Given: Query Q, Candidate documents D = {d1, d2, ..., dn}, K results needed

Step 1: Select d_best = argmax(Similarity(d, Q)) from D
        Add d_best to Selected set S
        Remove d_best from D

Step 2: For each remaining step (until |S| = K):
        For each candidate d in D:
          mmr_score(d) = λ × Sim(d, Q) - (1-λ) × max(Sim(d, s) for s in S)
        Select d_best = argmax(mmr_score(d)) from D
        Add d_best to S, remove from D

Result: S contains K documents that are relevant AND diverse

Choosing the Lambda Value

Lambda Value Behavior Best For
0.9 - 1.0 Almost pure relevance When redundancy is acceptable and precision is critical
0.7 - 0.8 Relevance-biased balance Most RAG applications (recommended starting point)
0.5 - 0.6 Equal balance Broad exploration, research queries
0.3 - 0.4 Diversity-biased When covering all aspects matters more than top relevance
0.0 - 0.2 Almost pure diversity Rarely useful; may include irrelevant results

MMR in the Retrieval Pipeline

MMR is typically applied as a post-retrieval step. You first retrieve a larger candidate set, then apply MMR to select the final subset:

[Query] → [Retrieve top 20 candidates] → [MMR filter to top 5] → [LLM generates answer]

This two-stage approach works because MMR needs a pool of candidates to select from. Retrieving more candidates than needed gives MMR room to find diverse, relevant passages.

MMR vs. Other Diversity Techniques

Technique Approach Tradeoff
MMR Iterative selection balancing relevance and diversity Tunable λ parameter, well-established
Clustering Cluster results, take one from each cluster Fixed diversity, ignores relevance ordering
Deduplication Remove near-exact duplicates Simple but only handles exact redundancy
Multi-Query + RRF Multiple queries, merge with rank fusion Diversity through query variety, not result filtering

MMR is complementary to multi-query and RRF. You can first broaden retrieval with multi-query, merge with RRF, and then apply MMR to ensure the final set is diverse.

Impact on RAG Quality

In retrieval pipelines where documents contain overlapping content (which is common after chunking, since adjacent chunks share context), MMR provides significant quality improvements:

  • Without MMR: Top-5 chunks often come from the same document section, covering one subtopic deeply
  • With MMR: Top-5 chunks span multiple sections and subtopics, giving the LLM a comprehensive view

This is especially important for chunking strategies that use overlapping windows, where adjacent chunks share substantial text and will naturally have high similarity scores.


Practical Use Cases

  • Document Q&A: When users ask broad questions about a document, MMR ensures the retrieved passages cover different sections and aspects rather than returning variations of the same paragraph. See Chat with PDF Documents.

  • Research and Analysis: Researchers asking "What are the key findings?" need diverse passages covering multiple findings, not five variations of the most prominent one.

  • Multi-Document RAG: When the knowledge base contains multiple documents on the same topic, standard retrieval may return passages from only one document. MMR encourages selection across documents, providing multiple perspectives. See Build RAG Pipeline.

  • Summarization: Generating summaries from retrieved context benefits from diverse passages that cover the full scope of the topic. See Build Document Summarization Pipeline.

  • Enterprise Knowledge Search: Internal documentation often has redundant content across wikis, manuals, and guides. MMR filters out the redundancy, presenting the user with distinct, useful results. See Build Private Document Q&A.


Key Terms

  • Maximal Marginal Relevance (MMR): A retrieval reranking algorithm that iteratively selects documents maximizing a combined score of query relevance and dissimilarity from already-selected documents.

  • Lambda (λ): The tradeoff parameter controlling the balance between relevance (λ = 1.0) and diversity (λ = 0.0).

  • Marginal Relevance: The additional, non-redundant information a document provides given what has already been selected.

  • Diversity Filtering: The general practice of ensuring retrieved results are not redundant, of which MMR is the most widely used algorithm.

  • Candidate Pool: The initial set of retrieved documents (typically larger than the final K) from which MMR selects the diverse subset.

  • Redundancy: When multiple retrieved passages convey the same information, wasting context window capacity without adding new knowledge.


  • RagEngine: Core RAG engine with MmrLambda parameter
  • PdfChat: PDF-based RAG with MmrLambda for diversity filtering
  • Embedder: Generates the embeddings used for similarity and diversity calculation



External Resources


Summary

Maximal Marginal Relevance (MMR) is a simple but powerful technique that prevents RAG retrieval from returning redundant passages. By iteratively selecting documents that are both relevant to the query and dissimilar from already-selected documents, MMR ensures that the context window is filled with diverse, non-overlapping information. The single λ parameter provides intuitive control: higher values favor relevance, lower values favor diversity, with the typical range of 0.5 to 0.8 providing the best balance for most applications. LM-Kit.NET implements MMR via the MmrLambda parameter on RagEngine and PdfChat, making it straightforward to enable diversity filtering in any retrieval pipeline. Combined with multi-query retrieval for broader recall, RRF for result merging, and reranking for precision, MMR is an essential component of production-grade RAG systems that deliver complete, non-redundant answers.

Share