Class Bm25RetrievalStrategy
A retrieval strategy that scores partitions using the BM25+ ranking function with proximity-aware boosting, measuring lexical relevance based on term frequency, document length, and query term proximity.
public sealed class Bm25RetrievalStrategy : IRetrievalStrategy
- Inheritance
-
Bm25RetrievalStrategy
- Implements
- Inherited Members
Examples
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Retrieval.Bm25;
LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
RagEngine ragEngine = new RagEngine(embeddingModel);
// Use pure BM25 keyword retrieval.
ragEngine.RetrievalStrategy = new Bm25RetrievalStrategy
{
K1 = 1.5f,
B = 0.6f,
Language = Language.English
};
ragEngine.ImportText("The quick brown fox jumps over the lazy dog.", "docs", "animals");
var results = await ragEngine.QueryAsync("brown fox");
Remarks
BM25+ excels at exact keyword matching, complementing vector search which captures semantic similarity. An inverted index is built lazily on the first query and rebuilt automatically when the underlying data changes.
Only TextPartition instances are indexed; image partitions are skipped because they contain no textual content for lexical matching.
The BM25+ variant adds a configurable Delta floor to the term frequency component, preventing excessive penalization of long documents. When multiple query terms appear close together in a document, the ProximityWeight parameter controls how much this phrase-like co-occurrence boosts the score.
Set the Language property to match the language of your corpus. This selects the appropriate stopword list for filtering high-frequency function words and controls whether suffix stemming is applied during tokenization. The default is English.
Scores are normalized to [0, 1] via sigmoid scaling so that the
standard minScore threshold works consistently.
Fields
- DefaultDelta
The default value for the Delta parameter.
- DefaultProximityWeight
The default value for the ProximityWeight parameter.
Properties
- B
Gets or sets the BM25 length normalization parameter.
- CustomStopWords
Gets or sets custom stopwords to filter during tokenization in addition to the language-specific stopwords selected by Language.
- Delta
Gets or sets the BM25+ lower-bound delta applied to the term frequency component.
- K1
Gets or sets the BM25 term saturation parameter.
- Language
Gets or sets the language used for stopword filtering and stemming during BM25 tokenization.
- ProximityWeight
Gets or sets the weight applied to the proximity boosting factor.
- RequiresQueryVector
Gets a value indicating whether the strategy requires a query embedding vector.
Methods
- InvalidateIndex()
Explicitly invalidates the cached BM25 index, forcing a rebuild on the next query.
- RetrieveAsync(IReadOnlyList<DataSource>, string, float[], int, float, bool, bool, DataFilter, CancellationToken)
Retrieves matching partitions from the given data sources.