Understanding Semantic Similarity in LM-Kit.NET

TL;DR

Semantic Similarity measures how alike two pieces of text are in meaning, not just word overlap. Using vector embeddings, semantic similarity captures conceptual relationships that keyword matching misses, such as understanding that "car" and "automobile" are similar despite being different words. In LM-Kit.NET, semantic similarity powers RAG retrieval, reranking, text matching, and document clustering through the EmbeddingGenerator, TextMatcher, and DataSource classes, enabling applications to find relevant content based on meaning rather than exact matches.

What is Semantic Similarity?

Definition: Semantic Similarity is a measure of how closely related two texts are in terms of their meaning or conceptual content. Unlike lexical similarity (which counts shared words), semantic similarity understands that:

"The cat sat on the mat" is similar to "A feline rested on the rug"
"Bank" (financial) is different from "bank" (river) based on context
"Happy" relates to "joyful" more than to "blue"

Lexical vs Semantic Matching

+-------------------------------------------------------------------------+
|                  Lexical vs Semantic Similarity                         |
+-------------------------------------------------------------------------+
|                                                                         |
|  Query: "How to fix a car engine"                                       |
|                                                                         |
|  LEXICAL MATCHING (keyword-based):                                      |
|  +-------------------------------------------------------------------+  |
|  | Matches documents containing: "fix", "car", "engine"              |  |
|  |                                                                   |  |
|  | Match: "Fix your car engine in 5 steps"                           |  |
|  | Miss:  "Automobile motor repair guide"      (no shared words)     |  |
|  | Miss:  "Vehicle troubleshooting manual"     (no shared words)     |  |
|  +-------------------------------------------------------------------+  |
|                                                                         |
|  SEMANTIC MATCHING (meaning-based):                                     |
|  +-------------------------------------------------------------------+  |
|  | Matches documents with similar MEANING                            |  |
|  |                                                                   |  |
|  | Match: "Fix your car engine in 5 steps"     (direct match)        |  |
|  | Match: "Automobile motor repair guide"      (synonyms understood) |  |
|  | Match: "Vehicle troubleshooting manual"     (concept overlap)     |  |
|  | Miss:  "Car dealership locations"           (different intent)    |  |
|  +-------------------------------------------------------------------+  |
|                                                                         |
+-------------------------------------------------------------------------+

How Semantic Similarity Works

+-------------------------------------------------------------------------+
|                    Semantic Similarity Pipeline                         |
+-------------------------------------------------------------------------+
|                                                                         |
|  Text A: "Machine learning algorithms"                                  |
|  Text B: "AI neural network models"                                     |
|                                                                         |
|       |                                   |                             |
|       v                                   v                             |
|  +-----------+                       +-----------+                      |
|  | Embedding |                       | Embedding |                      |
|  |   Model   |                       |   Model   |                      |
|  +-----------+                       +-----------+                      |
|       |                                   |                             |
|       v                                   v                             |
|  [0.23, -0.15, 0.87, ...]           [0.21, -0.12, 0.84, ...]            |
|  (Vector A: 768 dimensions)          (Vector B: 768 dimensions)         |
|                                                                         |
|                    |                                                    |
|                    v                                                    |
|           +----------------+                                            |
|           | Cosine         |                                            |
|           | Similarity     |                                            |
|           +----------------+                                            |
|                    |                                                    |
|                    v                                                    |
|              Score: 0.92                                                |
|              (Highly similar)                                           |
|                                                                         |
+-------------------------------------------------------------------------+

Similarity Metrics

Common Distance/Similarity Functions

Metric	Formula	Range	Best For
Cosine Similarity	A·B / (		A		×	B	)	-1 to 1	Normalized text comparison
Dot Product	A·B	-∞ to +∞	When magnitudes matter
Euclidean Distance	√Σ(Aᵢ-Bᵢ)²	0 to ∞	Absolute difference
Manhattan Distance	Σ	Aᵢ-Bᵢ		0 to ∞	Sparse vectors

Cosine Similarity Explained

+-------------------------------------------------------------------------+
|                      Cosine Similarity                                  |
+-------------------------------------------------------------------------+
|                                                                         |
|  Measures the angle between two vectors, ignoring magnitude             |
|                                                                         |
|                    Vector B                                             |
|                   /                                                     |
|                  /  angle = small                                       |
|                 /   cos(angle) = high                                   |
|                /    similarity = HIGH                                   |
|    Vector A  /________________                                          |
|                                                                         |
|                                                                         |
|                    Vector B                                             |
|                   |                                                     |
|                   |  angle = large                                      |
|                   |  cos(angle) = low                                   |
|                   |  similarity = LOW                                   |
|    Vector A  ____|                                                      |
|                                                                         |
|  Score Interpretation:                                                  |
|  - 1.0: Identical meaning                                               |
|  - 0.8-0.9: Very similar                                                |
|  - 0.5-0.7: Somewhat related                                            |
|  - 0.0-0.4: Different topics                                            |
|  - Negative: Opposite meanings (rare in practice)                       |
|                                                                         |
+-------------------------------------------------------------------------+

Semantic Similarity in LM-Kit.NET

Generating Embeddings

using LMKit.Model;
using LMKit.Embeddings;

// Load an embedding model
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create embedding generator
var embedder = new EmbeddingGenerator(embeddingModel);

// Generate embeddings for texts
var embedding1 = embedder.GenerateEmbedding("Machine learning algorithms");
var embedding2 = embedder.GenerateEmbedding("AI neural network models");

Console.WriteLine($"Embedding dimension: {embedding1.Length}");

Computing Similarity

using LMKit.Embeddings;

// Generate embeddings
var vec1 = embedder.GenerateEmbedding("The quick brown fox");
var vec2 = embedder.GenerateEmbedding("A fast auburn canine");
var vec3 = embedder.GenerateEmbedding("Stock market analysis");

// Compute cosine similarity
float similarity12 = CosineSimilarity(vec1, vec2);
float similarity13 = CosineSimilarity(vec1, vec3);

Console.WriteLine($"Fox sentences: {similarity12:F3}");  // High (~0.85)
Console.WriteLine($"Fox vs stocks: {similarity13:F3}"); // Low (~0.15)

static float CosineSimilarity(float[] a, float[] b)
{
    float dot = 0, magA = 0, magB = 0;
    for (int i = 0; i < a.Length; i++)
    {
        dot += a[i] * b[i];
        magA += a[i] * a[i];
        magB += b[i] * b[i];
    }
    return dot / (MathF.Sqrt(magA) * MathF.Sqrt(magB));
}

Text Matching with TextMatcher

using LMKit.TextAnalysis;

var model = LM.LoadFromModelID("gemma3:4b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create text matcher
var matcher = new TextMatcher(model, new EmbeddingGenerator(embeddingModel));

// Add reference texts
matcher.AddReference("Invoice processing automation");
matcher.AddReference("Customer support chatbot");
matcher.AddReference("Data analytics dashboard");
matcher.AddReference("Inventory management system");

// Find most similar
var query = "Automated billing and payment handling";
var matches = matcher.FindSimilar(query, topK: 3);

foreach (var match in matches)
{
    Console.WriteLine($"{match.Text}: {match.Score:F3}");
}
// "Invoice processing automation" will score highest

RAG with Semantic Search

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Embeddings;

var model = LM.LoadFromModelID("gemma3:12b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create knowledge base with semantic search
var dataSource = new DataSource(new EmbeddingGenerator(embeddingModel));

// Index documents (embeddings computed automatically)
await dataSource.AddDocumentAsync("company_policies.pdf");
await dataSource.AddDocumentAsync("product_manual.docx");
await dataSource.AddDocumentAsync("faq.md");

// Semantic search
var results = await dataSource.SearchAsync(
    "What is the refund policy?",
    topK: 5,
    minScore: 0.5f  // Minimum similarity threshold
);

foreach (var result in results)
{
    Console.WriteLine($"[{result.Score:F3}] {result.DocumentName}: {result.Content[..100]}...");
}

Similarity-Based Clustering

using LMKit.Embeddings;

// Generate embeddings for documents
var documents = new[]
{
    "Machine learning for image recognition",
    "Deep learning neural networks",
    "Stock market prediction algorithms",
    "Financial portfolio optimization",
    "Computer vision applications",
    "Investment risk analysis"
};

var embeddings = documents
    .Select(d => embedder.GenerateEmbedding(d))
    .ToArray();

// Compute similarity matrix
var similarityMatrix = new float[documents.Length, documents.Length];
for (int i = 0; i < documents.Length; i++)
{
    for (int j = 0; j < documents.Length; j++)
    {
        similarityMatrix[i, j] = CosineSimilarity(embeddings[i], embeddings[j]);
    }
}

// Documents cluster into: ML/AI group and Finance group

Applications of Semantic Similarity

1. Retrieval-Augmented Generation (RAG)

Find relevant context for grounded responses:

var ragEngine = new RagEngine(model, dataSource);

// Semantic search finds relevant documents
var response = await ragEngine.GenerateAsync(
    "Explain the return process",
    CancellationToken.None
);

2. Duplicate Detection

Find near-duplicate content:

// Detect semantically similar tickets
var newTicket = "Cannot log into my account after password reset";
var existingTickets = await GetOpenTickets();

foreach (var ticket in existingTickets)
{
    var similarity = ComputeSimilarity(newTicket, ticket);
    if (similarity > 0.85f)
    {
        Console.WriteLine($"Possible duplicate: {ticket.Id}");
    }
}

3. Semantic Search

Search by meaning, not keywords:

// Query: "feeling sad and hopeless"
// Finds: "depression symptoms", "mental health support", "emotional wellness"

4. Recommendation Systems

Find similar content:

// "Users who liked X also liked Y" based on content similarity
var userLiked = embedder.GenerateEmbedding(likedArticle);
var recommendations = articles
    .Select(a => (Article: a, Score: CosineSimilarity(userLiked, a.Embedding)))
    .OrderByDescending(x => x.Score)
    .Take(5);

5. Question-Answer Matching

Match questions to known answers:

// FAQ matching
var faqPairs = LoadFAQs();
var userQuestion = "How do I reset my password?";
var userEmbedding = embedder.GenerateEmbedding(userQuestion);

var bestMatch = faqPairs
    .Select(faq => (FAQ: faq, Score: CosineSimilarity(userEmbedding, faq.QuestionEmbedding)))
    .OrderByDescending(x => x.Score)
    .First();

if (bestMatch.Score > 0.8f)
{
    return bestMatch.FAQ.Answer;
}

Choosing Similarity Thresholds

Threshold	Interpretation	Use Case
> 0.95	Near-identical	Exact duplicate detection
0.85-0.95	Very similar	Paraphrase detection
0.70-0.85	Related	RAG retrieval, recommendations
0.50-0.70	Somewhat related	Broad topic matching
< 0.50	Different topics	Filter out irrelevant results

Key Terms

Semantic Similarity: Measure of meaning overlap between texts
Embedding: Dense vector representation of text in high-dimensional space
Cosine Similarity: Similarity metric based on angle between vectors
Vector Space: Mathematical space where embeddings reside
Nearest Neighbor Search: Finding most similar vectors to a query
Similarity Threshold: Minimum score to consider texts related
Dense Retrieval: Finding documents using embedding similarity
Cross-Encoder: Model that scores pairs directly (more accurate, slower)

EmbeddingGenerator: Generate text embeddings
DataSource: Semantic search over documents
TextMatcher: Find similar texts
RagEngine: RAG with semantic retrieval

Embeddings: Vector representations enabling similarity
RAG (Retrieval-Augmented Generation): Using similarity for retrieval
Vector Database: Storing and searching embeddings
Reranking: Improving similarity-based retrieval
AI Agent Grounding: Using similarity for context retrieval
LLM: Large Language Models that generate embeddings
Tokenization: Text preprocessing before embedding generation
Inference: Running models to compute embeddings and similarity
Attention Mechanism: Core mechanism enabling contextual embeddings

External Resources

Sentence-BERT (Reimers & Gurevych, 2019): Sentence embeddings for similarity
Dense Passage Retrieval (Karpukhin et al., 2020): Dense retrieval for QA
SimCSE (Gao et al., 2021): Contrastive learning for embeddings
MTEB Benchmark: Embedding model evaluation

Summary

Semantic Similarity measures how alike texts are in meaning using vector embeddings and distance metrics like cosine similarity. Unlike keyword matching, semantic similarity understands synonyms, paraphrases, and conceptual relationships. In LM-Kit.NET, semantic similarity powers RAG retrieval (DataSource.SearchAsync), text matching (TextMatcher), and reranking through the EmbeddingGenerator class. Applications include document search, duplicate detection, recommendation systems, and FAQ matching. Choosing appropriate similarity thresholds (typically 0.7-0.85 for retrieval) balances precision and recall. Combined with vector databases for efficient storage, semantic similarity enables intelligent content discovery that understands user intent beyond exact keyword matches.

Table of Contents