Table of Contents

๐Ÿ” Understanding Semantic Similarity in LM-Kit.NET


๐Ÿ“„ TL;DR

Semantic Similarity measures how alike two pieces of text are in meaning, not just word overlap. Using vector embeddings, semantic similarity captures conceptual relationships that keyword matching misses, such as understanding that "car" and "automobile" are similar despite being different words. In LM-Kit.NET, semantic similarity powers RAG retrieval, reranking, text matching, and document clustering through the EmbeddingGenerator, TextMatcher, and DataSource classes, enabling applications to find relevant content based on meaning rather than exact matches.


๐Ÿ“š What is Semantic Similarity?

Definition: Semantic Similarity is a measure of how closely related two texts are in terms of their meaning or conceptual content. Unlike lexical similarity (which counts shared words), semantic similarity understands that:

  • "The cat sat on the mat" is similar to "A feline rested on the rug"
  • "Bank" (financial) is different from "bank" (river) based on context
  • "Happy" relates to "joyful" more than to "blue"

Lexical vs Semantic Matching

+-------------------------------------------------------------------------+
|                  Lexical vs Semantic Similarity                         |
+-------------------------------------------------------------------------+
|                                                                         |
|  Query: "How to fix a car engine"                                       |
|                                                                         |
|  LEXICAL MATCHING (keyword-based):                                      |
|  +-------------------------------------------------------------------+  |
|  | Matches documents containing: "fix", "car", "engine"              |  |
|  |                                                                   |  |
|  | Match: "Fix your car engine in 5 steps"                           |  |
|  | Miss:  "Automobile motor repair guide"      (no shared words)     |  |
|  | Miss:  "Vehicle troubleshooting manual"     (no shared words)     |  |
|  +-------------------------------------------------------------------+  |
|                                                                         |
|  SEMANTIC MATCHING (meaning-based):                                     |
|  +-------------------------------------------------------------------+  |
|  | Matches documents with similar MEANING                            |  |
|  |                                                                   |  |
|  | Match: "Fix your car engine in 5 steps"     (direct match)        |  |
|  | Match: "Automobile motor repair guide"      (synonyms understood) |  |
|  | Match: "Vehicle troubleshooting manual"     (concept overlap)     |  |
|  | Miss:  "Car dealership locations"           (different intent)    |  |
|  +-------------------------------------------------------------------+  |
|                                                                         |
+-------------------------------------------------------------------------+

How Semantic Similarity Works

+-------------------------------------------------------------------------+
|                    Semantic Similarity Pipeline                         |
+-------------------------------------------------------------------------+
|                                                                         |
|  Text A: "Machine learning algorithms"                                  |
|  Text B: "AI neural network models"                                     |
|                                                                         |
|       |                                   |                             |
|       v                                   v                             |
|  +-----------+                       +-----------+                      |
|  | Embedding |                       | Embedding |                      |
|  |   Model   |                       |   Model   |                      |
|  +-----------+                       +-----------+                      |
|       |                                   |                             |
|       v                                   v                             |
|  [0.23, -0.15, 0.87, ...]           [0.21, -0.12, 0.84, ...]            |
|  (Vector A: 768 dimensions)          (Vector B: 768 dimensions)         |
|                                                                         |
|                    |                                                    |
|                    v                                                    |
|           +----------------+                                            |
|           | Cosine         |                                            |
|           | Similarity     |                                            |
|           +----------------+                                            |
|                    |                                                    |
|                    v                                                    |
|              Score: 0.92                                                |
|              (Highly similar)                                           |
|                                                                         |
+-------------------------------------------------------------------------+

๐Ÿ” Similarity Metrics

Common Distance/Similarity Functions

Metric Formula Range Best For
Cosine Similarity AยทB / ( A ร— B ) -1 to 1 Normalized text comparison
Dot Product AยทB -โˆž to +โˆž When magnitudes matter
Euclidean Distance โˆšฮฃ(Aแตข-Bแตข)ยฒ 0 to โˆž Absolute difference
Manhattan Distance ฮฃ Aแตข-Bแตข 0 to โˆž Sparse vectors

Cosine Similarity Explained

+-------------------------------------------------------------------------+
|                      Cosine Similarity                                  |
+-------------------------------------------------------------------------+
|                                                                         |
|  Measures the angle between two vectors, ignoring magnitude             |
|                                                                         |
|                    Vector B                                             |
|                   /                                                     |
|                  /  angle = small                                       |
|                 /   cos(angle) = high                                   |
|                /    similarity = HIGH                                   |
|    Vector A  /________________                                          |
|                                                                         |
|                                                                         |
|                    Vector B                                             |
|                   |                                                     |
|                   |  angle = large                                      |
|                   |  cos(angle) = low                                   |
|                   |  similarity = LOW                                   |
|    Vector A  ____|                                                      |
|                                                                         |
|  Score Interpretation:                                                  |
|  - 1.0: Identical meaning                                               |
|  - 0.8-0.9: Very similar                                                |
|  - 0.5-0.7: Somewhat related                                            |
|  - 0.0-0.4: Different topics                                            |
|  - Negative: Opposite meanings (rare in practice)                       |
|                                                                         |
+-------------------------------------------------------------------------+

โš™๏ธ Semantic Similarity in LM-Kit.NET

Generating Embeddings

using LMKit.Model;
using LMKit.Embeddings;

// Load an embedding model
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create embedding generator
var embedder = new EmbeddingGenerator(embeddingModel);

// Generate embeddings for texts
var embedding1 = embedder.GenerateEmbedding("Machine learning algorithms");
var embedding2 = embedder.GenerateEmbedding("AI neural network models");

Console.WriteLine($"Embedding dimension: {embedding1.Length}");

Computing Similarity

using LMKit.Embeddings;

// Generate embeddings
var vec1 = embedder.GenerateEmbedding("The quick brown fox");
var vec2 = embedder.GenerateEmbedding("A fast auburn canine");
var vec3 = embedder.GenerateEmbedding("Stock market analysis");

// Compute cosine similarity
float similarity12 = CosineSimilarity(vec1, vec2);
float similarity13 = CosineSimilarity(vec1, vec3);

Console.WriteLine($"Fox sentences: {similarity12:F3}");  // High (~0.85)
Console.WriteLine($"Fox vs stocks: {similarity13:F3}"); // Low (~0.15)

static float CosineSimilarity(float[] a, float[] b)
{
    float dot = 0, magA = 0, magB = 0;
    for (int i = 0; i < a.Length; i++)
    {
        dot += a[i] * b[i];
        magA += a[i] * a[i];
        magB += b[i] * b[i];
    }
    return dot / (MathF.Sqrt(magA) * MathF.Sqrt(magB));
}

Text Matching with TextMatcher

using LMKit.TextAnalysis;

var model = LM.LoadFromModelID("gemma3:4b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create text matcher
var matcher = new TextMatcher(model, new EmbeddingGenerator(embeddingModel));

// Add reference texts
matcher.AddReference("Invoice processing automation");
matcher.AddReference("Customer support chatbot");
matcher.AddReference("Data analytics dashboard");
matcher.AddReference("Inventory management system");

// Find most similar
var query = "Automated billing and payment handling";
var matches = matcher.FindSimilar(query, topK: 3);

foreach (var match in matches)
{
    Console.WriteLine($"{match.Text}: {match.Score:F3}");
}
// "Invoice processing automation" will score highest
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Embeddings;

var model = LM.LoadFromModelID("gemma3:12b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create knowledge base with semantic search
var dataSource = new DataSource(new EmbeddingGenerator(embeddingModel));

// Index documents (embeddings computed automatically)
await dataSource.AddDocumentAsync("company_policies.pdf");
await dataSource.AddDocumentAsync("product_manual.docx");
await dataSource.AddDocumentAsync("faq.md");

// Semantic search
var results = await dataSource.SearchAsync(
    "What is the refund policy?",
    topK: 5,
    minScore: 0.5f  // Minimum similarity threshold
);

foreach (var result in results)
{
    Console.WriteLine($"[{result.Score:F3}] {result.DocumentName}: {result.Content[..100]}...");
}

Similarity-Based Clustering

using LMKit.Embeddings;

// Generate embeddings for documents
var documents = new[]
{
    "Machine learning for image recognition",
    "Deep learning neural networks",
    "Stock market prediction algorithms",
    "Financial portfolio optimization",
    "Computer vision applications",
    "Investment risk analysis"
};

var embeddings = documents
    .Select(d => embedder.GenerateEmbedding(d))
    .ToArray();

// Compute similarity matrix
var similarityMatrix = new float[documents.Length, documents.Length];
for (int i = 0; i < documents.Length; i++)
{
    for (int j = 0; j < documents.Length; j++)
    {
        similarityMatrix[i, j] = CosineSimilarity(embeddings[i], embeddings[j]);
    }
}

// Documents cluster into: ML/AI group and Finance group

๐ŸŽฏ Applications of Semantic Similarity

1. Retrieval-Augmented Generation (RAG)

Find relevant context for grounded responses:

var ragEngine = new RagEngine(model, dataSource);

// Semantic search finds relevant documents
var response = await ragEngine.GenerateAsync(
    "Explain the return process",
    CancellationToken.None
);

2. Duplicate Detection

Find near-duplicate content:

// Detect semantically similar tickets
var newTicket = "Cannot log into my account after password reset";
var existingTickets = await GetOpenTickets();

foreach (var ticket in existingTickets)
{
    var similarity = ComputeSimilarity(newTicket, ticket);
    if (similarity > 0.85f)
    {
        Console.WriteLine($"Possible duplicate: {ticket.Id}");
    }
}

Search by meaning, not keywords:

// Query: "feeling sad and hopeless"
// Finds: "depression symptoms", "mental health support", "emotional wellness"

4. Recommendation Systems

Find similar content:

// "Users who liked X also liked Y" based on content similarity
var userLiked = embedder.GenerateEmbedding(likedArticle);
var recommendations = articles
    .Select(a => (Article: a, Score: CosineSimilarity(userLiked, a.Embedding)))
    .OrderByDescending(x => x.Score)
    .Take(5);

5. Question-Answer Matching

Match questions to known answers:

// FAQ matching
var faqPairs = LoadFAQs();
var userQuestion = "How do I reset my password?";
var userEmbedding = embedder.GenerateEmbedding(userQuestion);

var bestMatch = faqPairs
    .Select(faq => (FAQ: faq, Score: CosineSimilarity(userEmbedding, faq.QuestionEmbedding)))
    .OrderByDescending(x => x.Score)
    .First();

if (bestMatch.Score > 0.8f)
{
    return bestMatch.FAQ.Answer;
}

๐Ÿ“Š Choosing Similarity Thresholds

Threshold Interpretation Use Case
> 0.95 Near-identical Exact duplicate detection
0.85-0.95 Very similar Paraphrase detection
0.70-0.85 Related RAG retrieval, recommendations
0.50-0.70 Somewhat related Broad topic matching
< 0.50 Different topics Filter out irrelevant results

๐Ÿ“– Key Terms

  • Semantic Similarity: Measure of meaning overlap between texts
  • Embedding: Dense vector representation of text in high-dimensional space
  • Cosine Similarity: Similarity metric based on angle between vectors
  • Vector Space: Mathematical space where embeddings reside
  • Nearest Neighbor Search: Finding most similar vectors to a query
  • Similarity Threshold: Minimum score to consider texts related
  • Dense Retrieval: Finding documents using embedding similarity
  • Cross-Encoder: Model that scores pairs directly (more accurate, slower)



๐ŸŒ External Resources


๐Ÿ“ Summary

Semantic Similarity measures how alike texts are in meaning using vector embeddings and distance metrics like cosine similarity. Unlike keyword matching, semantic similarity understands synonyms, paraphrases, and conceptual relationships. In LM-Kit.NET, semantic similarity powers RAG retrieval (DataSource.SearchAsync), text matching (TextMatcher), and reranking through the EmbeddingGenerator class. Applications include document search, duplicate detection, recommendation systems, and FAQ matching. Choosing appropriate similarity thresholds (typically 0.7-0.85 for retrieval) balances precision and recall. Combined with vector databases for efficient storage, semantic similarity enables intelligent content discovery that understands user intent beyond exact keyword matches.