๐ Understanding Semantic Similarity in LM-Kit.NET
๐ TL;DR
Semantic Similarity measures how alike two pieces of text are in meaning, not just word overlap. Using vector embeddings, semantic similarity captures conceptual relationships that keyword matching misses, such as understanding that "car" and "automobile" are similar despite being different words. In LM-Kit.NET, semantic similarity powers RAG retrieval, reranking, text matching, and document clustering through the EmbeddingGenerator, TextMatcher, and DataSource classes, enabling applications to find relevant content based on meaning rather than exact matches.
๐ What is Semantic Similarity?
Definition: Semantic Similarity is a measure of how closely related two texts are in terms of their meaning or conceptual content. Unlike lexical similarity (which counts shared words), semantic similarity understands that:
- "The cat sat on the mat" is similar to "A feline rested on the rug"
- "Bank" (financial) is different from "bank" (river) based on context
- "Happy" relates to "joyful" more than to "blue"
Lexical vs Semantic Matching
+-------------------------------------------------------------------------+
| Lexical vs Semantic Similarity |
+-------------------------------------------------------------------------+
| |
| Query: "How to fix a car engine" |
| |
| LEXICAL MATCHING (keyword-based): |
| +-------------------------------------------------------------------+ |
| | Matches documents containing: "fix", "car", "engine" | |
| | | |
| | Match: "Fix your car engine in 5 steps" | |
| | Miss: "Automobile motor repair guide" (no shared words) | |
| | Miss: "Vehicle troubleshooting manual" (no shared words) | |
| +-------------------------------------------------------------------+ |
| |
| SEMANTIC MATCHING (meaning-based): |
| +-------------------------------------------------------------------+ |
| | Matches documents with similar MEANING | |
| | | |
| | Match: "Fix your car engine in 5 steps" (direct match) | |
| | Match: "Automobile motor repair guide" (synonyms understood) | |
| | Match: "Vehicle troubleshooting manual" (concept overlap) | |
| | Miss: "Car dealership locations" (different intent) | |
| +-------------------------------------------------------------------+ |
| |
+-------------------------------------------------------------------------+
How Semantic Similarity Works
+-------------------------------------------------------------------------+
| Semantic Similarity Pipeline |
+-------------------------------------------------------------------------+
| |
| Text A: "Machine learning algorithms" |
| Text B: "AI neural network models" |
| |
| | | |
| v v |
| +-----------+ +-----------+ |
| | Embedding | | Embedding | |
| | Model | | Model | |
| +-----------+ +-----------+ |
| | | |
| v v |
| [0.23, -0.15, 0.87, ...] [0.21, -0.12, 0.84, ...] |
| (Vector A: 768 dimensions) (Vector B: 768 dimensions) |
| |
| | |
| v |
| +----------------+ |
| | Cosine | |
| | Similarity | |
| +----------------+ |
| | |
| v |
| Score: 0.92 |
| (Highly similar) |
| |
+-------------------------------------------------------------------------+
๐ Similarity Metrics
Common Distance/Similarity Functions
| Metric | Formula | Range | Best For | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cosine Similarity | AยทB / ( | A | ร | B | ) | -1 to 1 | Normalized text comparison | ||||
| Dot Product | AยทB | -โ to +โ | When magnitudes matter | ||||||||
| Euclidean Distance | โฮฃ(Aแตข-Bแตข)ยฒ | 0 to โ | Absolute difference | ||||||||
| Manhattan Distance | ฮฃ | Aแตข-Bแตข | 0 to โ | Sparse vectors |
Cosine Similarity Explained
+-------------------------------------------------------------------------+
| Cosine Similarity |
+-------------------------------------------------------------------------+
| |
| Measures the angle between two vectors, ignoring magnitude |
| |
| Vector B |
| / |
| / angle = small |
| / cos(angle) = high |
| / similarity = HIGH |
| Vector A /________________ |
| |
| |
| Vector B |
| | |
| | angle = large |
| | cos(angle) = low |
| | similarity = LOW |
| Vector A ____| |
| |
| Score Interpretation: |
| - 1.0: Identical meaning |
| - 0.8-0.9: Very similar |
| - 0.5-0.7: Somewhat related |
| - 0.0-0.4: Different topics |
| - Negative: Opposite meanings (rare in practice) |
| |
+-------------------------------------------------------------------------+
โ๏ธ Semantic Similarity in LM-Kit.NET
Generating Embeddings
using LMKit.Model;
using LMKit.Embeddings;
// Load an embedding model
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
// Create embedding generator
var embedder = new EmbeddingGenerator(embeddingModel);
// Generate embeddings for texts
var embedding1 = embedder.GenerateEmbedding("Machine learning algorithms");
var embedding2 = embedder.GenerateEmbedding("AI neural network models");
Console.WriteLine($"Embedding dimension: {embedding1.Length}");
Computing Similarity
using LMKit.Embeddings;
// Generate embeddings
var vec1 = embedder.GenerateEmbedding("The quick brown fox");
var vec2 = embedder.GenerateEmbedding("A fast auburn canine");
var vec3 = embedder.GenerateEmbedding("Stock market analysis");
// Compute cosine similarity
float similarity12 = CosineSimilarity(vec1, vec2);
float similarity13 = CosineSimilarity(vec1, vec3);
Console.WriteLine($"Fox sentences: {similarity12:F3}"); // High (~0.85)
Console.WriteLine($"Fox vs stocks: {similarity13:F3}"); // Low (~0.15)
static float CosineSimilarity(float[] a, float[] b)
{
float dot = 0, magA = 0, magB = 0;
for (int i = 0; i < a.Length; i++)
{
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (MathF.Sqrt(magA) * MathF.Sqrt(magB));
}
Text Matching with TextMatcher
using LMKit.TextAnalysis;
var model = LM.LoadFromModelID("gemma3:4b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
// Create text matcher
var matcher = new TextMatcher(model, new EmbeddingGenerator(embeddingModel));
// Add reference texts
matcher.AddReference("Invoice processing automation");
matcher.AddReference("Customer support chatbot");
matcher.AddReference("Data analytics dashboard");
matcher.AddReference("Inventory management system");
// Find most similar
var query = "Automated billing and payment handling";
var matches = matcher.FindSimilar(query, topK: 3);
foreach (var match in matches)
{
Console.WriteLine($"{match.Text}: {match.Score:F3}");
}
// "Invoice processing automation" will score highest
RAG with Semantic Search
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Embeddings;
var model = LM.LoadFromModelID("gemma3:12b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
// Create knowledge base with semantic search
var dataSource = new DataSource(new EmbeddingGenerator(embeddingModel));
// Index documents (embeddings computed automatically)
await dataSource.AddDocumentAsync("company_policies.pdf");
await dataSource.AddDocumentAsync("product_manual.docx");
await dataSource.AddDocumentAsync("faq.md");
// Semantic search
var results = await dataSource.SearchAsync(
"What is the refund policy?",
topK: 5,
minScore: 0.5f // Minimum similarity threshold
);
foreach (var result in results)
{
Console.WriteLine($"[{result.Score:F3}] {result.DocumentName}: {result.Content[..100]}...");
}
Similarity-Based Clustering
using LMKit.Embeddings;
// Generate embeddings for documents
var documents = new[]
{
"Machine learning for image recognition",
"Deep learning neural networks",
"Stock market prediction algorithms",
"Financial portfolio optimization",
"Computer vision applications",
"Investment risk analysis"
};
var embeddings = documents
.Select(d => embedder.GenerateEmbedding(d))
.ToArray();
// Compute similarity matrix
var similarityMatrix = new float[documents.Length, documents.Length];
for (int i = 0; i < documents.Length; i++)
{
for (int j = 0; j < documents.Length; j++)
{
similarityMatrix[i, j] = CosineSimilarity(embeddings[i], embeddings[j]);
}
}
// Documents cluster into: ML/AI group and Finance group
๐ฏ Applications of Semantic Similarity
1. Retrieval-Augmented Generation (RAG)
Find relevant context for grounded responses:
var ragEngine = new RagEngine(model, dataSource);
// Semantic search finds relevant documents
var response = await ragEngine.GenerateAsync(
"Explain the return process",
CancellationToken.None
);
2. Duplicate Detection
Find near-duplicate content:
// Detect semantically similar tickets
var newTicket = "Cannot log into my account after password reset";
var existingTickets = await GetOpenTickets();
foreach (var ticket in existingTickets)
{
var similarity = ComputeSimilarity(newTicket, ticket);
if (similarity > 0.85f)
{
Console.WriteLine($"Possible duplicate: {ticket.Id}");
}
}
3. Semantic Search
Search by meaning, not keywords:
// Query: "feeling sad and hopeless"
// Finds: "depression symptoms", "mental health support", "emotional wellness"
4. Recommendation Systems
Find similar content:
// "Users who liked X also liked Y" based on content similarity
var userLiked = embedder.GenerateEmbedding(likedArticle);
var recommendations = articles
.Select(a => (Article: a, Score: CosineSimilarity(userLiked, a.Embedding)))
.OrderByDescending(x => x.Score)
.Take(5);
5. Question-Answer Matching
Match questions to known answers:
// FAQ matching
var faqPairs = LoadFAQs();
var userQuestion = "How do I reset my password?";
var userEmbedding = embedder.GenerateEmbedding(userQuestion);
var bestMatch = faqPairs
.Select(faq => (FAQ: faq, Score: CosineSimilarity(userEmbedding, faq.QuestionEmbedding)))
.OrderByDescending(x => x.Score)
.First();
if (bestMatch.Score > 0.8f)
{
return bestMatch.FAQ.Answer;
}
๐ Choosing Similarity Thresholds
| Threshold | Interpretation | Use Case |
|---|---|---|
| > 0.95 | Near-identical | Exact duplicate detection |
| 0.85-0.95 | Very similar | Paraphrase detection |
| 0.70-0.85 | Related | RAG retrieval, recommendations |
| 0.50-0.70 | Somewhat related | Broad topic matching |
| < 0.50 | Different topics | Filter out irrelevant results |
๐ Key Terms
- Semantic Similarity: Measure of meaning overlap between texts
- Embedding: Dense vector representation of text in high-dimensional space
- Cosine Similarity: Similarity metric based on angle between vectors
- Vector Space: Mathematical space where embeddings reside
- Nearest Neighbor Search: Finding most similar vectors to a query
- Similarity Threshold: Minimum score to consider texts related
- Dense Retrieval: Finding documents using embedding similarity
- Cross-Encoder: Model that scores pairs directly (more accurate, slower)
๐ Related API Documentation
EmbeddingGenerator: Generate text embeddingsDataSource: Semantic search over documentsTextMatcher: Find similar textsRagEngine: RAG with semantic retrieval
๐ Related Glossary Topics
- Embeddings: Vector representations enabling similarity
- RAG (Retrieval-Augmented Generation): Using similarity for retrieval
- Vector Database: Storing and searching embeddings
- Reranking: Improving similarity-based retrieval
- AI Agent Grounding: Using similarity for context retrieval
๐ External Resources
- Sentence-BERT (Reimers & Gurevych, 2019): Sentence embeddings for similarity
- Dense Passage Retrieval (Karpukhin et al., 2020): Dense retrieval for QA
- SimCSE (Gao et al., 2021): Contrastive learning for embeddings
- MTEB Benchmark: Embedding model evaluation
๐ Summary
Semantic Similarity measures how alike texts are in meaning using vector embeddings and distance metrics like cosine similarity. Unlike keyword matching, semantic similarity understands synonyms, paraphrases, and conceptual relationships. In LM-Kit.NET, semantic similarity powers RAG retrieval (DataSource.SearchAsync), text matching (TextMatcher), and reranking through the EmbeddingGenerator class. Applications include document search, duplicate detection, recommendation systems, and FAQ matching. Choosing appropriate similarity thresholds (typically 0.7-0.85 for retrieval) balances precision and recall. Combined with vector databases for efficient storage, semantic similarity enables intelligent content discovery that understands user intent beyond exact keyword matches.