👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/rag-and-knowledge/embeddings/text_similarity_ranker
Semantic Search Over a Text Corpus for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that indexes a folder of .md / .txt files and answers semantic queries. The chunker splits files on paragraph boundaries with file + start/end-line metadata, so every hit comes back with provenance ready to paste into a ticket or jump to in an editor. Built directly on LM-Kit.NET's Embedder API.
All inference runs on-device. No cloud round-trip, no token billing.
👥 Industry Target Audience
- Engineering teams indexing runbooks / wikis / incident postmortems for "did anyone hit this before?" queries.
- Support and CX indexing prior tickets for similar-case retrieval.
- Internal docs / KB owners building search over a Markdown content repo.
- Compliance / policy teams searching a body of regulation for relevant clauses.
🚀 Problem Solved
Keyword grep misses paraphrases. Semantic search finds "memory leak" inside a doc that says "RAM keeps climbing". This demo is the smallest correct pipeline that does it on a real folder: chunk, batch-embed, rank by cosine similarity, attribute each hit back to its source line range.
💻 Application Overview
Interactive menu — no command-line arguments — with four modes. Model load happens once at startup; the index is held in memory and reused across queries.
| Mode | What it does |
|---|---|
| Index | Prompts for a folder and chunk size. Walks the folder, splits files on paragraph boundaries, batch-embeds every chunk (32 at a time). |
| Demo | Loads a built-in 10-passage corpus so you can run search without an external folder. |
| Search | Top-K query REPL. Prompts for K (default 5). Each query prints score | file:lines | snippet and offers to write a CSV (rank, score, path, start_line, end_line, snippet). |
| Stats | Chunks indexed, distinct files, total characters, embedding dimension. |
| Quit | Exit. |
✨ Key Features
Embedder.GetEmbeddings(string[])batched at 32-at-a-time for high throughput on real folders.- Asymmetric query side via
Embedder.GetQueryEmbeddings(text)when the model exposes one. - Paragraph-level chunking with
(file, start_line, end_line)provenance — citations are exact, not document-level. - In-memory index reuse across the query REPL — load once, query many times.
- Optional per-query CSV for downstream analysis or auto-ticketing.
🧠 Supported Models
The model picker offers:
qwen3-embedding:0.6b— fast default.qwen3-embedding:4b— higher quality.qwen3-embedding:8b— best quality.embeddinggemma-300m— very small.
Or any text-embedding model URI / id.
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/rag-and-knowledge/embeddings/text_similarity_ranker
dotnet run
Pick a model, then either 1 to index a folder or 2 to load the built-in corpus, then 3 to query.