👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/rag-and-knowledge/embeddings/text_similarity_ranker

Semantic Search Over a Text Corpus for C# .NET Applications

🎯 Purpose of the Demo

An interactive console app that indexes a folder of .md / .txt files and answers semantic queries. The chunker splits files on paragraph boundaries with file + start/end-line metadata, so every hit comes back with provenance ready to paste into a ticket or jump to in an editor. Built directly on LM-Kit.NET's Embedder API.

All inference runs on-device. No cloud round-trip, no token billing.

👥 Industry Target Audience

Engineering teams indexing runbooks / wikis / incident postmortems for "did anyone hit this before?" queries.
Support and CX indexing prior tickets for similar-case retrieval.
Internal docs / KB owners building search over a Markdown content repo.
Compliance / policy teams searching a body of regulation for relevant clauses.

🚀 Problem Solved

Keyword grep misses paraphrases. Semantic search finds "memory leak" inside a doc that says "RAM keeps climbing". This demo is the smallest correct pipeline that does it on a real folder: chunk, batch-embed, rank by cosine similarity, attribute each hit back to its source line range.

💻 Application Overview

Interactive menu — no command-line arguments — with four modes. Model load happens once at startup; the index is held in memory and reused across queries.

Mode	What it does
Index	Prompts for a folder and chunk size. Walks the folder, splits files on paragraph boundaries, batch-embeds every chunk (32 at a time).
Demo	Loads a built-in 10-passage corpus so you can run search without an external folder.
Search	Top-K query REPL. Prompts for K (default 5). Each query prints `score \| file:lines \| snippet` and offers to write a CSV (`rank, score, path, start_line, end_line, snippet`).
Stats	Chunks indexed, distinct files, total characters, embedding dimension.
Quit	Exit.

✨ Key Features

Embedder.GetEmbeddings(string[]) batched at 32-at-a-time for high throughput on real folders.
Asymmetric query side via Embedder.GetQueryEmbeddings(text) when the model exposes one.
Paragraph-level chunking with (file, start_line, end_line) provenance — citations are exact, not document-level.
In-memory index reuse across the query REPL — load once, query many times.
Optional per-query CSV for downstream analysis or auto-ticketing.

🧠 Supported Models

The model picker offers:

qwen3-embedding:0.6b: fast default.
qwen3-embedding:4b: higher quality.
qwen3-embedding:8b: best quality.
embeddinggemma-300m: very small.
harrier-oss:0.6b: compact, multilingual, instruction-aware.

Or any text-embedding model URI / id.

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/rag-and-knowledge/embeddings/text_similarity_ranker
dotnet run

Pick a model, then either 1 to index a folder or 2 to load the built-in corpus, then 3 to query.

Table of Contents