Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/rag-and-knowledge/embeddings/text_similarity_ranker

Semantic Search Over a Text Corpus for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that indexes a folder of .md / .txt files and answers semantic queries. The chunker splits files on paragraph boundaries with file + start/end-line metadata, so every hit comes back with provenance ready to paste into a ticket or jump to in an editor. Built directly on LM-Kit.NET's Embedder API.

All inference runs on-device. No cloud round-trip, no token billing.


👥 Industry Target Audience

  • Engineering teams indexing runbooks / wikis / incident postmortems for "did anyone hit this before?" queries.
  • Support and CX indexing prior tickets for similar-case retrieval.
  • Internal docs / KB owners building search over a Markdown content repo.
  • Compliance / policy teams searching a body of regulation for relevant clauses.

🚀 Problem Solved

Keyword grep misses paraphrases. Semantic search finds "memory leak" inside a doc that says "RAM keeps climbing". This demo is the smallest correct pipeline that does it on a real folder: chunk, batch-embed, rank by cosine similarity, attribute each hit back to its source line range.


💻 Application Overview

Interactive menu — no command-line arguments — with four modes. Model load happens once at startup; the index is held in memory and reused across queries.

Mode What it does
Index Prompts for a folder and chunk size. Walks the folder, splits files on paragraph boundaries, batch-embeds every chunk (32 at a time).
Demo Loads a built-in 10-passage corpus so you can run search without an external folder.
Search Top-K query REPL. Prompts for K (default 5). Each query prints score | file:lines | snippet and offers to write a CSV (rank, score, path, start_line, end_line, snippet).
Stats Chunks indexed, distinct files, total characters, embedding dimension.
Quit Exit.

✨ Key Features

  • Embedder.GetEmbeddings(string[]) batched at 32-at-a-time for high throughput on real folders.
  • Asymmetric query side via Embedder.GetQueryEmbeddings(text) when the model exposes one.
  • Paragraph-level chunking with (file, start_line, end_line) provenance — citations are exact, not document-level.
  • In-memory index reuse across the query REPL — load once, query many times.
  • Optional per-query CSV for downstream analysis or auto-ticketing.

🧠 Supported Models

The model picker offers:

  • qwen3-embedding:0.6b — fast default.
  • qwen3-embedding:4b — higher quality.
  • qwen3-embedding:8b — best quality.
  • embeddinggemma-300m — very small.

Or any text-embedding model URI / id.


🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/rag-and-knowledge/embeddings/text_similarity_ranker
dotnet run

Pick a model, then either 1 to index a folder or 2 to load the built-in corpus, then 3 to query.

Share