👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vision/image-embeddings/image_similarity_search
Visual Similarity & Near-Duplicate Index for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that embeds an image folder into a persistent file-based vector index, then either (a) clusters near-duplicates above a cosine threshold or (b) runs visual search by query image. Built on LM.LoadFromModelID("nomic-embed-vision"), Embedder, DataSource, and VectorSearch.
All embedding and search runs on-device.
👥 Industry Target Audience
- Photo libraries / DAM: surface near-duplicates and reclaim storage.
- E-commerce: find product photos that look alike across listings.
- Content moderation / brand-safety: detect re-uploads of flagged images.
- Personal archives: clean up years of unsorted captures.
- Visual search UX: "find more like this" backend.
🚀 Problem Solved
Perceptual hashing misses crops, rescales, re-encodes, and minor edits. Embedding-based similarity catches them: the same scene shot from a slightly different angle, or the same photo re-saved at a different quality, end up close in vector space. The demo uses that to do two real tasks — cluster duplicates (with a reclaimable-bytes estimate) and search by example image.
💻 Application Overview
Interactive menu — no command-line arguments — with four modes. Embedding model loads once at startup.
| Mode | What it does |
|---|---|
| Index | Walk a folder, embed every image with nomic-embed-vision, persist into a file-based DataSource. |
| Duplicates | Prompt for a cosine threshold (default 0.92), cluster every neighbour above the threshold, write duplicates_report.md + duplicates.csv, print reclaimable-bytes estimate. |
| Search | REPL of query-image paths; each returns top-K nearest neighbours with similarity scores. |
| Stats | Image count, total bytes, embedding dimension. |
| Quit | Exit. |
Output artefacts (after Duplicates mode):
duplicates_report.md— one section per cluster with similarity / size / path.duplicates.csv— flat:cluster, similarity, size_bytes, path.
✨ Key Features
Embedder.GetEmbeddings(Attachment): one call per image; returns a fixed-length float vector.DataSource.CreateFileDataSource(...): persistent on-disk index so the embedding pass is paid once.VectorSearch.FindMatchingPartitions(...): cosine-similarity neighbours straight from the SDK.- Cluster aggregation: longest-overlap-wins growth so each image lands in at most one cluster.
- Reclaimable-bytes estimate: keep the largest file per cluster, report the rest as savings.
🧠 Model
nomic-embed-vision— fixed across this demo. (Edit the source to use a different image-embedding model.)
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vision/image-embeddings/image_similarity_search
dotnet run
Pick a mode from the menu and follow the prompts.