Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vision/image-embeddings/image_similarity_search

Visual Similarity & Near-Duplicate Index for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that embeds an image folder into a persistent file-based vector index, then either (a) clusters near-duplicates above a cosine threshold or (b) runs visual search by query image. Built on LM.LoadFromModelID("nomic-embed-vision"), Embedder, DataSource, and VectorSearch.

All embedding and search runs on-device.


👥 Industry Target Audience

  • Photo libraries / DAM: surface near-duplicates and reclaim storage.
  • E-commerce: find product photos that look alike across listings.
  • Content moderation / brand-safety: detect re-uploads of flagged images.
  • Personal archives: clean up years of unsorted captures.
  • Visual search UX: "find more like this" backend.

🚀 Problem Solved

Perceptual hashing misses crops, rescales, re-encodes, and minor edits. Embedding-based similarity catches them: the same scene shot from a slightly different angle, or the same photo re-saved at a different quality, end up close in vector space. The demo uses that to do two real tasks — cluster duplicates (with a reclaimable-bytes estimate) and search by example image.


💻 Application Overview

Interactive menu — no command-line arguments — with four modes. Embedding model loads once at startup.

Mode What it does
Index Walk a folder, embed every image with nomic-embed-vision, persist into a file-based DataSource.
Duplicates Prompt for a cosine threshold (default 0.92), cluster every neighbour above the threshold, write duplicates_report.md + duplicates.csv, print reclaimable-bytes estimate.
Search REPL of query-image paths; each returns top-K nearest neighbours with similarity scores.
Stats Image count, total bytes, embedding dimension.
Quit Exit.

Output artefacts (after Duplicates mode):

  • duplicates_report.md — one section per cluster with similarity / size / path.
  • duplicates.csv — flat: cluster, similarity, size_bytes, path.

✨ Key Features

  • Embedder.GetEmbeddings(Attachment): one call per image; returns a fixed-length float vector.
  • DataSource.CreateFileDataSource(...): persistent on-disk index so the embedding pass is paid once.
  • VectorSearch.FindMatchingPartitions(...): cosine-similarity neighbours straight from the SDK.
  • Cluster aggregation: longest-overlap-wins growth so each image lands in at most one cluster.
  • Reclaimable-bytes estimate: keep the largest file per cluster, report the rest as savings.

🧠 Model

  • nomic-embed-vision — fixed across this demo. (Edit the source to use a different image-embedding model.)

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vision/image-embeddings/image_similarity_search
dotnet run

Pick a mode from the menu and follow the prompts.

Share