👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vision/image-understanding/vlm_visual_qa

VLM Visual Q&A for C# .NET Applications

🎯 Purpose of the Demo

An interactive console app that asks vision-language models analytical questions about images. Pick the vision model at startup, then the mode from a menu: chat with one image, run a standard 4-question audit, or caption every image in a folder.

All inference runs on-device.

👥 Industry Target Audience

Image-aware chat / search / moderation: replace several task-specific CV pipelines with one VLM.
DAM / asset management: bulk-caption media libraries.
e-commerce: automated alt-text and category cues for product photos.
Accessibility: generate alt-text for user-uploaded images.
Triage / claims: caption and audit photos arriving from the field.

🚀 Problem Solved

A VLM is a single endpoint that answers caption / count / describe / yes-no questions without per-task training. The demo wraps that endpoint behind a menu so the same model handles ad-hoc REPL exploration and folder-wide batch captioning. The folder mode emits a portable CSV that downstream tooling can consume immediately.

💻 Application Overview

Interactive menu (no command-line arguments) with three modes:

Mode	What it does
Chat	Attach one image and ask repeated questions (REPL). Token streaming via `AfterTextCompletion`.
Audit	Run a standard 4-question audit: caption, 3-sentence description, person count, outdoors yes/no.
Folder	Caption every supported image in a folder using a chosen prompt; write a `path,caption` CSV.
Quit	Exit.

✨ Key Features

Model picker at startup: qwen3.5:4b, qwen3.5:9b, gemma4:e2b, gemma4:e4b, glm-4.6v-flash, or custom URI / ID.
LM.HasVision capability check at runtime.
Attachment(imagePath) to attach an image to a chat turn.
chat.Submit(new ChatHistory.Message(prompt, attachment)) for vision input.
AfterTextCompletion token streaming.
CSV folder export: portable path,caption rows for downstream tools.

🧠 Models

qwen3.5:4b (default, fast, ~3 GB VRAM)
qwen3.5:9b (~6 GB VRAM)
gemma4:e2b (~3 GB VRAM)
gemma4:e4b (~5 GB VRAM)
glm-4.6v-flash (~6 GB VRAM)

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later
A GPU is recommended for VLMs (CPU works but is slower).

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vision/image-understanding/vlm_visual_qa
dotnet run

Pick the vision model at startup, then a mode from the menu.

Table of Contents