👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vision/image-understanding/vlm_visual_qa
VLM Visual Q&A for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that asks vision-language models analytical questions about images. Pick the vision model at startup, then the mode from a menu: chat with one image, run a standard 4-question audit, or caption every image in a folder.
All inference runs on-device.
👥 Industry Target Audience
- Image-aware chat / search / moderation: replace several task-specific CV pipelines with one VLM.
- DAM / asset management: bulk-caption media libraries.
- e-commerce: automated alt-text and category cues for product photos.
- Accessibility: generate alt-text for user-uploaded images.
- Triage / claims: caption and audit photos arriving from the field.
🚀 Problem Solved
A VLM is a single endpoint that answers caption / count / describe / yes-no questions without per-task training. The demo wraps that endpoint behind a menu so the same model handles ad-hoc REPL exploration and folder-wide batch captioning. The folder mode emits a portable CSV that downstream tooling can consume immediately.
💻 Application Overview
Interactive menu (no command-line arguments) with three modes:
| Mode | What it does |
|---|---|
| Chat | Attach one image and ask repeated questions (REPL). Token streaming via AfterTextCompletion. |
| Audit | Run a standard 4-question audit: caption, 3-sentence description, person count, outdoors yes/no. |
| Folder | Caption every supported image in a folder using a chosen prompt; write a path,caption CSV. |
| Quit | Exit. |
✨ Key Features
- Model picker at startup:
qwen3.5:4b,qwen3.5:9b,gemma4:e2b,gemma4:e4b,glm-4.6v-flash, or custom URI / ID. LM.HasVisioncapability check at runtime.Attachment(imagePath)to attach an image to a chat turn.chat.Submit(new ChatHistory.Message(prompt, attachment))for vision input.AfterTextCompletiontoken streaming.- CSV folder export: portable
path,captionrows for downstream tools.
🧠 Models
qwen3.5:4b(default, fast, ~3 GB VRAM)qwen3.5:9b(~6 GB VRAM)gemma4:e2b(~3 GB VRAM)gemma4:e4b(~5 GB VRAM)glm-4.6v-flash(~6 GB VRAM)
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
- A GPU is recommended for VLMs (CPU works but is slower).
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vision/image-understanding/vlm_visual_qa
dotnet run
Pick the vision model at startup, then a mode from the menu.