Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vision/image-understanding/vlm_visual_qa

VLM Visual Q&A for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that asks vision-language models analytical questions about images. Pick the vision model at startup, then the mode from a menu: chat with one image, run a standard 4-question audit, or caption every image in a folder.

All inference runs on-device.


👥 Industry Target Audience

  • Image-aware chat / search / moderation: replace several task-specific CV pipelines with one VLM.
  • DAM / asset management: bulk-caption media libraries.
  • e-commerce: automated alt-text and category cues for product photos.
  • Accessibility: generate alt-text for user-uploaded images.
  • Triage / claims: caption and audit photos arriving from the field.

🚀 Problem Solved

A VLM is a single endpoint that answers caption / count / describe / yes-no questions without per-task training. The demo wraps that endpoint behind a menu so the same model handles ad-hoc REPL exploration and folder-wide batch captioning. The folder mode emits a portable CSV that downstream tooling can consume immediately.


💻 Application Overview

Interactive menu (no command-line arguments) with three modes:

Mode What it does
Chat Attach one image and ask repeated questions (REPL). Token streaming via AfterTextCompletion.
Audit Run a standard 4-question audit: caption, 3-sentence description, person count, outdoors yes/no.
Folder Caption every supported image in a folder using a chosen prompt; write a path,caption CSV.
Quit Exit.

✨ Key Features

  • Model picker at startup: qwen3.5:4b, qwen3.5:9b, gemma4:e2b, gemma4:e4b, glm-4.6v-flash, or custom URI / ID.
  • LM.HasVision capability check at runtime.
  • Attachment(imagePath) to attach an image to a chat turn.
  • chat.Submit(new ChatHistory.Message(prompt, attachment)) for vision input.
  • AfterTextCompletion token streaming.
  • CSV folder export: portable path,caption rows for downstream tools.

🧠 Models

  • qwen3.5:4b (default, fast, ~3 GB VRAM)
  • qwen3.5:9b (~6 GB VRAM)
  • gemma4:e2b (~3 GB VRAM)
  • gemma4:e4b (~5 GB VRAM)
  • glm-4.6v-flash (~6 GB VRAM)

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later
  • A GPU is recommended for VLMs (CPU works but is slower).

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vision/image-understanding/vlm_visual_qa
dotnet run

Pick the vision model at startup, then a mode from the menu.

Share