Analyze Images with Vision Language Models
Vision Language Models (VLMs) understand both text and images in a single model. Instead of chaining OCR, object detection, and text generation as separate steps, a VLM processes an image directly and answers questions about it in natural language. This tutorial builds a working image analysis program that describes images, answers visual questions, and handles multi-turn conversations about images.
Why Local Vision Matters
Two enterprise problems that on-device VLMs solve:
- Sensitive document processing. Organizations handling medical scans, legal evidence, classified imagery, or proprietary engineering diagrams cannot send images to cloud APIs. A local VLM processes everything on-premises, keeping sensitive visual data within the organization's infrastructure.
- Field inspection and quality control. Manufacturing floors, construction sites, and remote facilities need real-time visual analysis without depending on internet connectivity. A local VLM running on an edge device or laptop can inspect parts, flag defects, and read labels offline.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB (for a 4B VLM) |
| Disk | ~3 GB free for model download |
| Test image | Any .jpg, .png, .bmp, or .webp file |
Step 1: Create the Project
dotnet new console -n VisionQuickstart
cd VisionQuickstart
dotnet add package LM-Kit.NET
Step 2: Understand How VLMs Work
A VLM extends a text-only LLM with a vision encoder. When you send an image along with a text prompt, the vision encoder converts the image into a sequence of visual tokens that the language model processes alongside the text tokens. The result is a unified understanding of both modalities.
┌──────────────┐
Image file ──►│ Vision │──► Visual tokens ─┐
│ Encoder │ │
└──────────────┘ ▼
┌──────────────┐
│ Language │──► Text response
│ Model │
┌──────────────┐ └──────────────┘
Text prompt ─►│ Tokenizer │──► Text tokens ───┘
└──────────────┘
In LM-Kit.NET, you send images through the Attachment class attached to a ChatHistory.Message. The model handles the visual encoding internally.
Step 3: Basic Image Analysis
This program loads a VLM, takes an image path as input, describes the image, and enters a multi-turn chat loop for follow-up questions.
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3-vl:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($" Vision: {model.HasVision}\n");
// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";
if (string.IsNullOrWhiteSpace(imagePath))
{
Console.Write("Enter the path to an image file: ");
imagePath = Console.ReadLine()?.Trim('"') ?? "";
}
if (!File.Exists(imagePath))
{
Console.WriteLine($"File not found: {imagePath}");
Console.WriteLine("Usage: dotnet run -- <path-to-image>");
return;
}
// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
MaximumCompletionTokens = 1024,
SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
"answer questions about their content. Be specific about colors, text, " +
"positions, and quantities when relevant."
};
// Stream tokens as they are generated
chat.AfterTextCompletion += (_, e) =>
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
};
// ──────────────────────────────────────
// 4. First turn: describe the image
// ──────────────────────────────────────
Console.WriteLine($"Analyzing: {Path.GetFileName(imagePath)}\n");
var attachment = new Attachment(imagePath);
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();
var result = chat.Submit(
new ChatHistory.Message("Describe this image in detail.", attachment));
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
Console.ResetColor();
// ──────────────────────────────────────
// 5. Follow-up questions (text only)
// ──────────────────────────────────────
Console.WriteLine("Ask follow-up questions about the image (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("You: ");
Console.ResetColor();
string? question = Console.ReadLine();
if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();
result = chat.Submit(new ChatHistory.Message(question));
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
Console.ResetColor();
}
Run it:
dotnet run -- "photo.jpg"
Step 4: Practical Use Cases
Invoice and Receipt Reading
var invoice = new Attachment("receipt.jpg");
chat.Submit(new ChatHistory.Message(
"Extract all line items from this receipt. " +
"For each item list: name, quantity, unit price, and total. " +
"Also extract the subtotal, tax, and grand total.",
invoice));
Visual Quality Inspection
var partPhoto = new Attachment("component.png");
chat.Submit(new ChatHistory.Message(
"Inspect this manufactured component. " +
"Identify any defects: scratches, cracks, discoloration, or misalignment. " +
"Rate the overall quality as PASS, MARGINAL, or FAIL.",
partPhoto));
Diagram and Chart Interpretation
var chart = new Attachment("quarterly-chart.png");
chat.Submit(new ChatHistory.Message(
"Analyze this chart. What trends do you see? " +
"Which quarter had the highest value? " +
"Summarize the key takeaways.",
chart));
Step 5: Analyzing Multiple Images
To analyze a new image in the same session, attach it to a new message. The model retains the conversation history, so you can compare images across turns:
// First image
var before = new Attachment("site-before.jpg");
chat.Submit(new ChatHistory.Message("Describe this construction site.", before));
// Second image in the same conversation
var after = new Attachment("site-after.jpg");
chat.Submit(new ChatHistory.Message(
"Now look at this updated photo of the same site. " +
"What has changed since the first image?", after));
To start fresh with a new image and no prior context:
chat.ClearHistory();
var newImage = new Attachment("new-photo.jpg");
chat.Submit(new ChatHistory.Message("Describe this image.", newImage));
Choosing a Vision Model
| Model ID | VRAM | Speed | Quality | Best For |
|---|---|---|---|---|
qwen3-vl:2b |
~2.5 GB | Fastest | Good | Quick classification, simple descriptions |
qwen3-vl:4b |
~4 GB | Fast | Very good | General analysis (recommended start) |
gemma3:4b |
~5.7 GB | Fast | Very good | Multilingual image understanding |
qwen3-vl:8b |
~6.5 GB | Moderate | Excellent | Detailed analysis, complex reasoning |
gemma3:12b |
~11 GB | Slower | Excellent | Highest accuracy, OCR-grade text reading |
ministral3:3b |
~3.5 GB | Fast | Good | Lightweight edge deployment |
For document processing (invoices, forms, text-heavy images), larger models (8B+) read small text more accurately. For general object recognition and scene description, 4B models offer the best speed/quality balance.
Combining Vision with Structured Extraction
Combine VLMs with structured extraction to get typed data from images instead of free-form text:
using LMKit.Extraction;
var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
new("ItemCount", ElementType.Integer, "Number of distinct items visible"),
new("DominantColor", ElementType.String, "The most prominent color"),
new("ContainsText", ElementType.Bool, "Whether readable text is visible"),
new("Description", ElementType.String, "One-sentence description")
};
extractor.SetContent(new Attachment("photo.jpg"));
var extractResult = extractor.Parse();
int items = extractResult["ItemCount"].As<int>();
string color = extractResult["DominantColor"].Value.ToString();
See Extract Structured Data from Unstructured Text for the full extraction API.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
HasVision is False |
Model is text-only | Use a VLM: qwen3-vl:4b, gemma3:4b, or ministral3:3b |
| Blurry or small text not read | Model too small for OCR tasks | Use qwen3-vl:8b or gemma3:12b for text-heavy images |
| Slow first response | Image encoding is compute-heavy | Normal for high-resolution images. Subsequent text-only turns are faster |
| Out of memory | Image generates many visual tokens | Resize large images before loading, or use a smaller model |
| Wrong colors or counts | VLMs can hallucinate visual details | Ask the model to be precise; use structured extraction for critical data |
Next Steps
- Extract Structured Data from Unstructured Text: get typed JSON from images instead of free-form text.
- Build a Private Document Q&A System: process scanned PDFs with vision models.
- Create an AI Agent with Tools: build agents that can analyze images as part of a tool-using workflow.
- Samples: Multi-Turn Chat with Vision: full vision chat demo.