Table of Contents

Analyze Images with Vision Language Models

Vision Language Models (VLMs) understand both text and images in a single model. Instead of chaining OCR, object detection, and text generation as separate steps, a VLM processes an image directly and answers questions about it in natural language. This tutorial builds a working image analysis program that describes images, answers visual questions, and handles multi-turn conversations about images.


Why Local Vision Matters

Two enterprise problems that on-device VLMs solve:

  1. Sensitive document processing. Organizations handling medical scans, legal evidence, classified imagery, or proprietary engineering diagrams cannot send images to cloud APIs. A local VLM processes everything on-premises, keeping sensitive visual data within the organization's infrastructure.
  2. Field inspection and quality control. Manufacturing floors, construction sites, and remote facilities need real-time visual analysis without depending on internet connectivity. A local VLM running on an edge device or laptop can inspect parts, flag defects, and read labels offline.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB (for a 4B VLM)
Disk ~3 GB free for model download
Test image Any .jpg, .png, .bmp, or .webp file

Step 1: Create the Project

dotnet new console -n VisionQuickstart
cd VisionQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand How VLMs Work

A VLM extends a text-only LLM with a vision encoder. When you send an image along with a text prompt, the vision encoder converts the image into a sequence of visual tokens that the language model processes alongside the text tokens. The result is a unified understanding of both modalities.

                ┌──────────────┐
  Image file ──►│ Vision       │──► Visual tokens ─┐
                │ Encoder      │                    │
                └──────────────┘                    ▼
                                            ┌──────────────┐
                                            │  Language    │──► Text response
                                            │  Model       │
                ┌──────────────┐            └──────────────┘
  Text prompt ─►│ Tokenizer    │──► Text tokens ───┘
                └──────────────┘

In LM-Kit.NET, you send images through the Attachment class attached to a ChatHistory.Message. The model handles the visual encoding internally.


Step 3: Basic Image Analysis

This program loads a VLM, takes an image path as input, describes the image, and enters a multi-turn chat loop for follow-up questions.

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3.5:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

// Stream tokens as they are generated
chat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

// ──────────────────────────────────────
// 4. First turn: describe the image
// ──────────────────────────────────────
Console.WriteLine($"Analyzing: {Path.GetFileName(imagePath)}\n");

var attachment = new Attachment(imagePath);

Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();

var result = chat.Submit(
    new ChatHistory.Message("Describe this image in detail.", attachment));

Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
Console.ResetColor();

// ──────────────────────────────────────
// 5. Follow-up questions (text only)
// ──────────────────────────────────────
Console.WriteLine("Ask follow-up questions about the image (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? question = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    result = chat.Submit(new ChatHistory.Message(question));

    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
    Console.ResetColor();
}

Run it:

dotnet run -- "photo.jpg"

Step 4: Practical Use Cases

Invoice and Receipt Reading

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3.5:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

var invoice = new Attachment("receipt.jpg");

chat.Submit(new ChatHistory.Message(
    "Extract all line items from this receipt. " +
    "For each item list: name, quantity, unit price, and total. " +
    "Also extract the subtotal, tax, and grand total.",
    invoice));

Visual Quality Inspection

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3.5:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

var partPhoto = new Attachment("component.png");

chat.Submit(new ChatHistory.Message(
    "Inspect this manufactured component. " +
    "Identify any defects: scratches, cracks, discoloration, or misalignment. " +
    "Rate the overall quality as PASS, MARGINAL, or FAIL.",
    partPhoto));

Diagram and Chart Interpretation

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3.5:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

var chart = new Attachment("quarterly-chart.png");

chat.Submit(new ChatHistory.Message(
    "Analyze this chart. What trends do you see? " +
    "Which quarter had the highest value? " +
    "Summarize the key takeaways.",
    chart));

Step 5: Analyzing Multiple Images

To analyze a new image in the same session, attach it to a new message. The model retains the conversation history, so you can compare images across turns:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3.5:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

// First image
var before = new Attachment("site-before.jpg");
chat.Submit(new ChatHistory.Message("Describe this construction site.", before));

// Second image in the same conversation
var after = new Attachment("site-after.jpg");
chat.Submit(new ChatHistory.Message(
    "Now look at this updated photo of the same site. " +
    "What has changed since the first image?", after));

To start fresh with a new image and no prior context:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3.5:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

chat.ClearHistory();
var newImage = new Attachment("new-photo.jpg");
chat.Submit(new ChatHistory.Message("Describe this image.", newImage));

Choosing a Vision Model

Model ID VRAM Speed Quality Best For
qwen3-vl:2b ~2 GB Fastest Good Compact purpose-built VL; quick classification
qwen3.5:2b ~2 GB Fastest Good Quick classification, simple descriptions
qwen3-vl:4b ~4 GB Fast Very good Latest compact VL with tool calling and OCR
qwen3.5:4b ~3.5 GB Fast Very good General analysis (recommended start)
gemma4:e4b ~6 GB Fast Very good Multilingual image understanding
qwen3-vl:8b ~7 GB Moderate Excellent Latest purpose-built VL flagship
qwen3.5:9b ~7 GB Moderate Excellent Detailed analysis, complex reasoning
glm-4.6v-flash ~7 GB Moderate Excellent Lightweight GLM vision-language with OCR
qwen3.6:27b ~17 GB Moderate Top-tier Latest Qwen flagship, Vision + OCR + reasoning
qwen3.6:35b-a3b ~22 GB Fast (MoE) Top-tier Latest MoE flagship, 3B active params
ministral3:3b ~3.5 GB Fast Good Lightweight edge deployment

For document processing (invoices, forms, text-heavy images), larger models (8B+) read small text more accurately. For general object recognition and scene description, 4B models offer the best speed/quality balance. The Qwen 3.6 dense and MoE checkpoints are the current SOTA for the catalog.


Combining Vision with Structured Extraction

Combine VLMs with structured extraction to get typed data from images instead of free-form text:

using LMKit.Extraction;
using LMKit.Data;

var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
    new("ItemCount", ElementType.Integer, "Number of distinct items visible"),
    new("DominantColor", ElementType.String, "The most prominent color"),
    new("ContainsText", ElementType.Bool, "Whether readable text is visible"),
    new("Description", ElementType.String, "One-sentence description")
};

extractor.SetContent(new Attachment("photo.jpg"));
var extractResult = extractor.Parse();

int items = extractResult["ItemCount"].As<int>();
string color = extractResult["DominantColor"].Value.ToString();

See Extract Structured Data from Unstructured Text for the full extraction API.


Common Issues

Problem Cause Fix
HasVision is False Model is text-only Use a VLM: qwen3-vl:4b, qwen3.5:4b, gemma4:e4b, or ministral3:3b
Blurry or small text not read Model too small for OCR tasks Use an OCR-capable VLM for text-heavy images: qwen3.5:9b, qwen3-vl:8b, glm-4.6v-flash, or qwen3.6:27b
Slow first response Image encoding is compute-heavy Normal for high-resolution images. Subsequent text-only turns are faster
Out of memory Image generates many visual tokens Resize large images before loading, or use a smaller model
Wrong colors or counts VLMs can hallucinate visual details Ask the model to be precise; use structured extraction for critical data

Next Steps

Share