Table of Contents

Analyze Images with Vision Language Models

Vision Language Models (VLMs) understand both text and images in a single model. Instead of chaining OCR, object detection, and text generation as separate steps, a VLM processes an image directly and answers questions about it in natural language. This tutorial builds a working image analysis program that describes images, answers visual questions, and handles multi-turn conversations about images.


Why Local Vision Matters

Two enterprise problems that on-device VLMs solve:

  1. Sensitive document processing. Organizations handling medical scans, legal evidence, classified imagery, or proprietary engineering diagrams cannot send images to cloud APIs. A local VLM processes everything on-premises, keeping sensitive visual data within the organization's infrastructure.
  2. Field inspection and quality control. Manufacturing floors, construction sites, and remote facilities need real-time visual analysis without depending on internet connectivity. A local VLM running on an edge device or laptop can inspect parts, flag defects, and read labels offline.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB (for a 4B VLM)
Disk ~3 GB free for model download
Test image Any .jpg, .png, .bmp, or .webp file

Step 1: Create the Project

dotnet new console -n VisionQuickstart
cd VisionQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand How VLMs Work

A VLM extends a text-only LLM with a vision encoder. When you send an image along with a text prompt, the vision encoder converts the image into a sequence of visual tokens that the language model processes alongside the text tokens. The result is a unified understanding of both modalities.

                ┌──────────────┐
  Image file ──►│ Vision       │──► Visual tokens ─┐
                │ Encoder      │                    │
                └──────────────┘                    ▼
                                            ┌──────────────┐
                                            │  Language    │──► Text response
                                            │  Model       │
                ┌──────────────┐            └──────────────┘
  Text prompt ─►│ Tokenizer    │──► Text tokens ───┘
                └──────────────┘

In LM-Kit.NET, you send images through the Attachment class attached to a ChatHistory.Message. The model handles the visual encoding internally.


Step 3: Basic Image Analysis

This program loads a VLM, takes an image path as input, describes the image, and enters a multi-turn chat loop for follow-up questions.

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a Vision Language Model
// ──────────────────────────────────────
Console.WriteLine("Loading vision model...");
using LM model = LM.LoadFromModelID("qwen3-vl:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Vision: {model.HasVision}\n");

// ──────────────────────────────────────
// 2. Get the image path
// ──────────────────────────────────────
string imagePath = args.Length > 0 ? args[0] : "";

if (string.IsNullOrWhiteSpace(imagePath))
{
    Console.Write("Enter the path to an image file: ");
    imagePath = Console.ReadLine()?.Trim('"') ?? "";
}

if (!File.Exists(imagePath))
{
    Console.WriteLine($"File not found: {imagePath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-image>");
    return;
}

// ──────────────────────────────────────
// 3. Set up the conversation
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 1024,
    SystemPrompt = "You are a visual analysis assistant. Describe images accurately and " +
                   "answer questions about their content. Be specific about colors, text, " +
                   "positions, and quantities when relevant."
};

// Stream tokens as they are generated
chat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

// ──────────────────────────────────────
// 4. First turn: describe the image
// ──────────────────────────────────────
Console.WriteLine($"Analyzing: {Path.GetFileName(imagePath)}\n");

var attachment = new Attachment(imagePath);

Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();

var result = chat.Submit(
    new ChatHistory.Message("Describe this image in detail.", attachment));

Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
Console.ResetColor();

// ──────────────────────────────────────
// 5. Follow-up questions (text only)
// ──────────────────────────────────────
Console.WriteLine("Ask follow-up questions about the image (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? question = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    result = chat.Submit(new ChatHistory.Message(question));

    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
    Console.ResetColor();
}

Run it:

dotnet run -- "photo.jpg"

Step 4: Practical Use Cases

Invoice and Receipt Reading

var invoice = new Attachment("receipt.jpg");

chat.Submit(new ChatHistory.Message(
    "Extract all line items from this receipt. " +
    "For each item list: name, quantity, unit price, and total. " +
    "Also extract the subtotal, tax, and grand total.",
    invoice));

Visual Quality Inspection

var partPhoto = new Attachment("component.png");

chat.Submit(new ChatHistory.Message(
    "Inspect this manufactured component. " +
    "Identify any defects: scratches, cracks, discoloration, or misalignment. " +
    "Rate the overall quality as PASS, MARGINAL, or FAIL.",
    partPhoto));

Diagram and Chart Interpretation

var chart = new Attachment("quarterly-chart.png");

chat.Submit(new ChatHistory.Message(
    "Analyze this chart. What trends do you see? " +
    "Which quarter had the highest value? " +
    "Summarize the key takeaways.",
    chart));

Step 5: Analyzing Multiple Images

To analyze a new image in the same session, attach it to a new message. The model retains the conversation history, so you can compare images across turns:

// First image
var before = new Attachment("site-before.jpg");
chat.Submit(new ChatHistory.Message("Describe this construction site.", before));

// Second image in the same conversation
var after = new Attachment("site-after.jpg");
chat.Submit(new ChatHistory.Message(
    "Now look at this updated photo of the same site. " +
    "What has changed since the first image?", after));

To start fresh with a new image and no prior context:

chat.ClearHistory();
var newImage = new Attachment("new-photo.jpg");
chat.Submit(new ChatHistory.Message("Describe this image.", newImage));

Choosing a Vision Model

Model ID VRAM Speed Quality Best For
qwen3-vl:2b ~2.5 GB Fastest Good Quick classification, simple descriptions
qwen3-vl:4b ~4 GB Fast Very good General analysis (recommended start)
gemma3:4b ~5.7 GB Fast Very good Multilingual image understanding
qwen3-vl:8b ~6.5 GB Moderate Excellent Detailed analysis, complex reasoning
gemma3:12b ~11 GB Slower Excellent Highest accuracy, OCR-grade text reading
ministral3:3b ~3.5 GB Fast Good Lightweight edge deployment

For document processing (invoices, forms, text-heavy images), larger models (8B+) read small text more accurately. For general object recognition and scene description, 4B models offer the best speed/quality balance.


Combining Vision with Structured Extraction

Combine VLMs with structured extraction to get typed data from images instead of free-form text:

using LMKit.Extraction;

var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
    new("ItemCount", ElementType.Integer, "Number of distinct items visible"),
    new("DominantColor", ElementType.String, "The most prominent color"),
    new("ContainsText", ElementType.Bool, "Whether readable text is visible"),
    new("Description", ElementType.String, "One-sentence description")
};

extractor.SetContent(new Attachment("photo.jpg"));
var extractResult = extractor.Parse();

int items = extractResult["ItemCount"].As<int>();
string color = extractResult["DominantColor"].Value.ToString();

See Extract Structured Data from Unstructured Text for the full extraction API.


Common Issues

Problem Cause Fix
HasVision is False Model is text-only Use a VLM: qwen3-vl:4b, gemma3:4b, or ministral3:3b
Blurry or small text not read Model too small for OCR tasks Use qwen3-vl:8b or gemma3:12b for text-heavy images
Slow first response Image encoding is compute-heavy Normal for high-resolution images. Subsequent text-only turns are faster
Out of memory Image generates many visual tokens Resize large images before loading, or use a smaller model
Wrong colors or counts VLMs can hallucinate visual details Ask the model to be precise; use structured extraction for critical data

Next Steps