Convert Documents to Markdown with VLM OCR

Scanned PDFs, photographed whiteboards, and image-based documents contain valuable text locked inside pixels. LM-Kit.NET's VlmOcr class uses Vision Language Models to convert these into structured Markdown, preserving headings, tables, lists, and code blocks. Unlike traditional OCR that outputs flat text, VLM OCR understands document layout and produces properly formatted output. This tutorial builds a document-to-Markdown converter for PDFs, images, and multi-page documents.

Why VLM OCR Over Traditional OCR

Two practical advantages of vision-model OCR:

Structure preservation. Traditional OCR produces a flat string of characters. VLM OCR understands that a bold line at the top is a heading, that aligned columns are a table, and that indented text is a list. The output is ready-to-use Markdown, not raw text that needs post-processing.
Handwriting and poor scans. Traditional OCR struggles with handwritten notes, low-resolution scans, and photographs of documents. Vision models handle degraded inputs because they understand context, not just character shapes.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	1.5+ GB
Disk	~1 GB free for model download

Step 1: Create the Project

dotnet new console -n OcrQuickstart
cd OcrQuickstart
dotnet add package LM-Kit.NET

Step 2: Convert an Image to Markdown

using System.Text;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a model trained for OCR
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Convert an image to Markdown
// ──────────────────────────────────────
var ocr = new VlmOcr(model);

var image = ImageBuffer.LoadAsRGB("scanned_document.png");
VlmOcr.VlmOcrResult result = ocr.Run(image);

string markdown = result.TextGeneration.Completion;
Console.WriteLine(markdown);

// Save to file
File.WriteAllText("output.md", markdown);
Console.WriteLine("\nSaved to output.md");

Step 3: Convert a PDF (Multi-Page)

Process each page of a PDF and combine the results:

using System.Text;
using LMKit.Data;

var ocr = new VlmOcr(model)
{
    MaximumCompletionTokens = 4096
};

string pdfPath = "report.pdf";
var attachment = new Attachment(pdfPath);

// Get page count
int pageCount = attachment.PageCount;
Console.WriteLine($"Processing {pageCount} pages from {Path.GetFileName(pdfPath)}...\n");

var fullDocument = new StringBuilder();

for (int page = 0; page < pageCount; page++)
{
    Console.Write($"  Page {page + 1}/{pageCount}... ");

    VlmOcr.VlmOcrResult pageResult = ocr.Run(attachment, pageIndex: page);
    string pageMarkdown = pageResult.TextGeneration.Completion;

    fullDocument.AppendLine($"<!-- Page {page + 1} -->");
    fullDocument.AppendLine(pageMarkdown);
    fullDocument.AppendLine();

    int tokens = pageResult.TextGeneration.GeneratedTokenCount;
    Console.WriteLine($"{tokens} tokens generated");
}

string outputPath = Path.ChangeExtension(pdfPath, ".md");
File.WriteAllText(outputPath, fullDocument.ToString());
Console.WriteLine($"\nSaved {pageCount} pages to {outputPath}");

Step 4: Custom Instructions

The Instruction property guides the model on how to transcribe the content. This is useful for specialized documents:

using System.Text;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a model trained for OCR
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// Default: general document transcription
var ocr = new VlmOcr(model);

// Focus on tables and structured data
ocr.Instruction = "Extract all tables as Markdown tables. Preserve column alignment and headers.";

// Focus on code
ocr.Instruction = "This is a screenshot of source code. Transcribe as a fenced code block with language annotation.";

// Focus on forms
ocr.Instruction = "This is a scanned form. Extract each field as a key-value pair in Markdown.";

var image = ImageBuffer.LoadAsRGB("form_scan.png");
VlmOcr.VlmOcrResult result = ocr.Run(image);
Console.WriteLine(result.TextGeneration.Completion);

Step 5: Batch Conversion

Convert an entire folder of images or PDFs:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a model trained for OCR
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Batch convert files to Markdown
// ──────────────────────────────────────
string inputDir = "scanned_docs";
string outputDir = "markdown_output";
Directory.CreateDirectory(outputDir);

string[] supportedExtensions = { ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".pdf" };

string[] files = Directory.GetFiles(inputDir)
    .Where(f => supportedExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

Console.WriteLine($"Converting {files.Length} files...\n");

var ocr = new VlmOcr(model)
{
    MaximumCompletionTokens = 4096
};

foreach (string file in files)
{
    string fileName = Path.GetFileName(file);
    Console.Write($"  {fileName}... ");

    var attachment = new Attachment(file);

    if (attachment.PageCount > 1)
    {
        // Multi-page document
        var pages = new StringBuilder();
        for (int p = 0; p < attachment.PageCount; p++)
        {
            VlmOcr.VlmOcrResult pageResult = ocr.Run(attachment, pageIndex: p);
            pages.AppendLine(pageResult.TextGeneration.Completion);
            pages.AppendLine();
        }

        string outPath = Path.Combine(outputDir, Path.ChangeExtension(fileName, ".md"));
        File.WriteAllText(outPath, pages.ToString());
        Console.WriteLine($"{attachment.PageCount} pages");
    }
    else
    {
        // Single page/image
        VlmOcr.VlmOcrResult result = ocr.Run(attachment);
        string outPath = Path.Combine(outputDir, Path.ChangeExtension(fileName, ".md"));
        File.WriteAllText(outPath, result.TextGeneration.Completion);
        Console.WriteLine("done");
    }
}

Console.WriteLine($"\nAll files saved to {outputDir}/");

Step 6: Performance Metrics

Track token generation speed and processing time:

using System.Diagnostics;
using System.Text;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a model trained for OCR
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Convert an image to Markdown
// ──────────────────────────────────────
var ocr = new VlmOcr(model);

var stopwatch = Stopwatch.StartNew();

VlmOcr.VlmOcrResult result = ocr.Run(ImageBuffer.LoadAsRGB("document.png"));

stopwatch.Stop();

int tokens = result.TextGeneration.GeneratedTokenCount;
double seconds = stopwatch.Elapsed.TotalSeconds;
double tokensPerSecond = tokens / seconds;

Console.WriteLine($"Tokens generated: {tokens}");
Console.WriteLine($"Time elapsed:     {seconds:F1}s");
Console.WriteLine($"Speed:            {tokensPerSecond:F1} tokens/s");
Console.WriteLine($"Output length:    {result.TextGeneration.Completion.Length} characters");

Model Selection for VLM OCR

Model ID	VRAM	Speed	Quality	Best For
`lightonocr-2:1b`	~2 GB	Fastest	Very good	Purpose-built OCR model (recommended)
`qwen3-vl:2b`	~2.5 GB	Very fast	Good	Lightweight multilingual OCR
`ministral3:3b`	~3.5 GB	Fast	Good	Compact general-purpose VLM
`qwen3-vl:4b`	~4 GB	Fast	Very good	Multilingual documents
`gemma3:4b`	~5.7 GB	Moderate	Good	Mixed text and vision tasks
`minicpm-o-45`	~5.9 GB	Moderate	Very good	Strong all-round vision model
`qwen3-vl:8b`	~6.5 GB	Moderate	Excellent	High-quality multilingual OCR
`ministral3:8b`	~6.5 GB	Moderate	Very good	Complex document layouts
`gemma3:12b`	~11 GB	Slow	Excellent	Complex layouts, tables, handwriting
`ministral3:14b`	~12 GB	Slow	Excellent	Highest quality for critical documents

LightOnOCR 2 is a compact 1B model specifically trained for high-accuracy OCR and document understanding. It delivers fast, layout-aware text extraction and is the best choice for dedicated OCR workloads. For multilingual documents, the Qwen3-VL family offers strong results. Use a larger model like gemma3:12b or ministral3:14b when dealing with complex layouts, degraded scans, or handwriting.

Common Issues

Problem	Cause	Fix
Output truncated mid-sentence	`MaximumCompletionTokens` too low	Increase to 4096 or higher; set to -1 for unlimited
Image markup in output (`![](...)`)	`StripImageMarkup` is false	Set `ocr.StripImageMarkup = true` (default)
Tables not properly formatted	Model struggles with complex table layouts	Use a larger model; add `Instruction` specifying table extraction
Slow on large PDFs	Processing all pages sequentially	Process pages in parallel with async; or focus on specific pages only
Blank output	Image too small or low contrast	Resize image before processing; improve scan quality

Next Steps

Analyze Images with Vision Language Models: image Q&A and analysis beyond OCR.
Extract Structured Data from Unstructured Text: extract typed fields from the Markdown output.
Preprocess Images for Vision Pipelines: deskew, crop, and resize scanned images before conversion.
Process PDFs and Images with Built-In Document Tools: use ImageDeskew, ImageCrop, and OCR tools in agent workflows.
Samples: Document to Markdown: document conversion demo.
Samples: Image to Markdown: image conversion demo.

Table of Contents