Recognize Mathematical Formulas from Documents with VLM OCR

Scientific papers, textbooks, exam sheets, and engineering specifications contain mathematical formulas that are impossible to extract with traditional OCR. LM-Kit.NET's VlmOcr engine, combined with PaddleOCR VL's Formula Recognition: instruction, converts printed and handwritten equations into machine-readable notation in a single inference pass. This tutorial shows how to extract formulas from images, PDFs, and mixed-content documents containing both text and equations.

Why Vision-Based Formula Recognition

Two practical advantages of PaddleOCR VL's formula mode:

Handles complex notation natively. Fractions, integrals, summations, matrices, Greek letters, and nested subscripts/superscripts are recognized as structured expressions. Traditional OCR sees these as random character sequences with broken spatial relationships.
No symbol library or grammar rules. Rule-based formula extractors require maintaining symbol dictionaries and layout grammars for every notation variant. PaddleOCR VL learned to read formulas end-to-end from training data, covering standard mathematical typography without manual configuration.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	~1 GB (PaddleOCR VL 1.5)
Disk	~750 MB free for model download

Input formats: scanned PDF, PNG, JPEG, TIFF, BMP, WebP.

Step 1: Create the Project

dotnet new console -n FormulaRecognition
cd FormulaRecognition
dotnet add package LM-Kit.NET

Step 2: Extract Formulas from an Image

Load the PaddleOCR VL model and use the Formula Recognition: instruction:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract formula using Formula Recognition mode
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition);

var attachment = new Attachment("equation_screenshot.png");

VlmOcr.VlmOcrResult result = ocr.Run(attachment);

string formula = result.PageElement.Text;
Console.WriteLine("Recognized formula:");
Console.WriteLine(formula);

File.WriteAllText("formula_output.txt", formula);
Console.WriteLine("\nSaved to formula_output.txt");

The Formula Recognition: instruction tells PaddleOCR VL to focus on mathematical notation. The model identifies equation regions, parses symbol relationships, and outputs a structured representation.

Step 3: Extract Formulas from a Multi-Page PDF

Process a textbook or paper with equations spread across multiple pages:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Multi-page formula extraction
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition)
{
    MaximumCompletionTokens = 4096
};

string pdfPath = "math_textbook_chapter.pdf";
var attachment = new Attachment(pdfPath);

int pageCount = attachment.PageCount;
Console.WriteLine($"Scanning {pageCount} pages for formulas...\n");

var allFormulas = new StringBuilder();

for (int page = 0; page < pageCount; page++)
{
    Console.Write($"  Page {page + 1}/{pageCount}... ");

    VlmOcr.VlmOcrResult pageResult = ocr.Run(attachment, pageIndex: page);
    string pageContent = pageResult.PageElement.Text;

    if (!string.IsNullOrWhiteSpace(pageContent))
    {
        allFormulas.AppendLine($"--- Formulas from page {page + 1} ---");
        allFormulas.AppendLine(pageContent);
        allFormulas.AppendLine();
        Console.WriteLine("formula(s) found");
    }
    else
    {
        Console.WriteLine("no formula detected");
    }
}

File.WriteAllText("all_formulas.txt", allFormulas.ToString());
Console.WriteLine($"\nExtracted formulas saved to all_formulas.txt");

Step 4: Combine Text and Formula Extraction

For documents that mix prose with equations (papers, textbooks, homework), run two passes:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Two-pass: full text + formulas
// ──────────────────────────────────────
var attachment = new Attachment("physics_exam.png");

// Pass 1: Full text extraction
var textOcr = new VlmOcr(model, VlmOcrIntent.PlainText);
VlmOcr.VlmOcrResult textResult = textOcr.Run(attachment);
Console.WriteLine("=== Full Page Text ===");
Console.WriteLine(textResult.PageElement.Text);

// Pass 2: Focused formula extraction
var formulaOcr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition);
VlmOcr.VlmOcrResult formulaResult = formulaOcr.Run(attachment);
Console.WriteLine("\n=== Extracted Formulas ===");
Console.WriteLine(formulaResult.PageElement.Text);

This two-pass approach is valuable for digitizing exam papers where you need both the question text and the mathematical expressions.

Industry Use Cases

Industry	Document Type	What You Extract
Education	Textbooks, exams, homework, lecture notes	Equations, derivations, worked solutions
Scientific Publishing	Research papers, journal articles, preprints	Inline and display equations for indexing
Engineering	Specification sheets, design calculations	Transfer functions, load equations, tolerances
Finance (Quantitative)	Pricing models, risk formulas, algorithm specs	Black-Scholes, VaR formulas, regression equations
Healthcare	Dosage calculation sheets, pharmacokinetic models	Drug concentration formulas, decay equations
Patent Processing	Technical patents with mathematical claims	Algorithmic formulas, signal processing equations

Common Issues

Problem	Cause	Fix
Symbols garbled or swapped	Low-resolution input image	Use at least 150 DPI; resize small formula crops before processing
Partial formula recognized	`MaximumCompletionTokens` too low for complex expressions	Increase to 4096 or higher
Surrounding text mixed in	Image contains both text and formulas	Crop the formula region before processing, or use the two-pass approach
Handwritten formulas poorly recognized	Messy handwriting or unusual notation	PaddleOCR VL works best on printed formulas; for handwriting, try a larger VLM like `qwen3.5:9b`

Next Steps

Extract Text from Images and Documents with VLM OCR: general OCR for full-page text extraction.
Extract Tables from Documents with VLM OCR: switch to table mode for structured tabular data.
Extract Data from Charts and Graphs with VLM OCR: pull data from visual charts.
Samples: VLM OCR Demo: interactive console demo with all OCR intents.

Table of Contents