Table of Contents

Recognize Mathematical Formulas from Documents with VLM OCR

Scientific papers, textbooks, exam sheets, and engineering specifications contain mathematical formulas that are impossible to extract with traditional OCR. LM-Kit.NET's VlmOcr engine, combined with PaddleOCR VL's Formula Recognition: instruction, converts printed and handwritten equations into machine-readable notation in a single inference pass. This tutorial shows how to extract formulas from images, PDFs, and mixed-content documents containing both text and equations.


Why Vision-Based Formula Recognition

Two practical advantages of PaddleOCR VL's formula mode:

  1. Handles complex notation natively. Fractions, integrals, summations, matrices, Greek letters, and nested subscripts/superscripts are recognized as structured expressions. Traditional OCR sees these as random character sequences with broken spatial relationships.
  2. No symbol library or grammar rules. Rule-based formula extractors require maintaining symbol dictionaries and layout grammars for every notation variant. PaddleOCR VL learned to read formulas end-to-end from training data, covering standard mathematical typography without manual configuration.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~1 GB (PaddleOCR VL 1.5)
Disk ~750 MB free for model download

Input formats: scanned PDF, PNG, JPEG, TIFF, BMP, WebP.


Step 1: Create the Project

dotnet new console -n FormulaRecognition
cd FormulaRecognition
dotnet add package LM-Kit.NET

Step 2: Extract Formulas from an Image

Load the PaddleOCR VL model and use the Formula Recognition: instruction:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract formula using Formula Recognition mode
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition);

var attachment = new Attachment("equation_screenshot.png");

VlmOcr.VlmOcrResult result = ocr.Run(attachment);

string formula = result.PageElement.Text;
Console.WriteLine("Recognized formula:");
Console.WriteLine(formula);

File.WriteAllText("formula_output.txt", formula);
Console.WriteLine("\nSaved to formula_output.txt");

The Formula Recognition: instruction tells PaddleOCR VL to focus on mathematical notation. The model identifies equation regions, parses symbol relationships, and outputs a structured representation.


Step 3: Extract Formulas from a Multi-Page PDF

Process a textbook or paper with equations spread across multiple pages:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Multi-page formula extraction
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition)
{
    MaximumCompletionTokens = 4096
};

string pdfPath = "math_textbook_chapter.pdf";
var attachment = new Attachment(pdfPath);

int pageCount = attachment.PageCount;
Console.WriteLine($"Scanning {pageCount} pages for formulas...\n");

var allFormulas = new StringBuilder();

for (int page = 0; page < pageCount; page++)
{
    Console.Write($"  Page {page + 1}/{pageCount}... ");

    VlmOcr.VlmOcrResult pageResult = ocr.Run(attachment, pageIndex: page);
    string pageContent = pageResult.PageElement.Text;

    if (!string.IsNullOrWhiteSpace(pageContent))
    {
        allFormulas.AppendLine($"--- Formulas from page {page + 1} ---");
        allFormulas.AppendLine(pageContent);
        allFormulas.AppendLine();
        Console.WriteLine("formula(s) found");
    }
    else
    {
        Console.WriteLine("no formula detected");
    }
}

File.WriteAllText("all_formulas.txt", allFormulas.ToString());
Console.WriteLine($"\nExtracted formulas saved to all_formulas.txt");

Step 4: Combine Text and Formula Extraction

For documents that mix prose with equations (papers, textbooks, homework), run two passes:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Two-pass: full text + formulas
// ──────────────────────────────────────
var attachment = new Attachment("physics_exam.png");

// Pass 1: Full text extraction
var textOcr = new VlmOcr(model, VlmOcrIntent.PlainText);
VlmOcr.VlmOcrResult textResult = textOcr.Run(attachment);
Console.WriteLine("=== Full Page Text ===");
Console.WriteLine(textResult.PageElement.Text);

// Pass 2: Focused formula extraction
var formulaOcr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition);
VlmOcr.VlmOcrResult formulaResult = formulaOcr.Run(attachment);
Console.WriteLine("\n=== Extracted Formulas ===");
Console.WriteLine(formulaResult.PageElement.Text);

This two-pass approach is valuable for digitizing exam papers where you need both the question text and the mathematical expressions.


Industry Use Cases

Industry Document Type What You Extract
Education Textbooks, exams, homework, lecture notes Equations, derivations, worked solutions
Scientific Publishing Research papers, journal articles, preprints Inline and display equations for indexing
Engineering Specification sheets, design calculations Transfer functions, load equations, tolerances
Finance (Quantitative) Pricing models, risk formulas, algorithm specs Black-Scholes, VaR formulas, regression equations
Healthcare Dosage calculation sheets, pharmacokinetic models Drug concentration formulas, decay equations
Patent Processing Technical patents with mathematical claims Algorithmic formulas, signal processing equations

Common Issues

Problem Cause Fix
Symbols garbled or swapped Low-resolution input image Use at least 150 DPI; resize small formula crops before processing
Partial formula recognized MaximumCompletionTokens too low for complex expressions Increase to 4096 or higher
Surrounding text mixed in Image contains both text and formulas Crop the formula region before processing, or use the two-pass approach
Handwritten formulas poorly recognized Messy handwriting or unusual notation PaddleOCR VL works best on printed formulas; for handwriting, try a larger VLM like qwen3-vl:8b

Next Steps