Recognize Mathematical Formulas from Documents with VLM OCR
Scientific papers, textbooks, exam sheets, and engineering specifications contain mathematical formulas that are impossible to extract with traditional OCR. LM-Kit.NET's VlmOcr engine, combined with PaddleOCR VL's Formula Recognition: instruction, converts printed and handwritten equations into machine-readable notation in a single inference pass. This tutorial shows how to extract formulas from images, PDFs, and mixed-content documents containing both text and equations.
Why Vision-Based Formula Recognition
Two practical advantages of PaddleOCR VL's formula mode:
- Handles complex notation natively. Fractions, integrals, summations, matrices, Greek letters, and nested subscripts/superscripts are recognized as structured expressions. Traditional OCR sees these as random character sequences with broken spatial relationships.
- No symbol library or grammar rules. Rule-based formula extractors require maintaining symbol dictionaries and layout grammars for every notation variant. PaddleOCR VL learned to read formulas end-to-end from training data, covering standard mathematical typography without manual configuration.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | ~1 GB (PaddleOCR VL 1.5) |
| Disk | ~750 MB free for model download |
Input formats: scanned PDF, PNG, JPEG, TIFF, BMP, WebP.
Step 1: Create the Project
dotnet new console -n FormulaRecognition
cd FormulaRecognition
dotnet add package LM-Kit.NET
Step 2: Extract Formulas from an Image
Load the PaddleOCR VL model and use the Formula Recognition: instruction:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Extract formula using Formula Recognition mode
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition);
var attachment = new Attachment("equation_screenshot.png");
VlmOcr.VlmOcrResult result = ocr.Run(attachment);
string formula = result.PageElement.Text;
Console.WriteLine("Recognized formula:");
Console.WriteLine(formula);
File.WriteAllText("formula_output.txt", formula);
Console.WriteLine("\nSaved to formula_output.txt");
The Formula Recognition: instruction tells PaddleOCR VL to focus on mathematical notation. The model identifies equation regions, parses symbol relationships, and outputs a structured representation.
Step 3: Extract Formulas from a Multi-Page PDF
Process a textbook or paper with equations spread across multiple pages:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Multi-page formula extraction
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition)
{
MaximumCompletionTokens = 4096
};
string pdfPath = "math_textbook_chapter.pdf";
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Scanning {pageCount} pages for formulas...\n");
var allFormulas = new StringBuilder();
for (int page = 0; page < pageCount; page++)
{
Console.Write($" Page {page + 1}/{pageCount}... ");
VlmOcr.VlmOcrResult pageResult = ocr.Run(attachment, pageIndex: page);
string pageContent = pageResult.PageElement.Text;
if (!string.IsNullOrWhiteSpace(pageContent))
{
allFormulas.AppendLine($"--- Formulas from page {page + 1} ---");
allFormulas.AppendLine(pageContent);
allFormulas.AppendLine();
Console.WriteLine("formula(s) found");
}
else
{
Console.WriteLine("no formula detected");
}
}
File.WriteAllText("all_formulas.txt", allFormulas.ToString());
Console.WriteLine($"\nExtracted formulas saved to all_formulas.txt");
Step 4: Combine Text and Formula Extraction
For documents that mix prose with equations (papers, textbooks, homework), run two passes:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Two-pass: full text + formulas
// ──────────────────────────────────────
var attachment = new Attachment("physics_exam.png");
// Pass 1: Full text extraction
var textOcr = new VlmOcr(model, VlmOcrIntent.PlainText);
VlmOcr.VlmOcrResult textResult = textOcr.Run(attachment);
Console.WriteLine("=== Full Page Text ===");
Console.WriteLine(textResult.PageElement.Text);
// Pass 2: Focused formula extraction
var formulaOcr = new VlmOcr(model, VlmOcrIntent.FormulaRecognition);
VlmOcr.VlmOcrResult formulaResult = formulaOcr.Run(attachment);
Console.WriteLine("\n=== Extracted Formulas ===");
Console.WriteLine(formulaResult.PageElement.Text);
This two-pass approach is valuable for digitizing exam papers where you need both the question text and the mathematical expressions.
Industry Use Cases
| Industry | Document Type | What You Extract |
|---|---|---|
| Education | Textbooks, exams, homework, lecture notes | Equations, derivations, worked solutions |
| Scientific Publishing | Research papers, journal articles, preprints | Inline and display equations for indexing |
| Engineering | Specification sheets, design calculations | Transfer functions, load equations, tolerances |
| Finance (Quantitative) | Pricing models, risk formulas, algorithm specs | Black-Scholes, VaR formulas, regression equations |
| Healthcare | Dosage calculation sheets, pharmacokinetic models | Drug concentration formulas, decay equations |
| Patent Processing | Technical patents with mathematical claims | Algorithmic formulas, signal processing equations |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Symbols garbled or swapped | Low-resolution input image | Use at least 150 DPI; resize small formula crops before processing |
| Partial formula recognized | MaximumCompletionTokens too low for complex expressions |
Increase to 4096 or higher |
| Surrounding text mixed in | Image contains both text and formulas | Crop the formula region before processing, or use the two-pass approach |
| Handwritten formulas poorly recognized | Messy handwriting or unusual notation | PaddleOCR VL works best on printed formulas; for handwriting, try a larger VLM like qwen3-vl:8b |
Next Steps
- Extract Text from Images and Documents with VLM OCR: general OCR for full-page text extraction.
- Extract Tables from Documents with VLM OCR: switch to table mode for structured tabular data.
- Extract Data from Charts and Graphs with VLM OCR: pull data from visual charts.
- Samples: VLM OCR Demo: interactive console demo with all OCR intents.