Table of Contents

Extract Tables from Documents with VLM OCR

Financial statements, medical lab reports, shipping manifests, and procurement orders all rely on tabular data. When these documents arrive as scanned PDFs or photographs, extracting the table structure is one of the hardest problems in document processing. LM-Kit.NET's VlmOcr engine, combined with PaddleOCR VL's dedicated Table Recognition: instruction, detects rows, columns, headers, and merged cells in a single inference pass without any layout heuristics or post-processing rules. This tutorial shows how to extract tables from images, PDFs, and mixed-content documents.


Why Dedicated Table Recognition Matters

Two practical advantages of PaddleOCR VL's table mode over generic OCR:

  1. Structural fidelity. Generic OCR reads text left-to-right, top-to-bottom. It cannot distinguish between a heading, a body cell, and a footer. PaddleOCR VL's Table Recognition: mode preserves row and column boundaries, merged cells, and header rows, producing output that maps directly to structured data.
  2. No template configuration. Traditional table extraction requires manually defining column positions, separator patterns, or anchor keywords for each document type. PaddleOCR VL generalizes across invoices, lab results, financial reports, and forms without any per-template setup.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~1 GB (PaddleOCR VL 1.5)
Disk ~750 MB free for model download

Input formats: scanned PDF, PNG, JPEG, TIFF, BMP, WebP.


Step 1: Create the Project

dotnet new console -n TableExtraction
cd TableExtraction
dotnet add package LM-Kit.NET

Step 2: Extract a Table from an Image

Load the PaddleOCR VL model and use the Table Recognition: instruction to extract structured table data:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract table using Table Recognition mode
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.TableRecognition);

var attachment = new Attachment("financial_statement.png");

VlmOcr.VlmOcrResult result = ocr.Run(attachment);

string tableOutput = result.PageElement.Text;
Console.WriteLine(tableOutput);

File.WriteAllText("extracted_table.txt", tableOutput);
Console.WriteLine("\nSaved to extracted_table.txt");

The Table Recognition: instruction activates PaddleOCR VL's specialized table detection pipeline. The model identifies table boundaries, column headers, row separators, and cell content, and returns the data in a structured format.


Step 3: Extract Tables from a Multi-Page PDF

Financial reports and procurement documents often span multiple pages. Process each page and collect all tables:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Multi-page table extraction
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.TableRecognition)
{
    MaximumCompletionTokens = 4096
};

string pdfPath = "quarterly_report.pdf";
var attachment = new Attachment(pdfPath);

int pageCount = attachment.PageCount;
Console.WriteLine($"Scanning {pageCount} pages for tables...\n");

var allTables = new StringBuilder();

for (int page = 0; page < pageCount; page++)
{
    Console.Write($"  Page {page + 1}/{pageCount}... ");

    VlmOcr.VlmOcrResult pageResult = ocr.Run(attachment, pageIndex: page);
    string pageContent = pageResult.PageElement.Text;

    if (!string.IsNullOrWhiteSpace(pageContent))
    {
        allTables.AppendLine($"--- Table(s) from page {page + 1} ---");
        allTables.AppendLine(pageContent);
        allTables.AppendLine();
        Console.WriteLine("table(s) found");
    }
    else
    {
        Console.WriteLine("no table detected");
    }
}

File.WriteAllText("all_tables.txt", allTables.ToString());
Console.WriteLine($"\nExtracted tables saved to all_tables.txt");

Step 4: Combine General OCR with Table Extraction

Many documents contain both free-form text and tables. Run two passes: one for general text and one for tables:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Two-pass extraction: text + tables
// ──────────────────────────────────────
var attachment = new Attachment("invoice.png");

// Pass 1: General text extraction
var textOcr = new VlmOcr(model, VlmOcrIntent.PlainText);
VlmOcr.VlmOcrResult textResult = textOcr.Run(attachment);
Console.WriteLine("=== Full Document Text ===");
Console.WriteLine(textResult.PageElement.Text);

// Pass 2: Focused table extraction
var tableOcr = new VlmOcr(model, VlmOcrIntent.TableRecognition);
VlmOcr.VlmOcrResult tableResult = tableOcr.Run(attachment);
Console.WriteLine("\n=== Extracted Table(s) ===");
Console.WriteLine(tableResult.PageElement.Text);

This two-pass approach is useful for invoices where you need both the header information (vendor, date, invoice number) and the line-item table.


Industry Use Cases

Industry Document Type What You Extract
Finance Income statements, balance sheets, trial balances Account names, amounts, period columns
Healthcare Lab reports, vital signs logs, medication schedules Test names, reference ranges, values, dates
Procurement Purchase orders, packing slips, price lists Item codes, quantities, unit prices, totals
Insurance Coverage comparison tables, benefit schedules Plan names, limits, deductibles, copays
Logistics Customs declarations, bill of lading tables HS codes, weights, quantities, origins
Education Grade sheets, timetables, exam results Student names, subjects, scores, credits

Common Issues

Problem Cause Fix
Columns misaligned in output Low-resolution scan or extreme skew Improve scan quality; consider preprocessing with ImageBuffer.Deskew()
Merged cells not detected Complex multi-level headers Increase MaximumCompletionTokens to give the model room for verbose output
Table output mixed with body text Document has text above and below the table Use Table Recognition: mode, which focuses specifically on table regions
Partial table on page boundary Table spans two PDF pages Extract tables from consecutive pages and merge programmatically

Next Steps