Table of Contents

Extract Text with Layout Preservation from PDFs

Standard PDF text extraction loses the spatial relationships that give a document meaning: columns collapse into a single stream, table cells merge, and headers blend with body text. LM-Kit.NET's PageElement class provides five text output modes that preserve different aspects of document layout. You control whether the output maintains column alignment, groups text into paragraphs, preserves tabular structure, or lets the engine auto-select the best strategy per page. This tutorial builds a text extraction tool that handles financial reports, technical manuals, and multi-column documents.


Why Layout Preservation Matters

Two enterprise problems that layout-aware text extraction solves:

  1. Financial report parsing. Annual reports and earnings statements use multi-column layouts with tables, footnotes, and sidebar text. Naive extraction produces interleaved gibberish. Layout-aware extraction keeps columns separate and tables intact, producing clean input for downstream analysis or RAG indexing.
  2. Technical manual indexing. Service manuals and product datasheets use structured layouts with numbered sections, specification tables, and diagrams with captions. Preserving this structure during extraction means the indexed text retains its organizational hierarchy, improving search and retrieval accuracy.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 4 GB
Input formats PDF (with text layer), DOCX

No GPU is required for text extraction from digital PDFs. GPU is only needed for VLM-based OCR on scanned documents.


Step 1: Create the Project

dotnet new console -n LayoutExtraction
cd LayoutExtraction
dotnet add package LM-Kit.NET

Step 2: Understand the Text Output Modes

                    PageElement.GetText(mode)
                            │
    ┌───────────┬───────────┼───────────┬───────────┐
    ▼           ▼           ▼           ▼           ▼
 RawLines   GridAligned  Paragraph   Structured   Auto
                           Flow
 One line    Preserve     Group       Preserve     Auto-
 per text    columns &    lines       tables &     select
 element     indentation  into        paragraphs   best
                          paragraphs               mode
Mode Best For Output Style
RawLines Debug, raw analysis One line per detected text element, no grouping
GridAligned Multi-column documents Preserves column alignment and indentation
ParagraphFlow Articles, contracts Groups lines into paragraphs with spacing
Structured Tables, forms, reports Preserves both tabular and paragraph structure
Auto Unknown document types Evaluates page structure and selects optimal mode

Step 3: Extract Text in All Modes

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Extract page 1 in each mode
// ──────────────────────────────────────
PageElement page = attachment.PageElements.First();

TextOutputMode[] modes =
{
    TextOutputMode.RawLines,
    TextOutputMode.GridAligned,
    TextOutputMode.ParagraphFlow,
    TextOutputMode.Structured,
    TextOutputMode.Auto
};

foreach (TextOutputMode mode in modes)
{
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"=== {mode} ===");
    Console.ResetColor();

    string text = page.GetText(mode);
    // Show first 500 characters
    string preview = text.Length > 500 ? text[..500] + "\n[...]" : text;
    Console.WriteLine(preview);
    Console.WriteLine();
}

Step 4: Detect Lines and Paragraphs

PageElement provides line and paragraph detection that groups individual text elements into reading-order structures:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Get page 1
// ──────────────────────────────────────
PageElement page = attachment.PageElements.First();

// ──────────────────────────────────────
// 3. Line detection
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("=== Detected Lines ===");
Console.ResetColor();

List<LineElement> lines = page.DetectLines();

Console.WriteLine($"Found {lines.Count} text lines on page 1:\n");
for (int i = 0; i < Math.Min(lines.Count, 10); i++)
{
    LineElement line = lines[i];
    Console.WriteLine($"  Line {i + 1}: \"{Truncate(line.Text, 80)}\"");
    Console.WriteLine($"    Words: {line.Words.Count}, " +
                      $"Bounds: ({line.Bounds.TopLeft.X:F0}, {line.Bounds.TopLeft.Y:F0}) " +
                      $"to ({line.Bounds.BottomRight.X:F0}, {line.Bounds.BottomRight.Y:F0})");
}

if (lines.Count > 10)
    Console.WriteLine($"  ... and {lines.Count - 10} more lines");

// ──────────────────────────────────────
// 4. Paragraph detection
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Detected Paragraphs ===");
Console.ResetColor();

List<ParagraphElement> paragraphs = page.DetectParagraphs();

Console.WriteLine($"Found {paragraphs.Count} paragraphs on page 1:\n");
for (int i = 0; i < Math.Min(paragraphs.Count, 5); i++)
{
    ParagraphElement para = paragraphs[i];
    Console.WriteLine($"  Paragraph {i + 1} ({para.Lines.Count} lines):");
    Console.WriteLine($"    \"{Truncate(para.Text, 100)}\"");
    Console.WriteLine();
}

static string Truncate(string text, int max)
{
    if (string.IsNullOrEmpty(text)) return "";
    string cleaned = text.Replace("\n", " ").Replace("\r", "");
    return cleaned.Length <= max ? cleaned : cleaned[..max] + "...";
}

Step 5: Process Multi-Page Documents

Extract structured text from every page and combine into a single output:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);

Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("=== Full Document Extraction (Structured Mode) ===");
Console.ResetColor();

var fullText = new StringBuilder();
int totalLines = 0;
int totalParagraphs = 0;
int pageIndex = 0;

foreach (PageElement p in attachment.PageElements)
{
    string pageText = p.GetText(TextOutputMode.Structured);
    List<LineElement> pageLines = p.DetectLines();
    List<ParagraphElement> pageParas = p.DetectParagraphs();

    totalLines += pageLines.Count;
    totalParagraphs += pageParas.Count;

    fullText.AppendLine($"<!-- Page {pageIndex + 1} | {pageLines.Count} lines, {pageParas.Count} paragraphs -->");
    fullText.AppendLine(pageText);
    fullText.AppendLine();

    Console.WriteLine($"  Page {pageIndex + 1}: {pageLines.Count} lines, {pageParas.Count} paragraphs");
    pageIndex++;
}

string outputPath = Path.ChangeExtension(pdfPath, ".txt");
File.WriteAllText(outputPath, fullText.ToString());

Console.WriteLine($"\nTotal: {totalLines} lines, {totalParagraphs} paragraphs");
Console.WriteLine($"Saved to: {outputPath}");

Step 6: Export Page Layout as JSON

PageElement supports JSON serialization, enabling you to store and reload page layout data without re-parsing the original document:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Get page 1
// ──────────────────────────────────────
PageElement firstPage = attachment.PageElements.First();

Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== JSON Export ===");
Console.ResetColor();

// Export to JSON
string json = firstPage.ToJson();
File.WriteAllText("page1_layout.json", json);
Console.WriteLine($"Exported page layout to page1_layout.json ({json.Length} chars)");

// Reload from JSON
string loadedJson = File.ReadAllText("page1_layout.json");
PageElement reloaded = PageElement.FromJson(loadedJson);
Console.WriteLine($"Reloaded: {reloaded.Width:F0}x{reloaded.Height:F0} points, " +
                  $"{reloaded.DetectLines().Count} lines");

This is useful for caching layout data. Parse the PDF once, store the JSON, and reuse it for subsequent searches or analysis without re-opening the PDF.


Step 7: Choosing the Right Mode for Your Use Case

Use Case Recommended Mode Why
RAG ingestion pipeline Structured Preserves tables and paragraphs for accurate chunking
LLM prompt context ParagraphFlow Clean, readable text without visual formatting artifacts
Financial report analysis GridAligned Maintains column alignment for side-by-side data
Form field extraction Structured Keeps label-value pairs associated
OCR output cleanup Auto Adapts to whatever the OCR produced
Debug / raw analysis RawLines Shows every text element independently

For most document intelligence workflows, Structured mode produces the best results. It preserves the information hierarchy that LLMs and search engines need for accurate retrieval and generation.


Page Metadata

PageElement exposes geometric metadata about the page:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);

PageElement p = attachment.PageElements.First();

Console.WriteLine($"Dimensions:  {p.Width:F0} x {p.Height:F0} points");
Console.WriteLine($"Rotation:    {p.Rotation} degrees");
Console.WriteLine($"Skew:        {p.Skew:F2} degrees");
Console.WriteLine($"Unit:        {p.Unit}");
Property Description
Width, Height Page dimensions in points (72 points = 1 inch)
Rotation Detected page rotation in degrees clockwise (0, 90, 180, 270)
Skew Detected text skew angle in degrees
Unit Measurement unit (Points or Pixels)

Common Issues

Problem Cause Fix
Empty text from PDF PDF is scanned (image-only) Use VlmOcr to extract text first, then use PageElement
Columns interleaved Using RawLines on multi-column layout Switch to GridAligned or Structured mode
Tables not preserved Using ParagraphFlow on tabular data Switch to Structured mode for table preservation
Text elements out of order Complex layout with text boxes DetectLines() re-orders elements into reading order
JSON export very large Page has many small text elements This is normal for complex PDFs; use compression if storing

Next Steps

Share