Extract Text with Layout Preservation from PDFs

Standard PDF text extraction loses the spatial relationships that give a document meaning: columns collapse into a single stream, table cells merge, and headers blend with body text. LM-Kit.NET's PageElement class provides five text output modes that preserve different aspects of document layout. You control whether the output maintains column alignment, groups text into paragraphs, preserves tabular structure, or lets the engine auto-select the best strategy per page. This tutorial builds a text extraction tool that handles financial reports, technical manuals, and multi-column documents.

Why Layout Preservation Matters

Two enterprise problems that layout-aware text extraction solves:

Financial report parsing. Annual reports and earnings statements use multi-column layouts with tables, footnotes, and sidebar text. Naive extraction produces interleaved gibberish. Layout-aware extraction keeps columns separate and tables intact, producing clean input for downstream analysis or RAG indexing.
Technical manual indexing. Service manuals and product datasheets use structured layouts with numbered sections, specification tables, and diagrams with captions. Preserving this structure during extraction means the indexed text retains its organizational hierarchy, improving search and retrieval accuracy.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	4 GB
Input formats	PDF (with text layer), DOCX

No GPU is required for text extraction from digital PDFs. GPU is only needed for VLM-based OCR on scanned documents.

Step 1: Create the Project

dotnet new console -n LayoutExtraction
cd LayoutExtraction
dotnet add package LM-Kit.NET

Step 2: Understand the Text Output Modes

                    PageElement.GetText(mode)
                            │
    ┌───────────┬───────────┼───────────┬───────────┐
    ▼           ▼           ▼           ▼           ▼
 RawLines   GridAligned  Paragraph   Structured   Auto
                           Flow
 One line    Preserve     Group       Preserve     Auto-
 per text    columns &    lines       tables &     select
 element     indentation  into        paragraphs   best
                          paragraphs               mode

Mode	Best For	Output Style
`RawLines`	Debug, raw analysis	One line per detected text element, no grouping
`GridAligned`	Multi-column documents	Preserves column alignment and indentation
`ParagraphFlow`	Articles, contracts	Groups lines into paragraphs with spacing
`Structured`	Tables, forms, reports	Preserves both tabular and paragraph structure
`Auto`	Unknown document types	Evaluates page structure and selects optimal mode

Step 3: Extract Text in All Modes

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Extract page 1 in each mode
// ──────────────────────────────────────
PageElement page = attachment.PageElements.First();

TextOutputMode[] modes =
{
    TextOutputMode.RawLines,
    TextOutputMode.GridAligned,
    TextOutputMode.ParagraphFlow,
    TextOutputMode.Structured,
    TextOutputMode.Auto
};

foreach (TextOutputMode mode in modes)
{
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"=== {mode} ===");
    Console.ResetColor();

    string text = page.GetText(mode);
    // Show first 500 characters
    string preview = text.Length > 500 ? text[..500] + "\n[...]" : text;
    Console.WriteLine(preview);
    Console.WriteLine();
}

Step 4: Detect Lines and Paragraphs

PageElement provides line and paragraph detection that groups individual text elements into reading-order structures:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Get page 1
// ──────────────────────────────────────
PageElement page = attachment.PageElements.First();

// ──────────────────────────────────────
// 3. Line detection
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("=== Detected Lines ===");
Console.ResetColor();

List<LineElement> lines = page.DetectLines();

Console.WriteLine($"Found {lines.Count} text lines on page 1:\n");
for (int i = 0; i < Math.Min(lines.Count, 10); i++)
{
    LineElement line = lines[i];
    Console.WriteLine($"  Line {i + 1}: \"{Truncate(line.Text, 80)}\"");
    Console.WriteLine($"    Words: {line.Words.Count}, " +
                      $"Bounds: ({line.Bounds.TopLeft.X:F0}, {line.Bounds.TopLeft.Y:F0}) " +
                      $"to ({line.Bounds.BottomRight.X:F0}, {line.Bounds.BottomRight.Y:F0})");
}

if (lines.Count > 10)
    Console.WriteLine($"  ... and {lines.Count - 10} more lines");

// ──────────────────────────────────────
// 4. Paragraph detection
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Detected Paragraphs ===");
Console.ResetColor();

List<ParagraphElement> paragraphs = page.DetectParagraphs();

Console.WriteLine($"Found {paragraphs.Count} paragraphs on page 1:\n");
for (int i = 0; i < Math.Min(paragraphs.Count, 5); i++)
{
    ParagraphElement para = paragraphs[i];
    Console.WriteLine($"  Paragraph {i + 1} ({para.Lines.Count} lines):");
    Console.WriteLine($"    \"{Truncate(para.Text, 100)}\"");
    Console.WriteLine();
}

static string Truncate(string text, int max)
{
    if (string.IsNullOrEmpty(text)) return "";
    string cleaned = text.Replace("\n", " ").Replace("\r", "");
    return cleaned.Length <= max ? cleaned : cleaned[..max] + "...";
}

Step 5: Process Multi-Page Documents

Extract structured text from every page and combine into a single output:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);

Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("=== Full Document Extraction (Structured Mode) ===");
Console.ResetColor();

var fullText = new StringBuilder();
int totalLines = 0;
int totalParagraphs = 0;
int pageIndex = 0;

foreach (PageElement p in attachment.PageElements)
{
    string pageText = p.GetText(TextOutputMode.Structured);
    List<LineElement> pageLines = p.DetectLines();
    List<ParagraphElement> pageParas = p.DetectParagraphs();

    totalLines += pageLines.Count;
    totalParagraphs += pageParas.Count;

    fullText.AppendLine($"<!-- Page {pageIndex + 1} | {pageLines.Count} lines, {pageParas.Count} paragraphs -->");
    fullText.AppendLine(pageText);
    fullText.AppendLine();

    Console.WriteLine($"  Page {pageIndex + 1}: {pageLines.Count} lines, {pageParas.Count} paragraphs");
    pageIndex++;
}

string outputPath = Path.ChangeExtension(pdfPath, ".txt");
File.WriteAllText(outputPath, fullText.ToString());

Console.WriteLine($"\nTotal: {totalLines} lines, {totalParagraphs} paragraphs");
Console.WriteLine($"Saved to: {outputPath}");

Step 6: Export Page Layout as JSON

PageElement supports JSON serialization, enabling you to store and reload page layout data without re-parsing the original document:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Get page 1
// ──────────────────────────────────────
PageElement firstPage = attachment.PageElements.First();

Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== JSON Export ===");
Console.ResetColor();

// Export to JSON
string json = firstPage.ToJson();
File.WriteAllText("page1_layout.json", json);
Console.WriteLine($"Exported page layout to page1_layout.json ({json.Length} chars)");

// Reload from JSON
string loadedJson = File.ReadAllText("page1_layout.json");
PageElement reloaded = PageElement.FromJson(loadedJson);
Console.WriteLine($"Reloaded: {reloaded.Width:F0}x{reloaded.Height:F0} points, " +
                  $"{reloaded.DetectLines().Count} lines");

This is useful for caching layout data. Parse the PDF once, store the JSON, and reuse it for subsequent searches or analysis without re-opening the PDF.

Step 7: Choosing the Right Mode for Your Use Case

Use Case	Recommended Mode	Why
RAG ingestion pipeline	`Structured`	Preserves tables and paragraphs for accurate chunking
LLM prompt context	`ParagraphFlow`	Clean, readable text without visual formatting artifacts
Financial report analysis	`GridAligned`	Maintains column alignment for side-by-side data
Form field extraction	`Structured`	Keeps label-value pairs associated
OCR output cleanup	`Auto`	Adapts to whatever the OCR produced
Debug / raw analysis	`RawLines`	Shows every text element independently

For most document intelligence workflows, Structured mode produces the best results. It preserves the information hierarchy that LLMs and search engines need for accurate retrieval and generation.

Page Metadata

PageElement exposes geometric metadata about the page:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);

PageElement p = attachment.PageElements.First();

Console.WriteLine($"Dimensions:  {p.Width:F0} x {p.Height:F0} points");
Console.WriteLine($"Rotation:    {p.Rotation} degrees");
Console.WriteLine($"Skew:        {p.Skew:F2} degrees");
Console.WriteLine($"Unit:        {p.Unit}");

Property	Description
`Width`, `Height`	Page dimensions in points (72 points = 1 inch)
`Rotation`	Detected page rotation in degrees clockwise (0, 90, 180, 270)
`Skew`	Detected text skew angle in degrees
`Unit`	Measurement unit (`Points` or `Pixels`)

Common Issues

Problem	Cause	Fix
Empty text from PDF	PDF is scanned (image-only)	Use `VlmOcr` to extract text first, then use `PageElement`
Columns interleaved	Using `RawLines` on multi-column layout	Switch to `GridAligned` or `Structured` mode
Tables not preserved	Using `ParagraphFlow` on tabular data	Switch to `Structured` mode for table preservation
Text elements out of order	Complex layout with text boxes	`DetectLines()` re-orders elements into reading order
JSON export very large	Page has many small text elements	This is normal for complex PDFs; use compression if storing

Next Steps

Search and Locate Content Within Documents: perform text, regex, fuzzy, and spatial searches on extracted pages.
Build a Multi-Format Document Ingestion Pipeline: ingest and index documents for RAG.
Convert Documents to Markdown with VLM OCR: extract text from scanned documents using vision models.
Chat with PDF Documents: interactive Q&A over PDF content.

Table of Contents