Extract Text with Layout Preservation from PDFs
Standard PDF text extraction loses the spatial relationships that give a document meaning: columns collapse into a single stream, table cells merge, and headers blend with body text. LM-Kit.NET's PageElement class provides five text output modes that preserve different aspects of document layout. You control whether the output maintains column alignment, groups text into paragraphs, preserves tabular structure, or lets the engine auto-select the best strategy per page. This tutorial builds a text extraction tool that handles financial reports, technical manuals, and multi-column documents.
Why Layout Preservation Matters
Two enterprise problems that layout-aware text extraction solves:
- Financial report parsing. Annual reports and earnings statements use multi-column layouts with tables, footnotes, and sidebar text. Naive extraction produces interleaved gibberish. Layout-aware extraction keeps columns separate and tables intact, producing clean input for downstream analysis or RAG indexing.
- Technical manual indexing. Service manuals and product datasheets use structured layouts with numbered sections, specification tables, and diagrams with captions. Preserving this structure during extraction means the indexed text retains its organizational hierarchy, improving search and retrieval accuracy.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 4 GB |
| Input formats | PDF (with text layer), DOCX |
No GPU is required for text extraction from digital PDFs. GPU is only needed for VLM-based OCR on scanned documents.
Step 1: Create the Project
dotnet new console -n LayoutExtraction
cd LayoutExtraction
dotnet add package LM-Kit.NET
Step 2: Understand the Text Output Modes
PageElement.GetText(mode)
│
┌───────────┬───────────┼───────────┬───────────┐
▼ ▼ ▼ ▼ ▼
RawLines GridAligned Paragraph Structured Auto
Flow
One line Preserve Group Preserve Auto-
per text columns & lines tables & select
element indentation into paragraphs best
paragraphs mode
| Mode | Best For | Output Style |
|---|---|---|
RawLines |
Debug, raw analysis | One line per detected text element, no grouping |
GridAligned |
Multi-column documents | Preserves column alignment and indentation |
ParagraphFlow |
Articles, contracts | Groups lines into paragraphs with spacing |
Structured |
Tables, forms, reports | Preserves both tabular and paragraph structure |
Auto |
Unknown document types | Evaluates page structure and selects optimal mode |
Step 3: Extract Text in All Modes
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");
// ──────────────────────────────────────
// 2. Extract page 1 in each mode
// ──────────────────────────────────────
PageElement page = attachment.PageElements.First();
TextOutputMode[] modes =
{
TextOutputMode.RawLines,
TextOutputMode.GridAligned,
TextOutputMode.ParagraphFlow,
TextOutputMode.Structured,
TextOutputMode.Auto
};
foreach (TextOutputMode mode in modes)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"=== {mode} ===");
Console.ResetColor();
string text = page.GetText(mode);
// Show first 500 characters
string preview = text.Length > 500 ? text[..500] + "\n[...]" : text;
Console.WriteLine(preview);
Console.WriteLine();
}
Step 4: Detect Lines and Paragraphs
PageElement provides line and paragraph detection that groups individual text elements into reading-order structures:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");
// ──────────────────────────────────────
// 2. Get page 1
// ──────────────────────────────────────
PageElement page = attachment.PageElements.First();
// ──────────────────────────────────────
// 3. Line detection
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("=== Detected Lines ===");
Console.ResetColor();
List<LineElement> lines = page.DetectLines();
Console.WriteLine($"Found {lines.Count} text lines on page 1:\n");
for (int i = 0; i < Math.Min(lines.Count, 10); i++)
{
LineElement line = lines[i];
Console.WriteLine($" Line {i + 1}: \"{Truncate(line.Text, 80)}\"");
Console.WriteLine($" Words: {line.Words.Count}, " +
$"Bounds: ({line.Bounds.TopLeft.X:F0}, {line.Bounds.TopLeft.Y:F0}) " +
$"to ({line.Bounds.BottomRight.X:F0}, {line.Bounds.BottomRight.Y:F0})");
}
if (lines.Count > 10)
Console.WriteLine($" ... and {lines.Count - 10} more lines");
// ──────────────────────────────────────
// 4. Paragraph detection
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Detected Paragraphs ===");
Console.ResetColor();
List<ParagraphElement> paragraphs = page.DetectParagraphs();
Console.WriteLine($"Found {paragraphs.Count} paragraphs on page 1:\n");
for (int i = 0; i < Math.Min(paragraphs.Count, 5); i++)
{
ParagraphElement para = paragraphs[i];
Console.WriteLine($" Paragraph {i + 1} ({para.Lines.Count} lines):");
Console.WriteLine($" \"{Truncate(para.Text, 100)}\"");
Console.WriteLine();
}
static string Truncate(string text, int max)
{
if (string.IsNullOrEmpty(text)) return "";
string cleaned = text.Replace("\n", " ").Replace("\r", "");
return cleaned.Length <= max ? cleaned : cleaned[..max] + "...";
}
Step 5: Process Multi-Page Documents
Extract structured text from every page and combine into a single output:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("=== Full Document Extraction (Structured Mode) ===");
Console.ResetColor();
var fullText = new StringBuilder();
int totalLines = 0;
int totalParagraphs = 0;
int pageIndex = 0;
foreach (PageElement p in attachment.PageElements)
{
string pageText = p.GetText(TextOutputMode.Structured);
List<LineElement> pageLines = p.DetectLines();
List<ParagraphElement> pageParas = p.DetectParagraphs();
totalLines += pageLines.Count;
totalParagraphs += pageParas.Count;
fullText.AppendLine($"<!-- Page {pageIndex + 1} | {pageLines.Count} lines, {pageParas.Count} paragraphs -->");
fullText.AppendLine(pageText);
fullText.AppendLine();
Console.WriteLine($" Page {pageIndex + 1}: {pageLines.Count} lines, {pageParas.Count} paragraphs");
pageIndex++;
}
string outputPath = Path.ChangeExtension(pdfPath, ".txt");
File.WriteAllText(outputPath, fullText.ToString());
Console.WriteLine($"\nTotal: {totalLines} lines, {totalParagraphs} paragraphs");
Console.WriteLine($"Saved to: {outputPath}");
Step 6: Export Page Layout as JSON
PageElement supports JSON serialization, enabling you to store and reload page layout data without re-parsing the original document:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");
// ──────────────────────────────────────
// 2. Get page 1
// ──────────────────────────────────────
PageElement firstPage = attachment.PageElements.First();
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== JSON Export ===");
Console.ResetColor();
// Export to JSON
string json = firstPage.ToJson();
File.WriteAllText("page1_layout.json", json);
Console.WriteLine($"Exported page layout to page1_layout.json ({json.Length} chars)");
// Reload from JSON
string loadedJson = File.ReadAllText("page1_layout.json");
PageElement reloaded = PageElement.FromJson(loadedJson);
Console.WriteLine($"Reloaded: {reloaded.Width:F0}x{reloaded.Height:F0} points, " +
$"{reloaded.DetectLines().Count} lines");
This is useful for caching layout data. Parse the PDF once, store the JSON, and reuse it for subsequent searches or analysis without re-opening the PDF.
Step 7: Choosing the Right Mode for Your Use Case
| Use Case | Recommended Mode | Why |
|---|---|---|
| RAG ingestion pipeline | Structured |
Preserves tables and paragraphs for accurate chunking |
| LLM prompt context | ParagraphFlow |
Clean, readable text without visual formatting artifacts |
| Financial report analysis | GridAligned |
Maintains column alignment for side-by-side data |
| Form field extraction | Structured |
Keeps label-value pairs associated |
| OCR output cleanup | Auto |
Adapts to whatever the OCR produced |
| Debug / raw analysis | RawLines |
Shows every text element independently |
For most document intelligence workflows, Structured mode produces the best results. It preserves the information hierarchy that LLMs and search engines need for accurate retrieval and generation.
Page Metadata
PageElement exposes geometric metadata about the page:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "report.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
PageElement p = attachment.PageElements.First();
Console.WriteLine($"Dimensions: {p.Width:F0} x {p.Height:F0} points");
Console.WriteLine($"Rotation: {p.Rotation} degrees");
Console.WriteLine($"Skew: {p.Skew:F2} degrees");
Console.WriteLine($"Unit: {p.Unit}");
| Property | Description |
|---|---|
Width, Height |
Page dimensions in points (72 points = 1 inch) |
Rotation |
Detected page rotation in degrees clockwise (0, 90, 180, 270) |
Skew |
Detected text skew angle in degrees |
Unit |
Measurement unit (Points or Pixels) |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Empty text from PDF | PDF is scanned (image-only) | Use VlmOcr to extract text first, then use PageElement |
| Columns interleaved | Using RawLines on multi-column layout |
Switch to GridAligned or Structured mode |
| Tables not preserved | Using ParagraphFlow on tabular data |
Switch to Structured mode for table preservation |
| Text elements out of order | Complex layout with text boxes | DetectLines() re-orders elements into reading order |
| JSON export very large | Page has many small text elements | This is normal for complex PDFs; use compression if storing |
Next Steps
- Search and Locate Content Within Documents: perform text, regex, fuzzy, and spatial searches on extracted pages.
- Build a Multi-Format Document Ingestion Pipeline: ingest and index documents for RAG.
- Convert Documents to Markdown with VLM OCR: extract text from scanned documents using vision models.
- Chat with PDF Documents: interactive Q&A over PDF content.