Build a Multi-Language Document Processing Pipeline
Global organizations process documents in dozens of languages: vendor contracts in German, compliance filings in Japanese, engineering specs in Mandarin, and customer correspondence in Spanish. LM-Kit.NET provides language detection, translation, multilingual extraction, and localized summarization, all running locally without sending documents to external translation services. This tutorial builds a pipeline that detects document language, extracts structured data regardless of language, translates content, and generates summaries in a target language.
Why Local Multilingual Processing Matters
Two enterprise problems that on-device multilingual document processing solves:
- Export compliance and trade documents. Import/export companies process customs declarations, bills of lading, and certificates of origin in the language of the origin country. These documents contain trade secrets, pricing, and shipment details that cannot be sent to cloud translation APIs without regulatory risk. Local processing keeps all data on-premises.
- Multinational contract management. A company with subsidiaries in 15 countries receives vendor agreements in local languages. Each contract needs key terms extracted (parties, dates, amounts) and a summary generated in the corporate language (English). A multilingual pipeline automates this without a translation agency.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 6+ GB (for multilingual-capable model) |
| Disk | ~5 GB free for model downloads |
Step 1: Create the Project
dotnet new console -n MultiLanguagePipeline
cd MultiLanguagePipeline
dotnet add package LM-Kit.NET
Step 2: Understand the Pipeline
Document ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
(any language) ► │ 1. Detect │ ──► │ 2. Extract │ ──► │ 3. Translate │
│ language │ │ (native) │ │ or │
└───────────────┘ └───────────────┘ │ Summarize │
└───────────────┘
│
▼
Output in target
language
The pipeline uses four LM-Kit.NET capabilities:
TextTranslation.DetectLanguage()to identify the source languageTextExtractionto extract fields from documents in any languageTextTranslation.Translate()to convert text to a target languageSummarizerwithTargetLanguagefor cross-lingual summarization
Step 3: Language Detection
using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);
string[] documents =
{
"Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
"Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",
"Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
"Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",
"請求書番号: 2025-0392。日付: 2025年3月1日。" +
"株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};
Console.WriteLine("=== Language Detection ===\n");
foreach (string doc in documents)
{
Language detected = translator.DetectLanguage(doc);
float confidence = translator.Confidence;
string preview = doc.Length > 60 ? doc[..60] + "..." : doc;
Console.WriteLine($" [{detected}] ({confidence:P0}) {preview}");
}
Console.WriteLine();
Step 4: Extract Data from Any Language
TextExtraction works across languages. The model understands the document regardless of language and maps content to your schema:
using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);
string[] documents =
{
"Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
"Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",
"Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
"Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",
"請求書番号: 2025-0392。日付: 2025年3月1日。" +
"株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};
// ──────────────────────────────────────
// 3. Extract structured data (language-agnostic)
// ──────────────────────────────────────
Console.WriteLine("=== Cross-Language Extraction ===\n");
var extractor = new TextExtraction(model)
{
NullOnDoubt = true,
Elements = new List<TextExtractionElement>
{
new("invoice_number", ElementType.String,
"Invoice identifier or number", isRequired: true),
new("customer_name", ElementType.String,
"Name of the customer or client", isRequired: true),
new("total_amount", ElementType.Double,
"Total amount including tax", isRequired: true),
new("currency", ElementType.String,
"Currency code (EUR, USD, JPY, etc.)"),
new("payment_terms", ElementType.String,
"Payment terms or conditions")
}
};
foreach (string doc in documents)
{
Language lang = translator.DetectLanguage(doc);
extractor.SetContent(doc);
// Add guidance about regional formatting
extractor.Guidance = lang switch
{
Language.German => "German document. Dates use DD.MM.YYYY. Amounts use comma as decimal separator.",
Language.French => "French document. Dates use DD/MM/YYYY. Amounts use comma as decimal separator.",
Language.Japanese => "Japanese document. Dates may use 年月日 format. Amounts in yen have no decimal.",
_ => ""
};
TextExtractionResult result = extractor.Parse();
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($" [{lang}] Invoice #{result.GetValue<string>("invoice_number")}");
Console.ResetColor();
Console.WriteLine($" Customer: {result.GetValue<string>("customer_name")}");
Console.WriteLine($" Total: {result.GetValue<double>("total_amount")} {result.GetValue<string>("currency")}");
Console.WriteLine($" Terms: {result.GetValue<string>("payment_terms")}");
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" Confidence: {result.Confidence:P0}");
Console.ResetColor();
Console.WriteLine();
}
Step 5: Translate Documents
Translate document content into a target language:
using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);
string[] documents =
{
"Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
"Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",
"Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
"Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",
"請求書番号: 2025-0392。日付: 2025年3月1日。" +
"株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};
// ──────────────────────────────────────
// 4. Translate to English
// ──────────────────────────────────────
Console.WriteLine("=== Translation to English ===\n");
// Stream translation output
translator.AfterTextCompletion += (_, e) =>
{
Console.Write(e.Text);
};
foreach (string doc in documents)
{
Language sourceLang = translator.DetectLanguage(doc);
if (sourceLang == Language.English)
{
Console.WriteLine($" [{sourceLang}] Already in English. Skipping.\n");
continue;
}
Console.ForegroundColor = ConsoleColor.Yellow;
Console.Write($" [{sourceLang} → English] ");
Console.ResetColor();
string translated = translator.Translate(doc, Language.English);
Console.WriteLine("\n");
}
Step 6: Cross-Lingual Summarization
Generate summaries in a target language regardless of the source document language:
using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);
string[] documents =
{
"Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
"Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",
"Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
"Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",
"請求書番号: 2025-0392。日付: 2025年3月1日。" +
"株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};
// ──────────────────────────────────────
// 5. Summarize in English regardless of source language
// ──────────────────────────────────────
Console.WriteLine("=== Cross-Lingual Summarization ===\n");
var summarizer = new Summarizer(model)
{
MaxContentWords = 50,
MaxTitleWords = 8,
GenerateTitle = true,
GenerateContent = true,
Intent = Summarizer.SummarizationIntent.Abstraction,
TargetLanguage = Language.English
};
foreach (string doc in documents)
{
Language sourceLang = translator.DetectLanguage(doc);
Summarizer.SummarizerResult summary = summarizer.Summarize(doc);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($" [{sourceLang} → English summary]");
Console.ResetColor();
Console.WriteLine($" Title: {summary.Title}");
Console.WriteLine($" Summary: {summary.Summary}");
Console.WriteLine();
}
Step 7: Processing Multilingual PDF Files
The same pipeline works with PDF attachments:
using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);
string[] pdfFiles = Directory.GetFiles("multilingual_docs", "*.pdf");
var csvOutput = new List<string>();
csvOutput.Add("file,language,invoice_number,customer,total,currency");
foreach (string pdfPath in pdfFiles)
{
string fileName = Path.GetFileName(pdfPath);
var attachment = new Attachment(pdfPath);
// Detect language from PDF
Language lang = translator.DetectLanguage(attachment);
// Set regional guidance
extractor.Guidance = lang switch
{
Language.German or Language.Dutch => "European document. Comma is decimal separator. Dates: DD.MM.YYYY.",
Language.French or Language.Italian or Language.Spanish => "European document. Comma is decimal separator. Dates: DD/MM/YYYY.",
Language.Japanese or Language.Korean or Language.ChineseSimplified => "East Asian document. Use local number formatting.",
_ => ""
};
// Extract
extractor.SetContent(attachment);
TextExtractionResult result = extractor.Parse();
string invoiceNum = result.GetValue<string>("invoice_number") ?? "N/A";
string customer = result.GetValue<string>("customer_name") ?? "N/A";
double total = result.GetValue<double>("total_amount");
string currency = result.GetValue<string>("currency") ?? "N/A";
Console.WriteLine($" {fileName} [{lang}] → #{invoiceNum}, {customer}, {total} {currency}");
csvOutput.Add($"\"{fileName}\",\"{lang}\",\"{invoiceNum}\",\"{customer}\",{total},\"{currency}\"");
}
File.WriteAllLines("multilingual_results.csv", csvOutput);
Console.WriteLine($"\nExported to multilingual_results.csv");
Supported Languages
LM-Kit.NET supports 35+ languages for detection, translation, and summarization:
| Region | Languages |
|---|---|
| Western European | English, French, German, Spanish, Italian, Portuguese, Dutch, Danish, Swedish, Norwegian, Finnish |
| Eastern European | Polish, Czech, Slovak, Hungarian, Romanian, Bulgarian, Croatian, Serbian, Ukrainian, Russian |
| East Asian | Japanese, Korean, Chinese (Simplified), Chinese (Traditional) |
| Middle Eastern | Arabic, Hebrew, Turkish |
| South/Southeast Asian | Hindi, Indonesian, Thai, Vietnamese |
| Other | Armenian, Modern Greek |
Model Selection
| Model ID | VRAM | Multilingual Quality | Best For |
|---|---|---|---|
qwen3.5:4b |
~3.5 GB | Good | European languages, high throughput |
qwen3.5:9b |
~7 GB | Very good | Most languages including CJK (recommended) |
gemma4:e4b |
~8 GB | Excellent | Complex multilingual documents |
qwen3.6:27b |
~18 GB | Excellent | Highest accuracy across all languages |
The Qwen 3.5 family has the strongest multilingual capabilities. Use qwen3.5:9b for the best balance of accuracy and speed across all supported languages.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Wrong language detected | Short text or mixed-language document | Provide a longer text sample; use DetectLanguage with specific language candidates |
| Extraction fails on CJK documents | Model too small for ideographic scripts | Use qwen3.5:9b or larger for Japanese, Chinese, Korean |
| Date/number format errors | Regional formatting not specified | Add Guidance with regional format hints |
| Translation quality low | Using a non-multilingual model | Switch to Qwen 3.5 family for best multilingual support |
| Summary in wrong language | TargetLanguage not set on summarizer |
Set summarizer.TargetLanguage = Language.English explicitly |
Next Steps
- Translate and Localize Content: deep dive into translation with streaming and batch processing.
- Extract Invoice Data from PDFs and Images: specialized invoice extraction with predefined schemas.
- Build a Document Summarization Pipeline for Large Archives: batch summarization with overflow handling.
- Automate Contract and Compliance Document Review: contract review pipeline with risk flagging.