Table of Contents

Build a Multi-Language Document Processing Pipeline

Global organizations process documents in dozens of languages: vendor contracts in German, compliance filings in Japanese, engineering specs in Mandarin, and customer correspondence in Spanish. LM-Kit.NET provides language detection, translation, multilingual extraction, and localized summarization, all running locally without sending documents to external translation services. This tutorial builds a pipeline that detects document language, extracts structured data regardless of language, translates content, and generates summaries in a target language.


Why Local Multilingual Processing Matters

Two enterprise problems that on-device multilingual document processing solves:

  1. Export compliance and trade documents. Import/export companies process customs declarations, bills of lading, and certificates of origin in the language of the origin country. These documents contain trade secrets, pricing, and shipment details that cannot be sent to cloud translation APIs without regulatory risk. Local processing keeps all data on-premises.
  2. Multinational contract management. A company with subsidiaries in 15 countries receives vendor agreements in local languages. Each contract needs key terms extracted (parties, dates, amounts) and a summary generated in the corporate language (English). A multilingual pipeline automates this without a translation agency.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 6+ GB (for multilingual-capable model)
Disk ~5 GB free for model downloads

Step 1: Create the Project

dotnet new console -n MultiLanguagePipeline
cd MultiLanguagePipeline
dotnet add package LM-Kit.NET

Step 2: Understand the Pipeline

  Document          ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
  (any language) ►  │ 1. Detect     │ ──► │ 2. Extract    │ ──► │ 3. Translate  │
                    │    language   │     │    (native)   │     │    or         │
                    └───────────────┘     └───────────────┘     │    Summarize  │
                                                                └───────────────┘
                                                                       │
                                                                       ▼
                                                                Output in target
                                                                language

The pipeline uses four LM-Kit.NET capabilities:

  • TextTranslation.DetectLanguage() to identify the source language
  • TextExtraction to extract fields from documents in any language
  • TextTranslation.Translate() to convert text to a target language
  • Summarizer with TargetLanguage for cross-lingual summarization

Step 3: Language Detection

using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);

string[] documents =
{
    "Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
    "Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",

    "Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
    "Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",

    "請求書番号: 2025-0392。日付: 2025年3月1日。" +
    "株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};

Console.WriteLine("=== Language Detection ===\n");

foreach (string doc in documents)
{
    Language detected = translator.DetectLanguage(doc);
    float confidence = translator.Confidence;

    string preview = doc.Length > 60 ? doc[..60] + "..." : doc;
    Console.WriteLine($"  [{detected}] ({confidence:P0}) {preview}");
}
Console.WriteLine();

Step 4: Extract Data from Any Language

TextExtraction works across languages. The model understands the document regardless of language and maps content to your schema:

using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);

string[] documents =
{
    "Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
    "Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",

    "Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
    "Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",

    "請求書番号: 2025-0392。日付: 2025年3月1日。" +
    "株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};

// ──────────────────────────────────────
// 3. Extract structured data (language-agnostic)
// ──────────────────────────────────────
Console.WriteLine("=== Cross-Language Extraction ===\n");

var extractor = new TextExtraction(model)
{
    NullOnDoubt = true,
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", ElementType.String,
            "Invoice identifier or number", isRequired: true),
        new("customer_name", ElementType.String,
            "Name of the customer or client", isRequired: true),
        new("total_amount", ElementType.Double,
            "Total amount including tax", isRequired: true),
        new("currency", ElementType.String,
            "Currency code (EUR, USD, JPY, etc.)"),
        new("payment_terms", ElementType.String,
            "Payment terms or conditions")
    }
};

foreach (string doc in documents)
{
    Language lang = translator.DetectLanguage(doc);

    extractor.SetContent(doc);
    // Add guidance about regional formatting
    extractor.Guidance = lang switch
    {
        Language.German => "German document. Dates use DD.MM.YYYY. Amounts use comma as decimal separator.",
        Language.French => "French document. Dates use DD/MM/YYYY. Amounts use comma as decimal separator.",
        Language.Japanese => "Japanese document. Dates may use 年月日 format. Amounts in yen have no decimal.",
        _ => ""
    };

    TextExtractionResult result = extractor.Parse();

    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"  [{lang}] Invoice #{result.GetValue<string>("invoice_number")}");
    Console.ResetColor();
    Console.WriteLine($"    Customer: {result.GetValue<string>("customer_name")}");
    Console.WriteLine($"    Total:    {result.GetValue<double>("total_amount")} {result.GetValue<string>("currency")}");
    Console.WriteLine($"    Terms:    {result.GetValue<string>("payment_terms")}");
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"    Confidence: {result.Confidence:P0}");
    Console.ResetColor();
    Console.WriteLine();
}

Step 5: Translate Documents

Translate document content into a target language:

using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);

string[] documents =
{
    "Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
    "Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",

    "Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
    "Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",

    "請求書番号: 2025-0392。日付: 2025年3月1日。" +
    "株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};

// ──────────────────────────────────────
// 4. Translate to English
// ──────────────────────────────────────
Console.WriteLine("=== Translation to English ===\n");

// Stream translation output
translator.AfterTextCompletion += (_, e) =>
{
    Console.Write(e.Text);
};

foreach (string doc in documents)
{
    Language sourceLang = translator.DetectLanguage(doc);

    if (sourceLang == Language.English)
    {
        Console.WriteLine($"  [{sourceLang}] Already in English. Skipping.\n");
        continue;
    }

    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.Write($"  [{sourceLang} → English] ");
    Console.ResetColor();

    string translated = translator.Translate(doc, Language.English);
    Console.WriteLine("\n");
}

Step 6: Cross-Lingual Summarization

Generate summaries in a target language regardless of the source document language:

using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);

string[] documents =
{
    "Rechnung Nr. 2025-0847 vom 15. Januar 2025. Kunde: Müller GmbH. " +
    "Gesamtbetrag: 4.250,00 EUR inklusive 19% MwSt. Zahlungsziel: 30 Tage netto.",

    "Facture N° 2025-1203 du 22 février 2025. Client: Dupont SA. " +
    "Montant total: 3.780,00 EUR TTC. Conditions de paiement: 45 jours.",

    "請求書番号: 2025-0392。日付: 2025年3月1日。" +
    "株式会社田中製作所。合計金額: ¥856,000(税込)。支払条件: 月末締め翌月末払い。"
};

// ──────────────────────────────────────
// 5. Summarize in English regardless of source language
// ──────────────────────────────────────
Console.WriteLine("=== Cross-Lingual Summarization ===\n");

var summarizer = new Summarizer(model)
{
    MaxContentWords = 50,
    MaxTitleWords = 8,
    GenerateTitle = true,
    GenerateContent = true,
    Intent = Summarizer.SummarizationIntent.Abstraction,
    TargetLanguage = Language.English
};

foreach (string doc in documents)
{
    Language sourceLang = translator.DetectLanguage(doc);

    Summarizer.SummarizerResult summary = summarizer.Summarize(doc);

    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"  [{sourceLang} → English summary]");
    Console.ResetColor();
    Console.WriteLine($"    Title:   {summary.Title}");
    Console.WriteLine($"    Summary: {summary.Summary}");
    Console.WriteLine();
}

Step 7: Processing Multilingual PDF Files

The same pipeline works with PDF attachments:

using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Translation;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a multilingual model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Detect document language
// ──────────────────────────────────────
var translator = new TextTranslation(model);

string[] pdfFiles = Directory.GetFiles("multilingual_docs", "*.pdf");

var csvOutput = new List<string>();
csvOutput.Add("file,language,invoice_number,customer,total,currency");

foreach (string pdfPath in pdfFiles)
{
    string fileName = Path.GetFileName(pdfPath);
    var attachment = new Attachment(pdfPath);

    // Detect language from PDF
    Language lang = translator.DetectLanguage(attachment);

    // Set regional guidance
    extractor.Guidance = lang switch
    {
        Language.German or Language.Dutch => "European document. Comma is decimal separator. Dates: DD.MM.YYYY.",
        Language.French or Language.Italian or Language.Spanish => "European document. Comma is decimal separator. Dates: DD/MM/YYYY.",
        Language.Japanese or Language.Korean or Language.ChineseSimplified => "East Asian document. Use local number formatting.",
        _ => ""
    };

    // Extract
    extractor.SetContent(attachment);
    TextExtractionResult result = extractor.Parse();

    string invoiceNum = result.GetValue<string>("invoice_number") ?? "N/A";
    string customer = result.GetValue<string>("customer_name") ?? "N/A";
    double total = result.GetValue<double>("total_amount");
    string currency = result.GetValue<string>("currency") ?? "N/A";

    Console.WriteLine($"  {fileName} [{lang}] → #{invoiceNum}, {customer}, {total} {currency}");

    csvOutput.Add($"\"{fileName}\",\"{lang}\",\"{invoiceNum}\",\"{customer}\",{total},\"{currency}\"");
}

File.WriteAllLines("multilingual_results.csv", csvOutput);
Console.WriteLine($"\nExported to multilingual_results.csv");

Supported Languages

LM-Kit.NET supports 35+ languages for detection, translation, and summarization:

Region Languages
Western European English, French, German, Spanish, Italian, Portuguese, Dutch, Danish, Swedish, Norwegian, Finnish
Eastern European Polish, Czech, Slovak, Hungarian, Romanian, Bulgarian, Croatian, Serbian, Ukrainian, Russian
East Asian Japanese, Korean, Chinese (Simplified), Chinese (Traditional)
Middle Eastern Arabic, Hebrew, Turkish
South/Southeast Asian Hindi, Indonesian, Thai, Vietnamese
Other Armenian, Modern Greek

Model Selection

Model ID VRAM Multilingual Quality Best For
qwen3.5:4b ~3.5 GB Good European languages, high throughput
qwen3.5:9b ~7 GB Very good Most languages including CJK (recommended)
gemma4:e4b ~8 GB Excellent Complex multilingual documents
qwen3.6:27b ~18 GB Excellent Highest accuracy across all languages

The Qwen 3.5 family has the strongest multilingual capabilities. Use qwen3.5:9b for the best balance of accuracy and speed across all supported languages.


Common Issues

Problem Cause Fix
Wrong language detected Short text or mixed-language document Provide a longer text sample; use DetectLanguage with specific language candidates
Extraction fails on CJK documents Model too small for ideographic scripts Use qwen3.5:9b or larger for Japanese, Chinese, Korean
Date/number format errors Regional formatting not specified Add Guidance with regional format hints
Translation quality low Using a non-multilingual model Switch to Qwen 3.5 family for best multilingual support
Summary in wrong language TargetLanguage not set on summarizer Set summarizer.TargetLanguage = Language.English explicitly

Next Steps

Share