Table of Contents

Generate Fine-Tuning Datasets from Extraction Configurations

When you have a working TextExtraction configuration (schema, instructions, element definitions), you can generate ShareGPT-format training datasets from it. ExtractionTrainingDataset takes your extraction setup, pairs it with ground-truth samples, and exports conversation-style training data ready for LoRA fine-tuning. This lets you distill a large model's extraction behavior into a smaller, specialized model that runs faster and cheaper.

For background on fine-tuning concepts, see the Fine-Tuning glossary entry.


Why This Matters

Two production problems that extraction dataset generation solves:

  1. Cost reduction through distillation. You develop an extraction pipeline using a 14B model for accuracy. Once the pipeline is validated, you generate training data and fine-tune a 4B model to match the larger model's extractions, cutting inference cost and latency significantly.
  2. Domain specialization. General models struggle with industry-specific formats (medical records, legal contracts, financial filings). By collecting ground-truth samples from your domain and generating training data, you create a model that understands your document structure natively.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB

Step 1: Create the Project

dotnet new console -n ExtractionTraining
cd ExtractionTraining
dotnet add package LM-Kit.NET

Step 2: Understand the Training Pipeline

The pipeline has three stages:

Stage Input Output
Configure TextExtraction with element schema Extraction template
Collect Document content + JSON ground truth ChatTrainingSample objects
Export Collected samples ShareGPT JSON file for fine-tuning

ExtractionTrainingDataset inherits from TrainingDataset, which stores ChatTrainingSample objects. Each sample is a conversation: system prompt (extraction instructions), user message (document content), and assistant message (ground truth JSON).


Step 3: Define the Extraction Schema

Set up a TextExtraction configuration that defines what fields to extract. This is the same configuration you use for actual extraction.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Extraction.Training;
using LMKit.Finetuning;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// Define the extraction schema
var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
    new("CompanyName", ElementType.String, "Name of the company"),
    new("InvoiceNumber", ElementType.String, "Invoice reference number"),
    new("InvoiceDate", ElementType.Date, "Date the invoice was issued"),
    new("TotalAmount", ElementType.Double, "Total amount due"),
    new("Currency", ElementType.String, "Currency code (e.g., USD, EUR)"),
};

Step 4: Create the Training Dataset and Add Samples

Each sample pairs a document with its ground-truth JSON output.

// Create training dataset from the extraction configuration
var dataset = new ExtractionTrainingDataset(extractor);

// Add training samples: document content + expected JSON output
dataset.AddSample(
    "INVOICE\nAcme Corp\nInvoice #: INV-2025-001842\nDate: January 15, 2025\nTotal: $4,250.00 USD",
    """
    {
        "CompanyName": "Acme Corp",
        "InvoiceNumber": "INV-2025-001842",
        "InvoiceDate": "2025-01-15",
        "TotalAmount": 4250.00,
        "Currency": "USD"
    }
    """);

dataset.AddSample(
    "Rechnung Nr. 2025-0042\nFirma: Deutsche Technik GmbH\nDatum: 03.02.2025\nGesamtbetrag: 1.890,50 EUR",
    """
    {
        "CompanyName": "Deutsche Technik GmbH",
        "InvoiceNumber": "2025-0042",
        "InvoiceDate": "2025-02-03",
        "TotalAmount": 1890.50,
        "Currency": "EUR"
    }
    """);

dataset.AddSample(
    "Facture N° FAC-2025-789\nSociété: Lyon Services SARL\nDate: 28/01/2025\nMontant TTC: 3 200,00 €",
    """
    {
        "CompanyName": "Lyon Services SARL",
        "InvoiceNumber": "FAC-2025-789",
        "InvoiceDate": "2025-01-28",
        "TotalAmount": 3200.00,
        "Currency": "EUR"
    }
    """);

Console.WriteLine($"Training samples collected: {dataset.Samples.Count}");

Step 5: Enable Modality Augmentation

When EnableModalityAugmentation is set, the dataset generates additional training variants for different input modalities (text-only, vision, multimodal). This produces a more robust fine-tuned model.

dataset.EnableModalityAugmentation = true;

Step 6: Export as ShareGPT

The simplest export: call ExportAsSharegpt() on the dataset.

dataset.ExportAsSharegpt(
    "invoice_extraction_dataset.json",
    overwrite: true,
    imagePrefix: "invoice");

Console.WriteLine("Dataset exported to invoice_extraction_dataset.json");

The exported JSON follows the ShareGPT format, which is the standard input for LoRA fine-tuning tools:

[
  {
    "id": "invoice001",
    "messages": [
      { "role": "system", "content": "Extract the following fields..." },
      { "role": "user", "content": "INVOICE\nAcme Corp..." },
      { "role": "assistant", "content": "{\"CompanyName\": \"Acme Corp\", ...}" }
    ]
  }
]

Step 7: Use ShareGptExporter for Advanced Options

For finer control over the export process, use ShareGptExporter directly with DatasetBuilderOptions.

using LMKit.Finetuning.Export;

var options = new DatasetBuilderOptions
{
    Overwrite = true,
    IndentedJson = true,
    ImagePrefix = "invoice",
    ImageFolderName = "images",
    ContinueOnError = true,  // Skip invalid samples instead of failing
    RoleMappingPolicy = RoleMappingPolicy.Strict
};

var progress = new Progress<ExportProgress>(p =>
{
    Console.Write($"\rExporting: {p.Completed}/{p.Total} ({p.Percent:F0}%)   ");
});

ExportResult result = ShareGptExporter.Export(
    dataset.Samples,
    "invoice_extraction_dataset_v2.json",
    options,
    progress);

Console.WriteLine();
Console.WriteLine($"Exported: {result.SamplesWritten} samples");
Console.WriteLine($"Skipped: {result.SkippedSamples} samples");
Console.WriteLine($"JSON: {result.JsonPath}");
Console.WriteLine($"Images: {result.ImagesFolder}");

DatasetBuilderOptions controls:

Option Default Purpose
Overwrite false Overwrite existing output file
IndentedJson true Pretty-print the JSON
ImagePrefix "sample" Prefix for image file names
ImageFolderName "images" Subfolder for extracted images
ContinueOnError false Skip invalid samples instead of throwing
RoleMappingPolicy Strict How to handle non-standard message roles

RoleMappingPolicy options:

Policy Behavior
Strict Leaves roles unchanged
CoerceUnknownToUser Maps unrecognized roles to "user"
DropUnknown Drops messages with unrecognized roles

Step 8: Use the Dataset for Fine-Tuning

The exported ShareGPT file is the input for LM-Kit.NET's fine-tuning pipeline.

using LMKit.Finetuning;

// Load the base model to fine-tune
using LM baseModel = LM.LoadFromModelID("gemma3:4b");

var finetuning = new FineTuning(baseModel);

// Load the generated dataset
var trainingSamples = TrainingDataset.LoadSharegpt("invoice_extraction_dataset.json");

// Configure and start training (see the fine-tuning guide for details)

For the complete fine-tuning workflow, see Prepare Training Datasets for LoRA Fine-Tuning.


Share