Generate Fine-Tuning Datasets from Extraction Configurations
When you have a working TextExtraction configuration (schema, instructions, element definitions), you can generate ShareGPT-format training datasets from it. ExtractionTrainingDataset takes your extraction setup, pairs it with ground-truth samples, and exports conversation-style training data ready for LoRA fine-tuning. This lets you distill a large model's extraction behavior into a smaller, specialized model that runs faster and cheaper.
For background on fine-tuning concepts, see the Fine-Tuning glossary entry.
Why This Matters
Two production problems that extraction dataset generation solves:
- Cost reduction through distillation. You develop an extraction pipeline using a 14B model for accuracy. Once the pipeline is validated, you generate training data and fine-tune a 4B model to match the larger model's extractions, cutting inference cost and latency significantly.
- Domain specialization. General models struggle with industry-specific formats (medical records, legal contracts, financial filings). By collecting ground-truth samples from your domain and generating training data, you create a model that understands your document structure natively.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
Step 1: Create the Project
dotnet new console -n ExtractionTraining
cd ExtractionTraining
dotnet add package LM-Kit.NET
Step 2: Understand the Training Pipeline
The pipeline has three stages:
| Stage | Input | Output |
|---|---|---|
| Configure | TextExtraction with element schema |
Extraction template |
| Collect | Document content + JSON ground truth | ChatTrainingSample objects |
| Export | Collected samples | ShareGPT JSON file for fine-tuning |
ExtractionTrainingDataset inherits from TrainingDataset, which stores ChatTrainingSample objects. Each sample is a conversation: system prompt (extraction instructions), user message (document content), and assistant message (ground truth JSON).
Step 3: Define the Extraction Schema
Set up a TextExtraction configuration that defines what fields to extract. This is the same configuration you use for actual extraction.
using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Extraction.Training;
using LMKit.Finetuning;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// Define the extraction schema
var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
new("CompanyName", ElementType.String, "Name of the company"),
new("InvoiceNumber", ElementType.String, "Invoice reference number"),
new("InvoiceDate", ElementType.Date, "Date the invoice was issued"),
new("TotalAmount", ElementType.Double, "Total amount due"),
new("Currency", ElementType.String, "Currency code (e.g., USD, EUR)"),
};
Step 4: Create the Training Dataset and Add Samples
Each sample pairs a document with its ground-truth JSON output.
// Create training dataset from the extraction configuration
var dataset = new ExtractionTrainingDataset(extractor);
// Add training samples: document content + expected JSON output
dataset.AddSample(
"INVOICE\nAcme Corp\nInvoice #: INV-2025-001842\nDate: January 15, 2025\nTotal: $4,250.00 USD",
"""
{
"CompanyName": "Acme Corp",
"InvoiceNumber": "INV-2025-001842",
"InvoiceDate": "2025-01-15",
"TotalAmount": 4250.00,
"Currency": "USD"
}
""");
dataset.AddSample(
"Rechnung Nr. 2025-0042\nFirma: Deutsche Technik GmbH\nDatum: 03.02.2025\nGesamtbetrag: 1.890,50 EUR",
"""
{
"CompanyName": "Deutsche Technik GmbH",
"InvoiceNumber": "2025-0042",
"InvoiceDate": "2025-02-03",
"TotalAmount": 1890.50,
"Currency": "EUR"
}
""");
dataset.AddSample(
"Facture N° FAC-2025-789\nSociété: Lyon Services SARL\nDate: 28/01/2025\nMontant TTC: 3 200,00 €",
"""
{
"CompanyName": "Lyon Services SARL",
"InvoiceNumber": "FAC-2025-789",
"InvoiceDate": "2025-01-28",
"TotalAmount": 3200.00,
"Currency": "EUR"
}
""");
Console.WriteLine($"Training samples collected: {dataset.Samples.Count}");
Step 5: Enable Modality Augmentation
When EnableModalityAugmentation is set, the dataset generates additional training variants for different input modalities (text-only, vision, multimodal). This produces a more robust fine-tuned model.
dataset.EnableModalityAugmentation = true;
Step 6: Export as ShareGPT
The simplest export: call ExportAsSharegpt() on the dataset.
dataset.ExportAsSharegpt(
"invoice_extraction_dataset.json",
overwrite: true,
imagePrefix: "invoice");
Console.WriteLine("Dataset exported to invoice_extraction_dataset.json");
The exported JSON follows the ShareGPT format, which is the standard input for LoRA fine-tuning tools:
[
{
"id": "invoice001",
"messages": [
{ "role": "system", "content": "Extract the following fields..." },
{ "role": "user", "content": "INVOICE\nAcme Corp..." },
{ "role": "assistant", "content": "{\"CompanyName\": \"Acme Corp\", ...}" }
]
}
]
Step 7: Use ShareGptExporter for Advanced Options
For finer control over the export process, use ShareGptExporter directly with DatasetBuilderOptions.
using LMKit.Finetuning.Export;
var options = new DatasetBuilderOptions
{
Overwrite = true,
IndentedJson = true,
ImagePrefix = "invoice",
ImageFolderName = "images",
ContinueOnError = true, // Skip invalid samples instead of failing
RoleMappingPolicy = RoleMappingPolicy.Strict
};
var progress = new Progress<ExportProgress>(p =>
{
Console.Write($"\rExporting: {p.Completed}/{p.Total} ({p.Percent:F0}%) ");
});
ExportResult result = ShareGptExporter.Export(
dataset.Samples,
"invoice_extraction_dataset_v2.json",
options,
progress);
Console.WriteLine();
Console.WriteLine($"Exported: {result.SamplesWritten} samples");
Console.WriteLine($"Skipped: {result.SkippedSamples} samples");
Console.WriteLine($"JSON: {result.JsonPath}");
Console.WriteLine($"Images: {result.ImagesFolder}");
DatasetBuilderOptions controls:
| Option | Default | Purpose |
|---|---|---|
Overwrite |
false |
Overwrite existing output file |
IndentedJson |
true |
Pretty-print the JSON |
ImagePrefix |
"sample" |
Prefix for image file names |
ImageFolderName |
"images" |
Subfolder for extracted images |
ContinueOnError |
false |
Skip invalid samples instead of throwing |
RoleMappingPolicy |
Strict |
How to handle non-standard message roles |
RoleMappingPolicy options:
| Policy | Behavior |
|---|---|
Strict |
Leaves roles unchanged |
CoerceUnknownToUser |
Maps unrecognized roles to "user" |
DropUnknown |
Drops messages with unrecognized roles |
Step 8: Use the Dataset for Fine-Tuning
The exported ShareGPT file is the input for LM-Kit.NET's fine-tuning pipeline.
using LMKit.Finetuning;
// Load the base model to fine-tune
using LM baseModel = LM.LoadFromModelID("gemma3:4b");
var finetuning = new FineTuning(baseModel);
// Load the generated dataset
var trainingSamples = TrainingDataset.LoadSharegpt("invoice_extraction_dataset.json");
// Configure and start training (see the fine-tuning guide for details)
For the complete fine-tuning workflow, see Prepare Training Datasets for LoRA Fine-Tuning.
What to Read Next
- Extract Structured Data from Unstructured Text: basic extraction patterns
- Validate Extracted Entities with Built-In Format Validators: post-extraction validation
- Prepare Training Datasets for LoRA Fine-Tuning: full fine-tuning workflow
- Fine-Tuning: fine-tuning concepts