Prepare Training Datasets for LoRA Fine-Tuning
LoRA fine-tuning adapts a pre-trained model to your domain, but the quality of the result depends entirely on the training data. LM-Kit.NET provides a structured pipeline for building training datasets: construct conversations as ChatHistory objects, wrap them in ChatTrainingSample entries, export to ShareGPT JSON format for review or external tools, and feed them directly into the LoraFinetuning engine. This tutorial covers all three approaches.
Why Dataset Preparation Matters
Two real-world problems that structured dataset preparation solves:
- Consistent training format across team members. When multiple people contribute training examples, using
ChatTrainingSampleandShareGptExporterenforces a uniform schema. Every sample follows the same role structure and can include images for multimodal training. - Iterative dataset refinement. Exporting to ShareGPT JSON lets you inspect, filter, and version-control your training data before committing to an expensive fine-tuning run. You can review samples, remove low-quality ones, and re-export without touching model code.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB recommended for fine-tuning |
| VRAM | 8+ GB (for the base model during training) |
| Disk | Space for model + training data + output adapter |
Step 1: Create the Project
dotnet new console -n DatasetPrep
cd DatasetPrep
dotnet add package LM-Kit.NET
Step 2: Understand the Dataset Pipeline
┌───────────────────┐
│ ChatHistory │──── conversation turns
│ (User/Assistant) │ with role + content
└────────┬──────────┘
│
▼
┌────────────────────┐
│ ChatTrainingSample │──── wraps a ChatHistory
│ (+ modality) │ for training
└────────┬───────────┘
│
┌────┴────────────────┐
│ │
▼ ▼
┌──────────────┐ ┌────────────────┐
│ ShareGpt │ │ LoraFinetuning │
│ Exporter │ │ (direct load) │
│ (.json) │ │ │
└──────────────┘ └────────────────┘
| Class | Purpose |
|---|---|
ChatHistory |
Holds a sequence of role-tagged messages (system, user, assistant) |
ChatTrainingSample |
Wraps a ChatHistory for fine-tuning with a target modality |
TrainingDataset |
Collection of samples with convenience export |
ShareGptExporter |
Exports samples to ShareGPT JSON format |
DatasetBuilderOptions |
Controls export behavior (overwrite, image handling, error policy) |
LoraFinetuning |
Fine-tuning engine that accepts ChatHistory or text data directly |
Step 3: Build Training Samples from ChatHistory
using System.Text;
using LMKit.Finetuning;
using LMKit.Finetuning.Export;
using LMKit.Inference;
using LMKit.Model;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a base model (needed for tokenization)
// ──────────────────────────────────────
Console.WriteLine("Loading base model...");
using LM model = LM.LoadFromModelID("qwen3:1.7b",
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create training samples
// ──────────────────────────────────────
var dataset = new TrainingDataset();
// Sample 1: Customer support Q&A
var chat1 = new ChatHistory(model);
chat1.AddMessage(AuthorRole.System, "You are a customer support agent for Acme Corp.");
chat1.AddMessage(AuthorRole.User, "How do I reset my password?");
chat1.AddMessage(AuthorRole.Assistant,
"To reset your password, go to Settings > Account > Change Password. " +
"Enter your current password, then type your new password twice. " +
"Click Save to confirm the change.");
dataset.AddSample(new ChatTrainingSample(chat1));
// Sample 2: Multi-turn conversation
var chat2 = new ChatHistory(model);
chat2.AddMessage(AuthorRole.System, "You are a customer support agent for Acme Corp.");
chat2.AddMessage(AuthorRole.User, "What are your business hours?");
chat2.AddMessage(AuthorRole.Assistant,
"Our support team is available Monday through Friday, 9 AM to 6 PM Eastern Time.");
chat2.AddMessage(AuthorRole.User, "Do you have weekend support?");
chat2.AddMessage(AuthorRole.Assistant,
"We offer limited weekend support via email only. " +
"Emails received on weekends are answered by Monday noon.");
dataset.AddSample(new ChatTrainingSample(chat2));
// Sample 3: Technical Q&A
var chat3 = new ChatHistory(model);
chat3.AddMessage(AuthorRole.System, "You are a customer support agent for Acme Corp.");
chat3.AddMessage(AuthorRole.User, "My device won't turn on after the update.");
chat3.AddMessage(AuthorRole.Assistant,
"Try a hard reset: hold the power button for 10 seconds until the LED blinks red. " +
"If the device still doesn't respond, connect it to power and wait 30 minutes before retrying. " +
"Contact support if the issue persists.");
dataset.AddSample(new ChatTrainingSample(chat3));
Console.WriteLine($"Created {dataset.Samples.Count} training samples.\n");
// ──────────────────────────────────────
// 3. Export to ShareGPT JSON
// ──────────────────────────────────────
Console.WriteLine("Exporting to ShareGPT JSON...");
dataset.ExportAsSharegpt("training_data.json", overwrite: true);
Console.WriteLine(" Exported to training_data.json\n");
Console.WriteLine("Done. Review training_data.json to verify data quality.");
Step 4: Advanced Export with Options and Progress
For larger datasets, use ShareGptExporter directly for progress tracking and fine-grained control:
using System.Text;
using LMKit.Finetuning;
using LMKit.Finetuning.Export;
using LMKit.Inference;
using LMKit.Model;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("Loading base model...");
using LM model = LM.LoadFromModelID("qwen3:1.7b",
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 1. Build a large dataset programmatically
// ──────────────────────────────────────
var samples = new List<ChatTrainingSample>();
// Example: generate samples from a CSV or database
string[,] qaData = {
{ "What is your return policy?", "Items can be returned within 30 days of purchase with receipt." },
{ "How do I track my order?", "Log into your account and visit Orders > Track Shipment." },
{ "Do you offer international shipping?", "Yes, we ship to 40+ countries. Rates vary by destination." },
{ "How do I cancel my subscription?", "Go to Account > Subscriptions > Cancel. Effective at billing cycle end." }
};
for (int i = 0; i < qaData.GetLength(0); i++)
{
var chat = new ChatHistory(model);
chat.AddMessage(AuthorRole.System, "You are a helpful customer support agent.");
chat.AddMessage(AuthorRole.User, qaData[i, 0]);
chat.AddMessage(AuthorRole.Assistant, qaData[i, 1]);
samples.Add(new ChatTrainingSample(chat, InferenceModality.Text));
}
Console.WriteLine($"Built {samples.Count} training samples.\n");
// ──────────────────────────────────────
// 2. Configure export options
// ──────────────────────────────────────
var options = new DatasetBuilderOptions
{
Overwrite = true,
IndentedJson = true,
ImagePrefix = "sample",
ImageFolderName = "images",
RoleMappingPolicy = RoleMappingPolicy.Strict,
ContinueOnError = false,
ExpectedCount = samples.Count
};
// ──────────────────────────────────────
// 3. Export with progress tracking
// ──────────────────────────────────────
var progress = new Progress<ExportProgress>(p =>
{
Console.Write($"\r Exporting: {p.Completed}/{p.Total} ({p.Percent:F0}%) ");
});
ExportResult result = await ShareGptExporter.ExportAsync(
samples,
"customer_support_dataset.json",
options,
progress);
Console.WriteLine($"\n\n Samples written: {result.SamplesWritten}");
Console.WriteLine($" JSON path: {result.JsonPath}");
Console.WriteLine($" Images folder: {result.ImagesFolder}");
Console.WriteLine($" Skipped: {result.SkippedSamples}");
Step 5: Load Training Data Directly into LoRA
For fine-tuning without an intermediate JSON file, load ChatHistory directly:
using System.Text;
using LMKit.Finetuning;
using LMKit.Model;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the model for fine-tuning
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:1.7b",
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
using var finetuning = new LoraFinetuning(model, FinetuningIntent.StylisticGuidance);
// ──────────────────────────────────────
// 2. Load training data from ChatHistory
// ──────────────────────────────────────
var trainingChat = new ChatHistory(model);
// Each user/assistant pair becomes a training example.
// Use BeginOfNewConversation to separate independent examples.
trainingChat.AddMessage(AuthorRole.System, "You are a concise technical writer.");
trainingChat.AddMessage(AuthorRole.User, "Explain what an API is.");
trainingChat.AddMessage(AuthorRole.Assistant,
"An API (Application Programming Interface) is a set of rules " +
"that lets software programs communicate with each other.");
trainingChat.AddMessage(AuthorRole.BeginOfNewConversation, "");
trainingChat.AddMessage(AuthorRole.System, "You are a concise technical writer.");
trainingChat.AddMessage(AuthorRole.User, "What is a REST API?");
trainingChat.AddMessage(AuthorRole.Assistant,
"A REST API uses HTTP methods (GET, POST, PUT, DELETE) to perform " +
"operations on resources identified by URLs.");
int sampleCount = finetuning.LoadTrainingDataFromChatHistory(trainingChat);
Console.WriteLine($"Loaded {sampleCount} training sample(s).");
Console.WriteLine($" Average length: {finetuning.SampleAvgLength} tokens");
Console.WriteLine($" Min length: {finetuning.SampleMinLength} tokens");
Console.WriteLine($" Max length: {finetuning.SampleMaxLength} tokens\n");
// ──────────────────────────────────────
// 3. Configure and run fine-tuning
// ──────────────────────────────────────
finetuning.Iterations = 64;
finetuning.BatchSize = 4;
finetuning.ContextSize = 256;
finetuning.UseGradientCheckpointing = true;
// LoRA hyperparameters
finetuning.LoraTrainingParameters.LoraRank = 8;
finetuning.LoraTrainingParameters.LoraAlpha = 8;
finetuning.LoraTrainingParameters.AdamAlpha = 1e-4f;
// Monitor progress
finetuning.FinetuningProgress += (sender, e) =>
{
Console.Write($"\r Iteration {e.Iterations}/{e.IterationCount} | " +
$"Loss: {e.Loss:F4} | Best: {e.BestLoss:F4} | " +
$"{e.Percentage:F0}% ");
// Optional: save checkpoints periodically
if (e.Iterations % 20 == 0 && e.Iterations > 0)
{
e.SaveLoraCheckpoint($"checkpoint_iter{e.Iterations}.gguf");
}
};
Console.WriteLine("Starting fine-tuning...\n");
finetuning.Finetune2Lora("my-adapter.gguf");
Console.WriteLine($"\n\nAdapter saved to my-adapter.gguf");
Step 6: Load Training Data from Text Files
For plain-text datasets, use LoadTrainingDataFromText with sample delimiters:
using LMKit.Finetuning;
using LMKit.Model;
using LM model = LM.LoadFromModelID("qwen3:1.7b");
using var finetuning = new LoraFinetuning(model);
// Load from a text file where each sample starts with <SFT>
int count = finetuning.LoadTrainingDataFromText(
"training_samples.txt",
sampleStart: "<SFT>");
Console.WriteLine($"Loaded {count} samples from text file.");
// Inspect a specific sample
TrainingSample sample = finetuning.GetSample(0);
Console.WriteLine($"Sample 0: {sample.Tokens.Count} tokens");
Console.WriteLine($"Content: {sample.Value}");
// Filter out samples that are too short or too long
int removed = finetuning.FilterSamplesBySize(minSize: 32, maxSize: 512);
Console.WriteLine($"Removed {removed} samples outside [32, 512] token range.");
Console.WriteLine($"Remaining: {finetuning.SampleCount} samples.");
The text file format uses delimiters to separate samples:
<SFT>
What is machine learning?
Machine learning is a subset of AI where systems learn patterns from data.
<SFT>
What is deep learning?
Deep learning uses neural networks with many layers to learn complex patterns.
ShareGPT JSON Output Format
The exported JSON follows the ShareGPT schema, which is compatible with many fine-tuning frameworks:
[
{
"id": "sample001",
"images": [],
"messages": [
{ "role": "system", "content": "You are a helpful customer support agent." },
{ "role": "user", "content": "What is your return policy?" },
{ "role": "assistant", "content": "Items can be returned within 30 days..." }
]
},
{
"id": "sample002",
"images": ["images/sample002_1.png"],
"messages": [
{ "role": "user", "content": "What does this image show?" },
{ "role": "assistant", "content": "The image shows a product diagram..." }
]
}
]
Role Mapping Policies
When your data contains non-standard roles, RoleMappingPolicy controls how they are handled:
| Policy | Behavior | Use Case |
|---|---|---|
Strict (default) |
Roles are left as-is. Export fails on unrecognized roles | Clean, validated data |
CoerceUnknownToUser |
Unknown roles are mapped to "user" | Data from external sources |
DropUnknown |
Messages with unknown roles are silently dropped | Noisy data with metadata messages |
using System.Text;
using LMKit.Finetuning;
using LMKit.Finetuning.Export;
using LMKit.Inference;
using LMKit.Model;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a base model (needed for tokenization)
// ──────────────────────────────────────
Console.WriteLine("Loading base model...");
using LM model = LM.LoadFromModelID("qwen3:1.7b",
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
var options = new DatasetBuilderOptions
{
RoleMappingPolicy = RoleMappingPolicy.CoerceUnknownToUser
};
Dataset Quality Checklist
Before running fine-tuning, verify your dataset:
| Check | Why |
|---|---|
| Consistent system prompts | Varying system prompts confuse the model about its role |
| Balanced turn lengths | Extremely long or short assistant responses skew training |
| No duplicate samples | Duplicates cause overfitting to specific examples |
| Representative distribution | Include edge cases, not just common questions |
| Correct role ordering | System first, then alternating user/assistant |
Use FilterSamplesBySize to remove outliers:
finetuning.FilterSamplesBySize(minSize: 32, maxSize: 1024);
Common Issues
| Problem | Cause | Fix |
|---|---|---|
ExportAsync throws on first sample |
Empty ChatHistory |
Ensure each sample has at least one user + assistant message |
| Low quality after fine-tuning | Too few samples | Aim for 50+ diverse examples minimum |
| High loss that doesn't decrease | Learning rate too high | Reduce AdamAlpha (e.g., from 1e-3 to 1e-4) |
| Out of memory during training | Context size too large | Reduce ContextSize or BatchSize |
| Training stops early | MaxNoImprovement set too low |
Increase or set to 0 to disable early stopping |
Next Steps
- Quantize a Model for Edge Deployment: compress your fine-tuned model.
- Build a Conversational Assistant with Memory: use your fine-tuned model in a chat app.
- Samples: Fine-Tuning: complete fine-tuning demo application.