Prepare Training Datasets for LoRA Fine-Tuning

LoRA fine-tuning adapts a pre-trained model to your domain, but the quality of the result depends entirely on the training data. LM-Kit.NET provides a structured pipeline for building training datasets: construct conversations as ChatHistory objects, wrap them in ChatTrainingSample entries, export to ShareGPT JSON format for review or external tools, and feed them directly into the LoraFinetuning engine. This tutorial covers all three approaches.

Why Dataset Preparation Matters

Two real-world problems that structured dataset preparation solves:

Consistent training format across team members. When multiple people contribute training examples, using ChatTrainingSample and ShareGptExporter enforces a uniform schema. Every sample follows the same role structure and can include images for multimodal training.
Iterative dataset refinement. Exporting to ShareGPT JSON lets you inspect, filter, and version-control your training data before committing to an expensive fine-tuning run. You can review samples, remove low-quality ones, and re-export without touching model code.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	16 GB recommended for fine-tuning
VRAM	8+ GB (for the base model during training)
Disk	Space for model + training data + output adapter

Step 1: Create the Project

dotnet new console -n DatasetPrep
cd DatasetPrep
dotnet add package LM-Kit.NET

Step 2: Understand the Dataset Pipeline

┌───────────────────┐
│  ChatHistory      │──── conversation turns
│  (User/Assistant) │     with role + content
└────────┬──────────┘
         │
         ▼
┌────────────────────┐
│ ChatTrainingSample │──── wraps a ChatHistory
│ (+ modality)       │     for training
└────────┬───────────┘
         │
    ┌────┴────────────────┐
    │                     │
    ▼                     ▼
┌──────────────┐   ┌────────────────┐
│ ShareGpt     │   │ LoraFinetuning │
│ Exporter     │   │ (direct load)  │
│ (.json)      │   │                │
└──────────────┘   └────────────────┘

Class	Purpose
`ChatHistory`	Holds a sequence of role-tagged messages (system, user, assistant)
`ChatTrainingSample`	Wraps a `ChatHistory` for fine-tuning with a target modality
`TrainingDataset`	Collection of samples with convenience export
`ShareGptExporter`	Exports samples to ShareGPT JSON format
`DatasetBuilderOptions`	Controls export behavior (overwrite, image handling, error policy)
`LoraFinetuning`	Fine-tuning engine that accepts `ChatHistory` or text data directly

Step 3: Build Training Samples from ChatHistory

using System.Text;
using LMKit.Finetuning;
using LMKit.Finetuning.Export;
using LMKit.Inference;
using LMKit.Model;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a base model (needed for tokenization)
// ──────────────────────────────────────
Console.WriteLine("Loading base model...");
using LM model = LM.LoadFromModelID("qwen3.5:2b",
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create training samples
// ──────────────────────────────────────
var dataset = new TrainingDataset();

// Sample 1: Customer support Q&A
var chat1 = new ChatHistory(model);
chat1.AddMessage(AuthorRole.System, "You are a customer support agent for Acme Corp.");
chat1.AddMessage(AuthorRole.User, "How do I reset my password?");
chat1.AddMessage(AuthorRole.Assistant,
    "To reset your password, go to Settings > Account > Change Password. " +
    "Enter your current password, then type your new password twice. " +
    "Click Save to confirm the change.");
dataset.AddSample(new ChatTrainingSample(chat1));

// Sample 2: Multi-turn conversation
var chat2 = new ChatHistory(model);
chat2.AddMessage(AuthorRole.System, "You are a customer support agent for Acme Corp.");
chat2.AddMessage(AuthorRole.User, "What are your business hours?");
chat2.AddMessage(AuthorRole.Assistant,
    "Our support team is available Monday through Friday, 9 AM to 6 PM Eastern Time.");
chat2.AddMessage(AuthorRole.User, "Do you have weekend support?");
chat2.AddMessage(AuthorRole.Assistant,
    "We offer limited weekend support via email only. " +
    "Emails received on weekends are answered by Monday noon.");
dataset.AddSample(new ChatTrainingSample(chat2));

// Sample 3: Technical Q&A
var chat3 = new ChatHistory(model);
chat3.AddMessage(AuthorRole.System, "You are a customer support agent for Acme Corp.");
chat3.AddMessage(AuthorRole.User, "My device won't turn on after the update.");
chat3.AddMessage(AuthorRole.Assistant,
    "Try a hard reset: hold the power button for 10 seconds until the LED blinks red. " +
    "If the device still doesn't respond, connect it to power and wait 30 minutes before retrying. " +
    "Contact support if the issue persists.");
dataset.AddSample(new ChatTrainingSample(chat3));

Console.WriteLine($"Created {dataset.Samples.Count} training samples.\n");

// ──────────────────────────────────────
// 3. Export to ShareGPT JSON
// ──────────────────────────────────────
Console.WriteLine("Exporting to ShareGPT JSON...");
dataset.ExportAsSharegpt("training_data.json", overwrite: true);
Console.WriteLine("  Exported to training_data.json\n");

Console.WriteLine("Done. Review training_data.json to verify data quality.");

Step 4: Advanced Export with Options and Progress

For larger datasets, use ShareGptExporter directly for progress tracking and fine-grained control:

using System.Text;
using LMKit.Finetuning;
using LMKit.Finetuning.Export;
using LMKit.Inference;
using LMKit.Model;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

Console.WriteLine("Loading base model...");
using LM model = LM.LoadFromModelID("qwen3.5:2b",
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 1. Build a large dataset programmatically
// ──────────────────────────────────────
var samples = new List<ChatTrainingSample>();

// Example: generate samples from a CSV or database
string[,] qaData = {
    { "What is your return policy?", "Items can be returned within 30 days of purchase with receipt." },
    { "How do I track my order?", "Log into your account and visit Orders > Track Shipment." },
    { "Do you offer international shipping?", "Yes, we ship to 40+ countries. Rates vary by destination." },
    { "How do I cancel my subscription?", "Go to Account > Subscriptions > Cancel. Effective at billing cycle end." }
};

for (int i = 0; i < qaData.GetLength(0); i++)
{
    var chat = new ChatHistory(model);
    chat.AddMessage(AuthorRole.System, "You are a helpful customer support agent.");
    chat.AddMessage(AuthorRole.User, qaData[i, 0]);
    chat.AddMessage(AuthorRole.Assistant, qaData[i, 1]);
    samples.Add(new ChatTrainingSample(chat, InferenceModality.Text));
}

Console.WriteLine($"Built {samples.Count} training samples.\n");

// ──────────────────────────────────────
// 2. Configure export options
// ──────────────────────────────────────
var options = new DatasetBuilderOptions
{
    Overwrite = true,
    IndentedJson = true,
    ImagePrefix = "sample",
    ImageFolderName = "images",
    RoleMappingPolicy = RoleMappingPolicy.Strict,
    ContinueOnError = false,
    ExpectedCount = samples.Count
};

// ──────────────────────────────────────
// 3. Export with progress tracking
// ──────────────────────────────────────
var progress = new Progress<ExportProgress>(p =>
{
    Console.Write($"\r  Exporting: {p.Completed}/{p.Total} ({p.Percent:F0}%)   ");
});

ExportResult result = await ShareGptExporter.ExportAsync(
    samples,
    "customer_support_dataset.json",
    options,
    progress);

Console.WriteLine($"\n\n  Samples written: {result.SamplesWritten}");
Console.WriteLine($"  JSON path:       {result.JsonPath}");
Console.WriteLine($"  Images folder:   {result.ImagesFolder}");
Console.WriteLine($"  Skipped:         {result.SkippedSamples}");

Step 5: Load Training Data Directly into LoRA

For fine-tuning without an intermediate JSON file, load ChatHistory directly:

using System.Text;
using LMKit.Finetuning;
using LMKit.Model;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the model for fine-tuning
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:2b",
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

using var finetuning = new LoraFinetuning(model, FinetuningIntent.StylisticGuidance);

// ──────────────────────────────────────
// 2. Load training data from ChatHistory
// ──────────────────────────────────────
var trainingChat = new ChatHistory(model);

// Each user/assistant pair becomes a training example.
// Use BeginOfNewConversation to separate independent examples.
trainingChat.AddMessage(AuthorRole.System, "You are a concise technical writer.");

trainingChat.AddMessage(AuthorRole.User, "Explain what an API is.");
trainingChat.AddMessage(AuthorRole.Assistant,
    "An API (Application Programming Interface) is a set of rules " +
    "that lets software programs communicate with each other.");

trainingChat.AddMessage(AuthorRole.BeginOfNewConversation, "");

trainingChat.AddMessage(AuthorRole.System, "You are a concise technical writer.");

trainingChat.AddMessage(AuthorRole.User, "What is a REST API?");
trainingChat.AddMessage(AuthorRole.Assistant,
    "A REST API uses HTTP methods (GET, POST, PUT, DELETE) to perform " +
    "operations on resources identified by URLs.");

int sampleCount = finetuning.LoadTrainingDataFromChatHistory(trainingChat);

Console.WriteLine($"Loaded {sampleCount} training sample(s).");
Console.WriteLine($"  Average length: {finetuning.SampleAvgLength} tokens");
Console.WriteLine($"  Min length:     {finetuning.SampleMinLength} tokens");
Console.WriteLine($"  Max length:     {finetuning.SampleMaxLength} tokens\n");

// ──────────────────────────────────────
// 3. Configure and run fine-tuning
// ──────────────────────────────────────
finetuning.Iterations = 64;
finetuning.BatchSize = 4;
finetuning.ContextSize = 256;
finetuning.UseGradientCheckpointing = true;

// LoRA hyperparameters
finetuning.LoraTrainingParameters.LoraRank = 8;
finetuning.LoraTrainingParameters.LoraAlpha = 8;
finetuning.LoraTrainingParameters.AdamAlpha = 1e-4f;

// Monitor progress
finetuning.FinetuningProgress += (sender, e) =>
{
    Console.Write($"\r  Iteration {e.Iterations}/{e.IterationCount} | " +
                  $"Loss: {e.Loss:F4} | Best: {e.BestLoss:F4} | " +
                  $"{e.Percentage:F0}%   ");

    // Optional: save checkpoints periodically
    if (e.Iterations % 20 == 0 && e.Iterations > 0)
    {
        e.SaveLoraCheckpoint($"checkpoint_iter{e.Iterations}.gguf");
    }
};

Console.WriteLine("Starting fine-tuning...\n");
finetuning.Finetune2Lora("my-adapter.gguf");

Console.WriteLine($"\n\nAdapter saved to my-adapter.gguf");

Step 6: Load Training Data from Text Files

For plain-text datasets, use LoadTrainingDataFromText with sample delimiters:

using LMKit.Finetuning;
using LMKit.Model;

using LM model = LM.LoadFromModelID("qwen3.5:2b");
using var finetuning = new LoraFinetuning(model);

// Load from a text file where each sample starts with <SFT>
int count = finetuning.LoadTrainingDataFromText(
    "training_samples.txt",
    sampleStart: "<SFT>");

Console.WriteLine($"Loaded {count} samples from text file.");

// Inspect a specific sample
TrainingSample sample = finetuning.GetSample(0);
Console.WriteLine($"Sample 0: {sample.Tokens.Count} tokens");
Console.WriteLine($"Content: {sample.Value}");

// Filter out samples that are too short or too long
int removed = finetuning.FilterSamplesBySize(minSize: 32, maxSize: 512);
Console.WriteLine($"Removed {removed} samples outside [32, 512] token range.");
Console.WriteLine($"Remaining: {finetuning.SampleCount} samples.");

The text file format uses delimiters to separate samples:

<SFT>
What is machine learning?
Machine learning is a subset of AI where systems learn patterns from data.
<SFT>
What is deep learning?
Deep learning uses neural networks with many layers to learn complex patterns.

ShareGPT JSON Output Format

The exported JSON follows the ShareGPT schema, which is compatible with many fine-tuning frameworks:

[
  {
    "id": "sample001",
    "images": [],
    "messages": [
      { "role": "system", "content": "You are a helpful customer support agent." },
      { "role": "user", "content": "What is your return policy?" },
      { "role": "assistant", "content": "Items can be returned within 30 days..." }
    ]
  },
  {
    "id": "sample002",
    "images": ["images/sample002_1.png"],
    "messages": [
      { "role": "user", "content": "What does this image show?" },
      { "role": "assistant", "content": "The image shows a product diagram..." }
    ]
  }
]

Role Mapping Policies

When your data contains non-standard roles, RoleMappingPolicy controls how they are handled:

Policy	Behavior	Use Case
`Strict` (default)	Roles are left as-is. Export fails on unrecognized roles	Clean, validated data
`CoerceUnknownToUser`	Unknown roles are mapped to "user"	Data from external sources
`DropUnknown`	Messages with unknown roles are silently dropped	Noisy data with metadata messages

using System.Text;
using LMKit.Finetuning;
using LMKit.Finetuning.Export;
using LMKit.Inference;
using LMKit.Model;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a base model (needed for tokenization)
// ──────────────────────────────────────
Console.WriteLine("Loading base model...");
using LM model = LM.LoadFromModelID("qwen3.5:2b",
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var options = new DatasetBuilderOptions
{
    RoleMappingPolicy = RoleMappingPolicy.CoerceUnknownToUser
};

Dataset Quality Checklist

Before running fine-tuning, verify your dataset:

Check	Why
Consistent system prompts	Varying system prompts confuse the model about its role
Balanced turn lengths	Extremely long or short assistant responses skew training
No duplicate samples	Duplicates cause overfitting to specific examples
Representative distribution	Include edge cases, not just common questions
Correct role ordering	System first, then alternating user/assistant

Use FilterSamplesBySize to remove outliers:

finetuning.FilterSamplesBySize(minSize: 32, maxSize: 1024);

Common Issues

Problem	Cause	Fix
`ExportAsync` throws on first sample	Empty `ChatHistory`	Ensure each sample has at least one user + assistant message
Low quality after fine-tuning	Too few samples	Aim for 50+ diverse examples minimum
High loss that doesn't decrease	Learning rate too high	Reduce `AdamAlpha` (e.g., from 1e-3 to 1e-4)
Out of memory during training	Context size too large	Reduce `ContextSize` or `BatchSize`
Training stops early	`MaxNoImprovement` set too low	Increase or set to 0 to disable early stopping

Next Steps

Quantize a Model for Edge Deployment: compress your fine-tuned model.
Build a Conversational Assistant with Memory: use your fine-tuned model in a chat app.

Table of Contents