Table of Contents

Quantize a Model for Edge Deployment

Large language models consume significant memory: a 7B parameter model at FP16 requires ~14 GB of VRAM. Quantization reduces model precision from 16-bit floating point to 4-bit or 8-bit integers, cutting memory usage by 50-75% with minimal quality loss. LM-Kit.NET's Quantizer class converts models between precision levels, enabling deployment on laptops, edge devices, and consumer GPUs. This tutorial builds a quantization pipeline that converts models and compares output quality.


Why Quantization Matters

Two deployment problems that quantization solves:

  1. Run larger models on smaller hardware. A 12B model quantized to Q4_K_M fits in 8 GB VRAM instead of 24 GB. This opens deployment on consumer GPUs, laptops, and even some mobile devices, bringing powerful models to hardware that could not run them at full precision.
  2. Faster inference. Lower precision means smaller tensors, which means faster memory transfers and computation. Quantized models typically run 1.5-2x faster than their FP16 counterparts on the same hardware.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 16+ GB (quantization runs on CPU)
Disk 2x model size free for source + output

Step 1: Create the Project

dotnet new console -n QuantizeQuickstart
cd QuantizeQuickstart
dotnet add package LM-Kit.NET

Step 2: Basic Model Quantization

using System.Text;
using LMKit.Quantization;
using LMKit.Model;
using System.Diagnostics;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// Quantize a model to Q4_K_M
// ──────────────────────────────────────
string sourceModel = "models/gemma-3-4b-it-F16.lmk";
string outputModel = "models/gemma-3-4b-it-Q4_K_M.lmk";

Console.WriteLine($"Source: {sourceModel}");
Console.WriteLine($"Output: {outputModel}");
Console.WriteLine($"Target: Q4_K_M (4-bit, medium quality)\n");

var quantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount
};

Console.WriteLine("Quantizing...");
var stopwatch = System.Diagnostics.Stopwatch.StartNew();

quantizer.Quantize(outputModel, Precision.MOSTLY_Q4_K_M);

stopwatch.Stop();

long sourceSize = new FileInfo(sourceModel).Length;
long outputSize = new FileInfo(outputModel).Length;
double ratio = (double)outputSize / sourceSize * 100;

Console.WriteLine($"\nDone in {stopwatch.Elapsed.TotalSeconds:F0}s");
Console.WriteLine($"Source size: {sourceSize / (1024.0 * 1024):F0} MB");
Console.WriteLine($"Output size: {outputSize / (1024.0 * 1024):F0} MB ({ratio:F0}% of original)");

Step 3: Understand Precision Levels

Choose the right quantization level for your hardware and quality requirements:

Precision Bits Size Reduction Quality Use Case
MOSTLY_Q8_0 8-bit ~50% Excellent High-quality deployment with moderate savings
MOSTLY_Q6_K 6-bit ~60% Very good Best balance of quality and size
MOSTLY_Q5_K_M 5-bit ~65% Very good Good quality, significant savings
MOSTLY_Q4_K_M 4-bit ~70% Good Most popular choice for consumer GPUs
MOSTLY_Q4_K_S 4-bit ~72% Good Slightly smaller than Q4_K_M
MOSTLY_Q3_K_M 3-bit ~78% Acceptable Tight memory constraints
MOSTLY_Q2_K 2-bit ~85% Low Extreme compression, noticeable quality loss

Rule of thumb: Start with Q4_K_M. If quality is insufficient, move up to Q5_K_M or Q6_K. If you need to squeeze into tighter memory, try Q3_K_M.


Step 4: Quantize to Multiple Levels

Generate several variants and compare:

using LMKit.Model;
using LMKit.Quantization;
using LMKit.TextGeneration;

string[] testPrompts =
{
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a function in C# that checks if a string is a palindrome.",
    "What are the three main types of machine learning?"
};

string[] modelPaths =
{
    "models/gemma-3-4b-it-F16.lmk",
    "models/gemma-3-4b-it-Q4_K_M.lmk"
};

foreach (string modelPath in modelPaths)
{
    string modelName = Path.GetFileNameWithoutExtension(modelPath);
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {modelName} ===\n");
    Console.ResetColor();

    using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
        loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
    Console.WriteLine();
}

string sourceModel = "models/gemma-3-4b-it-F16.lmk";

var precisions = new[]
{
    (Precision.MOSTLY_Q8_0,  "Q8_0"),
    (Precision.MOSTLY_Q6_K,  "Q6_K"),
    (Precision.MOSTLY_Q5_K_M, "Q5_K_M"),
    (Precision.MOSTLY_Q4_K_M, "Q4_K_M"),
    (Precision.MOSTLY_Q3_K_M, "Q3_K_M")
};

var quantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount
};

long sourceSize = new FileInfo(sourceModel).Length;

Console.WriteLine($"Source: {sourceSize / (1024.0 * 1024):F0} MB\n");
Console.WriteLine($"{"Level",-10} {"Size (MB)",-12} {"Reduction",-12} {"Time"}");
Console.WriteLine(new string('-', 50));

foreach (var (precision, label) in precisions)
{
    string outputPath = $"models/gemma-3-4b-it-{label}.lmk";

    var sw = System.Diagnostics.Stopwatch.StartNew();
    quantizer.Quantize(outputPath, precision);
    sw.Stop();

    long outSize = new FileInfo(outputPath).Length;
    double reduction = (1.0 - (double)outSize / sourceSize) * 100;

    Console.WriteLine($"{label,-10} {outSize / (1024.0 * 1024),-12:F0} {reduction,-12:F0}% {sw.Elapsed.TotalSeconds:F0}s");
}

Step 5: Quality Comparison

Test the quantized model against the original to verify output quality:

using LMKit.Model;
using LMKit.TextGeneration;

string[] testPrompts =
{
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a function in C# that checks if a string is a palindrome.",
    "What are the three main types of machine learning?"
};

string[] modelPaths =
{
    "models/gemma-3-4b-it-F16.lmk",
    "models/gemma-3-4b-it-Q4_K_M.lmk"
};

foreach (string modelPath in modelPaths)
{
    string modelName = Path.GetFileNameWithoutExtension(modelPath);
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {modelName} ===\n");
    Console.ResetColor();

    using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
        loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
    Console.WriteLine();

    var chat = new SingleTurnConversation(testModel)
    {
        MaximumCompletionTokens = 256
    };

    foreach (string prompt in testPrompts)
    {
        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine($"  Q: {prompt}");
        Console.ResetColor();

        TextGenerationResult result = chat.Submit(prompt);
        string preview = result.Completion.Length > 120
            ? result.Completion.Substring(0, 120) + "..."
            : result.Completion;
        Console.WriteLine($"  A: {preview}");
        Console.ForegroundColor = ConsoleColor.DarkGray;
        Console.WriteLine($"     [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
        Console.ResetColor();
    }
}

Step 6: Quantize with Custom Thread Count

Control CPU usage during quantization to leave resources for other processes:

using LMKit.Model;
using LMKit.Quantization;
using LMKit.TextGeneration;

string[] testPrompts =
{
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a function in C# that checks if a string is a palindrome.",
    "What are the three main types of machine learning?"
};

string[] modelPaths =
{
    "models/gemma-3-4b-it-F16.lmk",
    "models/gemma-3-4b-it-Q4_K_M.lmk"
};

foreach (string modelPath in modelPaths)
{
    string modelName = Path.GetFileNameWithoutExtension(modelPath);
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {modelName} ===\n");
    Console.ResetColor();

    using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
        loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
    Console.WriteLine();
}

string sourceModel = "models/gemma-3-4b-it-F16.lmk";

// Use all cores for fastest quantization
var fastQuantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount
};

// Use half the cores to leave headroom
var balancedQuantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount / 2
};

// Single-threaded for minimal system impact
var gentleQuantizer = new Quantizer(sourceModel)
{
    ThreadCount = 1
};

Common Issues

Problem Cause Fix
Out of memory during quantization Source model too large for RAM Quantization runs on CPU; ensure enough system RAM (2x model size)
Output quality too low Precision too aggressive Move up one level (Q3 to Q4, Q4 to Q5)
Quantization failed error Unsupported source format or corrupted model Verify source model loads correctly; re-download if corrupted
File not found Incorrect model path Use full absolute path or verify relative path from working directory
Slow quantization Low thread count Increase ThreadCount to Environment.ProcessorCount

Next Steps