Quantize a Model for Edge Deployment

Large language models consume significant memory: a 7B parameter model at FP16 requires ~14 GB of VRAM. Quantization reduces model precision from 16-bit floating point to 4-bit or 8-bit integers, cutting memory usage by 50-75% with minimal quality loss. LM-Kit.NET's Quantizer class converts models between precision levels, enabling deployment on laptops, edge devices, and consumer GPUs. This tutorial builds a quantization pipeline that converts models and compares output quality.

Why Quantization Matters

Two deployment problems that quantization solves:

Run larger models on smaller hardware. A 12B model quantized to Q4_K_M fits in 8 GB VRAM instead of 24 GB. This opens deployment on consumer GPUs, laptops, and even some mobile devices, bringing powerful models to hardware that could not run them at full precision.
Faster inference. Lower precision means smaller tensors, which means faster memory transfers and computation. Quantized models typically run 1.5-2x faster than their FP16 counterparts on the same hardware.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	16+ GB (quantization runs on CPU)
Disk	2x model size free for source + output

Step 1: Create the Project

dotnet new console -n QuantizeQuickstart
cd QuantizeQuickstart
dotnet add package LM-Kit.NET

Step 2: Basic Model Quantization

using System.Text;
using LMKit.Quantization;
using LMKit.Model;
using System.Diagnostics;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// Quantize a model to Q4_K_M
// ──────────────────────────────────────
string sourceModel = "models/gemma-3-4b-it-F16.lmk";
string outputModel = "models/gemma-3-4b-it-Q4_K_M.lmk";

Console.WriteLine($"Source: {sourceModel}");
Console.WriteLine($"Output: {outputModel}");
Console.WriteLine($"Target: Q4_K_M (4-bit, medium quality)\n");

var quantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount
};

Console.WriteLine("Quantizing...");
var stopwatch = System.Diagnostics.Stopwatch.StartNew();

quantizer.Quantize(outputModel, Precision.MOSTLY_Q4_K_M);

stopwatch.Stop();

long sourceSize = new FileInfo(sourceModel).Length;
long outputSize = new FileInfo(outputModel).Length;
double ratio = (double)outputSize / sourceSize * 100;

Console.WriteLine($"\nDone in {stopwatch.Elapsed.TotalSeconds:F0}s");
Console.WriteLine($"Source size: {sourceSize / (1024.0 * 1024):F0} MB");
Console.WriteLine($"Output size: {outputSize / (1024.0 * 1024):F0} MB ({ratio:F0}% of original)");

Step 3: Understand Precision Levels

Choose the right quantization level for your hardware and quality requirements:

Precision	Bits	Size Reduction	Quality	Use Case
`MOSTLY_Q8_0`	8-bit	~50%	Excellent	High-quality deployment with moderate savings
`MOSTLY_Q6_K`	6-bit	~60%	Very good	Best balance of quality and size
`MOSTLY_Q5_K_M`	5-bit	~65%	Very good	Good quality, significant savings
`MOSTLY_Q4_K_M`	4-bit	~70%	Good	Most popular choice for consumer GPUs
`MOSTLY_Q4_K_S`	4-bit	~72%	Good	Slightly smaller than Q4_K_M
`MOSTLY_Q3_K_M`	3-bit	~78%	Acceptable	Tight memory constraints
`MOSTLY_Q2_K`	2-bit	~85%	Low	Extreme compression, noticeable quality loss

Rule of thumb: Start with Q4_K_M. If quality is insufficient, move up to Q5_K_M or Q6_K. If you need to squeeze into tighter memory, try Q3_K_M.

Step 4: Quantize to Multiple Levels

Generate several variants and compare:

using LMKit.Model;
using LMKit.Quantization;
using LMKit.TextGeneration;

string[] testPrompts =
{
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a function in C# that checks if a string is a palindrome.",
    "What are the three main types of machine learning?"
};

string[] modelPaths =
{
    "models/gemma-3-4b-it-F16.lmk",
    "models/gemma-3-4b-it-Q4_K_M.lmk"
};

foreach (string modelPath in modelPaths)
{
    string modelName = Path.GetFileNameWithoutExtension(modelPath);
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {modelName} ===\n");
    Console.ResetColor();

    using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
        loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
    Console.WriteLine();
}

string sourceModel = "models/gemma-3-4b-it-F16.lmk";

var precisions = new[]
{
    (Precision.MOSTLY_Q8_0,  "Q8_0"),
    (Precision.MOSTLY_Q6_K,  "Q6_K"),
    (Precision.MOSTLY_Q5_K_M, "Q5_K_M"),
    (Precision.MOSTLY_Q4_K_M, "Q4_K_M"),
    (Precision.MOSTLY_Q3_K_M, "Q3_K_M")
};

var quantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount
};

long sourceSize = new FileInfo(sourceModel).Length;

Console.WriteLine($"Source: {sourceSize / (1024.0 * 1024):F0} MB\n");
Console.WriteLine($"{"Level",-10} {"Size (MB)",-12} {"Reduction",-12} {"Time"}");
Console.WriteLine(new string('-', 50));

foreach (var (precision, label) in precisions)
{
    string outputPath = $"models/gemma-3-4b-it-{label}.lmk";

    var sw = System.Diagnostics.Stopwatch.StartNew();
    quantizer.Quantize(outputPath, precision);
    sw.Stop();

    long outSize = new FileInfo(outputPath).Length;
    double reduction = (1.0 - (double)outSize / sourceSize) * 100;

    Console.WriteLine($"{label,-10} {outSize / (1024.0 * 1024),-12:F0} {reduction,-12:F0}% {sw.Elapsed.TotalSeconds:F0}s");
}

Step 5: Quality Comparison

Test the quantized model against the original to verify output quality:

using LMKit.Model;
using LMKit.TextGeneration;

string[] testPrompts =
{
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a function in C# that checks if a string is a palindrome.",
    "What are the three main types of machine learning?"
};

string[] modelPaths =
{
    "models/gemma-3-4b-it-F16.lmk",
    "models/gemma-3-4b-it-Q4_K_M.lmk"
};

foreach (string modelPath in modelPaths)
{
    string modelName = Path.GetFileNameWithoutExtension(modelPath);
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {modelName} ===\n");
    Console.ResetColor();

    using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
        loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
    Console.WriteLine();

    var chat = new SingleTurnConversation(testModel)
    {
        MaximumCompletionTokens = 256
    };

    foreach (string prompt in testPrompts)
    {
        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine($"  Q: {prompt}");
        Console.ResetColor();

        TextGenerationResult result = chat.Submit(prompt);
        string preview = result.Completion.Length > 120
            ? result.Completion.Substring(0, 120) + "..."
            : result.Completion;
        Console.WriteLine($"  A: {preview}");
        Console.ForegroundColor = ConsoleColor.DarkGray;
        Console.WriteLine($"     [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
        Console.ResetColor();
    }
}

Step 6: Quantize with Custom Thread Count

Control CPU usage during quantization to leave resources for other processes:

using LMKit.Model;
using LMKit.Quantization;
using LMKit.TextGeneration;

string[] testPrompts =
{
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a function in C# that checks if a string is a palindrome.",
    "What are the three main types of machine learning?"
};

string[] modelPaths =
{
    "models/gemma-3-4b-it-F16.lmk",
    "models/gemma-3-4b-it-Q4_K_M.lmk"
};

foreach (string modelPath in modelPaths)
{
    string modelName = Path.GetFileNameWithoutExtension(modelPath);
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {modelName} ===\n");
    Console.ResetColor();

    using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
        loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
    Console.WriteLine();
}

string sourceModel = "models/gemma-3-4b-it-F16.lmk";

// Use all cores for fastest quantization
var fastQuantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount
};

// Use half the cores to leave headroom
var balancedQuantizer = new Quantizer(sourceModel)
{
    ThreadCount = Environment.ProcessorCount / 2
};

// Single-threaded for minimal system impact
var gentleQuantizer = new Quantizer(sourceModel)
{
    ThreadCount = 1
};

Common Issues

Problem	Cause	Fix
Out of memory during quantization	Source model too large for RAM	Quantization runs on CPU; ensure enough system RAM (2x model size)
Output quality too low	Precision too aggressive	Move up one level (Q3 to Q4, Q4 to Q5)
Quantization failed error	Unsupported source format or corrupted model	Verify source model loads correctly; re-download if corrupted
File not found	Incorrect model path	Use full absolute path or verify relative path from working directory
Slow quantization	Low thread count	Increase `ThreadCount` to `Environment.ProcessorCount`

Next Steps

Load a Model and Generate Your First Response: load and use quantized models.
Build Semantic Search with Embeddings: use quantized models for embedding generation.
Build and Deploy an Offline AI Application for Edge Environments: end-to-end guide for packaging quantized models into offline deployments.
Samples: Quantization: quantization demo.

Table of Contents