Quantize a Model for Edge Deployment
Large language models consume significant memory: a 7B parameter model at FP16 requires ~14 GB of VRAM. Quantization reduces model precision from 16-bit floating point to 4-bit or 8-bit integers, cutting memory usage by 50-75% with minimal quality loss. LM-Kit.NET's Quantizer class converts models between precision levels, enabling deployment on laptops, edge devices, and consumer GPUs. This tutorial builds a quantization pipeline that converts models and compares output quality.
Why Quantization Matters
Two deployment problems that quantization solves:
- Run larger models on smaller hardware. A 12B model quantized to Q4_K_M fits in 8 GB VRAM instead of 24 GB. This opens deployment on consumer GPUs, laptops, and even some mobile devices, bringing powerful models to hardware that could not run them at full precision.
- Faster inference. Lower precision means smaller tensors, which means faster memory transfers and computation. Quantized models typically run 1.5-2x faster than their FP16 counterparts on the same hardware.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16+ GB (quantization runs on CPU) |
| Disk | 2x model size free for source + output |
Step 1: Create the Project
dotnet new console -n QuantizeQuickstart
cd QuantizeQuickstart
dotnet add package LM-Kit.NET
Step 2: Basic Model Quantization
using System.Text;
using LMKit.Quantization;
using LMKit.Model;
using System.Diagnostics;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// Quantize a model to Q4_K_M
// ──────────────────────────────────────
string sourceModel = "models/gemma-3-4b-it-F16.lmk";
string outputModel = "models/gemma-3-4b-it-Q4_K_M.lmk";
Console.WriteLine($"Source: {sourceModel}");
Console.WriteLine($"Output: {outputModel}");
Console.WriteLine($"Target: Q4_K_M (4-bit, medium quality)\n");
var quantizer = new Quantizer(sourceModel)
{
ThreadCount = Environment.ProcessorCount
};
Console.WriteLine("Quantizing...");
var stopwatch = System.Diagnostics.Stopwatch.StartNew();
quantizer.Quantize(outputModel, Precision.MOSTLY_Q4_K_M);
stopwatch.Stop();
long sourceSize = new FileInfo(sourceModel).Length;
long outputSize = new FileInfo(outputModel).Length;
double ratio = (double)outputSize / sourceSize * 100;
Console.WriteLine($"\nDone in {stopwatch.Elapsed.TotalSeconds:F0}s");
Console.WriteLine($"Source size: {sourceSize / (1024.0 * 1024):F0} MB");
Console.WriteLine($"Output size: {outputSize / (1024.0 * 1024):F0} MB ({ratio:F0}% of original)");
Step 3: Understand Precision Levels
Choose the right quantization level for your hardware and quality requirements:
| Precision | Bits | Size Reduction | Quality | Use Case |
|---|---|---|---|---|
MOSTLY_Q8_0 |
8-bit | ~50% | Excellent | High-quality deployment with moderate savings |
MOSTLY_Q6_K |
6-bit | ~60% | Very good | Best balance of quality and size |
MOSTLY_Q5_K_M |
5-bit | ~65% | Very good | Good quality, significant savings |
MOSTLY_Q4_K_M |
4-bit | ~70% | Good | Most popular choice for consumer GPUs |
MOSTLY_Q4_K_S |
4-bit | ~72% | Good | Slightly smaller than Q4_K_M |
MOSTLY_Q3_K_M |
3-bit | ~78% | Acceptable | Tight memory constraints |
MOSTLY_Q2_K |
2-bit | ~85% | Low | Extreme compression, noticeable quality loss |
Rule of thumb: Start with Q4_K_M. If quality is insufficient, move up to Q5_K_M or Q6_K. If you need to squeeze into tighter memory, try Q3_K_M.
Step 4: Quantize to Multiple Levels
Generate several variants and compare:
using LMKit.Model;
using LMKit.Quantization;
using LMKit.TextGeneration;
string[] testPrompts =
{
"Explain the difference between TCP and UDP in one paragraph.",
"Write a function in C# that checks if a string is a palindrome.",
"What are the three main types of machine learning?"
};
string[] modelPaths =
{
"models/gemma-3-4b-it-F16.lmk",
"models/gemma-3-4b-it-Q4_K_M.lmk"
};
foreach (string modelPath in modelPaths)
{
string modelName = Path.GetFileNameWithoutExtension(modelPath);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"\n=== {modelName} ===\n");
Console.ResetColor();
using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine();
}
string sourceModel = "models/gemma-3-4b-it-F16.lmk";
var precisions = new[]
{
(Precision.MOSTLY_Q8_0, "Q8_0"),
(Precision.MOSTLY_Q6_K, "Q6_K"),
(Precision.MOSTLY_Q5_K_M, "Q5_K_M"),
(Precision.MOSTLY_Q4_K_M, "Q4_K_M"),
(Precision.MOSTLY_Q3_K_M, "Q3_K_M")
};
var quantizer = new Quantizer(sourceModel)
{
ThreadCount = Environment.ProcessorCount
};
long sourceSize = new FileInfo(sourceModel).Length;
Console.WriteLine($"Source: {sourceSize / (1024.0 * 1024):F0} MB\n");
Console.WriteLine($"{"Level",-10} {"Size (MB)",-12} {"Reduction",-12} {"Time"}");
Console.WriteLine(new string('-', 50));
foreach (var (precision, label) in precisions)
{
string outputPath = $"models/gemma-3-4b-it-{label}.lmk";
var sw = System.Diagnostics.Stopwatch.StartNew();
quantizer.Quantize(outputPath, precision);
sw.Stop();
long outSize = new FileInfo(outputPath).Length;
double reduction = (1.0 - (double)outSize / sourceSize) * 100;
Console.WriteLine($"{label,-10} {outSize / (1024.0 * 1024),-12:F0} {reduction,-12:F0}% {sw.Elapsed.TotalSeconds:F0}s");
}
Step 5: Quality Comparison
Test the quantized model against the original to verify output quality:
using LMKit.Model;
using LMKit.TextGeneration;
string[] testPrompts =
{
"Explain the difference between TCP and UDP in one paragraph.",
"Write a function in C# that checks if a string is a palindrome.",
"What are the three main types of machine learning?"
};
string[] modelPaths =
{
"models/gemma-3-4b-it-F16.lmk",
"models/gemma-3-4b-it-Q4_K_M.lmk"
};
foreach (string modelPath in modelPaths)
{
string modelName = Path.GetFileNameWithoutExtension(modelPath);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"\n=== {modelName} ===\n");
Console.ResetColor();
using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine();
var chat = new SingleTurnConversation(testModel)
{
MaximumCompletionTokens = 256
};
foreach (string prompt in testPrompts)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($" Q: {prompt}");
Console.ResetColor();
TextGenerationResult result = chat.Submit(prompt);
string preview = result.Completion.Length > 120
? result.Completion.Substring(0, 120) + "..."
: result.Completion;
Console.WriteLine($" A: {preview}");
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
Console.ResetColor();
}
}
Step 6: Quantize with Custom Thread Count
Control CPU usage during quantization to leave resources for other processes:
using LMKit.Model;
using LMKit.Quantization;
using LMKit.TextGeneration;
string[] testPrompts =
{
"Explain the difference between TCP and UDP in one paragraph.",
"Write a function in C# that checks if a string is a palindrome.",
"What are the three main types of machine learning?"
};
string[] modelPaths =
{
"models/gemma-3-4b-it-F16.lmk",
"models/gemma-3-4b-it-Q4_K_M.lmk"
};
foreach (string modelPath in modelPaths)
{
string modelName = Path.GetFileNameWithoutExtension(modelPath);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"\n=== {modelName} ===\n");
Console.ResetColor();
using LM testModel = new LM(new Uri($"file://{Path.GetFullPath(modelPath)}"),
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine();
}
string sourceModel = "models/gemma-3-4b-it-F16.lmk";
// Use all cores for fastest quantization
var fastQuantizer = new Quantizer(sourceModel)
{
ThreadCount = Environment.ProcessorCount
};
// Use half the cores to leave headroom
var balancedQuantizer = new Quantizer(sourceModel)
{
ThreadCount = Environment.ProcessorCount / 2
};
// Single-threaded for minimal system impact
var gentleQuantizer = new Quantizer(sourceModel)
{
ThreadCount = 1
};
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Out of memory during quantization | Source model too large for RAM | Quantization runs on CPU; ensure enough system RAM (2x model size) |
| Output quality too low | Precision too aggressive | Move up one level (Q3 to Q4, Q4 to Q5) |
| Quantization failed error | Unsupported source format or corrupted model | Verify source model loads correctly; re-download if corrupted |
| File not found | Incorrect model path | Use full absolute path or verify relative path from working directory |
| Slow quantization | Low thread count | Increase ThreadCount to Environment.ProcessorCount |
Next Steps
- Load a Model and Generate Your First Response: load and use quantized models.
- Build Semantic Search with Embeddings: use quantized models for embedding generation.
- Samples: Quantization: quantization demo.