What Is Quantization and Which Level Should I Choose?
TL;DR
Quantization reduces model precision to shrink file size and memory usage. A full-precision (F32) model uses 32 bits per weight. Quantization compresses this to 8, 4, 3, or even 2 bits, making models dramatically smaller and faster with some quality trade-off. Q4_K_M (4-bit, K-means, medium) is the recommended default: it offers the best balance of quality and size. The LM-Kit.NET catalog uses Q4_K_M for most models.
Quantization Levels
| Level | Bits | Size Ratio | Quality | Best For |
|---|---|---|---|---|
| F32 | 32 | 1.0x (baseline) | Lossless | Research, reference comparisons |
| F16 | 16 | 0.5x | Near-lossless | When quality is paramount and memory allows |
| Q8_0 | 8 | 0.25x | Excellent | Whisper models, embeddings, quality-critical tasks |
| Q6_K | 6 | ~0.2x | Very good | When Q8 is too large but you need high quality |
| Q5_K_M | 5 | ~0.17x | Very good | Good quality with meaningful size reduction |
| Q4_K_M | 4 | ~0.14x | Good | Recommended default. Best quality/size balance. |
| Q4_K_S | 4 | ~0.13x | Good | Slightly smaller than Q4_K_M |
| Q3_K_M | 3 | ~0.11x | Acceptable | Memory-constrained environments |
| Q2_K | 2 | ~0.08x | Degraded | Extreme compression, noticeable quality loss |
The "K" variants use K-means clustering for smarter weight grouping. "M" (medium) and "S" (small) indicate the amount of metadata preserved: M retains more, producing better quality.
Why Q4_K_M Is the Default
Q4_K_M has become the industry standard for quantized LLMs because:
- Quality: Retains most of the model's original capability. Benchmarks show minimal degradation compared to F16 for instruction-following and generation tasks.
- Size: A 7B parameter model compresses from ~14 GB (F16) to ~4 GB (Q4_K_M).
- Speed: Fewer bits means faster memory bandwidth, especially on CPU where memory is the bottleneck.
- Compatibility: Works well on consumer hardware (8 GB VRAM handles most 7B-8B models in Q4_K_M).
Quantizing a Model
Use the Quantizer class to convert models to a different precision:
using LMKit.Quantization;
using LMKit.Model;
var quantizer = new Quantizer();
quantizer.Quantize(
inputPath: "model-f16.gguf",
outputPath: "model-q4km.gguf",
modelPrecision: LM.Precision.MOSTLY_Q4_K_M
);
Impact on Different Tasks
| Task | Tolerance to Quantization | Recommended Minimum |
|---|---|---|
| Chat and conversation | High | Q4_K_M |
| RAG and Q&A | High | Q4_K_M |
| Function calling / tool use | Medium | Q4_K_M (prefer Q5+) |
| Code generation | Medium | Q4_K_M |
| Classification | High | Q4_K_M or lower |
| Mathematical reasoning | Low | Q5_K_M or higher |
| Embeddings | Low | Q8_0 or F16 |
| Speech-to-text (Whisper) | Low | Q8_0 |
Size Estimation
For a quick estimate: multiply the parameter count by the bits per weight, then divide by 8 to get bytes.
| Model Parameters | F16 | Q8_0 | Q4_K_M |
|---|---|---|---|
| 1B | ~2 GB | ~1 GB | ~0.7 GB |
| 4B | ~8 GB | ~4 GB | ~2.5 GB |
| 8B | ~16 GB | ~8 GB | ~4.5 GB |
| 14B | ~28 GB | ~14 GB | ~8 GB |
Actual sizes vary because quantization is not perfectly uniform across all layers.
📚 Related Content
- How do I choose the right model size for my hardware?: Memory estimation and hardware matching.
- Can I use my own GGUF model files with LM-Kit.NET?: Loading quantized GGUF models.
- Can LM-Kit.NET work with models from Hugging Face?: Finding quantized models on Hugging Face.
- Quantize a Model for Edge Deployment: Step-by-step quantization guide.