Table of Contents

What Is Quantization and Which Level Should I Choose?


TL;DR

Quantization reduces model precision to shrink file size and memory usage. A full-precision (F32) model uses 32 bits per weight. Quantization compresses this to 8, 4, 3, or even 2 bits, making models dramatically smaller and faster with some quality trade-off. Q4_K_M (4-bit, K-means, medium) is the recommended default: it offers the best balance of quality and size. The LM-Kit.NET catalog uses Q4_K_M for most models.


Quantization Levels

Level Bits Size Ratio Quality Best For
F32 32 1.0x (baseline) Lossless Research, reference comparisons
F16 16 0.5x Near-lossless When quality is paramount and memory allows
Q8_0 8 0.25x Excellent Whisper models, embeddings, quality-critical tasks
Q6_K 6 ~0.2x Very good When Q8 is too large but you need high quality
Q5_K_M 5 ~0.17x Very good Good quality with meaningful size reduction
Q4_K_M 4 ~0.14x Good Recommended default. Best quality/size balance.
Q4_K_S 4 ~0.13x Good Slightly smaller than Q4_K_M
Q3_K_M 3 ~0.11x Acceptable Memory-constrained environments
Q2_K 2 ~0.08x Degraded Extreme compression, noticeable quality loss

The "K" variants use K-means clustering for smarter weight grouping. "M" (medium) and "S" (small) indicate the amount of metadata preserved: M retains more, producing better quality.


Why Q4_K_M Is the Default

Q4_K_M has become the industry standard for quantized LLMs because:

  • Quality: Retains most of the model's original capability. Benchmarks show minimal degradation compared to F16 for instruction-following and generation tasks.
  • Size: A 7B parameter model compresses from ~14 GB (F16) to ~4 GB (Q4_K_M).
  • Speed: Fewer bits means faster memory bandwidth, especially on CPU where memory is the bottleneck.
  • Compatibility: Works well on consumer hardware (8 GB VRAM handles most 7B-8B models in Q4_K_M).

Quantizing a Model

Use the Quantizer class to convert models to a different precision:

using LMKit.Quantization;
using LMKit.Model;

var quantizer = new Quantizer();

quantizer.Quantize(
    inputPath: "model-f16.gguf",
    outputPath: "model-q4km.gguf",
    modelPrecision: LM.Precision.MOSTLY_Q4_K_M
);

Impact on Different Tasks

Task Tolerance to Quantization Recommended Minimum
Chat and conversation High Q4_K_M
RAG and Q&A High Q4_K_M
Function calling / tool use Medium Q4_K_M (prefer Q5+)
Code generation Medium Q4_K_M
Classification High Q4_K_M or lower
Mathematical reasoning Low Q5_K_M or higher
Embeddings Low Q8_0 or F16
Speech-to-text (Whisper) Low Q8_0

Size Estimation

For a quick estimate: multiply the parameter count by the bits per weight, then divide by 8 to get bytes.

Model Parameters F16 Q8_0 Q4_K_M
1B ~2 GB ~1 GB ~0.7 GB
4B ~8 GB ~4 GB ~2.5 GB
8B ~16 GB ~8 GB ~4.5 GB
14B ~28 GB ~14 GB ~8 GB

Actual sizes vary because quantization is not perfectly uniform across all layers.


Share