What Is Quantization and Which Level Should I Choose?

TL;DR

Quantization reduces model precision to shrink file size and memory usage. A full-precision (F32) model uses 32 bits per weight. Quantization compresses this to 8, 4, 3, or even 2 bits, making models dramatically smaller and faster with some quality trade-off. Q4_K_M (4-bit, K-means, medium) is the recommended default: it offers the best balance of quality and size. The LM-Kit.NET catalog uses Q4_K_M for most models.

Quantization Levels

Level	Bits	Size Ratio	Quality	Best For
F32	32	1.0x (baseline)	Lossless	Research, reference comparisons
F16	16	0.5x	Near-lossless	When quality is paramount and memory allows
Q8_0	8	0.25x	Excellent	Whisper models, embeddings, quality-critical tasks
Q6_K	6	~0.2x	Very good	When Q8 is too large but you need high quality
Q5_K_M	5	~0.17x	Very good	Good quality with meaningful size reduction
Q4_K_M	4	~0.14x	Good	Recommended default. Best quality/size balance.
Q4_K_S	4	~0.13x	Good	Slightly smaller than Q4_K_M
Q3_K_M	3	~0.11x	Acceptable	Memory-constrained environments
Q2_K	2	~0.08x	Degraded	Extreme compression, noticeable quality loss

The "K" variants use K-means clustering for smarter weight grouping. "M" (medium) and "S" (small) indicate the amount of metadata preserved: M retains more, producing better quality.

Why Q4_K_M Is the Default

Q4_K_M has become the industry standard for quantized LLMs because:

Quality: Retains most of the model's original capability. Benchmarks show minimal degradation compared to F16 for instruction-following and generation tasks.
Size: A 7B parameter model compresses from ~14 GB (F16) to ~4 GB (Q4_K_M).
Speed: Fewer bits means faster memory bandwidth, especially on CPU where memory is the bottleneck.
Compatibility: Works well on consumer hardware (8 GB VRAM handles most 7B-8B models in Q4_K_M).

Quantizing a Model

Use the Quantizer class to convert models to a different precision:

using LMKit.Quantization;
using LMKit.Model;

var quantizer = new Quantizer();

quantizer.Quantize(
    inputPath: "model-f16.gguf",
    outputPath: "model-q4km.gguf",
    modelPrecision: LM.Precision.MOSTLY_Q4_K_M
);

Impact on Different Tasks

Task	Tolerance to Quantization	Recommended Minimum
Chat and conversation	High	Q4_K_M
RAG and Q&A	High	Q4_K_M
Function calling / tool use	Medium	Q4_K_M (prefer Q5+)
Code generation	Medium	Q4_K_M
Classification	High	Q4_K_M or lower
Mathematical reasoning	Low	Q5_K_M or higher
Embeddings	Low	Q8_0 or F16
Speech-to-text (Whisper)	Low	Q8_0

Size Estimation

For a quick estimate: multiply the parameter count by the bits per weight, then divide by 8 to get bytes.

Model Parameters	F16	Q8_0	Q4_K_M
1B	~2 GB	~1 GB	~0.7 GB
4B	~8 GB	~4 GB	~2.5 GB
8B	~16 GB	~8 GB	~4.5 GB
14B	~28 GB	~14 GB	~8 GB

Actual sizes vary because quantization is not perfectly uniform across all layers.

How do I choose the right model size for my hardware?: Memory estimation and hardware matching.
Can I use my own GGUF model files with LM-Kit.NET?: Loading quantized GGUF models.
Can LM-Kit.NET work with models from Hugging Face?: Finding quantized models on Hugging Face.
Quantize a Model for Edge Deployment: Step-by-step quantization guide.

Table of Contents