Enum KVCacheType
Data type used to store a context's KV-cache, i.e. its quantization level. Lower-precision types shrink the per-token cache footprint at some cost to numerical accuracy. F16 is the default and is effectively unquantized for caching purposes.
public enum KVCacheType
Fields
F32 = 0Full 32-bit floating point. Highest precision, largest footprint.
F16 = 116-bit floating point. The default; treated as the unquantized baseline.
BF16 = 216-bit brain floating point (bfloat16). Same size as F16 with a wider exponent range.
Q8_0 = 38-bit quantization. About half the footprint of F16 with minor precision loss.
Q4_0 = 44-bit quantization. Smallest common footprint, larger precision loss.
Q4_1 = 54-bit quantization with a per-block scale and offset, slightly more accurate than Q4_0.
IQ4_NL = 64-bit non-linear quantization. 4-bit footprint with improved accuracy over Q4_0.
Q5_0 = 75-bit quantization. Between Q4_0 and Q8_0 in size and precision.
Q5_1 = 85-bit quantization with a per-block scale and offset, slightly more accurate than Q5_0.