Field EnableKVCacheQuantization
Gets or sets a value indicating whether the conversation key-value (KV) cache is stored in a quantized form instead of full precision.
public static bool EnableKVCacheQuantization
Returns
- bool
trueto quantize the KV cache; otherwise,false. Default isfalse.
Remarks
When enabled, eligible contexts hold their KV cache at 8-bit precision (Q8_0) rather than 16-bit, roughly halving the cache's memory footprint so longer contexts fit on the same device, at a small generation-quality cost. The setting is read when a context is created, so it affects newly created contexts only; contexts already in memory keep the precision they were built with. Quantization is skipped for models that do not support it.