Class ContextInfo
Immutable, read-only snapshot of a single inference context (KV-cache) held in memory for a loaded model, as returned by GetLoadedContexts().
public sealed class ContextInfo
- Inheritance
-
ContextInfo
- Inherited Members
Remarks
A loaded model keeps one context per concurrent session or in-flight request, plus any contexts retained in the recycle pool for reuse. Each context owns a KV-cache, which is the dominant per-session memory cost on top of the model weights. This type exposes that cost and the context's lifecycle state so callers can account for a model's full memory footprint and understand what is keeping it resident.
The values are captured at the moment of the call and never change afterwards; the underlying context is not exposed, so reading them cannot mutate inference state.
Constructors
- ContextInfo(string, int, long, ContextResidency, bool, int, bool, bool, KVCacheType, long, long)
Initializes a new ContextInfo snapshot.
Properties
- ContextLength
Gets the context window size, in tokens.
- DeviceNumber
Gets the number of the device the context resides on.
-1indicates the CPU; a value of0or greater is the GPU device number (see DeviceNumber), matching the convention used by MainGpu.
- DraftMemorySize
Gets the size, in bytes, of the speculative-decoding draft (Multi-Token Prediction or attached draft-model) sibling context bound to this session: the draft's own compute buffers, plus its own KV-cache when it keeps one. Reported apart from MemorySize so the draft's footprint is visible on its own. When the draft shares the main context's KV-cache (an attached assistant draft linked through the target), that shared cache belongs to the main context and is counted in MemorySize, not here, so the two never overlap. Returns
0when the session has no draft context or when the context is hibernated.
- FlashAttention
Gets a value indicating whether flash-attention is enabled for the context.
- Id
Gets the stable, unique identifier of the context.
- IsCachePriority
Gets a value indicating whether the context is pinned, exempting it from cache eviction under memory pressure.
- IsInUse
Gets a value indicating whether the context is actively held by a session or an in-flight request (
true), or sits idle in the recycle pool kept warm for reuse (false).
- KVCacheQuantization
Gets the data type the context's KV-cache is stored in, that is, its quantization level. F16 is the unquantized default; lower-precision types such as Q8_0 trade accuracy for a smaller per-token footprint.
- MemorySize
Gets the context's main KV-cache plus scheduler-managed compute-buffer size, in bytes. Returns
0when the context is hibernated (its memory has been released to disk; see Residency). This does not include the draft context (see DraftMemorySize) or the output/logits buffer (see OutputBufferBytes); the full resident footprint of the session isMemorySize + DraftMemorySize + OutputBufferBytes.
- OutputBufferBytes
Gets the size, in bytes, of the context's output/logits buffer, allocated apart from the KV-cache and compute buffers and therefore not counted in MemorySize. This buffer scales with the model's vocabulary size and can be a substantial per-context allocation for large-vocabulary models. Returns
0when the context is hibernated, or when the running native backend predates the query export (older redistributable binaries), in which case the bytes fall into the dashboard's unattributed bucket rather than being mis-reported.
- Residency
Gets the residency of the context: whether it is live in memory, hibernated to disk, or not yet created.