Class ContextInfo

Namespace: LMKit.Inference

Assembly: LM-Kit.NET.dll

Immutable, read-only snapshot of a single inference context (KV-cache) held in memory for a loaded model, as returned by GetLoadedContexts().

public sealed class ContextInfo

Inheritance: object

ContextInfo

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

A loaded model keeps one context per concurrent session or in-flight request, plus any contexts retained in the recycle pool for reuse. Each context owns a KV-cache, which is the dominant per-session memory cost on top of the model weights. This type exposes that cost and the context's lifecycle state so callers can account for a model's full memory footprint and understand what is keeping it resident.

The values are captured at the moment of the call and never change afterwards; the underlying context is not exposed, so reading them cannot mutate inference state.

Constructors

ContextInfo(string, int, long, ContextResidency, bool, int, bool, bool, KVCacheType, long, long): Initializes a new ContextInfo snapshot.

Properties

ContextLength: Gets the context window size, in tokens.

DeviceNumber: Gets the number of the device the context resides on. -1 indicates the CPU; a value of 0 or greater is the GPU device number (see DeviceNumber), matching the convention used by MainGpu.

DraftMemorySize: Gets the size, in bytes, of the speculative-decoding draft (Multi-Token Prediction or attached draft-model) sibling context bound to this session: the draft's own compute buffers, plus its own KV-cache when it keeps one. Reported apart from MemorySize so the draft's footprint is visible on its own. When the draft shares the main context's KV-cache (an attached assistant draft linked through the target), that shared cache belongs to the main context and is counted in MemorySize, not here, so the two never overlap. Returns 0 when the session has no draft context or when the context is hibernated.

FlashAttention: Gets a value indicating whether flash-attention is enabled for the context.

Id: Gets the stable, unique identifier of the context.

IsCachePriority: Gets a value indicating whether the context is pinned, exempting it from cache eviction under memory pressure.

IsInUse: Gets a value indicating whether the context is actively held by a session or an in-flight request (true), or sits idle in the recycle pool kept warm for reuse (false).

KVCacheQuantization: Gets the data type the context's KV-cache is stored in, that is, its quantization level. F16 is the unquantized default; lower-precision types such as Q8_0 trade accuracy for a smaller per-token footprint.

MemorySize: Gets the context's main KV-cache plus scheduler-managed compute-buffer size, in bytes. Returns 0 when the context is hibernated (its memory has been released to disk; see Residency). This does not include the draft context (see DraftMemorySize) or the output/logits buffer (see OutputBufferBytes); the full resident footprint of the session is MemorySize + DraftMemorySize + OutputBufferBytes.

OutputBufferBytes: Gets the size, in bytes, of the context's output/logits buffer, allocated apart from the KV-cache and compute buffers and therefore not counted in MemorySize. This buffer scales with the model's vocabulary size and can be a substantial per-context allocation for large-vocabulary models. Returns 0 when the context is hibernated, or when the running native backend predates the query export (older redistributable binaries), in which case the bytes fall into the dashboard's unattributed bucket rather than being mis-reported.

Residency: Gets the residency of the context: whether it is live in memory, hibernated to disk, or not yet created.

Table of Contents

Class ContextInfo

Remarks

Constructors

Properties