Table of Contents

Can Multiple Users Share One LM-Kit.NET Instance?


TL;DR

Yes. A single loaded LM model instance can be shared across multiple conversations running on different threads. The model weights are loaded into memory only once and shared by all threads. Each conversation allocates its own KV-cache (context memory), which is a small fraction of the model size: typically 25 to 150 MB per conversation compared to several GB for the model itself. You do not need to duplicate model memory for each thread.


Thread Safety in the SDK

The LM class uses internal locking to ensure safe concurrent access. You load the model once and share it across as many MultiTurnConversation instances as you need:

using LMKit.Model;
using LMKit.TextGeneration;

// Load the model once
using LM model = LM.LoadFromModelID("qwen3.5:9b");

// Create separate conversations for each user (share the same model)
var userAChat = new MultiTurnConversation(model);
var userBChat = new MultiTurnConversation(model);

// These can run on different threads safely
var taskA = Task.Run(() => userAChat.Submit("What is machine learning?"));
var taskB = Task.Run(() => userBChat.Submit("Explain neural networks."));

await Task.WhenAll(taskA, taskB);

Each MultiTurnConversation maintains its own conversation history and state. The underlying model handles request serialization internally through a ConcurrentWaitingQueue, so concurrent submissions are queued and processed safely.


Concurrency Model

Component Behavior
LM instance Thread-safe. Shared across conversations. Model weights are loaded once.
MultiTurnConversation One instance per conversation. Maintains its own history and KV-cache. Multiple conversations can generate concurrently.
Agent One instance per agent session. Can share the underlying LM with other agents and conversations.

The SDK handles concurrency internally: each conversation acquires its own inference context, so multiple conversations sharing the same model can process requests concurrently. Within a single conversation, requests are serialized (a second request waits for the first to finish). This means N users submitting to N separate conversations can all be served in parallel without requiring N copies of the model.


Scaling Patterns

Pattern When to Use Setup
Single model, queued requests Low to medium traffic. Simple deployment. One LM instance shared by all conversations. Requests are serialized.
Multiple model instances Need parallel generation for throughput. Load the same model multiple times (each needs its own VRAM/RAM).
ASP.NET Core service Multi-user production via REST API. Host LM-Kit.NET in an ASP.NET Core application with your own endpoints.
Multiple app instances behind a load balancer High traffic. Horizontal scaling. Deploy multiple instances across machines.

Model Weights vs Per-Conversation Memory

When a model is loaded, its memory footprint has two distinct parts:

Memory Component Loaded When Shared? Typical Size
Model weights LM.LoadFromModelID() is called Yes, shared by all conversations 2 to 20 GB depending on model size and quantization
KV-cache (context) A MultiTurnConversation or Agent is created No, private to each conversation 25 to 150 MB per conversation

Model weights are the neural network parameters. They are loaded into VRAM (or RAM) once and shared read-only across every conversation that uses the model. Loading the same model a second time does not duplicate these weights.

KV-cache (key-value cache) stores the attention state for each conversation's context window. It is the per-conversation memory overhead, and its size depends on context length, model architecture, and KV-cache quantization level. Even at full precision, the KV-cache is a small fraction of total model memory.

Memory Example

Consider a 7B parameter model (Q4_K_M quantization, ~4.9 GB) serving 10 concurrent conversations with 8192-token context each:

Component Memory
Model weights (shared) ~4.9 GB
KV-cache per conversation (F16) ~100 MB
10 conversations total 4.9 GB + 10 × 100 MB ≈ 5.9 GB

For comparison, loading 10 separate model instances would require ~49 GB. Sharing one model instance keeps memory usage nearly flat as you add more conversations.

KV-Cache Quantization

LM-Kit.NET supports KV-cache quantization to reduce per-conversation memory further:

KV-Cache Precision Per-Conversation Overhead (7B, 8K context) Use Case
F16 (default) ~100 MB Best quality, recommended default
Q8_0 ~55 MB Good balance of quality and memory
Q4_0 ~30 MB Maximum density, minimal quality impact for most tasks

Configure via Configuration.KVCacheQuantizationLevel before creating conversations.


Memory Considerations for Scaling

A single model instance with multiple conversations covers most deployment scenarios efficiently. You only need to load a second model instance if you need more parallel throughput than one instance can deliver:

Setup Parallel Throughput Memory Usage
1 model, 10 conversations Concurrent (shared model) 1x model weights + 10x KV-cache
1 model, 100 conversations Concurrent (shared model) 1x model weights + 100x KV-cache
2 model instances, 100 conversations 2x throughput ceiling 2x model weights + 100x KV-cache

For most deployments, a single model instance handles typical multi-user traffic well because individual inference requests are fast (especially on GPU) and the SDK manages context scheduling internally.


Share