Can Multiple Users Share One LM-Kit.NET Instance?
TL;DR
Yes. A single loaded LM model instance can be shared across multiple conversations running on different threads. The model weights are loaded into memory only once and shared by all threads. Each conversation allocates its own KV-cache (context memory), which is a small fraction of the model size: typically 25 to 150 MB per conversation compared to several GB for the model itself. You do not need to duplicate model memory for each thread.
Thread Safety in the SDK
The LM class uses internal locking to ensure safe concurrent access. You load the model once and share it across as many MultiTurnConversation instances as you need:
using LMKit.Model;
using LMKit.TextGeneration;
// Load the model once
using LM model = LM.LoadFromModelID("qwen3.5:9b");
// Create separate conversations for each user (share the same model)
var userAChat = new MultiTurnConversation(model);
var userBChat = new MultiTurnConversation(model);
// These can run on different threads safely
var taskA = Task.Run(() => userAChat.Submit("What is machine learning?"));
var taskB = Task.Run(() => userBChat.Submit("Explain neural networks."));
await Task.WhenAll(taskA, taskB);
Each MultiTurnConversation maintains its own conversation history and state. The underlying model handles request serialization internally through a ConcurrentWaitingQueue, so concurrent submissions are queued and processed safely.
Concurrency Model
| Component | Behavior |
|---|---|
| LM instance | Thread-safe. Shared across conversations. Model weights are loaded once. |
| MultiTurnConversation | One instance per conversation. Maintains its own history and KV-cache. Multiple conversations can generate concurrently. |
| Agent | One instance per agent session. Can share the underlying LM with other agents and conversations. |
The SDK handles concurrency internally: each conversation acquires its own inference context, so multiple conversations sharing the same model can process requests concurrently. Within a single conversation, requests are serialized (a second request waits for the first to finish). This means N users submitting to N separate conversations can all be served in parallel without requiring N copies of the model.
Scaling Patterns
| Pattern | When to Use | Setup |
|---|---|---|
| Single model, queued requests | Low to medium traffic. Simple deployment. | One LM instance shared by all conversations. Requests are serialized. |
| Multiple model instances | Need parallel generation for throughput. | Load the same model multiple times (each needs its own VRAM/RAM). |
| ASP.NET Core service | Multi-user production via REST API. | Host LM-Kit.NET in an ASP.NET Core application with your own endpoints. |
| Multiple app instances behind a load balancer | High traffic. Horizontal scaling. | Deploy multiple instances across machines. |
Model Weights vs Per-Conversation Memory
When a model is loaded, its memory footprint has two distinct parts:
| Memory Component | Loaded When | Shared? | Typical Size |
|---|---|---|---|
| Model weights | LM.LoadFromModelID() is called |
Yes, shared by all conversations | 2 to 20 GB depending on model size and quantization |
| KV-cache (context) | A MultiTurnConversation or Agent is created |
No, private to each conversation | 25 to 150 MB per conversation |
Model weights are the neural network parameters. They are loaded into VRAM (or RAM) once and shared read-only across every conversation that uses the model. Loading the same model a second time does not duplicate these weights.
KV-cache (key-value cache) stores the attention state for each conversation's context window. It is the per-conversation memory overhead, and its size depends on context length, model architecture, and KV-cache quantization level. Even at full precision, the KV-cache is a small fraction of total model memory.
Memory Example
Consider a 7B parameter model (Q4_K_M quantization, ~4.9 GB) serving 10 concurrent conversations with 8192-token context each:
| Component | Memory |
|---|---|
| Model weights (shared) | ~4.9 GB |
| KV-cache per conversation (F16) | ~100 MB |
| 10 conversations total | 4.9 GB + 10 × 100 MB ≈ 5.9 GB |
For comparison, loading 10 separate model instances would require ~49 GB. Sharing one model instance keeps memory usage nearly flat as you add more conversations.
KV-Cache Quantization
LM-Kit.NET supports KV-cache quantization to reduce per-conversation memory further:
| KV-Cache Precision | Per-Conversation Overhead (7B, 8K context) | Use Case |
|---|---|---|
| F16 (default) | ~100 MB | Best quality, recommended default |
| Q8_0 | ~55 MB | Good balance of quality and memory |
| Q4_0 | ~30 MB | Maximum density, minimal quality impact for most tasks |
Configure via Configuration.KVCacheQuantizationLevel before creating conversations.
Memory Considerations for Scaling
A single model instance with multiple conversations covers most deployment scenarios efficiently. You only need to load a second model instance if you need more parallel throughput than one instance can deliver:
| Setup | Parallel Throughput | Memory Usage |
|---|---|---|
| 1 model, 10 conversations | Concurrent (shared model) | 1x model weights + 10x KV-cache |
| 1 model, 100 conversations | Concurrent (shared model) | 1x model weights + 100x KV-cache |
| 2 model instances, 100 conversations | 2x throughput ceiling | 2x model weights + 100x KV-cache |
For most deployments, a single model instance handles typical multi-user traffic well because individual inference requests are fast (especially on GPU) and the SDK manages context scheduling internally.
📚 Related Content
- How fast is local inference compared to cloud APIs?: Understand throughput to plan capacity for multi-user deployments.
- What happens when a model does not fit in my GPU memory?: Manage memory when running multiple model instances.
- What .NET frameworks and integrations does LM-Kit.NET support?: ASP.NET Core hosting and integration options for server deployments.
- How does LM-Kit.NET compare to cloud AI APIs?: When local multi-user serving makes more sense than cloud APIs.