LM-Kit.NET Multi-User - Can Multiple Users Share One LM-Kit Instance?

TL;DR

Yes. A single loaded LM model instance can be shared across multiple conversations running on different threads. The model weights are loaded into memory only once and shared by all threads. Each conversation allocates its own KV-cache (context memory), which is a small fraction of the model size: typically 25 to 150 MB per conversation compared to several GB for the model itself. You do not need to duplicate model memory for each thread.

Thread Safety in the SDK

The LM class uses internal locking to ensure safe concurrent access. You load the model once and share it across as many MultiTurnConversation instances as you need:

using LMKit.Model;
using LMKit.TextGeneration;

// Load the model once
using LM model = LM.LoadFromModelID("qwen3.5:9b");

// Create separate conversations for each user (share the same model)
var userAChat = new MultiTurnConversation(model);
var userBChat = new MultiTurnConversation(model);

// These can run on different threads safely
var taskA = Task.Run(() => userAChat.Submit("What is machine learning?"));
var taskB = Task.Run(() => userBChat.Submit("Explain neural networks."));

await Task.WhenAll(taskA, taskB);

Each MultiTurnConversation maintains its own conversation history and state. The underlying model handles request serialization internally through a ConcurrentWaitingQueue, so concurrent submissions are queued and processed safely.

Concurrency Model

Component	Behavior
LM instance	Thread-safe. Shared across conversations. Model weights are loaded once.
MultiTurnConversation	One instance per conversation. Maintains its own history and KV-cache. Multiple conversations can generate concurrently.
Agent	One instance per agent session. Can share the underlying LM with other agents and conversations.

The SDK handles concurrency internally: each conversation acquires its own inference context, so multiple conversations sharing the same model can process requests concurrently. Within a single conversation, requests are serialized (a second request waits for the first to finish). This means N users submitting to N separate conversations can all be served in parallel without requiring N copies of the model.

Scaling Patterns

Pattern	When to Use	Setup
Single model, queued requests	Low to medium traffic. Simple deployment.	One `LM` instance shared by all conversations. Requests are serialized.
Multiple model instances	Need parallel generation for throughput.	Load the same model multiple times (each needs its own VRAM/RAM).
ASP.NET Core service	Multi-user production via REST API.	Host LM-Kit.NET in an ASP.NET Core application with your own endpoints.
Multiple app instances behind a load balancer	High traffic. Horizontal scaling.	Deploy multiple instances across machines.

Model Weights vs Per-Conversation Memory

When a model is loaded, its memory footprint has two distinct parts:

Memory Component	Loaded When	Shared?	Typical Size
Model weights	`LM.LoadFromModelID()` is called	Yes, shared by all conversations	2 to 20 GB depending on model size and quantization
KV-cache (context)	A `MultiTurnConversation` or `Agent` is created	No, private to each conversation	25 to 150 MB per conversation

Model weights are the neural network parameters. They are loaded into VRAM (or RAM) once and shared read-only across every conversation that uses the model. Loading the same model a second time does not duplicate these weights.

KV-cache (key-value cache) stores the attention state for each conversation's context window. It is the per-conversation memory overhead, and its size depends on context length, model architecture, and KV-cache quantization level. Even at full precision, the KV-cache is a small fraction of total model memory.

Memory Example

Consider a 7B parameter model (Q4_K_M quantization, ~4.9 GB) serving 10 concurrent conversations with 8192-token context each:

Component	Memory
Model weights (shared)	~4.9 GB
KV-cache per conversation (F16)	~100 MB
10 conversations total	4.9 GB + 10 × 100 MB ≈ 5.9 GB

For comparison, loading 10 separate model instances would require ~49 GB. Sharing one model instance keeps memory usage nearly flat as you add more conversations.

KV-Cache Quantization

LM-Kit.NET supports KV-cache quantization to reduce per-conversation memory further:

KV-Cache Precision	Per-Conversation Overhead (7B, 8K context)	Use Case
F16 (default)	~100 MB	Best quality, recommended default
Q8_0	~55 MB	Good balance of quality and memory
Q4_0	~30 MB	Maximum density, minimal quality impact for most tasks

Configure via Configuration.KVCacheQuantizationLevel before creating conversations.

Memory Considerations for Scaling

A single model instance with multiple conversations covers most deployment scenarios efficiently. You only need to load a second model instance if you need more parallel throughput than one instance can deliver:

Setup	Parallel Throughput	Memory Usage
1 model, 10 conversations	Concurrent (shared model)	1x model weights + 10x KV-cache
1 model, 100 conversations	Concurrent (shared model)	1x model weights + 100x KV-cache
2 model instances, 100 conversations	2x throughput ceiling	2x model weights + 100x KV-cache

For most deployments, a single model instance handles typical multi-user traffic well because individual inference requests are fast (especially on GPU) and the SDK manages context scheduling internally.

How fast is local inference compared to cloud APIs?: Understand throughput to plan capacity for multi-user deployments.
What happens when a model does not fit in my GPU memory?: Manage memory when running multiple model instances.
What .NET frameworks and integrations does LM-Kit.NET support?: ASP.NET Core hosting and integration options for server deployments.
How does LM-Kit.NET compare to cloud AI APIs?: When local multi-user serving makes more sense than cloud APIs.

Table of Contents

Can Multiple Users Share One LM-Kit.NET Instance?