Table of Contents

Can I Run Multiple AI Models at the Same Time?


TL;DR

Yes. LM-Kit.NET places no restriction on the number of models loaded simultaneously. You can run a chat model, an embedding model, and a Whisper speech model all at once. Each LM instance consumes its own memory, so the practical limit is your available VRAM and RAM. Most real-world applications load at least two models (chat + embeddings).


Common Multi-Model Scenarios

Scenario Models Needed Typical Memory
RAG pipeline Chat model + embedding model 5 GB + 300 MB
Document processing agent Chat model + embedding model + vision model 5 GB + 300 MB + 1.4 GB
Meeting assistant Whisper model + chat model 874 MB + 5 GB
Multilingual pipeline Chat model + embedding model + reranker 5 GB + 438 MB + 438 MB

Loading Multiple Models

Each model is an independent LM instance. Load them one at a time (or in parallel) and pass them to the components that need them:

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;

// Load an embedding model for RAG
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");

// Load a chat model for the agent
using LM chatModel = LM.LoadFromModelID("qwen3.5:9b");

// Use both together
var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("knowledge-base.pdf");

var agent = Agent.CreateBuilder(chatModel)
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.Calculator);
    })
    .Build();

Memory Planning

Each model consumes memory independently. Use MemoryEstimation.FitParameters() to check each model before loading:

using LMKit.Model;

// Check if both models fit
var fitEmbedding = MemoryEstimation.FitParameters("embeddinggemma-300m.lmk", contextSize: 512);
var fitChat = MemoryEstimation.FitParameters("qwen3.5-9b-Q4_K_M.lmk", contextSize: 8192);

Console.WriteLine($"Embedding: {fitEmbedding.Success}, GPU layers: {fitEmbedding.GpuLayerCount}");
Console.WriteLine($"Chat: {fitChat.Success}, GPU layers: {fitChat.GpuLayerCount}");

Tip: Load the larger model first. The MemoryEstimation API accounts for currently used memory, so checking the chat model first (before the embedding model occupies memory) gives you the most accurate GPU layer estimate.


GPU Memory Sharing

When multiple models are loaded on the same GPU, they share the available VRAM. This means:

  • The first model loaded gets the most GPU layers.
  • Subsequent models may get fewer GPU layers or fall back to CPU if VRAM is exhausted.
  • Small embedding models (67 to 438 MB) have minimal impact on available VRAM for the main chat model.
  • Whisper models (44 MB to 1.7 GB) vary widely; the turbo variant (874 MB) is a good balance.

If you have limited VRAM, consider keeping the chat model on GPU and running embedding and speech models on CPU. Embedding and speech models are typically small enough that CPU inference is fast.


Loading Models in Parallel

For faster startup, load independent models concurrently:

var embeddingTask = Task.Run(() => LM.LoadFromModelID("embeddinggemma-300m"));
var chatTask = Task.Run(() => LM.LoadFromModelID("qwen3.5:9b"));

await Task.WhenAll(embeddingTask, chatTask);

using LM embeddingModel = embeddingTask.Result;
using LM chatModel = chatTask.Result;

Disposing Models to Free Memory

When you no longer need a model, dispose it to reclaim memory for other models:

// Done with speech processing, free memory for a larger chat model
whisperModel.Dispose();

// Now load a larger model that needs the freed memory
using LM largeModel = LM.LoadFromModelID("gemma4:e4b");

Share