Build and Deploy an Offline AI Application for Edge Environments
Many production scenarios require AI to run without any internet connection: factory floors, medical devices, remote field equipment, and classified environments. This guide walks through the complete workflow of selecting a model that fits your hardware, downloading it ahead of time, configuring the inference backend for your device, and packaging everything into a self-contained .NET deployment that runs fully offline.
Why Offline AI Deployment Matters
Two enterprise problems that offline edge AI solves:
- Data sovereignty and air-gapped compliance. Defense contractors, healthcare organizations, and financial institutions operate networks that are physically disconnected from the internet. Sending patient records or classified documents to a cloud API is not an option. A local LLM embedded in the application keeps all data on-premises, simplifying compliance with ITAR, HIPAA, and SOC 2.
- Field operations with no connectivity. Offshore oil rigs, disaster relief teams, and agricultural inspectors work in environments where internet access is unreliable or nonexistent. An AI assistant that requires an API call for every prompt is unusable. An offline model on a ruggedized laptop works anywhere.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | Varies by model (see Step 2) |
| Disk | 1 to 15 GB depending on model size |
Step 1: Create the Project
dotnet new console -n OfflineAiApp
cd OfflineAiApp
dotnet add package LM-Kit.NET
Step 2: Choose a Model That Fits Your Hardware
The model catalog contains predefined models with metadata about size, quantization level, capabilities, and context length. Use this metadata to pick a model that fits your target device.
using LMKit.Model;
// List all predefined models with their memory requirements
var models = ModelCard.GetPredefinedModelCards();
Console.WriteLine($"{"Model ID",-30} {"Size (MB)",10} {"Quant",8} {"Ctx",6} {"Capabilities"}");
Console.WriteLine(new string('-', 90));
foreach (var card in models)
{
// Skip embedding and speech models for this example
if (!card.Capabilities.HasFlag(ModelCapabilities.Chat))
continue;
double sizeMB = card.FileSize / (1024.0 * 1024.0);
Console.WriteLine(
$"{card.ModelID,-30} {sizeMB,10:F0} {card.QuantizationPrecision,8:F1} {card.ContextLength,6} {card.Capabilities}");
}
A practical rule of thumb: the model file size is roughly the minimum VRAM needed for inference. A 2 GB model needs approximately 2 GB of VRAM (plus overhead for the KV-cache and context).
| Target Device | VRAM Budget | Recommended Models |
|---|---|---|
| Low-end laptop (no GPU) | CPU only | gemma3:1b, qwen3:0.6b, phi4-mini:3.8b |
| Mid-range laptop (4 GB GPU) | ~4 GB | gemma3:4b, qwen3:4b |
| Workstation (8+ GB GPU) | 8+ GB | qwen3:8b, gemma3:12b, llama3.1:8b |
| Server (16+ GB GPU) | 16+ GB | qwen3:14b, phi4:14.7b, gemma3:27b |
For edge AI scenarios, prefer smaller models with higher quantization. A gemma3:4b model at Q4_K_M quantization provides excellent quality-to-size ratio for most tasks.
Step 3: Download Models for Offline Packaging
Before deploying to an air-gapped device, download models to a local folder on a connected machine.
using System.Text;
using LMKit.Model;
Console.OutputEncoding = Encoding.UTF8;
// Choose your target model
var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:4b");
double sizeMB = card.FileSize / (1024.0 * 1024.0);
Console.WriteLine($"Model: {card.ModelID}");
Console.WriteLine($"Size: {sizeMB:F0} MB");
Console.WriteLine($"Context: {card.ContextLength} tokens");
// Download to a specific folder for offline packaging
string offlineModelPath = Path.Combine(AppContext.BaseDirectory, "models");
Directory.CreateDirectory(offlineModelPath);
Console.WriteLine($"\nDownloading to {offlineModelPath}...");
card.Download(downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue && contentLength.Value > 0)
{
double percent = (double)bytesRead / contentLength.Value * 100;
Console.Write($"\r {percent:F1}% ({bytesRead / (1024 * 1024)} / {contentLength.Value / (1024 * 1024)} MB)");
}
return true; // return false to cancel
});
Console.WriteLine("\nDownload complete.");
// Verify the file is ready for offline use
if (card.IsLocallyAvailable)
{
Console.WriteLine($"Model verified at: {card.LocalPath}");
}
Tip: After downloading, copy the entire
models/folder to your air-gapped device. The model file is self-contained and requires no additional downloads at runtime.
Step 4: Configure the Inference Backend
On the target device, configure the backend to match available hardware. LM-Kit.NET supports CPU, CUDA (NVIDIA), Vulkan (cross-platform GPU), and Metal (macOS).
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
// ──────────────────────────────────────
// Option A: Let LM-Kit auto-detect the best GPU
// ──────────────────────────────────────
var autoConfig = new LM.DeviceConfiguration();
// GpuLayerCount defaults to int.MaxValue (offload all layers)
// MainGpu defaults to the best GPU detected on the system
// ──────────────────────────────────────
// Option B: Force CPU-only for devices without a GPU
// ──────────────────────────────────────
var cpuConfig = new LM.DeviceConfiguration
{
GpuLayerCount = 0 // Zero layers on GPU = pure CPU inference
};
// ──────────────────────────────────────
// Option C: Partial GPU offload for limited VRAM
// ──────────────────────────────────────
var partialConfig = new LM.DeviceConfiguration
{
GpuLayerCount = 20 // Offload 20 layers to GPU, rest on CPU
};
Choosing the Right Backend
| Backend | When to Use | Platform |
|---|---|---|
| CPU (SSE4/AVX2) | No GPU available, or model fits in RAM | All |
| CUDA 12/13 | NVIDIA GPU with sufficient VRAM | Windows, Linux |
| Vulkan | AMD/Intel/NVIDIA GPU (cross-platform) | Windows, Linux |
| Metal | Apple Silicon Macs | macOS |
LM-Kit.NET selects the appropriate backend automatically based on the installed runtime package. See Configure GPU Backends for detailed backend setup instructions.
Step 5: Load the Model Offline
Load the model from a local file path without any network access.
using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Configure for target hardware
// ──────────────────────────────────────
var device = new LM.DeviceConfiguration
{
GpuLayerCount = 0 // CPU-only for this example
};
// ──────────────────────────────────────
// 2. Load from local file (no internet required)
// ──────────────────────────────────────
string modelPath = Path.Combine(AppContext.BaseDirectory, "models", "gemma-3-4b-it-Q4_K_M.lmk");
Console.Write("Loading model...");
using LM model = new LM(
modelPath,
deviceConfiguration: device,
loadingProgress: progress =>
{
Console.Write($"\rLoading model... {progress * 100:F0}%");
return true;
});
Console.WriteLine($"\rModel loaded: {model.Name}");
Console.WriteLine($" Parameters: {model.ParameterCount:N0}");
Console.WriteLine($" Context: {model.ContextLength} tokens");
Console.WriteLine($" Quantization: {model.ModelType}");
Console.WriteLine($" Layers: {model.LayerCount}");
Console.WriteLine($" GPU layers: {model.GpuLayerCount}");
// ──────────────────────────────────────
// 3. Run inference entirely offline
// ──────────────────────────────────────
using var chat = new MultiTurnConversation(model);
chat.SystemPrompt = "You are a helpful field assistant. Be concise.";
var result = chat.Submit("Summarize the safety procedure for a gas leak in 3 bullet points.");
Console.WriteLine($"\n{result.Content}");
Step 6: Reduce Model Size with Quantization (Optional)
If your target device has very limited memory and no predefined model is small enough, you can quantize a higher-precision model down to a smaller format.
using LMKit.Quantization;
using LMKit.Model;
// Quantize from FP16 down to Q4_K_M (roughly 4x size reduction)
var quantizer = new Quantizer("models/gemma-3-4b-it-f16.lmk");
quantizer.ThreadCount = Environment.ProcessorCount;
quantizer.Quantize(
dstFileName: "models/gemma-3-4b-it-Q4_K_M.lmk",
modelPrecision: LM.Precision.MOSTLY_Q4_K_M);
Console.WriteLine("Quantization complete.");
Quantization Precision Guide
| Precision | Size vs FP16 | Quality | Best For |
|---|---|---|---|
MOSTLY_Q8_0 |
~50% | Near-lossless | When VRAM allows, maximum quality |
MOSTLY_Q5_K_M |
~35% | Very good | Balance of quality and size |
MOSTLY_Q4_K_M |
~25% | Good | Recommended default for edge |
MOSTLY_Q3_K_M |
~20% | Acceptable | Very constrained devices |
MOSTLY_Q2_K |
~15% | Degraded | Last resort, smallest possible |
For more details, see Quantize a Model for Edge Deployment.
Step 7: Package for Deployment
Publish as a self-contained .NET application so the target machine does not need the .NET SDK installed.
dotnet publish -c Release -r win-x64 --self-contained true -o ./publish
Then copy:
- The
publish/folder (your application) - The
models/folder (your downloaded model files)
To the target device. The application runs without .NET SDK, without internet, and without any external dependencies.
Deployment Checklist
| Item | Verify |
|---|---|
| Application binary | publish/ folder contains the executable |
| Model file | .lmk file present in the expected path |
| Runtime libraries | LM-Kit native libraries included (automatic with NuGet) |
| GPU driver | Installed on target if using GPU backend |
| Disk space | Model file + application + ~500 MB working space |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
FileNotFoundException on model load |
Model file not at expected path | Verify the .lmk file exists at the path passed to the LM constructor |
| Out of memory at inference | Model too large for available RAM/VRAM | Use a smaller model or reduce GpuLayerCount for partial CPU offload |
| Slow inference on CPU | No GPU offload configured | Set GpuLayerCount > 0 if a GPU is available, or use a smaller model |
| Model validation fails | Corrupted download | Re-download the model and verify with ModelCard.ValidateFileChecksum() |
Next Steps
- Configure GPU Backends and Optimize Performance for detailed backend tuning
- Optimize Memory with Context Recycling and KV-Cache Configuration to fit larger contexts in limited memory
- Quantize a Model for Edge Deployment for advanced quantization options
- Browse and Select Models Programmatically to build model selection into your application
- Distribute Large Models Across Multiple GPUs for multi-GPU server deployments