Build and Deploy an Offline AI Application for Edge Environments

Many production scenarios require AI to run without any internet connection: factory floors, medical devices, remote field equipment, and classified environments. This guide walks through the complete workflow of selecting a model that fits your hardware, downloading it ahead of time, configuring the inference backend for your device, and packaging everything into a self-contained .NET deployment that runs fully offline.

Why Offline AI Deployment Matters

Two enterprise problems that offline edge AI solves:

Data sovereignty and air-gapped compliance. Defense contractors, healthcare organizations, and financial institutions operate networks that are physically disconnected from the internet. Sending patient records or classified documents to a cloud API is not an option. A local LLM embedded in the application keeps all data on-premises, simplifying compliance with ITAR, HIPAA, and SOC 2.
Field operations with no connectivity. Offshore oil rigs, disaster relief teams, and agricultural inspectors work in environments where internet access is unreliable or nonexistent. An AI assistant that requires an API call for every prompt is unusable. An offline model on a ruggedized laptop works anywhere.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	Varies by model (see Step 2)
Disk	1 to 15 GB depending on model size

Step 1: Create the Project

dotnet new console -n OfflineAiApp
cd OfflineAiApp
dotnet add package LM-Kit.NET

Step 2: Choose a Model That Fits Your Hardware

The model catalog contains predefined models with metadata about size, quantization level, capabilities, and context length. Use this metadata to pick a model that fits your target device.

using LMKit.Model;

// List all predefined models with their memory requirements
var models = ModelCard.GetPredefinedModelCards();

Console.WriteLine($"{"Model ID",-30} {"Size (MB)",10} {"Quant",8} {"Ctx",6} {"Capabilities"}");
Console.WriteLine(new string('-', 90));

foreach (var card in models)
{
    // Skip embedding and speech models for this example
    if (!card.Capabilities.HasFlag(ModelCapabilities.Chat))
        continue;

    double sizeMB = card.FileSize / (1024.0 * 1024.0);
    Console.WriteLine(
        $"{card.ModelID,-30} {sizeMB,10:F0} {card.QuantizationPrecision,8:F1} {card.ContextLength,6} {card.Capabilities}");
}

A practical rule of thumb: the model file size is roughly the minimum VRAM needed for inference. A 2 GB model needs approximately 2 GB of VRAM (plus overhead for the KV-cache and context).

Target Device	VRAM Budget	Recommended Models
Low-end laptop (no GPU)	CPU only	`gemma3:1b`, `qwen3:0.6b`, `phi4-mini:3.8b`
Mid-range laptop (4 GB GPU)	~4 GB	`gemma3:4b`, `qwen3:4b`
Workstation (8+ GB GPU)	8+ GB	`qwen3:8b`, `gemma3:12b`, `llama3.1:8b`
Server (16+ GB GPU)	16+ GB	`qwen3:14b`, `phi4:14.7b`, `gemma3:27b`

For edge AI scenarios, prefer smaller models with higher quantization. A gemma3:4b model at Q4_K_M quantization provides excellent quality-to-size ratio for most tasks.

Step 3: Download Models for Offline Packaging

Before deploying to an air-gapped device, download models to a local folder on a connected machine.

using System.Text;
using LMKit.Model;

Console.OutputEncoding = Encoding.UTF8;

// Choose your target model
var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:4b");

double sizeMB = card.FileSize / (1024.0 * 1024.0);
Console.WriteLine($"Model:    {card.ModelID}");
Console.WriteLine($"Size:     {sizeMB:F0} MB");
Console.WriteLine($"Context:  {card.ContextLength} tokens");

// Download to a specific folder for offline packaging
string offlineModelPath = Path.Combine(AppContext.BaseDirectory, "models");
Directory.CreateDirectory(offlineModelPath);

Console.WriteLine($"\nDownloading to {offlineModelPath}...");

card.Download(downloadingProgress: (path, contentLength, bytesRead) =>
{
    if (contentLength.HasValue && contentLength.Value > 0)
    {
        double percent = (double)bytesRead / contentLength.Value * 100;
        Console.Write($"\r  {percent:F1}% ({bytesRead / (1024 * 1024)} / {contentLength.Value / (1024 * 1024)} MB)");
    }
    return true; // return false to cancel
});

Console.WriteLine("\nDownload complete.");

// Verify the file is ready for offline use
if (card.IsLocallyAvailable)
{
    Console.WriteLine($"Model verified at: {card.LocalPath}");
}

Tip: After downloading, copy the entire models/ folder to your air-gapped device. The model file is self-contained and requires no additional downloads at runtime.

Step 4: Configure the Inference Backend

On the target device, configure the backend to match available hardware. LM-Kit.NET supports CPU, CUDA (NVIDIA), Vulkan (cross-platform GPU), and Metal (macOS).

using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

// ──────────────────────────────────────
// Option A: Let LM-Kit auto-detect the best GPU
// ──────────────────────────────────────
var autoConfig = new LM.DeviceConfiguration();
// GpuLayerCount defaults to int.MaxValue (offload all layers)
// MainGpu defaults to the best GPU detected on the system

// ──────────────────────────────────────
// Option B: Force CPU-only for devices without a GPU
// ──────────────────────────────────────
var cpuConfig = new LM.DeviceConfiguration
{
    GpuLayerCount = 0  // Zero layers on GPU = pure CPU inference
};

// ──────────────────────────────────────
// Option C: Partial GPU offload for limited VRAM
// ──────────────────────────────────────
var partialConfig = new LM.DeviceConfiguration
{
    GpuLayerCount = 20  // Offload 20 layers to GPU, rest on CPU
};

Choosing the Right Backend

Backend	When to Use	Platform
CPU (SSE4/AVX2)	No GPU available, or model fits in RAM	All
CUDA 12/13	NVIDIA GPU with sufficient VRAM	Windows, Linux
Vulkan	AMD/Intel/NVIDIA GPU (cross-platform)	Windows, Linux
Metal	Apple Silicon Macs	macOS

LM-Kit.NET selects the appropriate backend automatically based on the installed runtime package. See Configure GPU Backends for detailed backend setup instructions.

Step 5: Load the Model Offline

Load the model from a local file path without any network access.

using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure for target hardware
// ──────────────────────────────────────
var device = new LM.DeviceConfiguration
{
    GpuLayerCount = 0  // CPU-only for this example
};

// ──────────────────────────────────────
// 2. Load from local file (no internet required)
// ──────────────────────────────────────
string modelPath = Path.Combine(AppContext.BaseDirectory, "models", "gemma-3-4b-it-Q4_K_M.lmk");

Console.Write("Loading model...");
using LM model = new LM(
    modelPath,
    deviceConfiguration: device,
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading model... {progress * 100:F0}%");
        return true;
    });

Console.WriteLine($"\rModel loaded: {model.Name}");
Console.WriteLine($"  Parameters:  {model.ParameterCount:N0}");
Console.WriteLine($"  Context:     {model.ContextLength} tokens");
Console.WriteLine($"  Quantization: {model.ModelType}");
Console.WriteLine($"  Layers:      {model.LayerCount}");
Console.WriteLine($"  GPU layers:  {model.GpuLayerCount}");

// ──────────────────────────────────────
// 3. Run inference entirely offline
// ──────────────────────────────────────
using var chat = new MultiTurnConversation(model);
chat.SystemPrompt = "You are a helpful field assistant. Be concise.";

var result = chat.Submit("Summarize the safety procedure for a gas leak in 3 bullet points.");
Console.WriteLine($"\n{result.Content}");

Step 6: Reduce Model Size with Quantization (Optional)

If your target device has very limited memory and no predefined model is small enough, you can quantize a higher-precision model down to a smaller format.

using LMKit.Quantization;
using LMKit.Model;

// Quantize from FP16 down to Q4_K_M (roughly 4x size reduction)
var quantizer = new Quantizer("models/gemma-3-4b-it-f16.lmk");
quantizer.ThreadCount = Environment.ProcessorCount;

quantizer.Quantize(
    dstFileName: "models/gemma-3-4b-it-Q4_K_M.lmk",
    modelPrecision: LM.Precision.MOSTLY_Q4_K_M);

Console.WriteLine("Quantization complete.");

Quantization Precision Guide

Precision	Size vs FP16	Quality	Best For
`MOSTLY_Q8_0`	~50%	Near-lossless	When VRAM allows, maximum quality
`MOSTLY_Q5_K_M`	~35%	Very good	Balance of quality and size
`MOSTLY_Q4_K_M`	~25%	Good	Recommended default for edge
`MOSTLY_Q3_K_M`	~20%	Acceptable	Very constrained devices
`MOSTLY_Q2_K`	~15%	Degraded	Last resort, smallest possible

For more details, see Quantize a Model for Edge Deployment.

Step 7: Package for Deployment

Publish as a self-contained .NET application so the target machine does not need the .NET SDK installed.

dotnet publish -c Release -r win-x64 --self-contained true -o ./publish

Then copy:

The publish/ folder (your application)
The models/ folder (your downloaded model files)

To the target device. The application runs without .NET SDK, without internet, and without any external dependencies.

Deployment Checklist

Item	Verify
Application binary	`publish/` folder contains the executable
Model file	`.lmk` file present in the expected path
Runtime libraries	LM-Kit native libraries included (automatic with NuGet)
GPU driver	Installed on target if using GPU backend
Disk space	Model file + application + ~500 MB working space

Common Issues

Problem	Cause	Fix
`FileNotFoundException` on model load	Model file not at expected path	Verify the `.lmk` file exists at the path passed to the `LM` constructor
Out of memory at inference	Model too large for available RAM/VRAM	Use a smaller model or reduce `GpuLayerCount` for partial CPU offload
Slow inference on CPU	No GPU offload configured	Set `GpuLayerCount > 0` if a GPU is available, or use a smaller model
Model validation fails	Corrupted download	Re-download the model and verify with `ModelCard.ValidateFileChecksum()`

Next Steps

Configure GPU Backends and Optimize Performance for detailed backend tuning
Optimize Memory with Context Recycling and KV-Cache Configuration to fit larger contexts in limited memory
Quantize a Model for Edge Deployment for advanced quantization options
Browse and Select Models Programmatically to build model selection into your application
Distribute Large Models Across Multiple GPUs for multi-GPU server deployments

Table of Contents