Table of Contents

Build and Deploy an Offline AI Application for Edge Environments

Many production scenarios require AI to run without any internet connection: factory floors, medical devices, remote field equipment, and classified environments. This guide walks through the complete workflow of selecting a model that fits your hardware, downloading it ahead of time, configuring the inference backend for your device, and packaging everything into a self-contained .NET deployment that runs fully offline.


Why Offline AI Deployment Matters

Two enterprise problems that offline edge AI solves:

  1. Data sovereignty and air-gapped compliance. Defense contractors, healthcare organizations, and financial institutions operate networks that are physically disconnected from the internet. Sending patient records or classified documents to a cloud API is not an option. A local LLM embedded in the application keeps all data on-premises, simplifying compliance with ITAR, HIPAA, and SOC 2.
  2. Field operations with no connectivity. Offshore oil rigs, disaster relief teams, and agricultural inspectors work in environments where internet access is unreliable or nonexistent. An AI assistant that requires an API call for every prompt is unusable. An offline model on a ruggedized laptop works anywhere.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM Varies by model (see Step 2)
Disk 1 to 15 GB depending on model size

Step 1: Create the Project

dotnet new console -n OfflineAiApp
cd OfflineAiApp
dotnet add package LM-Kit.NET

Step 2: Choose a Model That Fits Your Hardware

The model catalog contains predefined models with metadata about size, quantization level, capabilities, and context length. Use this metadata to pick a model that fits your target device.

using LMKit.Model;

// List all predefined models with their memory requirements
var models = ModelCard.GetPredefinedModelCards();

Console.WriteLine($"{"Model ID",-30} {"Size (MB)",10} {"Quant",8} {"Ctx",6} {"Capabilities"}");
Console.WriteLine(new string('-', 90));

foreach (var card in models)
{
    // Skip embedding and speech models for this example
    if (!card.Capabilities.HasFlag(ModelCapabilities.Chat))
        continue;

    double sizeMB = card.FileSize / (1024.0 * 1024.0);
    Console.WriteLine(
        $"{card.ModelID,-30} {sizeMB,10:F0} {card.QuantizationPrecision,8:F1} {card.ContextLength,6} {card.Capabilities}");
}

A practical rule of thumb: the model file size is roughly the minimum VRAM needed for inference. A 2 GB model needs approximately 2 GB of VRAM (plus overhead for the KV-cache and context).

Target Device VRAM Budget Recommended Models
Low-end laptop (no GPU) CPU only gemma3:1b, qwen3:0.6b, phi4-mini:3.8b
Mid-range laptop (4 GB GPU) ~4 GB gemma3:4b, qwen3:4b
Workstation (8+ GB GPU) 8+ GB qwen3:8b, gemma3:12b, llama3.1:8b
Server (16+ GB GPU) 16+ GB qwen3:14b, phi4:14.7b, gemma3:27b

For edge AI scenarios, prefer smaller models with higher quantization. A gemma3:4b model at Q4_K_M quantization provides excellent quality-to-size ratio for most tasks.


Step 3: Download Models for Offline Packaging

Before deploying to an air-gapped device, download models to a local folder on a connected machine.

using System.Text;
using LMKit.Model;

Console.OutputEncoding = Encoding.UTF8;

// Choose your target model
var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:4b");

double sizeMB = card.FileSize / (1024.0 * 1024.0);
Console.WriteLine($"Model:    {card.ModelID}");
Console.WriteLine($"Size:     {sizeMB:F0} MB");
Console.WriteLine($"Context:  {card.ContextLength} tokens");

// Download to a specific folder for offline packaging
string offlineModelPath = Path.Combine(AppContext.BaseDirectory, "models");
Directory.CreateDirectory(offlineModelPath);

Console.WriteLine($"\nDownloading to {offlineModelPath}...");

card.Download(downloadingProgress: (path, contentLength, bytesRead) =>
{
    if (contentLength.HasValue && contentLength.Value > 0)
    {
        double percent = (double)bytesRead / contentLength.Value * 100;
        Console.Write($"\r  {percent:F1}% ({bytesRead / (1024 * 1024)} / {contentLength.Value / (1024 * 1024)} MB)");
    }
    return true; // return false to cancel
});

Console.WriteLine("\nDownload complete.");

// Verify the file is ready for offline use
if (card.IsLocallyAvailable)
{
    Console.WriteLine($"Model verified at: {card.LocalPath}");
}

Tip: After downloading, copy the entire models/ folder to your air-gapped device. The model file is self-contained and requires no additional downloads at runtime.


Step 4: Configure the Inference Backend

On the target device, configure the backend to match available hardware. LM-Kit.NET supports CPU, CUDA (NVIDIA), Vulkan (cross-platform GPU), and Metal (macOS).

using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

// ──────────────────────────────────────
// Option A: Let LM-Kit auto-detect the best GPU
// ──────────────────────────────────────
var autoConfig = new LM.DeviceConfiguration();
// GpuLayerCount defaults to int.MaxValue (offload all layers)
// MainGpu defaults to the best GPU detected on the system

// ──────────────────────────────────────
// Option B: Force CPU-only for devices without a GPU
// ──────────────────────────────────────
var cpuConfig = new LM.DeviceConfiguration
{
    GpuLayerCount = 0  // Zero layers on GPU = pure CPU inference
};

// ──────────────────────────────────────
// Option C: Partial GPU offload for limited VRAM
// ──────────────────────────────────────
var partialConfig = new LM.DeviceConfiguration
{
    GpuLayerCount = 20  // Offload 20 layers to GPU, rest on CPU
};

Choosing the Right Backend

Backend When to Use Platform
CPU (SSE4/AVX2) No GPU available, or model fits in RAM All
CUDA 12/13 NVIDIA GPU with sufficient VRAM Windows, Linux
Vulkan AMD/Intel/NVIDIA GPU (cross-platform) Windows, Linux
Metal Apple Silicon Macs macOS

LM-Kit.NET selects the appropriate backend automatically based on the installed runtime package. See Configure GPU Backends for detailed backend setup instructions.


Step 5: Load the Model Offline

Load the model from a local file path without any network access.

using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure for target hardware
// ──────────────────────────────────────
var device = new LM.DeviceConfiguration
{
    GpuLayerCount = 0  // CPU-only for this example
};

// ──────────────────────────────────────
// 2. Load from local file (no internet required)
// ──────────────────────────────────────
string modelPath = Path.Combine(AppContext.BaseDirectory, "models", "gemma-3-4b-it-Q4_K_M.lmk");

Console.Write("Loading model...");
using LM model = new LM(
    modelPath,
    deviceConfiguration: device,
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading model... {progress * 100:F0}%");
        return true;
    });

Console.WriteLine($"\rModel loaded: {model.Name}");
Console.WriteLine($"  Parameters:  {model.ParameterCount:N0}");
Console.WriteLine($"  Context:     {model.ContextLength} tokens");
Console.WriteLine($"  Quantization: {model.ModelType}");
Console.WriteLine($"  Layers:      {model.LayerCount}");
Console.WriteLine($"  GPU layers:  {model.GpuLayerCount}");

// ──────────────────────────────────────
// 3. Run inference entirely offline
// ──────────────────────────────────────
using var chat = new MultiTurnConversation(model);
chat.SystemPrompt = "You are a helpful field assistant. Be concise.";

var result = chat.Submit("Summarize the safety procedure for a gas leak in 3 bullet points.");
Console.WriteLine($"\n{result.Content}");

Step 6: Reduce Model Size with Quantization (Optional)

If your target device has very limited memory and no predefined model is small enough, you can quantize a higher-precision model down to a smaller format.

using LMKit.Quantization;
using LMKit.Model;

// Quantize from FP16 down to Q4_K_M (roughly 4x size reduction)
var quantizer = new Quantizer("models/gemma-3-4b-it-f16.lmk");
quantizer.ThreadCount = Environment.ProcessorCount;

quantizer.Quantize(
    dstFileName: "models/gemma-3-4b-it-Q4_K_M.lmk",
    modelPrecision: LM.Precision.MOSTLY_Q4_K_M);

Console.WriteLine("Quantization complete.");

Quantization Precision Guide

Precision Size vs FP16 Quality Best For
MOSTLY_Q8_0 ~50% Near-lossless When VRAM allows, maximum quality
MOSTLY_Q5_K_M ~35% Very good Balance of quality and size
MOSTLY_Q4_K_M ~25% Good Recommended default for edge
MOSTLY_Q3_K_M ~20% Acceptable Very constrained devices
MOSTLY_Q2_K ~15% Degraded Last resort, smallest possible

For more details, see Quantize a Model for Edge Deployment.


Step 7: Package for Deployment

Publish as a self-contained .NET application so the target machine does not need the .NET SDK installed.

dotnet publish -c Release -r win-x64 --self-contained true -o ./publish

Then copy:

  1. The publish/ folder (your application)
  2. The models/ folder (your downloaded model files)

To the target device. The application runs without .NET SDK, without internet, and without any external dependencies.

Deployment Checklist

Item Verify
Application binary publish/ folder contains the executable
Model file .lmk file present in the expected path
Runtime libraries LM-Kit native libraries included (automatic with NuGet)
GPU driver Installed on target if using GPU backend
Disk space Model file + application + ~500 MB working space

Common Issues

Problem Cause Fix
FileNotFoundException on model load Model file not at expected path Verify the .lmk file exists at the path passed to the LM constructor
Out of memory at inference Model too large for available RAM/VRAM Use a smaller model or reduce GpuLayerCount for partial CPU offload
Slow inference on CPU No GPU offload configured Set GpuLayerCount > 0 if a GPU is available, or use a smaller model
Model validation fails Corrupted download Re-download the model and verify with ModelCard.ValidateFileChecksum()

Next Steps

Share