Protect GGUF Model Files with Encryption

Commercial applications that ship a fine-tuned or proprietary model on the end-user machine face a simple problem: a .gguf file on disk is readable by anyone. Copying, redistributing, or loading the weights into another tool is trivial. LM-Kit.NET ships an encryption pipeline that keeps the plaintext model off disk entirely: you ship an encrypted container, the user provides a password, and the runtime decrypts one tensor at a time as it streams into the backend buffer. This tutorial shows how to encrypt a GGUF file and load it at runtime without ever materializing the plaintext.

Why On-Device Model Encryption Matters

Two enterprise problems that encrypted GGUF loading solves:

Protecting fine-tuned model IP. Teams that invest in domain-specific fine-tuning (legal, medical, proprietary code) need to ship the resulting weights with their product without exposing the raw file to competitors or disgruntled users who might rip it out of the install directory.
Licensing-controlled model distribution. A model may be licensed per seat, per tenant, or tied to an activation key. Keeping the weights encrypted at rest and decrypting in-memory at load time lets the application gate access through its own license logic without relying on filesystem permissions alone.

What Makes LM-Kit's Approach Different

Typical approaches either (a) decrypt the full model to a temp file before handing it to the inference engine, or (b) load the entire decrypted blob into RAM. Both defeat the purpose: (a) leaves a plaintext copy in the filesystem for the duration of the session, and (b) doubles peak memory for a 30 GB model.

LM-Kit's design is different:

Metadata-only upfront decrypt. Only the GGUF metadata block (a few MB) is decrypted into a pinned managed buffer.
Per-tensor streaming decrypt. The native runtime invokes a managed read callback once per tensor. Each call decrypts exactly that tensor's bytes from disk directly into the backend buffer. The decrypted bytes are then freed before the next tensor is requested.
No plaintext on disk, ever. The .lmke container stays encrypted at rest. No temp file is produced during load.
Seekable cipher. AES-256-CTR is used because it lets any byte range be decrypted independently, which is the building block needed for per-tensor random access.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM / RAM	Whatever the model itself needs (encryption adds ~0.5% overhead)
Disk	Space for the encrypted container (same size as plaintext, plus a 64-byte header)

Step 1: Encrypt a Plaintext GGUF File

Call EncryptedGguf.Encrypt once, typically as part of your build or packaging step. The input is any standard .gguf file; the output is an .lmke container that is useless without the password.

using LMKit.Cryptography;

EncryptedGguf.Encrypt(
    plaintextGgufPath: @"C:\models\my-finetuned-model.gguf",
    encryptedPath:     @"C:\models\my-finetuned-model.lmke",
    scheme:            GgufEncryptionScheme.AesCtr256,
    password:          "your-strong-password");

Defaults:

Scheme: AesCtr256 is the only supported scheme today (seekable, which is required for per-tensor random-access decrypt).
Key derivation: PBKDF2-HMAC-SHA256, 100,000 iterations, 32-byte key.
Per-container salt and nonce: 16 bytes each, generated via RandomNumberGenerator. Stored in the container header.
Encryption throughput: streams in 64 KB chunks. A 250 MB model encrypts in under two seconds on commodity hardware.

Step 2: Load the Encrypted Model at Runtime

Replace your regular new LM(path) call with LM.LoadEncrypted:

using LMKit.Cryptography;
using LMKit.Model;

using LM model = LM.LoadEncrypted(
    encryptedPath: @"C:\models\my-finetuned-model.lmke",
    scheme:        GgufEncryptionScheme.AesCtr256,
    password:      "your-strong-password");

The returned LM is a normal model instance. Use it with MultiTurnConversation, SingleTurnConversation, Summarizer, Agent, RagChat, or any other LM-Kit component exactly as you would a model loaded from a plaintext file.

using LMKit.TextGeneration;

var chat = new MultiTurnConversation(model)
{
    MaximumCompletionTokens = 512,
};

Console.WriteLine(chat.Submit("Summarize our company refund policy.").Completion);

Step 3: Protect the Password

The password never lives inside the container. That is the whole point: possession of the container alone gives no access to the weights. Common patterns:

Ask the user at startup. Simple, and appropriate when the model ships with a per-license password.
Derive from a hardware-bound secret. Combine a fixed salt with a machine-specific value (TPM-sealed blob, DPAPI-protected secret, Keychain item) and pass the result as the password.
Fetch from a remote license server. After license validation, the server returns the password over HTTPS. The password lives only in process memory.

Whichever path you pick, treat the password like any other secret: do not log it, do not persist it in plaintext, and zero it from memory when done (Array.Clear on the backing byte[]).

Step 4: Memory Profile While Loading

A common concern with encrypted model loading is: does the runtime secretly buffer the whole model? It does not. Here is the memory profile during a load:

┌─────────────────────────────────────────────────────────────┐
│ Encrypted container (disk)             ──── ENCRYPTED ──── │
├─────────────────────────────────────────────────────────────┤
│ In process memory at peak:                                  │
│   • Metadata buffer          ~few MB (decrypted, pinned)    │
│   • One tensor's bytes       tens to hundreds of MB (decrypted,│
│                              briefly, released on return)   │
│   • Final backend buffer     equals model size (same as     │
│                              loading a plaintext model)     │
└─────────────────────────────────────────────────────────────┘

The plaintext model is never simultaneously materialized in memory. Peak decrypted "extra" RAM is metadata + the largest single tensor. For a 30 GB model that is typically under 500 MB of overhead at any point.

Step 5: Validate the Loader Didn't Break Anything

Greedy decoding is deterministic, so the encrypted and plaintext loads must produce byte-identical completions for the same prompt. This makes a perfect regression test:

using LMKit.Cryptography;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Sampling;

string plaintextPath = @"C:\models\my-finetuned-model.gguf";
string encryptedPath = @"C:\models\my-finetuned-model.lmke";
string password      = "your-strong-password";

EncryptedGguf.Encrypt(plaintextPath, encryptedPath,
    GgufEncryptionScheme.AesCtr256, password);

string Run(Func<LM> load)
{
    using LM m = load();
    var chat = new MultiTurnConversation(m)
    {
        SamplingMode = new GreedyDecoding(),
        MaximumCompletionTokens = 64,
    };
    return chat.Submit("hello").Completion;
}

string a = Run(() => new LM(plaintextPath));
string b = Run(() => LM.LoadEncrypted(encryptedPath,
    GgufEncryptionScheme.AesCtr256, password));

if (a != b)
{
    throw new InvalidOperationException("encrypted load diverges from plaintext");
}
Console.WriteLine("PASS: encrypted and plaintext outputs are byte-identical.");

The LM-Kit unit test suite ships this exact check (T50_EncryptedGgufTests).

Container Format

EncryptedGguf.Encrypt produces a file with the following layout:

Offset  Size  Field
──────  ────  ──────────────────────────────────────────────────
[ 0..  4)  4  Magic "LMKE"
[ 4..  8)  4  Format version (uint32 LE)
[ 8.. 12)  4  Encryption scheme (uint32 LE), matches GgufEncryptionScheme
[12.. 28) 16  PBKDF2 salt (random, per container)
[28.. 44) 16  AES-CTR nonce (random, initial counter block)
[44.. 48)  4  PBKDF2 iteration count (uint32 LE)
[48.. 56)  8  Plaintext total size (uint64 LE)
[56.. 64)  8  Plaintext metadata size (uint64 LE)
[64..  N)  _  AES-CTR ciphertext of the original plaintext GGUF, byte-for-byte

Because AES-CTR is a stream cipher, file offset P + 64 in the container corresponds to plaintext offset P. GGUF tensor offsets embedded in the metadata block therefore stay valid for direct seeking after decryption.

Troubleshooting

"File is not an LM-Kit encrypted GGUF container." The magic bytes at the start of the file do not match LMKE. You are probably pointing LoadEncrypted at a plaintext .gguf by mistake, or the file was truncated during transfer.

"Failed to load encrypted model." The container is well-formed but the weights did not decode. Almost always a wrong password. Passwords are case-sensitive and must be byte-exact.

Slow load times. Expect roughly a 2x load-time overhead versus a plaintext load: the first pass decrypts metadata, then every tensor is decrypted on the callback path. If you need lower latency, encrypt once, cache the resulting LM in-process via LM-Kit's Configuration.EnableModelCache, and reuse it across sessions.

Limitations and Security Notes

AES-256-CTR is a confidentiality primitive, not an authentication primitive. A motivated attacker with write access to the container can flip bits and produce a slightly-different model. If you also need tamper detection, wrap the container with an HMAC-SHA256 or attach a digital signature next to it and verify before calling LoadEncrypted.
In-process secrets. The derived key lives in managed memory during a load. EncryptedGguf.Reader.Dispose() zeroes it. But the weights themselves are in plaintext in RAM for the lifetime of the LM instance, same as any model: if your threat model includes memory dumps, encryption at rest alone is not enough.
Single scheme today. Only AesCtr256 is supported. The enum leaves room to add authenticated variants (e.g. AES-GCM per chunk) in a future release without breaking existing containers.

demos/console_net/encrypted_model_loading: end-to-end demo that downloads a model from the LM-Kit catalog, encrypts it, loads the container, and drives a multi-turn chat.
testing/unit_tests/T50_EncryptedGgufTests.cs: regression test verifying that greedy decoding from the encrypted and plaintext loads matches byte-for-byte.

Table of Contents