Table of Contents

How Do I Reduce Hallucinations in Local AI Responses?


TL;DR

Hallucinations happen when a model generates plausible-sounding but incorrect information. LM-Kit.NET provides multiple techniques to minimize them: RAG grounding (anchor answers in your data), grammar-constrained decoding (force structured output), lower temperature (reduce randomness), system prompts (instruct the model to say "I don't know"), and choosing a larger model (bigger models hallucinate less). The most effective approach combines several of these techniques.


Technique 1: Ground Responses with RAG

The single most effective way to reduce hallucinations is Retrieval-Augmented Generation (RAG). Instead of relying on the model's training data, RAG retrieves relevant passages from your documents and injects them into the prompt. The model then generates answers based on this retrieved context.

using LMKit.Retrieval;

// Build a knowledge base from your documents
var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("company-handbook.pdf");

// RagChat automatically retrieves relevant context before generating
var chat = new RagChat(chatModel, ragEngine);
var answer = await chat.SubmitAsync("What is our vacation policy?");
// Answer is grounded in the actual handbook content

RAG works because the model is generating from provided facts rather than recalling from potentially unreliable training data.


Technique 2: Constrain Output Format with Grammar

Grammar-constrained decoding forces the model to produce output that matches a specific structure. This eliminates a large class of hallucinations by restricting what the model can generate:

using LMKit.TextGeneration.Sampling;

var chat = new MultiTurnConversation(model);

// Force JSON output matching a schema
chat.Grammar = Grammar.FromJsonSchema(@"{
    ""type"": ""object"",
    ""properties"": {
        ""answer"": { ""type"": ""string"" },
        ""confidence"": { ""type"": ""number"", ""minimum"": 0, ""maximum"": 1 },
        ""source"": { ""type"": ""string"" }
    },
    ""required"": [""answer"", ""confidence"", ""source""]
}");

Predefined grammars are also available for common formats:

// Force valid JSON output
chat.Grammar = new Grammar(Grammar.PredefinedGrammar.Json);

// Force output from a specific list of values
chat.Grammar = Grammar.FromStringList(new[] { "positive", "negative", "neutral" });

Technique 3: Lower the Temperature

Temperature controls randomness in token selection. Lower values make the model more deterministic and less likely to generate creative (but potentially wrong) content:

using LMKit.TextGeneration.Sampling;

var chat = new MultiTurnConversation(model);

// More deterministic, less creative (reduces hallucinations)
chat.SamplingMode = new RandomSampling
{
    Temperature = 0.3f,  // Default is 0.8
    TopP = 0.9f,
    TopK = 30
};

For factual Q&A tasks, a temperature of 0.1 to 0.3 significantly reduces hallucinations. For creative tasks, keep it higher (0.7 to 1.0).

Greedy decoding (temperature = 0) always picks the most likely token and produces fully deterministic output:

chat.SamplingMode = new GreedyDecoding();

Technique 4: Use Mirostat Sampling

Mirostat directly controls the perplexity (surprisal) of generated text, keeping output at a consistent quality level:

using LMKit.TextGeneration.Sampling;

chat.SamplingMode = new MirostatSampling
{
    TargetEntropy = 3.0f,  // Lower = more focused, less hallucination
    LearningRate = 0.1f,
    Temperature = 0.5f
};

Mirostat is particularly useful when you want consistent quality without manually tuning temperature and top-p together.


Technique 5: Write Effective System Prompts

A well-crafted system prompt instructs the model to acknowledge uncertainty instead of fabricating answers:

var chat = new MultiTurnConversation(model);
chat.SystemPrompt = @"You are a helpful assistant that answers questions accurately.
Follow these rules strictly:
- Only answer based on information you are confident about.
- If you are unsure or do not have enough information, say 'I don't have enough information to answer that.'
- Never make up facts, statistics, URLs, or citations.
- When providing technical information, qualify your confidence level.";

Technique 6: Choose a Larger Model

Larger models hallucinate less because they have better internal representations of factual knowledge. If hallucination is a critical concern and your hardware allows it, moving from a 4B to an 8B or 12B model often provides a noticeable improvement in factual accuracy.

Model Size Hallucination Risk Typical Use
1B to 3B Higher. Best with RAG or constrained output. Classification, extraction, simple chat
4B to 8B Moderate. Good for general tasks with proper prompting. Production agents, document Q&A
12B+ Lower. Stronger factual recall and reasoning. High-stakes generation, complex reasoning

Combining Techniques

The strongest results come from combining multiple approaches. A production setup might use:

  1. RAG for grounding in domain-specific data
  2. Grammar constraints for structured output
  3. Low temperature (0.2 to 0.4) for factual consistency
  4. System prompt instructing the model to avoid speculation
  5. 8B+ model for better base quality

Share