Table of Contents

Build a Voice-Commanded Agent That Executes Tools

Voice interfaces make AI accessible in situations where typing is impractical: hands-busy factory floors, medical exam rooms, accessibility scenarios, and vehicle cabins. This guide connects LM-Kit.NET's local speech-to-text engine (Whisper) to an AI agent equipped with tools, creating a system where a user speaks a request, the agent reasons about what tools to call, and speaks back an answer. All processing runs locally, with no cloud API calls.


Why Voice-Commanded Agents Matter

Two enterprise problems that voice-driven local agents solve:

  1. Hands-free operation in regulated environments. A surgeon reviewing patient notes mid-procedure, a warehouse worker checking inventory while handling packages, a pilot running pre-flight checklists. All need an AI assistant they can talk to without touching a screen. A local voice agent responds to spoken commands, calls tools to look up data, and delivers answers verbally, without sending audio to cloud servers.
  2. Accessible AI for users with mobility impairments. Keyboard and mouse interfaces exclude users with limited dexterity. A voice-commanded agent lets anyone interact with complex systems (search the web, calculate values, query databases) through natural speech, running entirely on their local machine for privacy.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~1 GB (Whisper) + 4+ GB (chat model with tools)
Disk ~5 GB for model downloads
Audio A .wav file (16-bit PCM, any sample rate)

Step 1: Create the Project

dotnet new console -n VoiceAgent
cd VoiceAgent
dotnet add package LM-Kit.NET

Step 2: Understand the Pipeline

The voice agent pipeline has three stages: transcribe, reason, respond.

  ┌───────────────┐     ┌──────────────────────┐     ┌──────────────────┐
  │  Audio Input  │     │     AI Agent         │     │   Text Output    │
  │  (.wav file)  │────►│                      │────►│   (response)     │
  └───────────────┘     │  1. Parse request    │     └──────────────────┘
         │              │  2. Select tools     │
         ▼              │  3. Execute tools    │
  ┌───────────────┐     │  4. Synthesize reply │
  │ Speech-to-    │     └──────────────────────┘
  │ Text (Whisper)│              │
  │               │──────────────┘
  │  Transcribed  │     Tools available:
  │  text         │     ├── Web Search
  └───────────────┘     ├── Calculator
                        ├── Date/Time
                        └── ... (any built-in tool)

The key insight: speech-to-text converts audio to plain text, and the agent processes that text exactly as it would process typed input. The two systems compose naturally.


Step 3: Set Up Speech-to-Text

Load a Whisper model and configure voice activity detection (VAD) to handle real-world audio with background noise and pauses.

using System.Text;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.Write("Loading speech model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rSpeech model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine(" done.");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel);
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;
stt.SuppressNonSpeechTokens = true;

// Tune VAD for your environment
stt.VadSettings = new VadSettings
{
    EnergyThreshold = 0.15f,              // Sensitivity (0 = most sensitive)
    MinSpeechDuration = TimeSpan.FromMilliseconds(300),  // Ignore very short sounds
    MinSilenceDuration = TimeSpan.FromMilliseconds(200)  // Pause detection
};

// ──────────────────────────────────────
// 3. Transcribe a test file
// ──────────────────────────────────────
var audio = new WaveFile("command.wav");
var transcription = stt.Transcribe(audio, language: "auto");

Console.WriteLine($"Transcribed: \"{transcription.Text}\"");
Console.WriteLine($"Language:    {transcription.Language}");
Console.WriteLine($"Segments:    {transcription.Segments.Count}");

For a deeper dive into transcription options, see Transcribe Audio with Local Speech-to-Text and Tune Whisper Transcription with VAD.


Step 4: Set Up the Agent with Tools

Create an agent with built-in tools so it can act on spoken commands.

using LMKit.Model;
using LMKit.Agents;
using LMKit.Agents.Tools;
using LMKit.Agents.Tools.BuiltIn;

// ──────────────────────────────────────
// 1. Load a chat model with tool-calling support
// ──────────────────────────────────────
Console.Write("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine(" done.");

// ──────────────────────────────────────
// 2. Build the agent
// ──────────────────────────────────────
var agent = Agent.CreateBuilder(chatModel)
    .WithPersona("Voice Assistant")
    .WithInstruction(
        "You are a voice-controlled assistant. Users speak commands that are " +
        "transcribed to text. Keep responses concise and conversational, as they " +
        "will be read aloud. Use tools when the user asks for real-time data, " +
        "calculations, or factual lookups.")
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.WebSearch);
        tools.Register(BuiltInTools.CalcArithmetic);
        tools.Register(BuiltInTools.DateTimeNow);
    })
    .WithMaxIterations(5)
    .Build();

For details on tool registration and permission policies, see Create an AI Agent with Tools and Secure Agent Tool Access with Permission Policies.


Step 5: Connect Voice Input to Agent Execution

Wire the two systems together: transcribe audio, pass the text to the agent, display the result.

using System.Text;
using LMKit.Model;
using LMKit.Speech;
using LMKit.Agents;
using LMKit.Agents.Tools;
using LMKit.Agents.Tools.BuiltIn;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading models...");

using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rSpeech model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine();

using LM chatModel = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine("\nModels loaded.\n");

// ──────────────────────────────────────
// 2. Set up speech-to-text
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel);
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;

// ──────────────────────────────────────
// 3. Set up agent with tools
// ──────────────────────────────────────
var agent = Agent.CreateBuilder(chatModel)
    .WithPersona("Voice Assistant")
    .WithInstruction(
        "You are a voice-controlled assistant. Keep responses short and " +
        "conversational. Use tools for real-time data and calculations.")
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.WebSearch);
        tools.Register(BuiltInTools.CalcArithmetic);
        tools.Register(BuiltInTools.DateTimeNow);
    })
    .WithMaxIterations(5)
    .Build();

// ──────────────────────────────────────
// 4. Process voice commands
// ──────────────────────────────────────
string[] audioFiles = Directory.GetFiles("audio_commands", "*.wav");

foreach (string audioPath in audioFiles)
{
    Console.WriteLine($"Processing: {Path.GetFileName(audioPath)}");

    // Transcribe
    var audio = new WaveFile(audioPath);
    var transcription = stt.Transcribe(audio, language: "auto");
    Console.WriteLine($"  Heard: \"{transcription.Text}\"");

    if (string.IsNullOrWhiteSpace(transcription.Text))
    {
        Console.WriteLine("  (no speech detected, skipping)\n");
        continue;
    }

    // Execute through agent
    var result = await agent.RunAsync(transcription.Text);
    Console.WriteLine($"  Response: {result.Content}\n");
}

Step 6: Add Multi-Turn Conversation Memory

For a conversational voice assistant (not just one-shot commands), use MultiTurnConversation to maintain context across multiple spoken exchanges.

using System.Text;
using LMKit.Model;
using LMKit.Speech;
using LMKit.TextGeneration;
using LMKit.Agents.Tools;
using LMKit.Agents.Tools.BuiltIn;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;

// Load models (as shown in previous steps)
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3");
using LM chatModel = LM.LoadFromModelID("qwen3:8b");

// Set up speech-to-text
var stt = new SpeechToText(whisperModel);
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;

// Set up multi-turn conversation with tools
using var chat = new MultiTurnConversation(chatModel);
chat.SystemPrompt =
    "You are a voice-controlled assistant. Responses are read aloud, " +
    "so keep them concise (1 to 3 sentences). Use tools when needed.";

chat.Tools.Register(BuiltInTools.WebSearch);
chat.Tools.Register(BuiltInTools.CalcArithmetic);
chat.Tools.Register(BuiltInTools.DateTimeNow);

// Conversation loop
Console.WriteLine("Voice assistant ready. Place .wav files in 'input/' to process.\n");

// Simulate a multi-turn conversation with sequential audio files
string[] turns = { "input/turn1.wav", "input/turn2.wav", "input/turn3.wav" };

foreach (string audioPath in turns)
{
    if (!File.Exists(audioPath))
        continue;

    var audio = new WaveFile(audioPath);
    var transcription = stt.Transcribe(audio, language: "auto");

    if (string.IsNullOrWhiteSpace(transcription.Text))
        continue;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"User: {transcription.Text}");
    Console.ResetColor();

    // Submit to multi-turn conversation (maintains full history)
    var result = chat.Submit(transcription.Text);

    Console.ForegroundColor = ConsoleColor.Green;
    Console.WriteLine($"Assistant: {result.Content}");
    Console.ResetColor();
    Console.WriteLine();
}

Console.WriteLine($"Conversation turns: {chat.ChatHistory.Count}");

For persistent memory across sessions, see Use Agent Memory for Long-Term Knowledge Across Sessions.


Common Issues

Problem Cause Fix
Transcription returns empty text Audio has no speech or VAD is too strict Lower EnergyThreshold in VadSettings (e.g., 0.05) or disable VAD temporarily to test
Agent ignores tool results Model lacks tool-calling capability Use a model with ToolsCall capability: qwen3:8b, gemma3:12b, gptoss:20b
Whisper hallucinates repeated text Silence or low-quality audio Enable SuppressHallucinations = true and ensure audio has clear speech
Slow transcription on CPU Large Whisper model on CPU Use whisper-tiny or whisper-small for faster CPU transcription

Next Steps

Share