Build a Voice-Commanded Agent That Executes Tools
Voice interfaces make AI accessible in situations where typing is impractical: hands-busy factory floors, medical exam rooms, accessibility scenarios, and vehicle cabins. This guide connects LM-Kit.NET's local speech-to-text engine (Whisper) to an AI agent equipped with tools, creating a system where a user speaks a request, the agent reasons about what tools to call, and speaks back an answer. All processing runs locally, with no cloud API calls.
Why Voice-Commanded Agents Matter
Two enterprise problems that voice-driven local agents solve:
- Hands-free operation in regulated environments. A surgeon reviewing patient notes mid-procedure, a warehouse worker checking inventory while handling packages, a pilot running pre-flight checklists. All need an AI assistant they can talk to without touching a screen. A local voice agent responds to spoken commands, calls tools to look up data, and delivers answers verbally, without sending audio to cloud servers.
- Accessible AI for users with mobility impairments. Keyboard and mouse interfaces exclude users with limited dexterity. A voice-commanded agent lets anyone interact with complex systems (search the web, calculate values, query databases) through natural speech, running entirely on their local machine for privacy.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | ~1 GB (Whisper) + 4+ GB (chat model with tools) |
| Disk | ~5 GB for model downloads |
| Audio | A .wav file (16-bit PCM, any sample rate) |
Step 1: Create the Project
dotnet new console -n VoiceAgent
cd VoiceAgent
dotnet add package LM-Kit.NET
Step 2: Understand the Pipeline
The voice agent pipeline has three stages: transcribe, reason, respond.
┌───────────────┐ ┌──────────────────────┐ ┌──────────────────┐
│ Audio Input │ │ AI Agent │ │ Text Output │
│ (.wav file) │────►│ │────►│ (response) │
└───────────────┘ │ 1. Parse request │ └──────────────────┘
│ │ 2. Select tools │
▼ │ 3. Execute tools │
┌───────────────┐ │ 4. Synthesize reply │
│ Speech-to- │ └──────────────────────┘
│ Text (Whisper)│ │
│ │──────────────┘
│ Transcribed │ Tools available:
│ text │ ├── Web Search
└───────────────┘ ├── Calculator
├── Date/Time
└── ... (any built-in tool)
The key insight: speech-to-text converts audio to plain text, and the agent processes that text exactly as it would process typed input. The two systems compose naturally.
Step 3: Set Up Speech-to-Text
Load a Whisper model and configure voice activity detection (VAD) to handle real-world audio with background noise and pauses.
using System.Text;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.Write("Loading speech model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rSpeech model: {(double)read / len.Value * 100:F1}%");
return true;
});
Console.WriteLine(" done.");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel);
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;
stt.SuppressNonSpeechTokens = true;
// Tune VAD for your environment
stt.VadSettings = new VadSettings
{
EnergyThreshold = 0.15f, // Sensitivity (0 = most sensitive)
MinSpeechDuration = TimeSpan.FromMilliseconds(300), // Ignore very short sounds
MinSilenceDuration = TimeSpan.FromMilliseconds(200) // Pause detection
};
// ──────────────────────────────────────
// 3. Transcribe a test file
// ──────────────────────────────────────
var audio = new WaveFile("command.wav");
var transcription = stt.Transcribe(audio, language: "auto");
Console.WriteLine($"Transcribed: \"{transcription.Text}\"");
Console.WriteLine($"Language: {transcription.Language}");
Console.WriteLine($"Segments: {transcription.Segments.Count}");
For a deeper dive into transcription options, see Transcribe Audio with Local Speech-to-Text and Tune Whisper Transcription with VAD.
Step 4: Set Up the Agent with Tools
Create an agent with built-in tools so it can act on spoken commands.
using LMKit.Model;
using LMKit.Agents;
using LMKit.Agents.Tools;
using LMKit.Agents.Tools.BuiltIn;
// ──────────────────────────────────────
// 1. Load a chat model with tool-calling support
// ──────────────────────────────────────
Console.Write("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
return true;
});
Console.WriteLine(" done.");
// ──────────────────────────────────────
// 2. Build the agent
// ──────────────────────────────────────
var agent = Agent.CreateBuilder(chatModel)
.WithPersona("Voice Assistant")
.WithInstruction(
"You are a voice-controlled assistant. Users speak commands that are " +
"transcribed to text. Keep responses concise and conversational, as they " +
"will be read aloud. Use tools when the user asks for real-time data, " +
"calculations, or factual lookups.")
.WithTools(tools =>
{
tools.Register(BuiltInTools.WebSearch);
tools.Register(BuiltInTools.CalcArithmetic);
tools.Register(BuiltInTools.DateTimeNow);
})
.WithMaxIterations(5)
.Build();
For details on tool registration and permission policies, see Create an AI Agent with Tools and Secure Agent Tool Access with Permission Policies.
Step 5: Connect Voice Input to Agent Execution
Wire the two systems together: transcribe audio, pass the text to the agent, display the result.
using System.Text;
using LMKit.Model;
using LMKit.Speech;
using LMKit.Agents;
using LMKit.Agents.Tools;
using LMKit.Agents.Tools.BuiltIn;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading models...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rSpeech model: {(double)read / len.Value * 100:F1}%");
return true;
});
Console.WriteLine();
using LM chatModel = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
return true;
});
Console.WriteLine("\nModels loaded.\n");
// ──────────────────────────────────────
// 2. Set up speech-to-text
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel);
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;
// ──────────────────────────────────────
// 3. Set up agent with tools
// ──────────────────────────────────────
var agent = Agent.CreateBuilder(chatModel)
.WithPersona("Voice Assistant")
.WithInstruction(
"You are a voice-controlled assistant. Keep responses short and " +
"conversational. Use tools for real-time data and calculations.")
.WithTools(tools =>
{
tools.Register(BuiltInTools.WebSearch);
tools.Register(BuiltInTools.CalcArithmetic);
tools.Register(BuiltInTools.DateTimeNow);
})
.WithMaxIterations(5)
.Build();
// ──────────────────────────────────────
// 4. Process voice commands
// ──────────────────────────────────────
string[] audioFiles = Directory.GetFiles("audio_commands", "*.wav");
foreach (string audioPath in audioFiles)
{
Console.WriteLine($"Processing: {Path.GetFileName(audioPath)}");
// Transcribe
var audio = new WaveFile(audioPath);
var transcription = stt.Transcribe(audio, language: "auto");
Console.WriteLine($" Heard: \"{transcription.Text}\"");
if (string.IsNullOrWhiteSpace(transcription.Text))
{
Console.WriteLine(" (no speech detected, skipping)\n");
continue;
}
// Execute through agent
var result = await agent.RunAsync(transcription.Text);
Console.WriteLine($" Response: {result.Content}\n");
}
Step 6: Add Multi-Turn Conversation Memory
For a conversational voice assistant (not just one-shot commands), use MultiTurnConversation to maintain context across multiple spoken exchanges.
using System.Text;
using LMKit.Model;
using LMKit.Speech;
using LMKit.TextGeneration;
using LMKit.Agents.Tools;
using LMKit.Agents.Tools.BuiltIn;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;
// Load models (as shown in previous steps)
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3");
using LM chatModel = LM.LoadFromModelID("qwen3:8b");
// Set up speech-to-text
var stt = new SpeechToText(whisperModel);
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;
// Set up multi-turn conversation with tools
using var chat = new MultiTurnConversation(chatModel);
chat.SystemPrompt =
"You are a voice-controlled assistant. Responses are read aloud, " +
"so keep them concise (1 to 3 sentences). Use tools when needed.";
chat.Tools.Register(BuiltInTools.WebSearch);
chat.Tools.Register(BuiltInTools.CalcArithmetic);
chat.Tools.Register(BuiltInTools.DateTimeNow);
// Conversation loop
Console.WriteLine("Voice assistant ready. Place .wav files in 'input/' to process.\n");
// Simulate a multi-turn conversation with sequential audio files
string[] turns = { "input/turn1.wav", "input/turn2.wav", "input/turn3.wav" };
foreach (string audioPath in turns)
{
if (!File.Exists(audioPath))
continue;
var audio = new WaveFile(audioPath);
var transcription = stt.Transcribe(audio, language: "auto");
if (string.IsNullOrWhiteSpace(transcription.Text))
continue;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($"User: {transcription.Text}");
Console.ResetColor();
// Submit to multi-turn conversation (maintains full history)
var result = chat.Submit(transcription.Text);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"Assistant: {result.Content}");
Console.ResetColor();
Console.WriteLine();
}
Console.WriteLine($"Conversation turns: {chat.ChatHistory.Count}");
For persistent memory across sessions, see Use Agent Memory for Long-Term Knowledge Across Sessions.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Transcription returns empty text | Audio has no speech or VAD is too strict | Lower EnergyThreshold in VadSettings (e.g., 0.05) or disable VAD temporarily to test |
| Agent ignores tool results | Model lacks tool-calling capability | Use a model with ToolsCall capability: qwen3:8b, gemma3:12b, gptoss:20b |
| Whisper hallucinates repeated text | Silence or low-quality audio | Enable SuppressHallucinations = true and ensure audio has clear speech |
| Slow transcription on CPU | Large Whisper model on CPU | Use whisper-tiny or whisper-small for faster CPU transcription |
Next Steps
- Transcribe Audio with Local Speech-to-Text for advanced Whisper configuration
- Tune Whisper Transcription with VAD, Hallucination Suppression, and Segment Processing for audio quality tuning
- Create an AI Agent with Tools for adding more tools to the agent
- Build an Agent with Web Search and Live Data Access for real-time web search
- Choose the Right Planning Strategy for Your Agent for multi-step voice commands