Transcribe Audio with Local Speech-to-Text
LM-Kit.NET includes OpenAI Whisper models for on-device speech recognition. Audio is transcribed entirely on your machine with no cloud API calls, no internet required, and no audio data leaving your infrastructure. This tutorial builds a working transcription program that processes audio files, streams results segment by segment, and detects the spoken language automatically.
Why Local Speech-to-Text Matters
Two enterprise problems that on-device transcription solves:
- Healthcare and legal compliance. Patient dictations, attorney-client conversations, and therapy sessions contain protected information that cannot be sent to cloud APIs without complex data processing agreements. Local Whisper transcription eliminates third-party data exposure entirely, simplifying HIPAA, GDPR, and privilege compliance.
- Offline field transcription. Journalists in remote areas, field engineers on oil rigs, and military personnel in disconnected environments need to transcribe interviews, inspection notes, and briefings without internet access. A local model runs on a laptop with no connectivity.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 0.9 GB (for whisper-large-turbo3) |
| Disk | ~1.7 GB free for model download |
| Audio file | A .wav file (16-bit PCM, any sample rate) |
Whisper models are small. Even the largest turbo model needs under 1 GB of VRAM, so speech-to-text works on virtually any GPU.
Step 1: Create the Project
dotnet new console -n TranscriptionQuickstart
cd TranscriptionQuickstart
dotnet add package LM-Kit.NET
Step 2: Understand Whisper Models
Whisper models convert audio waveforms into text. They process audio in 30-second chunks, detecting language automatically and producing timestamped segments.
┌───────────────────────────────────┐
│ Whisper Model │
│ │
Audio (WAV) ────► │ 1. Split into 30s chunks │
│ 2. Detect language │
│ 3. Transcribe each chunk │
│ 4. Output timestamped segments │
│ │
└───────────────┬───────────────────┘
│
▼
┌───────────────────────────────────┐
│ TranscriptionResult │
│ Segment 1: [00:00 - 00:08] │
│ Segment 2: [00:08 - 00:15] │
│ ... │
└───────────────────────────────────┘
| Model ID | VRAM | Speed | Accuracy | Best For |
|---|---|---|---|---|
whisper-tiny |
~50 MB | Fastest | Basic | Quick drafts, real-time previews |
whisper-base |
~80 MB | Very fast | Good | General use with speed priority |
whisper-small |
~260 MB | Fast | Very good | Good balance for most tasks |
whisper-medium |
~820 MB | Moderate | Excellent | Professional transcription |
whisper-large-turbo3 |
~870 MB | Moderate | Best | Highest accuracy (recommended) |
whisper-large-turbo3 is the recommended default. It matches the accuracy of the full large model at roughly 3x the speed.
Step 3: Basic Transcription
This program loads a Whisper model, transcribes a WAV file, and prints each segment with timestamps.
using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
Console.ResetColor();
Console.WriteLine(e.Segment.Text);
};
// Show progress
engine.OnProgress += (_, e) =>
{
Console.Write($"\r Progress: {e.Progress}% ");
};
// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("File: ");
Console.ResetColor();
string? path = Console.ReadLine()?.Trim('"');
if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
if (!File.Exists(path))
{
Console.WriteLine($" File not found: {path}\n");
continue;
}
try
{
using var audio = new WaveFile(path);
Console.WriteLine($" Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");
var sw = Stopwatch.StartNew();
var result = engine.Transcribe(audio);
sw.Stop();
// Summary
Console.WriteLine();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($" Segments: {result.Segments.Count}");
Console.WriteLine($" Duration: {sw.Elapsed:mm\\:ss\\.ff}");
Console.ResetColor();
// Full transcript
Console.WriteLine($"\n Full transcript:\n {result.Text}\n");
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($" Error: {ex.Message}\n");
Console.ResetColor();
}
}
Run it:
dotnet run
Step 4: Language Detection
Whisper can detect the spoken language before transcribing. This is useful for multilingual audio or when you need to route audio to language-specific processing:
using var audio = new WaveFile("meeting-recording.wav");
var langResult = engine.DetectLanguage(audio);
Console.WriteLine($"Detected language: {langResult.Language}");
Console.WriteLine($"Confidence: {langResult.Confidence:P0}");
To force a specific language (skips auto-detection and can improve accuracy):
using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
Console.ResetColor();
Console.WriteLine(e.Segment.Text);
};
// Show progress
engine.OnProgress += (_, e) =>
{
Console.Write($"\r Progress: {e.Progress}% ");
};
// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("File: ");
Console.ResetColor();
string? path = Console.ReadLine()?.Trim('"');
if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
if (!File.Exists(path))
{
Console.WriteLine($" File not found: {path}\n");
continue;
}
try
{
using var audio = new WaveFile(path);
Console.WriteLine($" Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");
var sw = Stopwatch.StartNew();
// Force a specific language instead of auto-detection:
// engine.Transcribe(audio, language: "fr") -> Force French
// engine.Transcribe(audio, language: "ja") -> Force Japanese
var result = engine.Transcribe(audio, language: "en"); // Force English
sw.Stop();
// Summary
Console.WriteLine();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($" Segments: {result.Segments.Count}");
Console.WriteLine($" Duration: {sw.Elapsed:mm\\:ss\\.ff}");
Console.ResetColor();
// Full transcript
Console.WriteLine($"\n Full transcript:\n {result.Text}\n");
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($" Error: {ex.Message}\n");
Console.ResetColor();
}
}
Step 5: Translation Mode
Whisper can translate speech from any supported language directly to English:
engine.Mode = SpeechToTextMode.Translation;
// French audio in, English text out
using var frenchAudio = new WaveFile("interview-fr.wav");
var result = engine.Transcribe(frenchAudio);
Console.WriteLine(result.Text); // English translation
// Switch back to transcription mode
engine.Mode = SpeechToTextMode.Transcription;
Step 6: Voice Activity Detection
Voice Activity Detection (VAD) identifies speech vs. silence in the audio. When enabled, Whisper skips silent segments, reducing processing time and preventing hallucinated text in quiet sections.
VAD is enabled by default. Fine-tune it for your audio characteristics:
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
engine.EnableVoiceActivityDetection = true;
engine.VadSettings = new VadSettings
{
EnergyThreshold = 0.5f, // 0.0-1.0, higher = stricter speech detection
MinSpeechDuration = TimeSpan.FromMilliseconds(250), // Ignore speech shorter than 250ms
MinSilenceDuration = TimeSpan.FromMilliseconds(100) // Silence gap to split segments
};
| Setting | Low Value | High Value |
|---|---|---|
EnergyThreshold |
More sensitive, catches quiet speech | Stricter, may miss soft speech |
MinSpeechDuration |
Catches short utterances | Filters out clicks, coughs |
MinSilenceDuration |
Fewer segment breaks | More granular segments |
For noisy environments (factories, outdoor), raise EnergyThreshold to 0.6-0.7. For clean recordings (studios, phone calls), the default 0.5 works well.
Step 7: Processing Partial Audio
To transcribe only a portion of a long recording:
using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
Console.ResetColor();
Console.WriteLine(e.Segment.Text);
};
// Show progress
engine.OnProgress += (_, e) =>
{
Console.Write($"\r Progress: {e.Progress}% ");
};
// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("File: ");
Console.ResetColor();
string? path = Console.ReadLine()?.Trim('"');
if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
if (!File.Exists(path))
{
Console.WriteLine($" File not found: {path}\n");
continue;
}
try
{
using var audio = new WaveFile(path);
Console.WriteLine($" Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");
// Start at 5 minutes, transcribe 2 minutes
engine.Start = TimeSpan.FromMinutes(5);
engine.Duration = TimeSpan.FromMinutes(2);
var sw = Stopwatch.StartNew();
var result = engine.Transcribe(audio);
sw.Stop();
// Reset for full transcription
engine.Start = TimeSpan.Zero;
engine.Duration = TimeSpan.Zero;
// Summary
Console.WriteLine();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($" Segments: {result.Segments.Count}");
Console.WriteLine($" Duration: {sw.Elapsed:mm\\:ss\\.ff}");
Console.ResetColor();
Console.WriteLine($"\n Full transcript:\n {result.Text}\n");
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($" Error: {ex.Message}\n");
Console.ResetColor();
}
}
Step 8: Using Prompts for Domain Accuracy
The Prompt property provides context that biases Whisper toward domain-specific vocabulary:
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
// Medical transcription
engine.Prompt = "cardiology, echocardiogram, ejection fraction, systolic, diastolic";
// Legal transcription
engine.Prompt = "deposition, plaintiff, defendant, stipulation, voir dire";
// Technical meeting
engine.Prompt = "Kubernetes, microservices, CI/CD pipeline, load balancer, API gateway";
The prompt does not appear in the output. It tells the model which specialized terms to expect, improving recognition accuracy for domain jargon.
Working with AudioSegment Results
Each segment in the transcription result contains timing and confidence information:
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
Console.ResetColor();
Console.WriteLine(e.Segment.Text);
};
// Show progress
engine.OnProgress += (_, e) =>
{
Console.Write($"\r Progress: {e.Progress}% ");
};
// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("File: ");
Console.ResetColor();
string? path = Console.ReadLine()?.Trim('"');
if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
if (!File.Exists(path))
{
Console.WriteLine($" File not found: {path}\n");
continue;
}
try
{
using var audio = new WaveFile(path);
Console.WriteLine($" Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");
var result = engine.Transcribe(audio);
foreach (var segment in result.Segments)
{
Console.WriteLine($" [{segment.Start:hh\\:mm\\:ss} - {segment.End:hh\\:mm\\:ss}]");
Console.WriteLine($" Text: {segment.Text}");
Console.WriteLine($" Confidence: {segment.Confidence:P0}");
Console.WriteLine($" Language: {segment.Language}");
Console.WriteLine();
}
// Full transcript (all segments joined)
string fullText = result.Text;
Console.WriteLine($"\n Full transcript:\n {fullText}\n");
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($" Error: {ex.Message}\n");
Console.ResetColor();
}
}
Common Issues
| Problem | Cause | Fix |
|---|---|---|
InvalidOperationException on load |
Model is not a Whisper model | Use a Whisper model ID: whisper-large-turbo3, whisper-small, etc. |
| Empty transcription | Audio is silence or VAD threshold too high | Lower VadSettings.EnergyThreshold to 0.3; check audio file is not empty |
| Hallucinated text in silent parts | VAD disabled or threshold too low | Enable VAD; raise threshold; set SuppressHallucinations = true |
| Wrong language detected | Short audio clip or ambiguous content | Force language with engine.Transcribe(audio, language: "en") |
| Garbled or repeated words | Audio quality too poor for model size | Use whisper-large-turbo3 for difficult audio; clean audio with preprocessing |
| Non-WAV format not supported | WaveFile expects 16-bit PCM WAV |
Convert to WAV first using ffmpeg or NAudio before loading |
Next Steps
- Load a Model and Generate Your First Response: model loading fundamentals if you are new to LM-Kit.NET.
- Build a RAG Pipeline Over Your Own Documents: index transcribed text for searchable knowledge bases.
- Extract Structured Data from Unstructured Text: pull structured data (names, dates, action items) from transcripts.
- Build a Voice-Commanded Agent That Executes Tools: connect speech-to-text to an AI agent with web search, calculator, and other tools.
- Samples: Audio Transcription App: full MAUI transcription application.