Transcribe Audio with Local Speech-to-Text

LM-Kit.NET includes OpenAI Whisper models for on-device speech recognition. Audio is transcribed entirely on your machine with no cloud API calls, no internet required, and no audio data leaving your infrastructure. This tutorial builds a working transcription program that processes audio files, streams results segment by segment, and detects the spoken language automatically.

Why Local Speech-to-Text Matters

Two enterprise problems that on-device transcription solves:

Healthcare and legal compliance. Patient dictations, attorney-client conversations, and therapy sessions contain protected information that cannot be sent to cloud APIs without complex data processing agreements. Local Whisper transcription eliminates third-party data exposure entirely, simplifying HIPAA, GDPR, and privilege compliance.
Offline field transcription. Journalists in remote areas, field engineers on oil rigs, and military personnel in disconnected environments need to transcribe interviews, inspection notes, and briefings without internet access. A local model runs on a laptop with no connectivity.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	0.9 GB (for whisper-large-turbo3)
Disk	~1.7 GB free for model download
Audio file	A `.wav` file (16-bit PCM, any sample rate)

Whisper models are small. Even the largest turbo model needs under 1 GB of VRAM, so speech-to-text works on virtually any GPU.

Step 1: Create the Project

dotnet new console -n TranscriptionQuickstart
cd TranscriptionQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand Whisper Models

Whisper models convert audio waveforms into text. They process audio in 30-second chunks, detecting language automatically and producing timestamped segments.

                    ┌───────────────────────────────────┐
                    │          Whisper Model            │
                    │                                   │
  Audio (WAV) ────► │  1. Split into 30s chunks         │
                    │  2. Detect language               │
                    │  3. Transcribe each chunk         │
                    │  4. Output timestamped segments   │
                    │                                   │
                    └───────────────┬───────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────────┐
                    │  TranscriptionResult              │
                    │    Segment 1: [00:00 - 00:08]     │
                    │    Segment 2: [00:08 - 00:15]     │
                    │    ...                            │
                    └───────────────────────────────────┘

Model ID	VRAM	Speed	Accuracy	Best For
`whisper-tiny`	~50 MB	Fastest	Basic	Quick drafts, real-time previews
`whisper-base`	~80 MB	Very fast	Good	General use with speed priority
`whisper-small`	~260 MB	Fast	Very good	Good balance for most tasks
`whisper-medium`	~820 MB	Moderate	Excellent	Professional transcription
`whisper-large-turbo3`	~870 MB	Moderate	Best	Highest accuracy (recommended)

whisper-large-turbo3 is the recommended default. It matches the accuracy of the full large model at roughly 3x the speed.

Step 3: Basic Transcription

This program loads a Whisper model, transcribes a WAV file, and prints each segment with timestamps.

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        var sw = Stopwatch.StartNew();
        var result = engine.Transcribe(audio);
        sw.Stop();

        // Summary
        Console.WriteLine();
        Console.ForegroundColor = ConsoleColor.Cyan;
        Console.WriteLine($"  Segments: {result.Segments.Count}");
        Console.WriteLine($"  Duration: {sw.Elapsed:mm\\:ss\\.ff}");
        Console.ResetColor();

        // Full transcript
        Console.WriteLine($"\n  Full transcript:\n  {result.Text}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Run it:

dotnet run

Step 4: Language Detection

Whisper can detect the spoken language before transcribing. This is useful for multilingual audio or when you need to route audio to language-specific processing:

using var audio = new WaveFile("meeting-recording.wav");

var langResult = engine.DetectLanguage(audio);
Console.WriteLine($"Detected language: {langResult.Language}");
Console.WriteLine($"Confidence: {langResult.Confidence:P0}");

To force a specific language (skips auto-detection and can improve accuracy):

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        var sw = Stopwatch.StartNew();
        // Force a specific language instead of auto-detection:
        //   engine.Transcribe(audio, language: "fr")  -> Force French
        //   engine.Transcribe(audio, language: "ja")  -> Force Japanese
        var result = engine.Transcribe(audio, language: "en");  // Force English
        sw.Stop();

        // Summary
        Console.WriteLine();
        Console.ForegroundColor = ConsoleColor.Cyan;
        Console.WriteLine($"  Segments: {result.Segments.Count}");
        Console.WriteLine($"  Duration: {sw.Elapsed:mm\\:ss\\.ff}");
        Console.ResetColor();

        // Full transcript
        Console.WriteLine($"\n  Full transcript:\n  {result.Text}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Step 5: Translation Mode

Whisper can translate speech from any supported language directly to English:

engine.Mode = SpeechToTextMode.Translation;

// French audio in, English text out
using var frenchAudio = new WaveFile("interview-fr.wav");
var result = engine.Transcribe(frenchAudio);
Console.WriteLine(result.Text);  // English translation

// Switch back to transcription mode
engine.Mode = SpeechToTextMode.Transcription;

Step 6: Voice Activity Detection

Voice Activity Detection (VAD) identifies speech vs. silence in the audio. When enabled, Whisper skips silent segments, reducing processing time and preventing hallucinated text in quiet sections.

VAD is enabled by default. Fine-tune it for your audio characteristics:

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

engine.EnableVoiceActivityDetection = true;

engine.VadSettings = new VadSettings
{
    EnergyThreshold = 0.5f,                                     // 0.0-1.0, higher = stricter speech detection
    MinSpeechDuration = TimeSpan.FromMilliseconds(250),         // Ignore speech shorter than 250ms
    MinSilenceDuration = TimeSpan.FromMilliseconds(100)         // Silence gap to split segments
};

Setting	Low Value	High Value
`EnergyThreshold`	More sensitive, catches quiet speech	Stricter, may miss soft speech
`MinSpeechDuration`	Catches short utterances	Filters out clicks, coughs
`MinSilenceDuration`	Fewer segment breaks	More granular segments

For noisy environments (factories, outdoor), raise EnergyThreshold to 0.6-0.7. For clean recordings (studios, phone calls), the default 0.5 works well.

Step 7: Processing Partial Audio

To transcribe only a portion of a long recording:

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        // Start at 5 minutes, transcribe 2 minutes
        engine.Start = TimeSpan.FromMinutes(5);
        engine.Duration = TimeSpan.FromMinutes(2);

        var sw = Stopwatch.StartNew();
        var result = engine.Transcribe(audio);
        sw.Stop();

        // Reset for full transcription
        engine.Start = TimeSpan.Zero;
        engine.Duration = TimeSpan.Zero;

        // Summary
        Console.WriteLine();
        Console.ForegroundColor = ConsoleColor.Cyan;
        Console.WriteLine($"  Segments: {result.Segments.Count}");
        Console.WriteLine($"  Duration: {sw.Elapsed:mm\\:ss\\.ff}");
        Console.ResetColor();

        Console.WriteLine($"\n  Full transcript:\n  {result.Text}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Step 8: Using Prompts for Domain Accuracy

The Prompt property provides context that biases Whisper toward domain-specific vocabulary:

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Medical transcription
engine.Prompt = "cardiology, echocardiogram, ejection fraction, systolic, diastolic";

// Legal transcription
engine.Prompt = "deposition, plaintiff, defendant, stipulation, voir dire";

// Technical meeting
engine.Prompt = "Kubernetes, microservices, CI/CD pipeline, load balancer, API gateway";

The prompt does not appear in the output. It tells the model which specialized terms to expect, improving recognition accuracy for domain jargon.

Working with AudioSegment Results

Each segment in the transcription result contains timing and confidence information:

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        var result = engine.Transcribe(audio);

        foreach (var segment in result.Segments)
        {
            Console.WriteLine($"  [{segment.Start:hh\\:mm\\:ss} - {segment.End:hh\\:mm\\:ss}]");
            Console.WriteLine($"  Text: {segment.Text}");
            Console.WriteLine($"  Confidence: {segment.Confidence:P0}");
            Console.WriteLine($"  Language: {segment.Language}");
            Console.WriteLine();
        }

        // Full transcript (all segments joined)
        string fullText = result.Text;
        Console.WriteLine($"\n  Full transcript:\n  {fullText}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Common Issues

Problem	Cause	Fix
`InvalidOperationException` on load	Model is not a Whisper model	Use a Whisper model ID: `whisper-large-turbo3`, `whisper-small`, etc.
Empty transcription	Audio is silence or VAD threshold too high	Lower `VadSettings.EnergyThreshold` to 0.3; check audio file is not empty
Hallucinated text in silent parts	VAD disabled or threshold too low	Enable VAD; raise threshold; set `SuppressHallucinations = true`
Wrong language detected	Short audio clip or ambiguous content	Force language with `engine.Transcribe(audio, language: "en")`
Garbled or repeated words	Audio quality too poor for model size	Use `whisper-large-turbo3` for difficult audio; clean audio with preprocessing
Non-WAV format not supported	`WaveFile` expects 16-bit PCM WAV	Convert to WAV first using ffmpeg or NAudio before loading

Next Steps

Load a Model and Generate Your First Response: model loading fundamentals if you are new to LM-Kit.NET.
Build a RAG Pipeline Over Your Own Documents: index transcribed text for searchable knowledge bases.
Extract Structured Data from Unstructured Text: pull structured data (names, dates, action items) from transcripts.
Build a Voice-Commanded Agent That Executes Tools: connect speech-to-text to an AI agent with web search, calculator, and other tools.
Samples: Audio Transcription App: full MAUI transcription application.

Table of Contents