Table of Contents

Transcribe Audio with Local Speech-to-Text

LM-Kit.NET includes OpenAI Whisper models for on-device speech recognition. Audio is transcribed entirely on your machine with no cloud API calls, no internet required, and no audio data leaving your infrastructure. This tutorial builds a working transcription program that processes audio files, streams results segment by segment, and detects the spoken language automatically.


Why Local Speech-to-Text Matters

Two enterprise problems that on-device transcription solves:

  1. Healthcare and legal compliance. Patient dictations, attorney-client conversations, and therapy sessions contain protected information that cannot be sent to cloud APIs without complex data processing agreements. Local Whisper transcription eliminates third-party data exposure entirely, simplifying HIPAA, GDPR, and privilege compliance.
  2. Offline field transcription. Journalists in remote areas, field engineers on oil rigs, and military personnel in disconnected environments need to transcribe interviews, inspection notes, and briefings without internet access. A local model runs on a laptop with no connectivity.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 0.9 GB (for whisper-large-turbo3)
Disk ~1.7 GB free for model download
Audio file A .wav file (16-bit PCM, any sample rate)

Whisper models are small. Even the largest turbo model needs under 1 GB of VRAM, so speech-to-text works on virtually any GPU.


Step 1: Create the Project

dotnet new console -n TranscriptionQuickstart
cd TranscriptionQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand Whisper Models

Whisper models convert audio waveforms into text. They process audio in 30-second chunks, detecting language automatically and producing timestamped segments.

                    ┌───────────────────────────────────┐
                    │          Whisper Model            │
                    │                                   │
  Audio (WAV) ────► │  1. Split into 30s chunks         │
                    │  2. Detect language               │
                    │  3. Transcribe each chunk         │
                    │  4. Output timestamped segments   │
                    │                                   │
                    └───────────────┬───────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────────┐
                    │  TranscriptionResult              │
                    │    Segment 1: [00:00 - 00:08]     │
                    │    Segment 2: [00:08 - 00:15]     │
                    │    ...                            │
                    └───────────────────────────────────┘
Model ID VRAM Speed Accuracy Best For
whisper-tiny ~50 MB Fastest Basic Quick drafts, real-time previews
whisper-base ~80 MB Very fast Good General use with speed priority
whisper-small ~260 MB Fast Very good Good balance for most tasks
whisper-medium ~820 MB Moderate Excellent Professional transcription
whisper-large-turbo3 ~870 MB Moderate Best Highest accuracy (recommended)

whisper-large-turbo3 is the recommended default. It matches the accuracy of the full large model at roughly 3x the speed.


Step 3: Basic Transcription

This program loads a Whisper model, transcribes a WAV file, and prints each segment with timestamps.

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        var sw = Stopwatch.StartNew();
        var result = engine.Transcribe(audio);
        sw.Stop();

        // Summary
        Console.WriteLine();
        Console.ForegroundColor = ConsoleColor.Cyan;
        Console.WriteLine($"  Segments: {result.Segments.Count}");
        Console.WriteLine($"  Duration: {sw.Elapsed:mm\\:ss\\.ff}");
        Console.ResetColor();

        // Full transcript
        Console.WriteLine($"\n  Full transcript:\n  {result.Text}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Run it:

dotnet run

Step 4: Language Detection

Whisper can detect the spoken language before transcribing. This is useful for multilingual audio or when you need to route audio to language-specific processing:

using var audio = new WaveFile("meeting-recording.wav");

var langResult = engine.DetectLanguage(audio);
Console.WriteLine($"Detected language: {langResult.Language}");
Console.WriteLine($"Confidence: {langResult.Confidence:P0}");

To force a specific language (skips auto-detection and can improve accuracy):

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        var sw = Stopwatch.StartNew();
        // Force a specific language instead of auto-detection:
        //   engine.Transcribe(audio, language: "fr")  -> Force French
        //   engine.Transcribe(audio, language: "ja")  -> Force Japanese
        var result = engine.Transcribe(audio, language: "en");  // Force English
        sw.Stop();

        // Summary
        Console.WriteLine();
        Console.ForegroundColor = ConsoleColor.Cyan;
        Console.WriteLine($"  Segments: {result.Segments.Count}");
        Console.WriteLine($"  Duration: {sw.Elapsed:mm\\:ss\\.ff}");
        Console.ResetColor();

        // Full transcript
        Console.WriteLine($"\n  Full transcript:\n  {result.Text}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Step 5: Translation Mode

Whisper can translate speech from any supported language directly to English:

engine.Mode = SpeechToTextMode.Translation;

// French audio in, English text out
using var frenchAudio = new WaveFile("interview-fr.wav");
var result = engine.Transcribe(frenchAudio);
Console.WriteLine(result.Text);  // English translation

// Switch back to transcription mode
engine.Mode = SpeechToTextMode.Transcription;

Step 6: Voice Activity Detection

Voice Activity Detection (VAD) identifies speech vs. silence in the audio. When enabled, Whisper skips silent segments, reducing processing time and preventing hallucinated text in quiet sections.

VAD is enabled by default. Fine-tune it for your audio characteristics:

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

engine.EnableVoiceActivityDetection = true;

engine.VadSettings = new VadSettings
{
    EnergyThreshold = 0.5f,                                     // 0.0-1.0, higher = stricter speech detection
    MinSpeechDuration = TimeSpan.FromMilliseconds(250),         // Ignore speech shorter than 250ms
    MinSilenceDuration = TimeSpan.FromMilliseconds(100)         // Silence gap to split segments
};
Setting Low Value High Value
EnergyThreshold More sensitive, catches quiet speech Stricter, may miss soft speech
MinSpeechDuration Catches short utterances Filters out clicks, coughs
MinSilenceDuration Fewer segment breaks More granular segments

For noisy environments (factories, outdoor), raise EnergyThreshold to 0.6-0.7. For clean recordings (studios, phone calls), the default 0.5 works well.


Step 7: Processing Partial Audio

To transcribe only a portion of a long recording:

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        // Start at 5 minutes, transcribe 2 minutes
        engine.Start = TimeSpan.FromMinutes(5);
        engine.Duration = TimeSpan.FromMinutes(2);

        var sw = Stopwatch.StartNew();
        var result = engine.Transcribe(audio);
        sw.Stop();

        // Reset for full transcription
        engine.Start = TimeSpan.Zero;
        engine.Duration = TimeSpan.Zero;

        // Summary
        Console.WriteLine();
        Console.ForegroundColor = ConsoleColor.Cyan;
        Console.WriteLine($"  Segments: {result.Segments.Count}");
        Console.WriteLine($"  Duration: {sw.Elapsed:mm\\:ss\\.ff}");
        Console.ResetColor();

        Console.WriteLine($"\n  Full transcript:\n  {result.Text}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Step 8: Using Prompts for Domain Accuracy

The Prompt property provides context that biases Whisper toward domain-specific vocabulary:

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Medical transcription
engine.Prompt = "cardiology, echocardiogram, ejection fraction, systolic, diastolic";

// Legal transcription
engine.Prompt = "deposition, plaintiff, defendant, stipulation, voir dire";

// Technical meeting
engine.Prompt = "Kubernetes, microservices, CI/CD pipeline, load balancer, API gateway";

The prompt does not appear in the output. It tells the model which specialized terms to expect, improving recognition accuracy for domain jargon.


Working with AudioSegment Results

Each segment in the transcription result contains timing and confidence information:

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM model = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine
// ──────────────────────────────────────
var engine = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

// Stream segments as they are recognized
engine.OnNewSegment += (_, e) =>
{
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"[{e.Segment.Start:mm\\:ss} - {e.Segment.End:mm\\:ss}] ");
    Console.ResetColor();
    Console.WriteLine(e.Segment.Text);
};

// Show progress
engine.OnProgress += (_, e) =>
{
    Console.Write($"\r  Progress: {e.Progress}%   ");
};

// ──────────────────────────────────────
// 3. Transcription loop
// ──────────────────────────────────────
Console.WriteLine("Enter the path to a WAV audio file (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("File: ");
    Console.ResetColor();

    string? path = Console.ReadLine()?.Trim('"');
    if (string.IsNullOrWhiteSpace(path) || path.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (!File.Exists(path))
    {
        Console.WriteLine($"  File not found: {path}\n");
        continue;
    }

    try
    {
        using var audio = new WaveFile(path);
        Console.WriteLine($"  Audio duration: {audio.Duration:mm\\:ss\\.ff}\n");

        var result = engine.Transcribe(audio);

        foreach (var segment in result.Segments)
        {
            Console.WriteLine($"  [{segment.Start:hh\\:mm\\:ss} - {segment.End:hh\\:mm\\:ss}]");
            Console.WriteLine($"  Text: {segment.Text}");
            Console.WriteLine($"  Confidence: {segment.Confidence:P0}");
            Console.WriteLine($"  Language: {segment.Language}");
            Console.WriteLine();
        }

        // Full transcript (all segments joined)
        string fullText = result.Text;
        Console.WriteLine($"\n  Full transcript:\n  {fullText}\n");
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}\n");
        Console.ResetColor();
    }
}

Common Issues

Problem Cause Fix
InvalidOperationException on load Model is not a Whisper model Use a Whisper model ID: whisper-large-turbo3, whisper-small, etc.
Empty transcription Audio is silence or VAD threshold too high Lower VadSettings.EnergyThreshold to 0.3; check audio file is not empty
Hallucinated text in silent parts VAD disabled or threshold too low Enable VAD; raise threshold; set SuppressHallucinations = true
Wrong language detected Short audio clip or ambiguous content Force language with engine.Transcribe(audio, language: "en")
Garbled or repeated words Audio quality too poor for model size Use whisper-large-turbo3 for difficult audio; clean audio with preprocessing
Non-WAV format not supported WaveFile expects 16-bit PCM WAV Convert to WAV first using ffmpeg or NAudio before loading

Next Steps

Share