Table of Contents

Tune Whisper Transcription with VAD, Hallucination Suppression, and Segment Processing

Basic transcription works out of the box, but production audio is rarely clean. Factory floors introduce constant machine hum, phone calls have echo and compression artifacts, meeting recordings pick up HVAC noise and side conversations, and long silent pauses cause Whisper to hallucinate phantom text. LM-Kit.NET exposes fine-grained controls for Voice Activity Detection thresholds, hallucination suppression, non-speech token filtering, domain prompts, segment confidence scoring, and partial audio processing. This guide walks through each control and builds a complete transcription pipeline with quality tiers.


Why Advanced Transcription Tuning Matters

Two enterprise problems that tuned transcription solves:

  1. Manufacturing quality inspection audio. Floor supervisors dictate inspection notes over constant machine noise. Default VAD settings either miss quiet speech or include machine hum as false positives. Tuning the energy threshold and minimum speech duration for that specific acoustic environment eliminates false segments while capturing every spoken note, producing reliable inspection records without manual cleanup.
  2. Long meeting transcription with dead air. Two-hour strategy meetings contain 15-minute breaks, side conversations, and periods of silence while participants read documents. Without hallucination suppression, Whisper generates phantom text during silent sections. Without confidence filtering, low-quality segments from overlapping speakers contaminate the transcript. Tuning both filters produces a clean, trustworthy record.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~870 MB (for whisper-large-turbo3)
Disk ~1.7 GB free for model download
Audio file A .wav file (16-bit PCM, any sample rate)

Step 1: Create the Project

dotnet new console -n WhisperTuning
cd WhisperTuning
dotnet add package LM-Kit.NET

Step 2: Understand the Whisper Pipeline

When you configure the SpeechToText engine, audio flows through a multi-stage pipeline. Each stage can be tuned independently:

  Audio (.wav)
       │
       ▼
  ┌────────────────────────────────────────┐
  │  Voice Activity Detection (VAD)        │   Filter silence and noise
  │  EnergyThreshold, MinSpeechDuration,   │   before Whisper sees it
  │  MinSilenceDuration, SpeechPadding     │
  └────────────┬───────────────────────────┘
               │  Speech segments only
               ▼
  ┌────────────────────────────────────────┐
  │  Whisper Decoder                       │   Transcribe speech to text
  │  Model, Language, Prompt               │   using domain-biased decoding
  └────────────┬───────────────────────────┘
               │  Raw segments
               ▼
  ┌────────────────────────────────────────┐
  │  Hallucination Filter                  │   Adaptive energy analysis,
  │  SuppressHallucinations                │   speaking rate validation,
  │  SuppressNonSpeechTokens               │   no-speech probability check
  └────────────┬───────────────────────────┘
               │  Filtered segments
               ▼
  ┌────────────────────────────────────────┐
  │  AudioSegment Output                   │   Text, Start, End,
  │  OnNewSegment event stream             │   Confidence, Language
  └────────────┬───────────────────────────┘
               │
               ▼
  ┌────────────────────────────────────────┐
  │  Post-Processing                       │   Confidence filtering,
  │  (your application code)               │   quality tiers, export
  └────────────────────────────────────────┘

Whisper Model Selection

Model ID VRAM Speed Accuracy Best For
whisper-tiny ~50 MB Fastest Basic Quick drafts, real-time previews
whisper-base ~80 MB Very fast Good General use with speed priority
whisper-small ~260 MB Fast Very good Good balance for most tasks
whisper-medium ~820 MB Moderate Excellent Professional transcription
whisper-large-turbo3 ~870 MB Moderate Best Highest accuracy (recommended)

whisper-large-turbo3 matches the accuracy of the full large model at roughly 3x the speed. Use it as the default unless VRAM is constrained.


Step 3: Configure Voice Activity Detection for Noisy Environments

VAD runs before Whisper processes the audio. It identifies speech vs. silence and discards non-speech regions. This both speeds up transcription and prevents the decoder from hallucinating text during quiet sections.

The key insight: different acoustic environments need different VAD settings. A factory floor and a quiet office have very different noise profiles.

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var stt = new SpeechToText(whisperModel);

// ──────────────────────────────────────
// 2. Default VAD: good for clean audio
// ──────────────────────────────────────
stt.EnableVoiceActivityDetection = true;
// Default VadSettings: EnergyThreshold=0.1, MinSpeechDuration=250ms,
// MinSilenceDuration=100ms, SpeechPadding=30ms

// ──────────────────────────────────────
// 3. Noisy environment: raise threshold, require longer speech
// ──────────────────────────────────────
stt.VadSettings = new VadSettings
{
    EnergyThreshold = 0.3f,                                    // Higher = more aggressive noise rejection
    MinSpeechDuration = TimeSpan.FromMilliseconds(500),        // Ignore short bursts (clanks, thuds)
    MinSilenceDuration = TimeSpan.FromMilliseconds(300),       // Wait longer before splitting segments
    SpeechPadding = TimeSpan.FromMilliseconds(50)              // Extra padding around detected speech
};

Console.WriteLine("Noisy environment VAD configured.");

// ──────────────────────────────────────
// 4. Quiet office: lower threshold, catch soft speech
// ──────────────────────────────────────
stt.VadSettings = new VadSettings
{
    EnergyThreshold = 0.05f,                                   // Sensitive: catch soft speakers
    MinSpeechDuration = TimeSpan.FromMilliseconds(150),        // Capture short confirmations ("yes", "ok")
    MinSilenceDuration = TimeSpan.FromMilliseconds(80),        // Fine-grained segment splits
    SpeechPadding = TimeSpan.FromMilliseconds(20)              // Minimal padding in quiet environments
};

Console.WriteLine("Quiet office VAD configured.");

VadSettings Parameter Reference

Parameter Range Default Effect
EnergyThreshold 0.0 to 1.0 0.1 Normalized energy level that distinguishes speech from noise. Higher values reject more background noise but may miss quiet speech.
MinSpeechDuration TimeSpan 250 ms Minimum duration for a region to qualify as speech. Short values catch brief utterances; long values filter out clicks and coughs.
MinSilenceDuration TimeSpan 100 ms Minimum silence gap required to split segments. Short values produce fine-grained segments; long values produce fewer, larger segments.
MaxSpeechDuration TimeSpan Unlimited Maximum length of a single speech region before forced splitting. Useful for very long continuous speech.
SpeechPadding TimeSpan 30 ms Extra audio included before and after each detected speech region. Prevents clipping the start or end of words.
SampleOverlapSeconds 0.0 to 1.0 0.1 Overlap between VAD analysis windows in seconds. Higher overlap improves boundary precision at a small speed cost.

Step 4: Suppress Hallucinations and Non-Speech Tokens

Whisper can produce two types of unwanted output. Hallucinations are phantom phrases generated during silent or low-energy sections, such as "Thank you for watching" or "Please subscribe." Non-speech tokens are filler sounds, music notation, and other artifacts that do not represent actual spoken words.

// ──────────────────────────────────────
// Hallucination suppression
// ──────────────────────────────────────
stt.SuppressHallucinations = true;

// When enabled, the engine applies adaptive filtering that combines:
//   - Audio energy analysis (RMS comparison against adaptive thresholds)
//   - Statistical adaptation (median, variance, and stability of prior segments)
//   - No-speech probability scoring from the model
//   - Token confidence validation (high-confidence segments bypass filtering)
//   - Speaking rate validation (words per second within realistic human range)

// ──────────────────────────────────────
// Non-speech token suppression
// ──────────────────────────────────────
stt.SuppressNonSpeechTokens = true;

// When enabled, the model suppresses tokens representing filler sounds,
// background noise artifacts, and repetitive non-speech patterns.

When hallucination suppression matters most: long recordings with significant silence or low-energy sections, such as meetings with breaks, overnight monitoring audio, and recordings that begin or end with dead air.

When non-speech suppression matters most: audio with background music, environmental sounds (rain, traffic), or recordings from devices that pick up mechanical noise (keyboards, fans).

Both properties default to true. Disable them only when you need to preserve every token the decoder produces, for example when debugging or when the audio contains intentional non-speech content you want captured.


Step 5: Use Domain Prompts to Bias Transcription

The Prompt property prepends context to the Whisper decoder. This biases the model toward recognizing domain-specific vocabulary without constraining it. The prompt text does not appear in the transcription output.

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var stt = new SpeechToText(whisperModel);

// ──────────────────────────────────────
// Medical terminology
// ──────────────────────────────────────
stt.Prompt = "Patient presents with dyspnea, tachycardia, and elevated troponin levels.";

// ──────────────────────────────────────
// Legal terminology
// ──────────────────────────────────────
stt.Prompt = "The plaintiff alleges breach of fiduciary duty under Section 10(b).";

// ──────────────────────────────────────
// Technical meeting
// ──────────────────────────────────────
stt.Prompt = "Discussion of Kubernetes deployment, CI/CD pipeline, and microservices architecture.";

// ──────────────────────────────────────
// Financial reporting
// ──────────────────────────────────────
stt.Prompt = "Quarterly EBITDA, year-over-year revenue growth, and adjusted free cash flow margin.";

// ──────────────────────────────────────
// Clear the prompt (no bias)
// ──────────────────────────────────────
stt.Prompt = "";

How it works. Whisper's decoder conditions each output token on all preceding tokens. By placing domain-specific terms in the prompt, those terms enter the decoder's context window, making the model more likely to recognize similar-sounding words as domain terms rather than common alternatives. For example, "troponin" might otherwise be transcribed as "tropan" or "troponine" without the medical prompt.

Best practices for prompts:

  • Include the exact spelling of domain terms you expect to hear.
  • Write the prompt in the same language as the audio.
  • Keep prompts under 200 tokens. Longer prompts reduce the effective context window for transcription.
  • Use complete phrases rather than isolated keywords for better decoder conditioning.

Step 6: Process Segments in Real Time with Events

The OnNewSegment and OnProgress events let you process transcription results as they are produced, rather than waiting for the entire file to finish.

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var stt = new SpeechToText(whisperModel);

// ──────────────────────────────────────
// Configure real-time segment streaming
// ──────────────────────────────────────
stt.EnableVoiceActivityDetection = true;
stt.SuppressHallucinations = true;
stt.SuppressNonSpeechTokens = true;

stt.OnProgress += (sender, e) =>
{
    Console.Write($"\rProgress: {e.Progress}%   ");
};

stt.OnNewSegment += (sender, e) =>
{
    var seg = e.Segment;

    // Classify confidence level
    string confidence = seg.Confidence >= 0.8f ? "HIGH"
                      : seg.Confidence >= 0.5f ? "MED"
                      : "LOW";

    // Color-code by confidence
    Console.ForegroundColor = seg.Confidence >= 0.8f ? ConsoleColor.Green
                            : seg.Confidence >= 0.5f ? ConsoleColor.Yellow
                            : ConsoleColor.Red;

    Console.WriteLine($"[{seg.Start:mm\\:ss} - {seg.End:mm\\:ss}] ({confidence} {seg.Confidence:F2}) {seg.Text}");
    Console.ResetColor();
};

// ──────────────────────────────────────
// Transcribe with real-time output
// ──────────────────────────────────────
var audioFile = new WaveFile("meeting.wav");
Console.WriteLine($"Audio duration: {audioFile.Duration:mm\\:ss\\.ff}\n");

var result = await stt.TranscribeAsync(audioFile);

Console.WriteLine($"\nTotal segments: {result.Segments.Count}");

The OnNewSegment event fires for each segment as it is recognized, allowing real-time display, logging, or streaming to a downstream system. The OnProgress event fires periodically with an integer from 0 to 100 representing overall completion.


Step 7: Filter Segments by Confidence

Each AudioSegment includes a Confidence score between 0.0 and 1.0. After transcription completes, you can separate segments into quality tiers for different processing paths.

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// Post-process: separate by confidence
// ──────────────────────────────────────
var highConfidence = result.Segments
    .Where(s => s.Confidence >= 0.7f)
    .ToList();

var lowConfidence = result.Segments
    .Where(s => s.Confidence < 0.7f)
    .ToList();

Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"High confidence: {highConfidence.Count} segments");
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"Low confidence (review needed): {lowConfidence.Count} segments");
Console.ResetColor();

// Build a clean transcript from high-confidence segments only
string cleanTranscript = string.Join("\n",
    highConfidence
        .OrderBy(s => s.Start)
        .Select(s => s.Text));

Console.WriteLine($"\n── Clean Transcript ──\n{cleanTranscript}");

// Flag low-confidence segments for human review
if (lowConfidence.Count > 0)
{
    Console.WriteLine("\n── Segments Requiring Review ──");
    foreach (var seg in lowConfidence.OrderBy(s => s.Start))
    {
        Console.ForegroundColor = ConsoleColor.Yellow;
        Console.WriteLine($"  [{seg.Start:mm\\:ss} - {seg.End:mm\\:ss}] " +
                          $"(confidence: {seg.Confidence:F2}) {seg.Text}");
        Console.ResetColor();
    }
}

Choosing a confidence threshold. A threshold of 0.7 works well as a general starting point. For critical transcription (legal depositions, medical records), raise the threshold to 0.85 and route all segments below it to human review. For draft transcription (meeting notes, interview summaries), a threshold of 0.5 retains more content at the cost of occasional errors.


Step 8: Detect Language Before Transcription

When you process multilingual audio or audio of unknown language, detect the language first and pass it explicitly to Transcribe. This improves accuracy because Whisper skips its internal auto-detection step and applies the correct language model from the start.

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var stt = new SpeechToText(whisperModel);

// ──────────────────────────────────────
// Detect language
// ──────────────────────────────────────
var langResult = await stt.DetectLanguageAsync(audioFile);
Console.WriteLine($"Detected: {langResult.Language} (confidence: {langResult.Confidence:P0})");

// Use detected language for transcription
var transcription = await stt.TranscribeAsync(audioFile, langResult.Language);
Console.WriteLine($"Segments: {transcription.Segments.Count}");
Console.WriteLine($"Text: {transcription.Text}");

// ──────────────────────────────────────
// List all supported languages
// ──────────────────────────────────────
var languages = stt.GetSupportedLanguages();
Console.WriteLine($"Supported languages ({languages.Count}): {string.Join(", ", languages)}");

DetectLanguage analyzes the first portion of the audio and returns a LanguageDetectionResult with the ISO 639-1 language code (e.g., "en", "fr", "ja") and a confidence score. Use GetSupportedLanguages() to retrieve the full list of language codes the loaded model supports.


Step 9: Transcribe Specific Time Ranges

For long recordings, you can transcribe only a specific portion by setting Start and Duration. This is useful for targeting a known segment of interest without processing the entire file.

using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var stt = new SpeechToText(whisperModel);

// ──────────────────────────────────────
// Transcribe only minutes 5 through 10
// ──────────────────────────────────────
stt.Start = TimeSpan.FromMinutes(5);
stt.Duration = TimeSpan.FromMinutes(5);

var partial = await stt.TranscribeAsync(audioFile);
Console.WriteLine($"Partial transcript ({stt.Start:mm\\:ss} to {stt.Start + stt.Duration:mm\\:ss}):");
Console.WriteLine(partial.Text);

// ──────────────────────────────────────
// Reset for full transcription
// ──────────────────────────────────────
stt.Start = TimeSpan.Zero;
stt.Duration = TimeSpan.Zero;

var full = await stt.TranscribeAsync(audioFile);
Console.WriteLine($"\nFull transcript ({full.Segments.Count} segments):");
Console.WriteLine(full.Text);

Setting Duration to TimeSpan.Zero means "process until the end of the audio." Both Start and Duration are measured from the beginning of the audio content.


Step 10: Build a Complete Transcription Pipeline with Quality Tiers

This complete example brings together every technique from the previous steps into a single, production-ready pipeline. It detects the language, transcribes with tuned VAD and hallucination suppression, applies a domain prompt, separates segments into confidence tiers, and produces a timestamped transcript with quality annotations and summary statistics.

using System.Diagnostics;
using System.Text;
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load Whisper model
// ──────────────────────────────────────
Console.WriteLine("Loading Whisper model...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Configure the engine for a conference room
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressHallucinations = true,
    SuppressNonSpeechTokens = true,
    Prompt = "Quarterly business review, revenue forecast, customer acquisition cost, churn rate."
};

// Conference room VAD tuning
stt.VadSettings = new VadSettings
{
    EnergyThreshold = 0.15f,
    MinSpeechDuration = TimeSpan.FromMilliseconds(300),
    MinSilenceDuration = TimeSpan.FromMilliseconds(200),
    SpeechPadding = TimeSpan.FromMilliseconds(40),
    SampleOverlapSeconds = 0.15f
};

// ──────────────────────────────────────
// 3. Track segments and statistics
// ──────────────────────────────────────
var allSegments = new List<AudioSegment>();

stt.OnProgress += (sender, e) =>
{
    Console.Write($"\r  Transcribing: {e.Progress}%   ");
};

stt.OnNewSegment += (sender, e) =>
{
    allSegments.Add(e.Segment);
};

// ──────────────────────────────────────
// 4. Load audio and detect language
// ──────────────────────────────────────
string audioPath = "quarterly_review.wav";
if (!File.Exists(audioPath))
{
    Console.WriteLine($"Place a WAV file at '{audioPath}' and run again.");
    return;
}

var audioFile = new WaveFile(audioPath);
Console.WriteLine($"Audio: {audioPath} (duration: {audioFile.Duration:hh\\:mm\\:ss\\.ff})\n");

Console.Write("Detecting language... ");
var langResult = stt.DetectLanguage(audioFile);
Console.WriteLine($"{langResult.Language} (confidence: {langResult.Confidence:P0})\n");

// ──────────────────────────────────────
// 5. Transcribe with detected language
// ──────────────────────────────────────
var stopwatch = Stopwatch.StartNew();
var result = stt.Transcribe(audioFile, langResult.Language);
stopwatch.Stop();
Console.WriteLine("\n");

// ──────────────────────────────────────
// 6. Separate into confidence tiers
// ──────────────────────────────────────
const float highThreshold = 0.8f;
const float medThreshold = 0.5f;

var highTier = result.Segments.Where(s => s.Confidence >= highThreshold).OrderBy(s => s.Start).ToList();
var medTier = result.Segments.Where(s => s.Confidence >= medThreshold && s.Confidence < highThreshold).OrderBy(s => s.Start).ToList();
var lowTier = result.Segments.Where(s => s.Confidence < medThreshold).OrderBy(s => s.Start).ToList();

// ──────────────────────────────────────
// 7. Output timestamped transcript with quality annotations
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("══════════════════════════════════════════");
Console.WriteLine("  TIMESTAMPED TRANSCRIPT WITH QUALITY");
Console.WriteLine("══════════════════════════════════════════");
Console.ResetColor();

foreach (var seg in result.Segments.OrderBy(s => s.Start))
{
    string tier = seg.Confidence >= highThreshold ? "HIGH"
                : seg.Confidence >= medThreshold  ? " MED"
                : " LOW";

    ConsoleColor color = seg.Confidence >= highThreshold ? ConsoleColor.Green
                       : seg.Confidence >= medThreshold  ? ConsoleColor.Yellow
                       : ConsoleColor.Red;

    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.Write($"  [{seg.Start:hh\\:mm\\:ss} - {seg.End:hh\\:mm\\:ss}] ");
    Console.ForegroundColor = color;
    Console.Write($"[{tier} {seg.Confidence:F2}] ");
    Console.ResetColor();
    Console.WriteLine(seg.Text);
}

// ──────────────────────────────────────
// 8. Build clean transcript (high confidence only)
// ──────────────────────────────────────
string cleanTranscript = string.Join("\n",
    highTier.Select(s => s.Text));

Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("\n══════════════════════════════════════════");
Console.WriteLine("  CLEAN TRANSCRIPT (HIGH CONFIDENCE)");
Console.WriteLine("══════════════════════════════════════════");
Console.ResetColor();
Console.WriteLine(cleanTranscript);

// ──────────────────────────────────────
// 9. Summary statistics
// ──────────────────────────────────────
int totalSegments = result.Segments.Count;
float avgConfidence = totalSegments > 0
    ? result.Segments.Average(s => s.Confidence)
    : 0f;

TimeSpan totalSpeech = TimeSpan.Zero;
foreach (var seg in result.Segments)
{
    totalSpeech += seg.End - seg.Start;
}

Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("\n══════════════════════════════════════════");
Console.WriteLine("  SUMMARY STATISTICS");
Console.WriteLine("══════════════════════════════════════════");
Console.ResetColor();
Console.WriteLine($"  Language detected:     {langResult.Language} ({langResult.Confidence:P0} confidence)");
Console.WriteLine($"  Audio duration:        {audioFile.Duration:hh\\:mm\\:ss}");
Console.WriteLine($"  Speech duration:       {totalSpeech:hh\\:mm\\:ss}");
Console.WriteLine($"  Processing time:       {stopwatch.Elapsed:mm\\:ss\\.ff}");
Console.WriteLine($"  Total segments:        {totalSegments}");
Console.WriteLine($"  Average confidence:    {avgConfidence:F2}");
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"  High confidence:       {highTier.Count} segments ({(totalSegments > 0 ? (float)highTier.Count / totalSegments * 100 : 0):F0}%)");
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"  Medium confidence:     {medTier.Count} segments ({(totalSegments > 0 ? (float)medTier.Count / totalSegments * 100 : 0):F0}%)");
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"  Low confidence:        {lowTier.Count} segments ({(totalSegments > 0 ? (float)lowTier.Count / totalSegments * 100 : 0):F0}%)");
Console.ResetColor();

// ──────────────────────────────────────
// 10. Export results
// ──────────────────────────────────────
File.WriteAllText("transcript_full.txt", result.Text);
File.WriteAllText("transcript_clean.txt", cleanTranscript);

if (lowTier.Count > 0)
{
    string reviewContent = string.Join("\n",
        lowTier.Select(s =>
            $"[{s.Start:hh\\:mm\\:ss} - {s.End:hh\\:mm\\:ss}] (confidence: {s.Confidence:F2}) {s.Text}"));
    File.WriteAllText("segments_for_review.txt", reviewContent);
    Console.WriteLine("\n  Exported: transcript_full.txt, transcript_clean.txt, segments_for_review.txt");
}
else
{
    Console.WriteLine("\n  Exported: transcript_full.txt, transcript_clean.txt");
}

Run it:

dotnet run

VAD Tuning Reference

Use this table as a starting point for common acoustic environments. Adjust from these baselines based on your specific audio characteristics.

Environment EnergyThreshold MinSpeechDuration MinSilenceDuration SpeechPadding Notes
Quiet office 0.05 150 ms 80 ms 20 ms Sensitive settings to catch soft speech
Conference room 0.15 300 ms 200 ms 40 ms Balanced for multiple speakers with moderate room noise
Outdoor recording 0.25 400 ms 250 ms 50 ms Wind and ambient noise require higher thresholds
Phone call / VoIP 0.2 300 ms 150 ms 30 ms Compression artifacts and echo; moderate sensitivity
Noisy factory 0.35 500 ms 300 ms 60 ms Aggressive filtering to reject constant machine hum

Common Issues

Problem Cause Fix
Empty transcription output VAD EnergyThreshold is too high for quiet audio, filtering out all speech Lower EnergyThreshold to 0.05 or 0.03. Alternatively, disable VAD entirely with EnableVoiceActivityDetection = false and check if speech is present.
Hallucinated text during silence (e.g., "Thank you for watching") SuppressHallucinations is disabled, or VAD is not enabled Set SuppressHallucinations = true and EnableVoiceActivityDetection = true. If hallucinations persist, raise EnergyThreshold.
Words clipped at the start or end of segments SpeechPadding is too short Increase SpeechPadding to 50 ms or higher. Also verify MinSilenceDuration is not forcing premature segment splits.
Too many tiny segments MinSilenceDuration is too low, causing splits on brief pauses Raise MinSilenceDuration to 200 ms or higher. Also increase MinSpeechDuration to filter very short fragments.
Domain-specific terms transcribed incorrectly No prompt set, or prompt is in a different language than the audio Set Prompt with example sentences containing the exact terms, written in the same language as the audio.
Low confidence scores across all segments Model too small for the audio quality, or language mismatch Use whisper-large-turbo3 for the best accuracy. Detect language first with DetectLanguage and pass it explicitly to Transcribe.

Next Steps

Share