🎙️ Understanding Voice Activity Detection (VAD) in LM-Kit.NET

📄 TL;DR

Voice Activity Detection (VAD) identifies segments of speech within audio streams, distinguishing spoken content from silence or background noise. In LM-Kit.NET, VAD is configured via the VadSettings class, optimizing transcription accuracy and efficiency by isolating meaningful speech segments for further processing, such as transcription or translation.

📝 What is Voice Activity Detection?

Definition: Voice Activity Detection (VAD) is a technology that detects the presence or absence of human speech within audio signals. It segments audio streams into speech and non-speech portions, significantly enhancing the effectiveness of subsequent speech-processing tasks.

Detection: Identifies regions containing speech within an audio stream.
Segmentation: Extracts speech segments clearly separated from silence or noise.
Customization: Provides adjustable parameters for precise detection accuracy.

🎯 Why Use VAD?

Improved Transcription Accuracy: Accurately identifies and isolates speech for clearer, error-reduced transcriptions.
Efficiency Optimization: Reduces processing time and computational resources by focusing only on meaningful speech segments.
Enhanced User Experience: Provides cleaner audio data, enhancing the clarity and usefulness of automated analyses.

⚙️ Key Class: `VadSettings`

Located in LMKit.Speech, VadSettings encapsulates VAD configurations:

public sealed class VadSettings
{
    // Energy level threshold for detecting speech (0 to 1, default: 0.1)
    public float EnergyThreshold { get; set; }

    // Maximum allowed duration of a speech segment (default: unlimited)
    public TimeSpan MaxSpeechDuration { get; set; }

    // Minimum duration of silence between speech segments (default: 100 ms)
    public TimeSpan MinSilenceDuration { get; set; }

    // Minimum duration for a segment to be considered speech (default: 250 ms)
    public TimeSpan MinSpeechDuration { get; set; }

    // Overlap duration between analysis windows (0 to 1 second, default: 0.1)
    public float SampleOverlapSeconds { get; set; }

    // Additional padding added around detected speech segments (default: 30 ms)
    public TimeSpan SpeechPadding { get; set; }
}

🔍 Key Parameters in VadSettings

EnergyThreshold: Determines sensitivity to sound levels, with lower values detecting quieter speech.
MaxSpeechDuration: Controls maximum segment length, preventing overly long segments.
MinSilenceDuration: Ensures clear separation between speech segments.
MinSpeechDuration: Filters out brief noises by setting a minimum segment duration.
SampleOverlapSeconds: Allows overlapping analysis for smoother detection.
SpeechPadding: Adds context around speech segments, improving transcription accuracy.

🚀 Quickstart Example

Here's how to quickly set up VAD settings for transcription:

var vadSettings = new VadSettings
{
    EnergyThreshold = 0.1f,
    MinSpeechDuration = TimeSpan.FromMilliseconds(300),
    MinSilenceDuration = TimeSpan.FromMilliseconds(150),
    SpeechPadding = TimeSpan.FromMilliseconds(50)
};

var speechToText = new SpeechToText("model-path", vadSettings);

var transcriptionResult = await speechToText.TranscribeAsync("audio-file.wav", CancellationToken.None);

foreach (var segment in transcriptionResult.AudioSegments)
{
    Console.WriteLine($"[{segment.StartTime}-{segment.EndTime}] {segment.Text}");
}

📖 Common Terms

Speech Segment: Continuous portion of audio containing speech.
Energy Threshold: Level determining what is considered speech.
Silence Duration: Length of silence separating speech segments.
Padding: Extra time added around segments for context clarity.

Speech-to-Text (STT): Converting spoken words to text.
Audio Segmentation: Dividing audio streams into meaningful sections.
Language Detection: Identifying spoken language automatically.

📝 Summary

In LM-Kit.NET, Voice Activity Detection (VAD) provides configurable and precise audio segmentation capabilities, isolating speech from noise for enhanced transcription accuracy, efficiency, and overall audio analysis quality. The adjustable settings in the VadSettings class empower developers to tailor speech processing workflows to specific use cases and environments.

Table of Contents