Table of Contents

What is Speech-to-Text (Automatic Speech Recognition)?


TL;DR

Speech-to-text (STT), also called automatic speech recognition (ASR), is the technology that converts spoken language in audio recordings into written text. Modern ASR systems use deep learning models (most notably OpenAI's Whisper architecture) to transcribe speech with high accuracy across dozens of languages, handling accents, background noise, and domain-specific vocabulary. Speech-to-text is the foundational modality bridge that turns audio into text that LLMs can process, enabling voice-driven AI assistants, meeting transcription, audio content analysis, and accessibility applications. LM-Kit.NET provides local, on-device speech-to-text via Whisper models through the SpeechToText class, with voice activity detection (VAD) for automatic speech segmentation, all running entirely on local hardware without cloud dependency.


What Exactly is Speech-to-Text?

Speech-to-text converts an audio signal (a waveform of sound) into a sequence of words:

Input:  Audio waveform (WAV, MP3, or other format)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        "Hello, I'd like to schedule a meeting for tomorrow at 3 PM"

Processing:
  [Audio Encoder] → Acoustic features (mel spectrograms)
        ↓
  [Decoder / Language Model] → Token predictions
        ↓
  [Post-processing] → Punctuation, formatting, timestamps

Output: "Hello, I'd like to schedule a meeting for tomorrow at 3 PM."
        + timestamps: [(0.0s, "Hello,"), (0.5s, "I'd like to..."), ...]

This is fundamentally different from simple audio analysis. ASR must handle:

  • Continuous speech: Words blend together without clear boundaries
  • Variability: Different speakers, accents, speaking speeds, emotional states
  • Ambiguity: "their" vs. "there" vs. "they're" require contextual understanding
  • Noise: Background sounds, music, multiple speakers, echo
  • Domain vocabulary: Medical terms, legal jargon, product names, technical terms

The Whisper Revolution

The Whisper model family (released by OpenAI in 2022, continuously improved since) transformed ASR by training on 680,000 hours of multilingual audio data. Key advances:

  • Multilingual: Supports 99+ languages in a single model
  • Robust to noise: Handles real-world audio quality, not just studio recordings
  • Punctuation and formatting: Produces properly punctuated, readable text
  • Translation: Can transcribe and translate simultaneously (e.g., French audio to English text)
  • Multiple sizes: From tiny (39M parameters) to large (1.5B parameters), enabling deployment across hardware tiers
Whisper Model Parameters Speed vs. Quality
Tiny 39M Fastest, suitable for real-time on CPU
Base 74M Good balance for lightweight deployment
Small 244M Strong accuracy for most languages
Medium 769M High accuracy, especially for non-English
Large v3 1.5B Best accuracy, requires more compute
Large v3 Turbo 809M Near-large accuracy at medium speed

Why Speech-to-Text Matters

  1. Voice as Input Modality: Speech is the most natural human communication mode. STT enables AI agents and assistants to accept voice input, making AI accessible in hands-free scenarios, on mobile devices, and for users who prefer speaking to typing.

  2. Audio Content Unlocking: Meetings, interviews, podcasts, customer calls, lectures, and voicemails contain valuable information locked in audio. STT converts this into searchable, analyzable text that LLMs can process. See Extract Action Items from Audio.

  3. Multi-Modal Pipelines: STT is the bridge that connects audio to text-based AI. Transcribed audio can feed into summarization, sentiment analysis, extraction, classification, and RAG pipelines. See Multi-Modal AI.

  4. Accessibility: STT enables real-time captioning for hearing-impaired users, transcription services for educational content, and voice-controlled interfaces for users with mobility limitations.

  5. Privacy with Local Inference: Cloud STT services require sending audio (often containing sensitive conversations) to external servers. Local STT with LM-Kit.NET keeps all audio on-premises, critical for healthcare, legal, financial, and government applications. See Edge AI.

  6. Cost at Scale: Transcribing thousands of hours of audio through cloud APIs is expensive. Local inference has a fixed hardware cost and zero per-minute fees, making high-volume transcription economically viable.


Technical Insights

The STT Pipeline

A complete speech-to-text pipeline involves several stages:

Raw Audio
    ↓
[Pre-processing]
  - Format conversion (MP3, WAV, etc.)
  - Sample rate normalization (16kHz for Whisper)
  - Channel mixing (stereo to mono)
    ↓
[Voice Activity Detection (VAD)]
  - Detect speech segments vs. silence/noise
  - Split audio into manageable chunks
  - Filter non-speech segments
  See: Voice Activity Detection (VAD)
    ↓
[Feature Extraction]
  - Convert waveform to mel spectrogram
  - 80-dimensional mel-frequency features
  - 30-second processing windows
    ↓
[Model Inference]
  - Encoder processes audio features
  - Decoder generates token sequence
  - Beam search for best transcription
    ↓
[Post-processing]
  - Assemble segments into full transcript
  - Add punctuation and formatting
  - Generate word-level timestamps
    ↓
Transcribed Text + Timestamps

Voice Activity Detection (VAD)

VAD is a critical preprocessing step that detects which portions of audio contain speech. Without VAD, the model processes silence and background noise, wasting compute and potentially generating hallucinated text from non-speech audio. VAD segments the audio into speech chunks that are processed independently, improving both speed and accuracy.

Key Quality Factors

Factor Impact Mitigation
Audio quality Low bitrate or heavy compression degrades accuracy Use highest available quality source
Background noise Competes with speech signal VAD filtering, noise-robust models (Whisper)
Speaker overlap Multiple simultaneous speakers confuse the model Speaker diarization preprocessing
Domain vocabulary Technical terms may be misrecognized Post-processing correction, prompt conditioning
Language mixing Code-switching between languages Multilingual models handle this better
Audio length Very long audio requires segmentation VAD-based chunking with overlap

STT for Downstream AI Processing

The real power of STT emerges when transcripts feed into LLM-based processing:

Audio Recording
    ↓
[Speech-to-Text] → Raw transcript
    ↓
[LLM Processing] → Choose downstream task:
    ├── Summarization → Meeting summary
    ├── Extraction → Action items, decisions, names
    ├── Sentiment Analysis → Customer satisfaction score
    ├── Classification → Topic categorization
    ├── RAG Indexing → Searchable knowledge base
    └── Translation → Multi-language transcript

See Transcribe and Reformat Audio with LLM for combining STT with LLM post-processing.


Practical Use Cases

  • Meeting Transcription and Summarization: Record meetings, transcribe with STT, then use an LLM to generate summaries, extract action items, and identify decisions. See Extract Action Items from Audio and Transcribe and Generate Chaptered Documents.

  • Customer Call Analysis: Transcribe support calls, then apply sentiment analysis to gauge customer satisfaction, extract key issues, and identify trends across thousands of calls. See Analyze Customer Sentiment.

  • Medical Dictation: Physicians dictate clinical notes that are transcribed locally, keeping patient data on-premises for HIPAA compliance. The transcript can then feed into extraction pipelines for structured data capture.

  • Legal Transcription: Depositions, hearings, and client meetings transcribed with timestamps for legal reference, all processed locally for attorney-client privilege protection.

  • Podcast and Video Indexing: Transcribe audio content and index it in a RAG knowledge base for searchable, question-answerable content. See Build RAG Pipeline.

  • Accessibility Services: Real-time captioning for live events, lecture transcription for students, and voice-to-text for communication assistance.


Key Terms

  • Speech-to-Text (STT): The process of converting spoken language in audio into written text. Also called automatic speech recognition (ASR).

  • Automatic Speech Recognition (ASR): The technical field and technology for converting speech to text. STT and ASR are used interchangeably.

  • Whisper: OpenAI's open-source speech recognition model family, trained on 680,000 hours of multilingual audio, available in multiple sizes from tiny to large.

  • Mel Spectrogram: A visual/numerical representation of audio frequency content over time, used as the input feature for most modern ASR models.

  • Voice Activity Detection (VAD): The preprocessing step that identifies speech segments in audio, filtering out silence and noise. See Voice Activity Detection.

  • Speaker Diarization: The process of determining "who spoke when" in multi-speaker audio, separating different speakers' contributions.

  • Word-Level Timestamps: Timing information associating each transcribed word with its position in the audio, enabling precise audio-text alignment.

  • Beam Search: A decoding strategy that explores multiple transcription hypotheses simultaneously to find the most likely complete transcription.


  • SpeechToText: Core speech recognition engine
  • ModelCard: Whisper model catalog (tiny through large-v3-turbo)



External Resources


Summary

Speech-to-text (STT) converts spoken language into written text, bridging the gap between audio content and text-based AI processing. Modern ASR, powered by models like Whisper, delivers high accuracy across 99+ languages with robustness to real-world noise, accents, and audio quality variations. STT is the entry point for multi-modal AI pipelines that process audio: meetings become searchable transcripts, customer calls become sentiment data, and voice commands become agent instructions. LM-Kit.NET provides fully local STT through the SpeechToText class with Whisper models ranging from tiny (39M parameters, real-time on CPU) to large-v3 (1.5B parameters, highest accuracy), integrated with VAD for automatic speech segmentation. Running STT on local hardware with edge AI deployment keeps sensitive audio on-premises, eliminates per-minute cloud costs, and enables offline operation.

Share