What is Speech-to-Text (Automatic Speech Recognition)?

TL;DR

Speech-to-text (STT), also called automatic speech recognition (ASR), is the technology that converts spoken language in audio recordings into written text. Modern ASR systems use deep learning models (most notably OpenAI's Whisper architecture) to transcribe speech with high accuracy across dozens of languages, handling accents, background noise, and domain-specific vocabulary. Speech-to-text is the foundational modality bridge that turns audio into text that LLMs can process, enabling voice-driven AI assistants, meeting transcription, audio content analysis, and accessibility applications. LM-Kit.NET provides local, on-device speech-to-text via Whisper models through the SpeechToText class, with voice activity detection (VAD) for automatic speech segmentation, all running entirely on local hardware without cloud dependency.

What Exactly is Speech-to-Text?

Speech-to-text converts an audio signal (a waveform of sound) into a sequence of words:

Input:  Audio waveform (WAV, MP3, or other format)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        "Hello, I'd like to schedule a meeting for tomorrow at 3 PM"

Processing:
  [Audio Encoder] → Acoustic features (mel spectrograms)
        ↓
  [Decoder / Language Model] → Token predictions
        ↓
  [Post-processing] → Punctuation, formatting, timestamps

Output: "Hello, I'd like to schedule a meeting for tomorrow at 3 PM."
        + timestamps: [(0.0s, "Hello,"), (0.5s, "I'd like to..."), ...]

This is fundamentally different from simple audio analysis. ASR must handle:

Continuous speech: Words blend together without clear boundaries
Variability: Different speakers, accents, speaking speeds, emotional states
Ambiguity: "their" vs. "there" vs. "they're" require contextual understanding
Noise: Background sounds, music, multiple speakers, echo
Domain vocabulary: Medical terms, legal jargon, product names, technical terms

The Whisper Revolution

The Whisper model family (released by OpenAI in 2022, continuously improved since) transformed ASR by training on 680,000 hours of multilingual audio data. Key advances:

Multilingual: Supports 99+ languages in a single model
Robust to noise: Handles real-world audio quality, not just studio recordings
Punctuation and formatting: Produces properly punctuated, readable text
Translation: Can transcribe and translate simultaneously (e.g., French audio to English text)
Multiple sizes: From tiny (39M parameters) to large (1.5B parameters), enabling deployment across hardware tiers

Whisper Model	Parameters	Speed vs. Quality
Tiny	39M	Fastest, suitable for real-time on CPU
Base	74M	Good balance for lightweight deployment
Small	244M	Strong accuracy for most languages
Medium	769M	High accuracy, especially for non-English
Large v3	1.5B	Best accuracy, requires more compute
Large v3 Turbo	809M	Near-large accuracy at medium speed

Why Speech-to-Text Matters

Voice as Input Modality: Speech is the most natural human communication mode. STT enables AI agents and assistants to accept voice input, making AI accessible in hands-free scenarios, on mobile devices, and for users who prefer speaking to typing.
Audio Content Unlocking: Meetings, interviews, podcasts, customer calls, lectures, and voicemails contain valuable information locked in audio. STT converts this into searchable, analyzable text that LLMs can process. See Extract Action Items from Audio.
Multi-Modal Pipelines: STT is the bridge that connects audio to text-based AI. Transcribed audio can feed into summarization, sentiment analysis, extraction, classification, and RAG pipelines. See Multi-Modal AI.
Accessibility: STT enables real-time captioning for hearing-impaired users, transcription services for educational content, and voice-controlled interfaces for users with mobility limitations.
Privacy with Local Inference: Cloud STT services require sending audio (often containing sensitive conversations) to external servers. Local STT with LM-Kit.NET keeps all audio on-premises, critical for healthcare, legal, financial, and government applications. See Edge AI.
Cost at Scale: Transcribing thousands of hours of audio through cloud APIs is expensive. Local inference has a fixed hardware cost and zero per-minute fees, making high-volume transcription economically viable.

Technical Insights

The STT Pipeline

A complete speech-to-text pipeline involves several stages:

Raw Audio
    ↓
[Pre-processing]
  - Format conversion (MP3, WAV, etc.)
  - Sample rate normalization (16kHz for Whisper)
  - Channel mixing (stereo to mono)
    ↓
[Voice Activity Detection (VAD)]
  - Detect speech segments vs. silence/noise
  - Split audio into manageable chunks
  - Filter non-speech segments
  See: Voice Activity Detection (VAD)
    ↓
[Feature Extraction]
  - Convert waveform to mel spectrogram
  - 80-dimensional mel-frequency features
  - 30-second processing windows
    ↓
[Model Inference]
  - Encoder processes audio features
  - Decoder generates token sequence
  - Beam search for best transcription
    ↓
[Post-processing]
  - Assemble segments into full transcript
  - Add punctuation and formatting
  - Generate word-level timestamps
    ↓
Transcribed Text + Timestamps

Voice Activity Detection (VAD)

VAD is a critical preprocessing step that detects which portions of audio contain speech. Without VAD, the model processes silence and background noise, wasting compute and potentially generating hallucinated text from non-speech audio. VAD segments the audio into speech chunks that are processed independently, improving both speed and accuracy.

Key Quality Factors

Factor	Impact	Mitigation
Audio quality	Low bitrate or heavy compression degrades accuracy	Use highest available quality source
Background noise	Competes with speech signal	VAD filtering, noise-robust models (Whisper)
Speaker overlap	Multiple simultaneous speakers confuse the model	Speaker diarization preprocessing
Domain vocabulary	Technical terms may be misrecognized	Post-processing correction, prompt conditioning
Language mixing	Code-switching between languages	Multilingual models handle this better
Audio length	Very long audio requires segmentation	VAD-based chunking with overlap

STT for Downstream AI Processing

The real power of STT emerges when transcripts feed into LLM-based processing:

Audio Recording
    ↓
[Speech-to-Text] → Raw transcript
    ↓
[LLM Processing] → Choose downstream task:
    ├── Summarization → Meeting summary
    ├── Extraction → Action items, decisions, names
    ├── Sentiment Analysis → Customer satisfaction score
    ├── Classification → Topic categorization
    ├── RAG Indexing → Searchable knowledge base
    └── Translation → Multi-language transcript

See Transcribe and Reformat Audio with LLM for combining STT with LLM post-processing.

Practical Use Cases

Meeting Transcription and Summarization: Record meetings, transcribe with STT, then use an LLM to generate summaries, extract action items, and identify decisions. See Extract Action Items from Audio and Transcribe and Generate Chaptered Documents.
Customer Call Analysis: Transcribe support calls, then apply sentiment analysis to gauge customer satisfaction, extract key issues, and identify trends across thousands of calls. See Analyze Customer Sentiment.
Medical Dictation: Physicians dictate clinical notes that are transcribed locally, keeping patient data on-premises for HIPAA compliance. The transcript can then feed into extraction pipelines for structured data capture.
Legal Transcription: Depositions, hearings, and client meetings transcribed with timestamps for legal reference, all processed locally for attorney-client privilege protection.
Podcast and Video Indexing: Transcribe audio content and index it in a RAG knowledge base for searchable, question-answerable content. See Build RAG Pipeline.
Accessibility Services: Real-time captioning for live events, lecture transcription for students, and voice-to-text for communication assistance.

Key Terms

Speech-to-Text (STT): The process of converting spoken language in audio into written text. Also called automatic speech recognition (ASR).
Automatic Speech Recognition (ASR): The technical field and technology for converting speech to text. STT and ASR are used interchangeably.
Whisper: OpenAI's open-source speech recognition model family, trained on 680,000 hours of multilingual audio, available in multiple sizes from tiny to large.
Mel Spectrogram: A visual/numerical representation of audio frequency content over time, used as the input feature for most modern ASR models.
Voice Activity Detection (VAD): The preprocessing step that identifies speech segments in audio, filtering out silence and noise. See Voice Activity Detection.
Speaker Diarization: The process of determining "who spoke when" in multi-speaker audio, separating different speakers' contributions.
Word-Level Timestamps: Timing information associating each transcribed word with its position in the audio, enabling precise audio-text alignment.
Beam Search: A decoding strategy that explores multiple transcription hypotheses simultaneously to find the most likely complete transcription.

SpeechToText: Core speech recognition engine
ModelCard: Whisper model catalog (tiny through large-v3-turbo)

Voice Activity Detection (VAD): Essential preprocessing for speech segmentation
Multi-Modal AI: STT as the audio modality bridge to text-based AI
Edge AI: Local STT inference for privacy and offline use
AI Agents: Voice-driven agent interactions enabled by STT
Extraction: Structured data extraction from transcribed audio
Text Summarization: Summarizing transcribed content
Sentiment Analysis: Analyzing tone and satisfaction from transcribed calls
RAG (Retrieval-Augmented Generation): Indexing transcribed audio for search
Context Engineering: Managing transcription context within LLM processing
Small Language Model (SLM): Whisper models as specialized small models for audio

Transcribe Audio with Speech-to-Text: Core STT guide
Tune Whisper Transcription with VAD and Segments: Fine-tuning transcription quality
Extract Action Items from Audio: STT + LLM for meeting intelligence
Transcribe and Generate Chaptered Documents: Long-form audio processing
Transcribe and Reformat Audio with LLM: STT + LLM post-processing
Speech-to-Text Demo: Interactive STT demo
Audio Transcription App Demo: Full transcription application

External Resources

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022): The original Whisper paper
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling (Gandhi et al., 2023): Faster Whisper variants through distillation
whisper.cpp: High-performance C/C++ Whisper inference engine

Summary

Speech-to-text (STT) converts spoken language into written text, bridging the gap between audio content and text-based AI processing. Modern ASR, powered by models like Whisper, delivers high accuracy across 99+ languages with robustness to real-world noise, accents, and audio quality variations. STT is the entry point for multi-modal AI pipelines that process audio: meetings become searchable transcripts, customer calls become sentiment data, and voice commands become agent instructions. LM-Kit.NET provides fully local STT through the SpeechToText class with Whisper models ranging from tiny (39M parameters, real-time on CPU) to large-v3 (1.5B parameters, highest accuracy), integrated with VAD for automatic speech segmentation. Running STT on local hardware with edge AI deployment keeps sensitive audio on-premises, eliminates per-minute cloud costs, and enables offline operation.

Table of Contents