Speech Recognition Overview

Study Notes

Speech Recognition (SR) or Automatic Speech Recognition (ASR) is the process of converting spoken language into text.
SR involves multiple layers of processing, from the acoustic level to the semantic level, which often includes:
- Acoustic Layer: Analyzing the sound waves of speech.
- Phonetic/Prosodic Layer: Identifying the sounds and their timing/intonation.
- Syntactic Layer: Arranging words into grammatically correct sentences.
- Semantic Layer: Understanding the meaning of the words and their relationships.
- Pragmatic Layer: Interpreting the context and speaker's intent.

Word Boundary Detection: Identifying where one word ends and another begins is difficult due to the natural flow of speech, variations in pronunciation, and disfluencies (hesitations, repetitions, etc.).
Speaking Rate Variability: People speak at different speeds, affecting the length and clarity of sounds.
Variability Across Languages: Languages differ in their sounds and grammatical structures, requiring specialized models for each language.
Noise and Environment: Background noise, microphone quality, and transmission channels can significantly impact the clarity of the speech signal, making it harder to analyze.

Speech to Text: Converting spoken language to written text for various applications, such as dictation software, transcription, and search.
Speaker Identification: Determining the identity of a speaker based on their voice characteristics.
Speaker Verification: Confirming the identity of a speaker by comparing their voice to a previously stored voice print.

Enrolment Phase: Collects and analyzes voice samples from a speaker to create a unique vocal model.
Verification Phase: Compares the voice of a speaker claiming a specific identity to their enrolled model to confirm or reject the identity claim.

Speech Quality: Clarity of speech, background noise, microphone quality, and channel variations can affect accuracy.
Speech Modality: Whether the system requires spoken text to be pre-defined (text-dependent), or can handle any spoken text (text-independent), influences performance and application.
Speech Duration: The length of the samples used for enrollment and verification can impact accuracy.
Speaker Population: The number of speakers in the system affects the challenge of differentiating between them.