Audio Speaker Diarization PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document explains speaker diarization, a process for segmenting and clustering speech recordings to identify speakers. It describes steps like speaker segmentation, feature extraction, and clustering, along with the MFCC (Mel Frequency Cepstral Coefficients) and CTC (Connectionist Temporal Classification) techniques used for timestamp alignment and best alignment path selection. Also includes two common search algorithms, Greedy Search and Beam Search for NLP purposes used in Speech recognition systems.
Full Transcript
🤭 Audio Speaker Diarization It is a process of segmenting and clustering a speech recording into homogeneus regions and answers the question “who spoke when”. Firstly it disriminates speech segments from the non-speech ones. Secondly, it detects speaker...
🤭 Audio Speaker Diarization It is a process of segmenting and clustering a speech recording into homogeneus regions and answers the question “who spoke when”. Firstly it disriminates speech segments from the non-speech ones. Secondly, it detects speaker change points to segment the audio data. Finally, it groups these segmented regions into speaker homogeneus clusters. Overall scheme: Speaker segmentation Feature Extraction Speaker clustering MFCC (Mel Frequency Cepstral Coeffictients) CTC Timestamp Alignment (eq. Wav2Vec model) 1. Input and Model We input audio waveform and know transcription. A neural network then process audio in account to generate emissions. Emissions are just probability distributions over possible output labels which can be phonemes or characters. It happens for each frame which is usually 10-25 miliseconds. 2. Frame-level emissions Model generates emissions for each frame (10-25 milisecods) also special blank label is used. Audio 1 Emission Matrix: Time Step Blank ("-") A B C T1 0.8 0.1 0.05 0.05 T2 0.7 0.2 0.05 0.05 T3 0.3 0.1 0.4 0.2 T4 0.1 0.1 0.1 0.7 A,B, C and Blank can be treated as phonemes. And numbers are their probabilities. There are many potential paths here such as: Path 1: "Blank -> A -> Blank -> B -> Blank -> C" Path 2: "A -> Blank -> B -> C" Path 3: "Blank -> A -> A -> B -> Blank -> C" 3. CTC Loss and Alignment Path CTC use blank label to represent time steps where no output is assigned. The forced alignment problem overally is about find most likely alignment path through emissions which matches known transcription. CTC computes the best alignment by considering all possible paths through the emissions that could lead to the target transcription 4. Best Alignment Path Best path is just the one with highest probabilities. Beam Search and Greedy Search These two algorithms are mostly used in case of NLP but in speech recognition systems they also can be found. Basically their role is to propose output sentence by the matrix of words and their probabilities in sentences. Audio 2 Greedy Search simply takes the word with highest probability for each position. It is really simple but not as effective as Beam Search. Beam Search looks for probabilities on each position and depending on parameter (beam width - it tells about how many “branches” will be considered in this tree) it calculates the probability scores. Example below: All of this is repeated until END token is finally found. Audio 3 ASR and VAD Therm ASR can be understood as Automatic Speech Recognition where VAD is Voice Activity Detection. Purpose 1. VAD simply detects if a segment of audio contains speech or not. It does not provide any form of transcript. 2. ASR converts prepares transcripts by recognizing spoken words. It interprets and identifies actual content of speech. Function 1. VAD is used primarily to separate segments of silent and speech, improving accuracy of systems which are next ex. ASR system. It is used in WhisperX before audio parts lands in actual Whisper model. It improves results. 2. ASR is full-featured system that analyzes input speech and converts it to text. Whisper is ASR model. It is based on transformer architecture (same as language models are) thats why it works pretty much the same. Processing Complexity 1. VAD is relatively lightweight process beacause it is focused on detecting basic features like volume, pitch and spectral changes. 2. ASR is complex model, requiring acoustic modeling, language processing, and often extensive computational resources. Acoustic modeling Fundamental component of ASR models (in traditional ASR systems). It basically alalyzes audio signals to identify phonetic units. Metrics Audio 4 Bleu score Bilingual Evaluation Understudy it is quite complex metric used mainly in NLP but in case of Whisper whose architecture is based on transformer. It contains serveral parts such as n-gram calculation: … https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and- wer-metrics-1a5ba06d812b Audio 5