2c speech-representation-en.pdf
Document Details
Full Transcript
Speech Representation Speech Language Processing Faculty of Computer Science Universitas Indonesia Dr. Kurniawati Azizah, S.T., M.Phil. Semester Gasal 2024/2025 2...
Speech Representation Speech Language Processing Faculty of Computer Science Universitas Indonesia Dr. Kurniawati Azizah, S.T., M.Phil. Semester Gasal 2024/2025 2 References Spectrogram, Cepstrum, and Mel-Frequency Analysis – Kishore Prahallad Acoustic Modeling, Feature Extraction, HMM-DNN models – Dan Jurafsky https://speechprocessingbook.aalto.fi/ https://www.audiolabs- erlangen.de/resources/MIR/FMP/C3/C3S1_SpecLogFreq-Chromagram.html 3 Speech representation: overview Speech block processing Spectrogram Cepstrum Mel-spectrogram, mel-frequency cepstral coefficient MFCC Chromagram Zero crossing rate 4 Speech Block Processing The speech signal captured by a microphone can be converted into digital form by an Analog-to-Digital Converter (ADC). It can be stored in a file (a speech file) consisting of a long sequence of sample values (quantised in value). ➜ When completely read into memory this results in a long sequence of samples stored ➜ The algorithms in this course all operate on quasi-stationary speech segments called blocks or frames. ➜ Recall that the frame-size is a compromise between having ✘ Enough data to accurately measure the quantities of interest ✘ The quasi-stationarity assumption being appropriate. We also need to ensure that we have frequent enough frames to fully represent the signal (especially important when including windowing) 5 Block Processing To accommodate the above, let frames overlap in time, so that block processing is characterized by: ➜ Frame size or window length (N F ): number of samples per frame (alternatively number of seconds of speech per frame = N F Ts ) ➜ Frame shift or hop length (N H ): numbers of samples between the start of successive frames (or seconds = N H Ts ). Sometimes characterised as a Frame rate (f r =fs/NH): number of frames per second Example: If Ts = 100µs (0.1ms), N W = 256, N H = 100 frame 0 Time In seconds, frame 1 frame size is 25.6ms (=256 x 0.1ms), NH frame 2 frame shift is 10ms (=100 x 0.1ms) etc 1 1 1 𝑓𝑟 = = = = 100𝐻𝑧 NF 𝑁𝐻 𝑇𝑠 100𝑥0.1𝑚𝑠 0.01𝑠 Hence the frame rate f r is 100Hz. 6 How many frame are there? A signal of N samples, with frame length NF < N and hop length NH will produce K frames, where: 𝑁−𝑁𝐹 𝑁𝑇𝑆 −𝑁𝐹 𝑇𝑆 𝐾 =1+ or in duration 𝐾 = 1 + 𝑁𝐻 𝑁𝐻 𝑇𝑆 The “extra” 1+ comes from the fact that frame k=0 does not invoke a step by the hop length; if we did not have this extra 1, the result would not agree with the “extreme” cases as follows: What if NF = N , so that the frame length is exactly the same as the signal length? In this case, there should be only 1 frame. The hop length does not matter here because any index offset other than 0 would push the frame off the end of the input array, and we would not have a full frame. What if NF = NH =1? In this case, each sample is a frame by itself, so we should have N frames (one per sample). What if NF =1 and NH =2? In this case, we’re effectively decimating the signal by a factor of 2 (taking every other sample), so we should have N/2 frames (if is even) or (N-1)/2(if is odd). 7 Frame to Time Conversion Combining these two quantities, the nth sample of the kth frame is given by 𝑦 𝑘, 𝑛 = 𝑥 𝑘𝑁𝐻 + 𝑛 for 𝑛 = 0,1,2,.. , 𝑁𝐹 − 1 𝑦𝑘 = 𝑥[𝑘𝑁𝐻 : 𝑘𝑁𝐻 + 𝑁𝐹 + 1] where, y is a two-dimensional array to represent the framed signal: the first index k selects which frame, and the second index n selects which sample within the frame. The kth frame, therefore, is the slice of the signal from sample indices kNH to kNH+NF-1. Using some dimensional analysis, we can convert frame indices to time as well: samples seconds 𝑘 frames 𝑁𝐻 𝑇𝑠 = 𝑘𝑁𝐻 𝑇𝑠 seconds frame sample 𝑁𝐻 =𝑘 seconds 𝑓𝑠 We can observe that NH/fs can be interpreted as the (time) period between frames. Equivalently, its reciprocal fs/NH gives us the number of frames per second, a quantity known as the frame rate. 8 Speech representation: overview Speech block processing Spectrogram Cepstrum Mel-spectrogram, mel-frequency cepstral coefficient MFCC Chromagram Zero crossing rate 9 Spektogram Speech signal frame x[n] Pre-emphasis filter: balancing the frequency spectrum due to high frequencies avoids numerical problems during Fourier transform operations can increase the Signal-to-Noise ratio (SNR). Windowing: overcomes discontinuities caused by segmenting a signal into frames that will distort the spectrum Window shape: Hamming, Hann, etc FFT: algorithm for performing DFT that converts waves from spectrum time-domain to frequency-domain. 10 Aplitudo Spectrogram time Windowing Windowing Windowing FFT FFT FFT Spectrum 11 Aplitudo Spectrogram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum 12 Aplitudo Spectrogram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Amp Hz 13 Aplitudo Spectrogram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Hz Rotate by 90 degrees Amp 14 Aplitudo Spectrogram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Hz Map spectral amplitude to a grey level (0- 255) value. 0 represents black and 255 represents white. Higher the amplitude, darker the Amp corresponding region 15 Aplitudo Spectrogram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Hz Time 16 Usefulness of spectrogram Dark regions indicate peaks (formants) in the spectrum 17 Usefulness of spectrogram Phones and their properties are better observed in spectrogram 18 Usefulness of spectrogram Speech can be identified much better by the Formants and by their transitions 19 Usefulness of spectrogram ASR models implicitly model these spectrograms to perform speech recognition 20 Usefulness of spectrogram Time-Frequency representation of the speech signal Spectrogram is a tool to study speech sounds (phones) Phones and their properties are visually studied by phoneticians ASR models implicitly model spectrograms for speech to text systems Useful for evaluation of text to speech systems A high quality text to speech system should produce synthesized speech whose spectrograms should nearly match with the natural sentences 21 Speech representation: overview Speech block processing Spectrogram Cepstrum Mel-spectrogram, mel-frequency cepstral coefficient MFCC Chromagram Zero crossing rate 22 Speech spectrum Peaks denote dominant frequency components in the speech signal Peaks are referred to as formants Formants carry the identity of the sound Log-spectrum 23 Spectral envelope The thing we want to extract Formants and smooth curve connecting speech spectrum The smooth curve is referred to as spectral envelope Spectral envelope 24 Spectral envelope Our goal: We want to separate spectral log X[k] envelope and spectral details from the spectrum Log-spectrum i.e Given log X[k], obtain log H[k] and log E[k], such that log X[k] = log H[k] + log H[k] log E[k] Spectral envelope log E[k] Spectral details 25 Spectral envelope Our goal: We want to separate spectral log X[k] envelope and spectral details from the spectrum Log-spectrum i.e Given log X[k], obtain log H[k] and log E[k], such that log X[k] = log H[k] + log H[k] log E[k] How to achieve this separation? Spectral By playing a mathematical trick envelope log E[k] Spectral details 26 Extracting spectral envelope Trick: Take FFT of the spectrum! log X[k] = log H[k] + log E[k] An FFT on spectrum referred to as Inverse FFT (IFFT) Log-spectrum Note: We are dealing with spectrum in log H[k] log domain (part of the trick) IFFT of log spectrum would represent Spectral the signal in pseudo-frequency envelope log E[k] Spectral details 27 Extracting spectral envelope log X[k] = log H[k] + log E[k] Log-spectrum log H[k] Spectral envelope log E[k] Spectral Pseudo-frequency axis details 28 Extracting spectral envelope log X[k] = log H[k] + log E[k] Log-spectrum Low freq region High freq region log H[k] Spectral envelope log E[k] Spectral Pseudo-frequency axis details 29 Extracting spectral envelope Gives a peak at log X[k] = log H[k] + log E[k] 4 Hz in frequency axis Treat this as a sine wave with 4 Log-spectrum Low freq region High cycles freq per sec. region log H[k] IFFT Spectral envelope log E[k] Spectral Pseudo-frequency axis details 30 Extracting spectral envelope log X[k] = log H[k] + log E[k] Gives a peak at 100 Hz in Log-spectrum frequency Low freq region High freq region log H[k] Treat this as a sine wave with 100 cycles IFFTper Spectral sec. envelope log E[k] IFFT Spectral Pseudo-frequency axis details 31 Extracting spectral envelope log X[k] = log H[k] + log E[k] IFFT Log-spectrum log H[k] Spectral envelope log E[k] Spectral Pseudo-frequency axis details 32 Extracting spectral envelope log X[k] = log H[k] + log E[k] IFFT Log-spectrum log H[k] In practice all you have Spectral access to only log X[k] envelope and hence you can obtain x[k] log E[k] Spectral details 33 Extracting spectral envelope log X[k] = log H[k] + log E[k] IFFT Log-spectrum log H[k] If you know x[k] Spectral Filter the low frequency envelope region to get h[k] log E[k] Spectral details 34 Extracting spectral envelope log X[k] = log H[k] + log E[k] IFFT Log-spectrum log H[k] x[k] is referred to as Cepstrum Spectral h[k] is obtained by considering the low envelope frequency region of x[k] h[k] represents the spectral envelope log E[k] and is widely used as feature for speech recognition Spectral details 35 Speech representation: overview Speech block processing Spectrogram Cepstrum Mel-spectrogram, mel-frequency cepstral coefficient MFCC Chromagram Zero crossing rate 36 Mel-scale We captured spectral envelope, but experiments say human ear concentrates on certain regions rather than using whole spectral envelope Human ear is less sensitive at higher frequency, roughly > 1000 kHz 37 Mel-frequency analysis Mel-Frequency analysis of speech is based on human perception experiments It is observed that human ear acts as filter o It concentrates on only certain frequency components These filters are non-uniformly spaced on the frequency axis o More filters in the low frequency regions o Less no. of filters in high frequency regions 38 Mel-scale filterbanks More no. of filters in Less no. of filters low freq. region in high freq. region 39 Mel-spectrogram Spectrum -> Mel-Filters -> Mel-Spectrum NOW perform operations to get the spectogram o Rotate Mel-spectrum by 90 degrees o Map into grayscale values (0-255) o Do the same for every frames The result obtained is referred to as Mel-spectrogram 40 Mel-Spectrogram Speech signal frame x[n] Mel-scale is the result of a perception study of sound which is in line with human sensitivity to sound at different frequencies. Generally it's 80 or 40. Normalization: to balance the spectrum. spectrum Mel-spectrum 41 Mel-Spectrogram FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Mel-spectrogram Mel-Filters Rotate by 90% for each mel-spectrum Map spectral amplitude to a yellow level (0- 255) value. Higher the amplitude, lighter the Mel-Spectrum corresponding region Plot mel-spectrum for all frames mel-spectrogram 42 Mel-Frequency Cepstral Coefficients (MFCC) Spectrum -> Mel-Filters -> Mel-Spectrum Say log X[k] = log (Mel-Spectrum) NOW perform spectral envelope extraction on log X[k] o log X[k] = log H[k] + log E[k] o Taking IFFT o x[k] = h[k] + e[k] Cepstral coefficients h[k] obtained for Mel-spectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by MFCC 43 MFCC FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Mel-Filters Mel-Spectrum Cepstral Analy. Cepstral vector 44 MFCC FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Mel-filters and cepstral analysis Cepstral Vectors 45 Usefulness of MFCC Speech synthesis o Used for joining two speech segments S1 and S2 o Represent S1 as a sequence of MFCC o Represent S2 as a sequence of MFCC o Join at the point where MFCCs of S1 and S2 have minimal Euclidean distance Used in speech recognition o MFCC are mostly used features in state-of-art speech recognition system 46 Speech representation: overview Speech block processing Spectogram Cepstrum Mel-spectrogram, mel-frequency cepstral coefficient MFCC Chromagram Zero crossing rate 47 Chroma Assuming the equal-tempered scale, one considers twelve chroma values represented by the set {C, C♯, D, D♯, E, F, F♯, G, G♯, A, A♯, B} consists of the twelve pitch spelling attributes as used in Western music notation 48 Octave One octave contains the twelve pitch {C, C♯, D, D♯, E, F, F♯, G, G♯, A, A♯, B} C#D# F#G#A# C#D# F#G#A# C DE F GA B C DE F GA B The human perception of pitch is periodic in the sense that two pitches are perceived as similar in "color" if they differ by an octave 49 Pitch class The set of all pitches that share the same chroma, consisting of all pitches separated by an integer number of octaves C = {…, C-2, C-1, C0, C1, C2, C3,...} We want to represent one chroma with one coefficient = Chromagram There are a total of 128 pitches (spanning 10 octaves) that need to be considered 128 pitches -> 12 chromas -> 12 coefficients 50 Aplitudo Chromagram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Amp Hz 51 Aplitudo Chromagram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Hz Rotate by 90 degrees Amp 52 Aplitudo Chromagram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Only consider the values of all 128 pitches p where p % 12 == chroma C SUM all the values that belong to that chroma, for every Amp chroma 53 Aplitudo Chromagram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum B Only consider the values of all 128 pitches p where p % 12 == chroma SUM all the values that belong to that chroma, for every Amp chroma 54 Aplitudo Chromagram time FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT Spectrum Chroma Time 55 Usefulness of chromagram Used in speech recognition o Speech classification o Cover song identification o Audio matching o Plagiarism detection 56 Speech representation: overview Speech block processing Spectogram Cepstrum Mel-spectrogram, mel-frequency cepstral coefficient MFCC Chromagram Zero crossing rate 57 Zero crossing rate A very simple way for measuring smoothness of a signal For example, voiced speech sounds are more smooth than unvoiced ones Denote the speech signal sign change from positive to negative (or vice versa) where M is the step between analysis windows and N the analysis window length 58 Zero crossing rate example 59 Usefulness of zero crossing rate Used in speech recognition o Classify percussive sounds o Detect whether there is human speech or not