Voiceprints Acoustic Processing of Speech PDF
Document Details
Tags
Summary
This document presents an overview of acoustic processing of speech signals, including signal analysis, feature extraction, Fourier analysis, and linear predictive coding (LPC). It also covers spectral analysis, human voiceprints, sound waves, and interpreting waveforms. The document could be lecture notes or a summary of a subject.
Full Transcript
Introduction to the acoustic processing of speech signals (The basis of speech recognition by computers) Signal Analysis Feature Extraction Fourier Analysis and Linear Predictive Coding (LPC) Spectral Analysis and Spectra: Human Voiceprints Sound Waves Interpreting...
Introduction to the acoustic processing of speech signals (The basis of speech recognition by computers) Signal Analysis Feature Extraction Fourier Analysis and Linear Predictive Coding (LPC) Spectral Analysis and Spectra: Human Voiceprints Sound Waves Interpreting a waveform Some things to do Major Topics of this Course Language Structure · W ords Tools and Te chnique s · Syntax · G ram m ars (Prolog/Java) Language Me aning · Parse rs · Se m antics · D ata Structure s D ialogue Age nts Machine Translation · D iscourse structure s · Translation Engine s · Conve rsation Age nts in Proce ssing of Spe e ch Signals software (Java/XML) · Spe e ch re cognition · V oice XML and voice · Spe e ch Synthe sis activate d applications & use r inte rface s · Te xt to Spe e ch Syste m s Applications Acoustic Processing of Speech This lecture presents a brief overview of the kind of acoustic processing commonly called signal analysis or feature extraction. The term features refers to the vector of numbers which represents one time slice of a speech signal. A number of kinds of features are commonly used, e.g. LPC features. These are spectral features , which means that they represent the waveform in terms of the distribution of different frequencies that make up the waveform. Such a distribution of frequencies is called a spectrum. In this lecture we: summarise the idea of frequency analysis and spectra, and sketch out different kinds of extracted features. Sound Waves The input to a speech recogniser , like the input to the human ear, is a complex series of changes of air pressure. These changes in air pressure originate with the speaker, and are caused by the specific way that air passes through something called the glottis and out the oral or nasal cavities. We represent sound waves by plotting the change in air pressure over time. One way of visualising this is to imagine a graph plot of a vertical plate which is blocking the air pressure waves i.e. a microphone in front of the speaker, or the eardrum of t he hearer. The graph measures the amount of compression of the air molecules at this plate. The diagram following, from the course textbook by Jurafsky and Martin, shows the waveform taken from a corpus of telephone speech of someone saying “ she just had a baby ”. 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 1 2 3......28 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 Two important characteristics of a wave are its: 1. frequency and 2. amplitude. The frequency is the number of times a second that a wave repeats itself, or more technically, cycles. 1 2 3......28 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 In the diagram there are 28 repetitions of the wave in the 0.11 seconds captured. Therefore, the frequency of this segment of the wave is 28/0.11 or 255 cycles per second Frequency = repetitions/time -period Cycles per second are called Hertz (Hz for sho rt). The frequency in the diagram is 255Hz. 1 2 3......28 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 The vertical axis in the diagram measures the amount of air pressure variation. A high value on the vertical axis (a high amplitude) indicates that there is more air pressure at that point in time, a zero value means that there is normal (atmospheric) pressure, a negative value means there is lower than normal air pressure. Two important perceptual properties are related to frequency and amplitude. The pitch of a sound is the perceptual correlate of frequency. In general, if a sound has a higher frequency we perceive it as having a higher pitch ,.... but the relationship i s not linear since human hearing has different acuities for different frequencies. Similarly, the loudness of a sound is the perceptual correlate of the power, which is related to the square of the amplitude. Sounds with higher amplitude are perceived as louder, but again, the relationship is not linear. How to interpret a waveform Since humans (... and computers) can transcribe and understand speech just given the sound wave, the waveform must contain enough information to make this task possible. In most cases, this information is hard to unlock just be looking at the waveform, bu t... we can still learn many things by a visual inspection of the waveform. For example, the difference between vowels and consonants of spoken language is quite clear on a waveform Vowels tend to be long and relatively loud. Length in time manifests itself as high amplitude. Fricatives [ sh] can also be recognised in a waveform. They produce an intense irregular pattern The diagram is a waveform by a 20 year female, speaking with an accent from the south midlands of the USA. sh iy j ax s hh ae dx ax b ey b Diagrams of: Say hid twice say had twice etc. Spectrograms using CoolEdit software Spectrograms using Gram software Gram30 Gram50 Spectra While some broad phonetic features can be interpreted from a waveform, more detailed classification requires a different representation of the input in terms of spectral features. Spectral features are based on the insight of Fourier that every complex w ave can be interpreted as a sum of many simple waves of different frequencies. A musical analogy is a chord. Just as a chord is composed of multiple notes, any waveform is composed of the waves corresponding to its individual “notes” 2000 0 -2000 0.905 0.910 0.915 0.920 0.925 0.930 0.935 0.940 The diagram shows part of the waveform for the vowel [ _ ]...in American speech... of the word had at second 0.9 of the sentence “She had a baby”. 1 2 3 4 5 6 7 8 9 2000 1234 0 -2000 0.905 0.910 0.915 0.920 0.925 0.930 0.935 0.940 Note however, that there is a complex wave which repeats about 9 times in the diagram. Note that there is also a smaller repeated wave which repeats 4 times for every larger pattern (... look at the 4 small peaks inside every repeated wave). _ The complex wave has a frequency of about 250 Hz. _ We can figure this out since it repeats about 9 times in 0.036 seconds giving 9 cycles/0.036 = 250 Hz. 1 2 3 4 5 6 7 8 9 12 2000 1234 0 -2000 0.905 0.910 0.915 0.920 0.925 0.930 0.935 0.940 The smaller wave should then have a frequency of approx. 4 times the frequency of the larger wave, say 1000Hz. Then, if you look very closely, you can see two little lines on the peak of many of the 1000Hz waves. The frequency of this tiny wave must be approx. 2 times that of the 1000Hz wave = about 2000Hz. A spectrum is a representation of these different frequency components of a wave. It can be computed by a Fourier transform, a mathematical procedure which separates out each of the frequency components of a wave. Many speech applications use an LPC (Linear Predictive Coding) spectrum because this makes it easier to see where the peaks are. An LPC spectrum for the vowel [ _ ] waveform of “She just h ad a baby” at the point in time shown in the previous diagram. LPCs makes it easier to see formants. The x-axis of a spectrum shows frequency while the y-axis shows some measure of the magnitude of each frequency component (in decibels dB, a logarithmic measure of amplitude). The diagram shows that there are important frequency components at 930Hz, 1860Hz, and 3020Hz, along with many other lower-magnitude frequency components. These components at approx. 1000Hz and 2000Hz are what we predicted earlier. Why is a Spectrum Useful? It turns out that these spectral peaks that are easily visible in a spectrum are very characteristic of different sounds. Phones have characteristic spectral “signatures”. For example; Chemical elements give off different wavelengths of light when they burn, allowing scientists to detect elements in stars that are light-years away by looking at the spectrum of the light. Similarly, by looking at the spectrum of a waveform, we can detect the characteristic signature of the different phones that are present. This use of spectral information is essential to both human and machine speech recognition. Spectrogram While a spectrum shows the frequency components of a wave at one point in time,... a spectrogram is a way of envisioning how the different frequencies which make up a waveform change over time. In human audition, the function of the cochlea or inner ear is to compute a spectrum of the incoming waveform. Similarly, the features that are input to HMMs in speech recognition are all representations of spectra. The x-axis shows time, as it did for the waveform, but the y -axis now shows frequencies in Hertz (Hz). The darkness of a point on a spectrogram corresponds to the amplitude of the frequency of the component. For example, in the diagram at the point in ti me of second 0.9, notice the dark bar at around 1000Hz. sh iy j ax s hh ae dx ax b ey b iy sh iy j ax s hh ae dx ax b ey b iy For example, in the diagram at the point in time of second 0.9, notice the dark bar at around 1000Hz. This means that the vowel [iy] of the word “she” has an important component around 1000Hz. The dark horizontal bars on a spectrogram, representing spect ral peaks, usually of vowels, are called formants. What specific clues can spectral representations give for phone identification? First, different vowels have their formants at characteristic places. We have seen that the vowel [ _ ] in the simple waveform had formants at 930Hz, 1860Hz and 3020Hz. Consider the vowel [iy], at the beginning of the utterance in the first diagram following of “she had a baby”. sh iy j ax s hh ae dx ax b ey b sh iy j ax s hh ae dx ax b ey b 540Hz 2581Hz first formant second formant F1 (formant 1) and F2 (formant 2) play a large role in determining vowel identity, although.... the formants still differ from speaker to speaker. Formants can also be used to identify: the nasal phones [], [], and [], the lateral phone [] and []. The spectrum for this particular vowel is shown next. The first formant of [iy] is 540 Hz, much lower than the first formant for [], while the second formant at 2581 Hz is much higher that the second formant for []. We can see these as dark bars on the spectrogram diagram Why do different vowels have different spectra? The formants are caused by the resonant cavities of the mouth. The oral cavity can be thought of as a filter which selectively passes through some of the harmonics of the vocal cord vibrations. Moving the tongue creates spaces of different size the mouth which selectively amplify waves of the appropriate wavelength, hence amplifying different frequency bands. back front Positions of the tongue for three English vowels... high front [iy] low front [ae] and high back [uw] IPA and ARPAbet symbols for transcription of English vowels IPA and ARPAbet symbols for transcription of English consonants Feature Extraction We can now summarise the process of extraction of spectral features, beginning with the sound wave itself and ending with a feature vector. An input sound-wave is first digitised. This process of analogue-to-digital conversion has two steps: Sampling and Quantisation. A signal is sampled by measuring its amplitude at a particular time. The sampling rate is the number of samples taken per second. Common sampling rates are 8,000Hz and 16,000Hz. In order to accurately measure a wave, it is necessary to have at least two samples in each cycle. One measuring the positive part of the wave and the other measuring the negative part. More that two samples per cycle increases the amplitude accuracy, but less that two samples will cause the frequency of the wave to be complete missed. Therefore, the maximum frequency wave that can be measured is one whose frequency is half that of the sample rate (since every cycle needs two samples). The maximum frequency for a given sampling rate is called the Nyquist frequency. Most information in human speech is in frequencies below 10,000 Hz, therefore a 20,000 Hz sampling rate would be necessary for complete accuracy. A sampling rate of, for example, 8,000 Hz , will require 8000 amplitude measurements for each second of speech. It is very important to store amplitude measurements efficiently. They are usually stored as integers, either 8-bit with values from –128 to 127 or 16-bit with values from –32768 to 32767. This process of representing a real-valued number as an integer is quantisation because there is a minimum granularity (the quantum size) and all values which are are closer together than this quantum size are represented identically. Once a waveform has been digitised, it is converted to some set of spectral features. TO DO THIS WEEK 1. R ead chapter 7 of the Jurafsky and M artin textbook TO DO THIS WEEK Using the software supplied under /.../misc/acoustic -resources / (and freely downloadable from the web) 1. Make separate recordings of your own voice saying each of the following utterances: a. “Say hod twice” b. “Say hood twice” c. “Say hide twice d. “Say whoed twice” e. “Say hoewed twice” f. “Say hawed twice” g. “Say had twice” h. “Say head twice” i. “Say hayed twice” j. “Say hid twice” k. “Say heed twice” 2. Save each of these as format = RAW 3. Play the sound as a waveform and capture the screen image for each 4. Play t he sounds back as a spectrogram and capture the screen image for each. 5. Identify the point where the vowel starts and ends in each 6. Identify the formant in Hz for F1, F2, F3, F4 and F5 7. Compare your results with at least one male and one female in class for each of the utterances. Discuss how your spectrogram voiceprints differ in each individual case. Email these to me before next lecture. This counts as CA