3a asr-earlymodel-en.pdf
Document Details
Uploaded by ReasonedMoose
Full Transcript
Automatic Speech Recognition Speech Language Processing Faculty of Computer Science Universitas Indonesia Dr. Kurniawati Azizah, S.T., M.Phil. Semester Gasal 2024/2025 2...
Automatic Speech Recognition Speech Language Processing Faculty of Computer Science Universitas Indonesia Dr. Kurniawati Azizah, S.T., M.Phil. Semester Gasal 2024/2025 2 References ASR: Bill Byrne ASR: Noisy Channel, HMMs, Evaluation – Dan Jurafsky End-to-end neural network speech recognition – Andrew Maas Foundation models and SpeechBrain training – Andrew Maas Speech in healthcare – Frank Rudzicz https://lorenlugosch.github.io/posts/2020/11/transducer/ https://distill.pub/2017/ctc/ 3 Automatic speech recognition: overview ASR task ASR history Noisy channel model and ASR architecture ASR with Hidden Markov Models o HMM GMM ASR model o Basic problem in HMM ASR model 4 What is speech recognition? The following distinction is usually made: ➜ Recognition Identification of the words in an utterance (speech to orthographic transcription) ➜ Understanding identification of utterance meaning. This course only deals with speech recognition. How far can you go in building ASR (automatic speech recognition) systems without understanding? (A long way...) 5 Why is it difficult? Speech is a complex combination of information from different levels (discourse, se- mantics, syntax, phonological, phonetic, acoustic) that is used to convey a message. The signal contains much variability (important difference or noise?). ➜ Intra-speaker physical/emotional state, environment , etc... ➜ Inter-speaker physiological, accent/dialect, etc... ➜ Speaking style read/spontaneous, formal/casual ➜ Acoustic channel record utterance and noise, telephone channel, background speech, noise, etc... 6 Variability ASR devices often lump together many of the variability sources. An ASR system needs the means of dealing with (i.e. capability to model) ➜ Spectral variability Linear or non-linear effects due to all variability sources ➜ Timing variability Mostly non-linear effects, speech can be stretched in non-linear fashion. More variations for speaker independent and continuous speech The importance of effects varies with the tasks. 7 Task classification Research tends to focus on making recognition systems more general (“all purpose”): Large vocabulary speaker independent continuous speech recognition systems trained on data from a variety of different sources. Recogniser capabilities can be defined along a number of dimensions. ➜ Speech ➜ Input/Output ➜ Environment ➜ Internal specifications ➜ Linguistic criteria ➜ Platforms Task Classification - Speech and 8 Environment ➜ Types of speech ✘ Mode of speaking Isolated word, connected, continuous ✘ Speaker set Single speaker (dependent), multi-speaker, any speaker (independent). Most speaker independent systems make assumptions about accent subsets. A modern trend is towards speaker adaptive systems. ➜ Environment ✘ Noise Noise free, office, telephone, high-noise (aircraft, factory floor). Fixed noise condition or adaptive to condition. ✘ Microphone and channel Close-talking, far-field, telephone channel. Known fixed microphone or variable (adaptation to channel). Task classification - Linguistic and 9 System Input ➜ Linguistic criteria ✘ Vocabulary ASR systems can only recognise words they “know”. Vocabulary size determines complexity. very small ( ≤ 20 words), small ( ≤ 200 words), medium ( ≤ 2000 words), large ( ≤ 64k words), very large (> 64k words). ✘ Syntax None, finite state (possibly stochastic), context free (possibly stochastic), N- gram. ✘ Languages Most work in English. Systems exist in most European languages, Mandarin, Japanese etc. , multi-lingual ➜ System input ✘ Multi-modal interfaces Combined modelling, lip-reading, gesture, speaker identification Task classification - System Output and 10 Internal Representation ➜ System output ✘ Multiple hypotheses System produces single or multiple (ranked) hypotheses as a list or (preferably) a lattice. ✘ Confidence scores System produces a confidence level (estimated probability of correctness) for each word in the output. ✘ Rich Transcripts Human readable transcripts, esp. of spontaneous speech. ➜ Internal representation ✘ Recognition units Words or sub-words (phones). Phone-based systems are relatively vocabulary independent and are much more easily re-configured. 11 System Aspects What makes a “good” ASR system ? ➜ Low error rate Performance of speech recognisers can be measured by comparison of the output string with a manually transcribed version (reference transcript). ➜ User satisfaction Recognisers form part of a larger system (e.g. a text input system or an enquiry system). Users (customers!) are interested in overall system performance (e.g transaction time ) rather than the raw error rate. It is necessary to integrate recogniser (& its shortcomings) into system design so the system can cope with recognition errors etc., e.g. the use of confirmatory strategies (either explicit or implicit) or the design of user interface to allow correction. 12 ASR Evaluation How to evaluate the word string output by a speech recognizer? Word error rate (WER) Character error rate (CER) 13 Word error rate (WER) Can be >100%. Doesn’t distinguish between function words (of, they, he, she) and more important content words Compute best alignment of reference and hypothesis to count errors: 14 Word error rate (WER) Comparing aligned systems for deeper error analysis: NIST sctk scoring software: Computing WER with sclite http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a correct or reference text (REF) (human transcribed) 15 CONFUSION PAIRS Total (972) With >= 1 occurrances (972) 1: 6 -> (%hesitation) ==> on Sclite output 2: 3: 6 5 -> -> the ==> that but ==> that for error 4: 5: 4 4 -> -> a ==> the four ==> for analysis 6: 4 -> in ==> and 7: 4 -> there ==> that 8: 3 -> (%hesitation) ==> and 9: 3 -> (%hesitation) ==> the 10: 3 -> (a-) ==> i 11: 3 -> and ==> i 12: 3 -> and ==> in 13: 3 -> are ==> there 14: 3 -> as ==> is 15: 3 -> have ==> that 16: 3 -> is ==> this 16 Better metrics than WER? WER is useful, but