Lecture 1-Speech Recognition PDF
Document Details
Uploaded by PlushNickel
Dr. Abeer Saber
Tags
Summary
This lecture covers automatic speech recognition (ASR), its components, architecture, challenges, and purpose. The lecture notes explain how speech signals are processed and analyzed for human-computer communication in a variety of applications.
Full Transcript
Automatic Speech Recognition Lecture 1 Dr. Abeer SAber Automatic speech recognition Speech is the most natural, efficient and preferred mode of communication between humans. Therefore it can be assumed that people are more comfortable using speech as a mode of input for various...
Automatic Speech Recognition Lecture 1 Dr. Abeer SAber Automatic speech recognition Speech is the most natural, efficient and preferred mode of communication between humans. Therefore it can be assumed that people are more comfortable using speech as a mode of input for various machines rather than such other primitive modes of communication as keypads and keyboards Automatic speech recognition Automatic speech recognition (ASR) system helps us achieve this goal. Such a system allows a computer to take the audio file or direct speech from the microphone as an input and convert it into the text; preferably in the script of the spoken language. An ideal ASR should be able to “perceive” the given input, “recognize” the spoken words and then subsequently use the recognized words as an input to another machine so that some “action” can be performed on it. Retrospectively, we consider ASRs to be the future means of communication between humans and machines. Automatic speech recognition Human speech and accents have huge variations, and this variation in speech patterns is one of the biggest obstacles in creating an autonomous speech recognition system. Bilingual or multilingual people tend to show more of these variations in their speech patterns than people who speak only one language. Automatic speech recognition The purpose of creating an ASR is that it can transliterate any language for any speaker. Languages differ in terms of phonetics, character set, and grammar rules; speakers vary in terms of voice pitch, accent, and personality. Every speaker has a unique voice and speaking style; on this basis, an ASR can be classified into the following three types: The architecture of an ASR The function of an ASR is to take input of a sound wave and convert the spoken speech into text form; the input could be either taken directly using a microphone or as an audio file. This problem can be explained in the following way: for a given sequence input sequence X, where X=X1, X2,…., Xn, where n is the length of the input sequence, the function of an ASR is to find a corresponding output sequence Y, where Y = Y1, Y2,…., Ym, where m is the length of the output sequence. And the output sequence Y has the highest posterior probability P(Y|X), where P(Y|X) can be calculated using the given formula: The architecture of an ASR The architecture of an ASR where P(W) is the probability of the occurrence of the word, P(X) is the probability that X is present in the signal, and P(X|W) is the probability of the acoustic signal W occurring in correspondence to the word X. An ASR can generally be divided into 4 modules: a pre- processing module, a feature extraction module, a classification model, and a language model, as shown in the following figure. Usually the input given to an ASR is captured using a microphone. Basic structure of an ASR The architecture of an ASR This implies that noise may also be carried alongside the audio. The goal of preprocessing the audio is to reduce the signal-to- noise ratio. There are different filters and methods that can be applied to a sound signal to reduce the associated noise. Framing, normalization, end-point detection and pre-emphasis are some of the frequently used methods to reduce noise in a signal. Preprocessing methods also vary based on the algorithm being used for feature extraction. Certain feature extraction algorithms require a specific type of pre- processing method to be applied to its input signal. The architecture of an ASR After pre-processing, the clean speech signal is then passed through the feature extraction module. The performance and efficiency of the classification module are highly dependent upon the extracted features. There are different methods of extracting features from speech signals. Features are usually the predefined number of coefficients or values that are obtained by applying various methods on the input speech signal. The feature extraction module should be robust to different factors, such as noise and echo effect. Most commonly used feature extraction methods are Melfrequency cepstral coefficients (MFCCs), linear predictive coding (LPC), and discrete wavelet transform (DWT). Types of ASR Automatic speech recognition Automatic speech recognition (ASR) has been an active research area for over five decades. It has always been considered as an important bridge in fostering better human–human and human–machine communication. Automatic speech recognition In the past, however, speech never actually became an important modality in the human–machine communication This is partly because the technology at that time was not good enough to pass the usable bar for most real world users under most real usage conditions, and partly because in many situations alternative communication modalities such as keyboard and mouse significantly outperform speech in the communication efficiency, restriction, and accuracy. Automatic speech recognition Components in a typical speech to speech translation system Components in a typical spoken language system Spoken language systems often include one or more of four major components: 1. A speech recognition component that converts speech into text; 2. A spoken language understanding component that finds semantic information in the spoken words; 3. A text-to-speech component that conveys spoken information; 4. A dialog manager that communicates with applications and other three components. Components in a typical spoken language system All these components are important to build a successful spoken language system. In this book, we only focus on the ASR component. Components in a typical spoken language system Virtues of Spoken Language The Speech Dialog Circle What Is Speech Processing? Speech processing is the study of speech signals and their processing methods. Purpose of speech processing: To represent speech for transmission and reproduction To improve the speech quality and/or intelligibility To analyze speech for automatic recognition and extraction of information To understand speech as a means of communication To discover some physiological characteristics of the speaker The Speech Dialog Circle 4. Key Challenges with ASR Key Challenges with ASR From a linguistic perspective, there exist many sources of variation Ø Speaker: Tuned for a particular speaker, or speaker-independent? Adaptation to speaker characteristics Ø Environment: Noise, competing speakers Ø Channel conditions (microphone, phone line, room acoustics) 24 Key Challenges with ASR Ø Style Continuously spoken or isolated? or spontaneous conversation? Ø Vocabulary Machine-directed commands, scientific language, colloquial expressions Ø Accent/dialect Recognize the speech of all speakers who speak a particular language Ø Other paralinguistic Emotional state, social class 25 Key Challenges with ASR From a machine learning perspective Ø As a classification problem: very high dimensional output space. Ø As a sequence-to-sequence problem: very long input sequence (although limited reordering between acoustic and word sequences). Ø Data is often noisy. Ø Very limited quantities of training data available. 26