Transformers for Speech - Ajou University - PDF
Document Details
Uploaded by LawAbidingConsciousness2055
Ajou University
2024
Sang-Hoon Lee
Tags
Related
- Transformers I (3.15) PDF
- Transformers II (3.15) Learning Objectives PDF
- Topic Assessment 2 Auto-transformers and Instrument Transformers (60 min) PDF
- Topic Assessment 7 Parallel Operation of Transformers PDF
- Topic Assessment 7 Parallel Operation of Transformers (PDF)
- Consultants Support Transformers Partie 2 PDF
Summary
This document presents a lecture or a presentation on Transformers for speech, focusing on various applications and techniques in the field of artificial intelligence. It covers topics like Transformer Networks, Large Language Models, and Automatic Speech Recognition.
Full Transcript
Artificial Intelligence 7. Transformers for Speech 14. Oct. 2024 Sang-Hoon Lee Ajou University Transformer Networks ▪ Feed-forward Neural Networks (no recurrent)...
Artificial Intelligence 7. Transformers for Speech 14. Oct. 2024 Sang-Hoon Lee Ajou University Transformer Networks ▪ Feed-forward Neural Networks (no recurrent) With Attention Module 1 Large Language Models 2 Index ▪ Transformer with Speech (Encoder-Decoders) Conventional Speech Application ✓ Automatic Speech recognition ✓ Text-to-Speech ▪ Self-supervised Pre-training with Transformers (Encoder-only) Masked Language Model Wav2Vec 2.0 HuBERT ▪ LLM with Speech (Decoder-only) Recent Speech Application ✓ Neural Audio Codec for Speech Tokenization ✓ Neural Codec Language Models for Text-to-Speech ✓ Speech Language Models 3 Automatic Speech Recognition ▪ Automatic Speech Recognition (ASR) Transcribing the text from speech “Hello!” ASR Text Transcripts Speech 4 Automatic Speech Recognition ▪ Whisper [OpenAI, ASR model] Architecture: Encoder-Decoder Transformer 5 Automatic Speech Recognition ▪ Encoder-Decoder Transformer is primarily used in tasks that require sequence-to-sequence learning Machine Translation ✓ The encoder processes the source language, and the decoder generates the translation in the target language. Text Summarization ✓ The model can summarize large bodies of text into shorter, coherent summaries by encoding the input text and decoding it into a condensed version. Speech Recognition and Synthesis ✓ In automatic speech recognition (ASR) or text-to-speech (TTS) tasks, the model can encode audio signals and decode them into text Image Captioning ✓ The encoder processes image features, while the decoder generates a descriptive caption for the image 6 Automatic Speech Recognition ▪ Encoder-Decoder Transformer Cross-Attention ✓ Operates on two different input sequences: source/target sequences ✓ Allows the model to focus on different parts of the source sequence when generating each element of the target sequence ✓ K and V from the source sequence ✓ Q from the target sequence 7 Automatic Speech Recognition ▪ Whisper [OpenAI, ASR model] Encoder: Encoding the speech Decoder: Predicting the text token 8 Automatic Speech Recognition ▪ Whisper [OpenAI, ASR model] Easy-to-use import whisper model = whisper.load_model("turbo") result = model.transcribe("audio.mp3") print(result["text"]) https://github.com/openai/whisper 9 Automatic Speech Recognition ▪ Whisper [OpenAI, ASR model] Support Multi-lingual ASR https://github.com/openai/whisper 10 Automatic Speech Recognition ▪ Whisper [OpenAI, ASR model] Using 680k hours to train the model 11 Speech Synthesis ▪ Text-to-Speech (TTS) Generating speech given text sequence “Hello!” Text sequence TTS Speech Speaker Information 12 Speech Synthesis ▪ Transformer TTS Encoder: Encoding the text Decoder: Predicting next Mel-spectrogram and Stop Token ✓ Mel-spectrogram: Speech Acoustic Feature ✓ Stop Token: Stop token is used to signal the end of speech generation, indicating when the model should stop producing audio output (Instead of EOS Token) RNN-based TTS Transformer-based TTS 13 Self-supervised Pre-training with Transformers ▪ Self-supervised Learning (SSL) A paradigm in machine learning where a model is trained on a task using Pretraining for three typ the data itself to generate supervisory signals, rather than relying on external labels provided by humans. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. ▪ Encoder-only Transformer Bidirectional Encoder-only Transformer models are usually utilized for self- The neural architecture influenc supervised learning ✓ BERT (Bidirectional Encoder Representations from Transformers) Encoders 14 Self-supervised Pre-training with Transformers ▪ Masked Language Models (MLM) MLM uses unannotated text from a large corpus The input tokens are randomly replaced with [mask] Tokens MLM training objective is to predict the original inputs for each of the masked tokens using a bidirectional encoder ✓ Cross-entropy loss 15 Self-supervised Pre-training with Transformers ▪ Masked Language Models (MLM) Pre-trained MLM can extract useful contextual embeddings for each token in the input ✓ Hidden representation (contextual embeddings or self-supervised representation) contains contextual information 16 Self-supervised Pre-training with Transformers ▪ Masked Language Models (MLM) BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks ✓ Question answering and language inference, without substantial task-specific architecture modifications. 17 Wav2Vec 2.0 ▪ Goal Learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods ✓ Masked Language Models with Contrastive Loss (Replacing Cross-entropy loss) ✓ Wav2vec 2.0 outperforms the previous state of the art on the 100 hours subset while using 100 times less labeled data. 18 Wav2Vec 2.0 ▪ Contrastive Loss: Calculated by comparing the predicted representation for the masked time step with the true representation and distractors We want our similar vectors to be as close to 1 as possible ✓ Similar vector: true representation from the masked position We want the negative examples to be close to 0 ✓ Negative samples: others 19 Wav2Vec 2.0 ▪ Fine-tuning for ASR wav2vec 2.0 outperforms the previous state of the art on the 100 hours subset while using x100 less labeled data 20 Wav2Vec 2.0 ▪ Fine-tuning for ASR This results showed the large potential of pre-training on unlabeled data for speech processing The approach is also effective when large amounts of labeled data are available 21 Self-supervised speech representation ▪ Self-supervised speech representation for linguistic information Wav2Vec 2.0 [A. Baevski, 2020] ✓ NANSY [H.-S. Choi, 2021]: Using the middle layer to extract the linguistic representation XLS-R [A. Babu, 2021]: Wav2Vec 2.0 with a large-scale cross-lingual speech dataset Wav2Vec 2.0 Visualization of intermediate representations of Wav2vec 2.0 using t-SNE in NANSY [H.-S. Choi, 2021] [A. Baevski, 2020] A. Baevski et al., “Wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representation,” NeurIPS, 2020. [H.-S. Choi, 2021] H.-S. Choi et al., “Neural Analysis and Synthesis Reconstructing Speech from Self-supervised Representations,” NeurIPS, 2021 [A. Babu, 2021] A. Babu, et al., "Xls-r: Self-supervised cross-lingual speech representation learning at scale," arXiv preprint arXiv:2111.09296, 2021. 22 Hierarchical TTS pipeline ▪ Goal To bridge the information gap between text and speech, we adopt self- supervised speech representations as additional linguistic representations ▪ Proposed pipeline Text → Linguistic representation → Acoustic representation → Waveform Text sequence Acoustic representation (e.g., spectrogram) Conventional TTS pipeline XLS-R (Wav2vec 2.0) Text sequence Linguistic representation Acoustic representation (Self-supervised representation) (e.g., spectrogram) Hierarchical TTS pipeline 23 HuBERT ▪ Goal Discretization of Speech with self-supervised learning and K-means ✓ 1. Encode unmasked audio inputs into meaningful continuous latent representation (Using convolutional layers) ✓ 2. Predicting discrete targets by K-means algorithm to capture the long-range temporal relations between learned representation Speech → Discrete Units ✓ We can use the discrete speech units as tokens for language models MFCC (Mel-Frequency Cepstral Coefficient) : acoustic features from speech 24 Language Models for Generative Models ▪ Image Tokenization ▪ Language models predict the image tokens 25 Audio Codec ▪ (audio enCOder/DECoder) Software that compresses and decompresses a digital audio signal. MP3, Windows Media Audio (WMA), Dolby Digital and DTS are examples of popular codecs that compress and decompress digital audio. The audio codec may also be a hardware circuit. 26 SoundStream [TASLP, 2021] ▪ Goal Efficiently compressing speech, music and general audio Combining the codec and neural vocoder ✓ Vector quantization ✓ Adversarial training (for high-quality waveform generation) 27 SoundStream [TASLP, 2021] ▪ Vector Qunatization Categorizing the samples into similar groups (Similar to K-means) ✓ A classic way for this task is to choose template vectors, which represents a typical sound in each environment k ✓ To categorize the sounds, you then find that template vector which is closest to your recording 28 SoundStream ▪ Residual Vector Quantization (RVQ) Multi-stage vector quantizer which cascades 𝑵𝒒 layers of VQ 1. the unquantized input vector is passed through a first VQ 2. Quantization residuals are computed ✓ Residual = Residual – 𝑸𝒊 (Residual) 3. The residuals are iteratively quantized by a sequence of additional quantizers 29 EnCodec [TMLR, 2023] ▪ Goal Real-time neural audio compression model that can produce high-fidelity audio samples 30 Neural Codec Language Models ▪ Goal Neural codec language models for in-context learning They utilize a neural codec as an input and target token so they can utilize a language model for zero-shot text-to-speech models, enjoying the effective in-context learning of language models 31 Speech Quantization ▪ EnCodec 24 kHz Audio → 75 Hz Codec (Hop size of 320) 8 numbers of Residual Vector Quantization (RVQ) ✓ 75*8 32 Comparison with Mel-spectrogram ▪ Mel-spectrogram Continuous signal regression (80 bins) ▪ Codec Next token prediction STFT + Mel filter Waveform Mel-spectrogram STFT 33 Codec Language Model ▪ Stage 1: AR GPT-3 style Transformer decoder Predicting the first token of RVQ ▪ Stage 2: NAR Transformer decoder Predicting the tokens of other seven quantizer (1:8) 34 Inference: In-context Learning via Prompting ▪ Previous method Extracting a Global Style Embedding from Mel-spectrogram or waveform signal Conditioning Speaker TTS Encoder Model Global Vector 35 Inference: In-context Learning via Prompting ▪ Prompting method For language models, prompting is necessary to enable in-context learning in the zero-shot scenario Prepending the enrolled speech to the sequence of acoustic tokens that are extracted from the target speaker Generation Prompt 36 Results ▪ WER (Word Error Rate) ▪ SPK: Speaker Similarity 37 Limitations ▪ Requires large-scale dataset to train the model However, language models have a higher capacity to learn the information from the large-scale dataset than other methods ▪ Highly dependent on the pre-trained neural audio codec or discrete speech unit Audio quality of these models is relatively lower than end-to-end speech synthesis framework ▪ Has a slow inference speed and lack of robustness, resulting in repeating, skipping, and mispronunciation due to auto-regressive generative manner 38 Codec Challenge ▪ To improve the quality and efficient of neural codec 39 Other Neural Codec Papers ▪ DAC: High-Fidelity Audio Compression with Improved RVQGAN [NeurIPS, 2023] Encodec + BigVGAN (Snake1D) + Improved RVQ (by reducing the codebook collapse) ▪ RepCodec SSL Token ▪ SpeechTokenizer SSL distilled RVQ ▪ NaturalSpeech3 Factorized Codec 40 Limitations ▪ Requires large-scale dataset to train the model However, language models have a higher capacity to learn the information from the large-scale dataset than other methods ▪ Highly dependent on the pre-trained neural audio codec or discrete speech unit Audio quality of these models is relatively lower than end-to-end speech synthesis framework ▪ Has a slow inference speed and lack of robustness, resulting in repeating, skipping, and mispronunciation due to auto-regressive generative manner 41 UniAudio [ICML, 2024] ▪ Multi-scale Transformer for multiple token prediction ▪ Universal Audio Generation model 42 Multiple Codec Prediction Methods 43 Multi-task results 44 Audio Generation ▪ Text-conditional audio generation Next audio token prediction ✓ Given text conditioning and previous tokens 45 Audio Generation Pop dance track with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, creating a cinematic atmosphere fit for a heroic battle. 46 Conventional Machine Translation ▪ Text Token → Other Language Text Token 47 Speech-to-Speech Translation ▪ Speech → Discrete Speech Units Prediction Without ASR models 48 Sentence-to-Speech Translation ▪ TranSentence Encoding speech into sentence-level speech embedding Decoding any language of speech from sentence-level speech embedding 49 Demo 50 SpeechGPT ▪ Goal Integrating the language models with speech encoder ✓ Speech Encoder: Unit-based models 51 Unified Speech-Text Pretraining for Spoken Dialog Modeling ▪ Speech-Text foundation models [NAVER CLOVA] This work proposes an extensive speech-text LLM framework to generate coherent spoken responses with organic prosodic features relevant to the given input speech ✓ Without relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions ✓ Can reflect the Content, Emotion, and Speaker Identity [H. Kim, 2024] H. Kim, et al., " Unified Speech-Text Pretraining for Spoken Dialog Modeling," arXiv, 2024 52 AnyGPT: Unified Multimodal LLM ▪ Speech/Text/Image/Music Tokenization 53 GPT-4o ▪ OpenAI Whisper [2023~]: ASR ChatGPT [2022~]: LLM OpenAI TTS [2023~]: TTS ▪ GPT-4o: Unified Multi-modal LLM Better performance than single-modal models ✓ ASR and Translation tasks Multi-modal Tokenization (?) 54 Speech-Text Language Model ▪ Moshi Speech-text foundation model and full-duplex spoken dialogue framework ✓ Real-time streaming neural audio codec ✓ Language models with streaming neural audio codec 55 Speech-Text Language Model ▪ Preparing Large-scale Dataset ▪ Training Audio Codec ▪ Training Speech-text language models 56 Real-time Streaming Codec ▪ Streaming audio codec Consisting of causal convolutional layer and transformer layers with causal masking to encode and decode speech in a streaming manner Self-supervised representation distillation on the first VQ layer ✓ WavLM (SSL pre-trained models) ✓ Transfer non-causal, high-level semantic information into the tokens produced by a causal model Semantic Token (1st layer) + Acoustic Tokens (Residual layers) 57 Moshi ▪ Speech-text foundation model which enables real-time spoken dialogue Moshi receives speech and generate (text-speech) Tokens jointly ✓ Semantic Token (1st layer) + Acoustic Tokens (7 number of residual tokens) = 8 Tokens 58 Moshi ▪ Tokens are predicted from bottom to top in the Depth Transformer Next token prediction with acoustic Delay ✓ Semantic Token with delayed acoustic tokens for streaming modeling of semantic and acoustic tokens jointly 59 Moshi ▪ RQ-Transformer Global-Local (Coarse-to-Fine) Transformer ✓ Global → Local (Residual Tokens Prediction) The RQ-Transformer breaks down a flattened sequence of length K·S into S timesteps for a large Temporal Transformer which produces a context embedding used to condition a smaller Depth Transformer over K steps. ✓ This allows scaling to longer sequences by increasing S—or to a higher depth by increasing K—than modeling the flattened sequence with a single model. K = 4 example for the sake of illustration. 60 LLaMA-Omni ▪ Speech Adaptor + pre-trained LLM Efficient fine-tuning with speech adaptor 61 NotebookLM https://huggingface.co/spaces/gabrielchua/open-notebooklm ▪ Open NotebookLM Convert PDFs into podcasts with open-source AI models ✓ Llama 3.1 (with Prompting) ✓ TTS models 62 Conclusion ▪ Language model have also shown their strength at the generative models ▪ Speech models can utilize language models with neural codec ▪ Open-source Application Whisper: ASR models Moshi and LLaMA-Omni: Speech Language models AudioGen and MusicGen: Audio Generation 63 Next Class ▪ Vision Transformer ▪ Masked Auto-Encoder (MAE) ▪ CLIP ▪ Vision Language Models (VLM) ▪ Generative Models DiT (Diffusion Transformer) VDT (Video Diffusion Transformer) 64