Audio_Survey.pdf
Document Details
Uploaded by EndorsedAntigorite1391
2023
Tags
Full Transcript
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 1 Audio Deepfake Detection: A Survey...
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 1 Audio Deepfake Detection: A Survey Jiangyan Yi, Member, IEEE, Chenglong Wang, Jianhua Tao, Senior Member, IEEE, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao Abstract—Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unied evaluation. Accordingly, in this survey paper, we rst highlight the key differences across various types of deepfake audio, then outline and analyse competitions, datasets, features, classications, and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are discussed. In addition, we perform a unied comparison of representative features and classiers on ASVspoof 2021, ADD 2023 and In-the-Wild datasets for arXiv:2308.14970v1 [cs.SD] 29 Aug 2023 audio deepfake detection, respectively. The survey shows that future research should address the lack of large scale datasets in the wild, poor generalization of existing detection methods to unknown fake attacks, as well as interpretability of detection results etc. Index Terms—Audio, deepfake detection, survey, features, classiers. ✦ 1 I NTRODUCTION O VER the past few years, deep learning based text-to- speech (TTS) and voice conversion (VC) technologies have made great improvement ,. These technologies enable the generation of human-like natural speech that proves difcult to distinguish from real audio. Admittedly, the development in these technologies signicantly improve the conveniences of our life in many scenarios, such as in- car navigation systems, e-book readers, intelligent robots Fig. 1. Mainstream solutions on audio deepfake detection: pipeline and etc. They nonetheless also pose a serious threat to social end-to-end detector. security and political economy if someone misuses the so- called deepfake technologies for malicious purposes. The term deepfake refers to the usage of deep learning meth- categorized into two kinds of solutions: pipeline and end-to- ods to seamlessly swap faces in videos on Reddit in 2017. end detector. The pipeline solution, consisting of a front-end Nowadays, deepfake is now generically used by the media feature extractor and a back-end classier, has become the or people to refer to any audio or video in which important de facto standard framework over the last decades. In recent attributes have been either digitally altered or swapped, years, end-to-end methods have attracted more and more with the help of articial intelligence (AI). Fraudsters used attention, which employ a model to jointly optimise the AI based software to mimic a chief executive’s voice and feature extraction and classication via operating directly demand a fraudulent transfer of USD 243, 000 in 2019. In upon raw audio waveform. response to these attacks, it is necessary to be able to detect Although previous studies on audio deepfake detection deepfake audio. have obtained promising performance, their scopes remain Audio deepfake detection is a task that aims to dis- largely scattered, with few systematic surveys. Most of them tinguish genuine utterances from fake ones via machine aim for summarising previous spoong attacks and coun- learning techniques as shown in Figure 1. An increasing termeasures for protecting automatic speaker verication number of attempts – have been made to further the (ASV) systems. Wu et al. provide a comprehensive sur- development of audio deepfake detection. Existing main- vey of past work to assess the vulnerability of ASV systems stream studies on audio deepfake detection can be roughly and the countermeasures to protect them in 2015. One recent review literature presents advances in anti-spoong from a perspective of ASVspoof challenges in 2020. Another Jiangyan Yi and Chu Yuan Zhang are with the State Key Laboratory survey work presents and analyses attack detection of Multimodal Articial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of work for ASV systems published between 2015 and 2021. Sciences. E-mail: [email protected] (Jiangyan Yi and Jianhua Tao Aakshi et al. review and analysis most of the benchmark are corresponding authors.) spoofed speech datasets, methods and evaluation metrics Jianhua Tao is with the Department of Automation, Tsinghua University. for ASV systems and spoof detection techniques. Very few of Chenglong Wang is with the University of Science and Technology of China. them focus on summarising the past work of audio deepfake Xiaohui Zhang is with the Beijing Jiaotong University. detection through the lens of helping people refrain from Yan Zhao is with the Hebei University of Technology. being deceived. Most recently, a survey introduces the deepfake audio types, datasets and detection methods. But JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 2 (a) (b) (c) (d) (e) Fig. 2. Five kinds of deepfake audio: (a) text-to-speech, (b) voice conversion, (c) emotion fake, (d) scene fake, (e) partially fake. it only provides a collection of results from classic methods, TABLE 1 and lacks consistent experimental analysis. Summary of audio deepfake types in past studies. Different from , this paper presents a comprehensive survey that makes the following contributions. We provide a Fake Type Fake Trait Fake Duration AI-aided systematic overview focusing on learning common discrimi- Text-to-speech (TTS) Speaker identity, Fully Yes Speech content native audio features related to audio deepfake detection, as Voice conversion (VC) Speaker identity Fully Yes well as computing methodologies that can be used to build Emotion fake Speaker emotion Fully Yes an appropriate generalized automatic system. This survey Scene fake Acoustic scene Fully Yes also includes a detailed summary of up-to-date audio deep- Partially fake Speech content Partially Yes fake detection datasets; based on this summary, we perform a unied comparison of representative detection methods. The remainder of this paper is organized as follows. Section 2 highlights differences across various types of deep- 2.1.1 Text-to-Speech fake audio, summarizes existing benchmark datasets and Text-to-speech (TTS) , commonly known as speech syn- competitions, as well as evaluation metrics. Discriminative thesis and shown in Figure 2 (a), aims to synthesise intel- features for audio deepfake detection are presented and ligible and natural speech, given any arbitrary text, using categorized in Section 3. Representative classication algo- machine learning based models. TTS models can generate rithms are summarized in Sections 4. End-to-end methods realistic and human-like speech with the development of and generalization methods are introduced in Sections 5 deep neural networks. TTS systems mainly include text and 6. Section 7 presents a detailed comparison of different analysis and speech waveform generation modules. There features and models. Some remaining challenges and future are two major methods on speech waveform generation: research directions are summarized in Section 8. Finally, concatenative , and statistical parametric TTS. Section 9 concludes this paper. The latter often consists of an acoustic model and a vocoder. Most recently, some end-to-end models have been proposed to generate high-quality sounding audio, such as Variational Inference with adversarial learning for end-to-end Text-to- 2 OVERVIEW Speech (VITS) and FastDiff-TTS. The eld of audio deepfake detection has been blossoming in terms of deepfake technologies, competitions, datasets, 2.1.2 Voice Conversion evaluation metrics and detection methods. Voice conversion (VC) refers to cloning a person’s voice digitally as shown in Figure 2 (b). It aims to change the timbre and prosody of a given speaker’s speech to that of 2.1 Types of Deepfake Audio another speaker, while the content of the speech remians the same. The input to a VC system is a natural utterance of Deepfake audio generally refers to any audio in which the given speaker. There are about three main approaches of important attributes have been manipulated via AI tech- VC technologies: statistical parametric , , frequency nologies while still retaining its perceived naturalness. Pre- warping and unit-selection. Statistical parametric vious studies mainly involve ve kinds of deepfake audio: model also has a vocoder which is similiar to that in statisti- text-to-speech, voice conversion, emotion fake, scene fake, cal parametric TTS ,. In recent years, end-to-end VC partially fake. The characteristics of different deepfake types models have also been proposed to mimic a person’s voice are summarised in Table 1. characteristics. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 3 TABLE 2 Characteristics of representative competitions on audio fake detection. Competition Language Year #Registration #Submission Task Fake Type Deepfake Goal Baseline Model Features Classiers 2015 28 16 LA TTS, VC No Detection - - 2017 113 49 PA Replay No Detection CQCC GMM 48 LA TTS, VC No Detection 2019 154 LFCC, CQCC GMM ASVspoof English 50 PA Replay No Detection 41 LA TTS, VC No Detection 2021 198 23 PA Replay No Detection LFCC, CQCC, Raw GMM, LCNN, RawNet2 33 DF Deepfake Yes Detection 48 LF TTS, VC Yes Detection 2022 121 33 PF Partially fake Yes Detection LFCC, Raw GMM, LCNN, RawNet2 39 FG TTS, VC Yes Game fake ADD Chinese 63 FG TTS, VC Yes Game fake LFCC, Wav2vec2.0 GMM, LCNN 2023 145 16 RL Partially fake Yes Forensics LFCC LCNN 11 AR TTS, VC Yes Attribution LFCC ResNet + Openmax 2.1.3 Emotion Fake access (LA) task involving the detection of synthetic and Emotion fake seeks to change the audio in such a converted utterances. The ASVspoof 2017 only has one way that the emotion of the speech changes, while other task named physical access (PA), which includes replay information remains the same, such as speaker identity attacks. The ASVspoof 2019 consists of two tasks: LA and speech content. Changing emotions of the voice often and PA, which are included in previous two challenges. leads to semantics changes. An example of emotion fake Speech deepfake detection task is included in the ASVspoof is illustrated in Figure 2 (c). The original utterance said by 2021 , which consists of three tasks: LA, PA and speech speaker B is with a happy emotion. The fake utterance is the deepfake (DF). The DF task involves compressed audio audio where the happy emotion has been changed into a similar to the LA task. sad emotion. There are two kinds of methods on emotional The ADD 2022 challenge was organized by includ- VC called emotion fake : parallel data based and non- ing three tasks: low-quality fake audio detection (LF), par- parallel data based methods. tially fake audio detection (PF) and audio fake game (FG). The LF task focuses on dealing with genuine and fully fake 2.1.4 Scene Fake utterances with various real-world noises and interferences Scene fake involves the tempering of the acoustic scene etc. The PF task aims to distinguish between partially fake of the original utterance with another scene via speech and real audio. The FG task is a rivalry game involving enhancement technologies while the speaker identity and an audio generation task and an audio fake detection task, speech content remain unchanged. An example of scene wherein the generation task participants aim to generate fake is shown in Figure 2 (d). The acoustic scene of the real audio that could deceive the detection systems submitted utterance is “Ofce”. The acoustic scene of the fake utterance by the detection task participants. The results in ADD 2022 is “Airport”. If the scene of an original audio is manipulated show that it is difcult to use the same model to deal with with another one, authenticity and integrity verication of all fake types. The result also show that the generalisation the audio will be unreliable and even the semantic meaning of detection techniques remains an open problem. Different of the original audio could be changed. from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real or 2.1.5 Partially Fake fake classication, and actually localizing the manipulated Partially fake focuses on only changing several words intervals in a partially fake utterance as well as pinpointing in an utterance. The fake utterance is generated by manip- the source responsible for generating any fake audio. The ulating the original utterances with genuine or synthesized ADD 2023 challenge includes three subchallenges: audio audio clips. The speaker of the original utterance and fake fake game (FG), manipulation region location (RL) and clips is the same person. The synthesized audio clips, while deepfake algorithm recognition (AR). keeping the speaker identity unchanged. An example of partially fake is shown in Figure 2 (e). 2.3 Benchmark Datasets 2.2 Competitions The development of audio deepfake detection techniques has been largely dependent on well-established datasets Over the last few years, a series of competitions have played with various fake types and diverse acoustic conditions. A a key role in accelerating the development of audio deep- variety of datasets have been designed to protect ASV sys- fake detection, such as the ASVspoof1 and ADD2 challenges. tems or human listeners from spoong or deceiving. Table 3 Table 2 shows the characteristics and baseline models of the highlights the characteristics of representative datasets on representative competitions. audio deepfake detection. The ASVspoof challenges mainly focus on detecting Many early studies designed spoofed datasets to de- spoofed audio from the perspective of protecting ASV sys- velop spoong countermeasures for ASV systems. In the tems from attack. The ASVspoof 2015 involves logical early days, a diverse set of spoong datasets were propri- 1. https://www.asvspoof.org etary due to the design of a dataset depending very much 2. http://addchallenge.cn on the specic spoong approach assumed in a particular JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 4 TABLE 3 Characteristics of representative datasets on audio deepfake detection. SR denotes sampling rate and SL refers to average length of utterances. Utt and Spk denote utterances and speakers, respectively. ASVspoof 2021 ADD 2022 ADD 2023 In-the-Wild WaveFake FoR DF LF PF FG-D FG-D LR AR Year 2021 2022 2022 2022 2023 2023 2023 2022 2021 2019 Language English Chinese Chinese Chinese Chinese Chinese Chinese English English English Goal Detection Detection Detection Game fake Game fake Forensics Attribution Detection Detection Detection Fake Types VC, TTS TTS, VC Partially fake TTS, VC TTS, VC Partially fake TTS, VC TTS TTS TTS Clean, Clean, Clean, Condition Noisy Clean Noisy Clean Noisy Clean Clean Noisy Noisy Noisy Format FLAC WAV WAV WAV WAV WAV WAV WAV WAV WAV SR (Hz) 16k 16k 16k 16k 16k 16k 16k 16k 16k 16k SL (s) 0.5∼12 1∼10 1∼10 1∼10 1∼10 1∼10 1∼10 2∼ 8 8∼12 0.5∼20 #Hours 325.8 222.0 201.8 396.0 394.7 131.2 194.5 38.0 196.0 150.3 #Real Utt 22, 617 36, 953 23, 897 46, 871 172, 819 46, 554 14, 907 19, 963 0 108, 256 #Fake Utt 589, 212 123, 932 127, 414 243, 537 113, 042 65, 449 95, 383 11, 816 117, 985 87, 285 #Real Spk 48 >400 >200 >400 >1000 >200 >500 58 0 140 #Fake Spk 48 >300 >200 >300 >500 >200 >500 58 2 33 Accessibility Public Restricted Restricted Restricted Restricted Restricted Restricted Public Public Public study. Some spoong datasets are designed only involving PF dataset are chosen from the HAD dataset designed by Yi a kind of TTS method or a sort of VC approach ,. et al. , which are generated by manipulating the original However, it is difcult to make comparisons across different genuine utterances with real or synthesized audio segments spoong methods. To alleviate this issue, several spoofed of several key words, such as named entities. The detection datasets including multiple approaches are designed by Wu task dataset of the FG track (FG-D) are randomly selected et al. and Alegre et al. , which involve replay, from the submitted utterances of generation task in ADD TTS and VC technologies. But the varieties of spoong 2022. A Chinese synthetic speech detection dataset FMFCC- techniques are still insufcient compared to the diversity A contains 13 types of fake audio involving noise addi- required by generalised countermeasure studies. In order tion and audio compression. The above-mentioned datasets to conduct repeatable and comparable spoong detection have played a pivotal role in accelerating the development studies, Wu et al. develop a standard public spoof- of audio deepfake detection. However, the fake utterances ing dataset SAS which consists of various TTS and VC mainly involve changing speaker identity, speech content methods in 2015. The SAS dataset is used to support the or channel noise of the original audio. Most recently, Zhao rst ASVspoof challenge (ASVspoof 2015) aiming to detect et al. design an emotion fake audio detection dataset the spoofed speech. Replay is considered as a lowcost named EmoFake, where the original emotion of a speaker’s and challenging attack included in the ASVspoof 2017 chal- speech has been manipulated with another one but other lenge. The ASVspoof 2019 and 2021 datasets both information still remains the same. A scene manipulation consist of replay, TTS and VC attacks. Previous datasets audio dataset named SceneFake is constructed by Yi et in ASVspoof challenges focus on detecting speech attacks al. , in which the acoustic scene of an original utterance in the microphone channel. Lavrentyeva et al. design is replaced with another one using speech enhancement a PhoneSpoof dataset for speaker verication systems, in technologies. In 2022, a real-world dataset named In-the- which the utterances are collected in telephone channels. Wild are collected from publicly available sources such as A partially spoofed database is designed by using social networks and popular video sharing platforms, where voice activity detection technologies to randomly concate- the utterances are from English-speaking celebrities and nate spoofed utterances. politicians. A few audio deepfake detection datasets have been developed to protect people from deceiving by deepfake 2.4 Evaluation Metrics audio. The deepfake types contained in the datasets mainly include: TTS, VC, emotion fake, scene fake and partially Previously, equal error rate (EER) is used as the evalu- fake. In 2020, Reimao et al. developed a publicly ation metrics for audio deepfake detection tasks in the available dataset FoR containing synthetic utterances, which ASVspoof and ADD challenges. The ’threshold- are generated with open-sourced TTS tools. A private fake free’ EER is dened as follows. Let Pf a (θ) and Pmiss (θ) dataset is constructed using the open-sourced VC and TTS denote the false alarm and miss rates at threshold θ. systems. In 2021, Frank et al. developed a fake audio #{fake trials with score > θ} dataset named WaveFake, which contained two speakers’ Pf a (θ) = (1) #{total fake trials} fake utterances synthesised by the latest TTS models. Audio #{genuine trials with score < θ} deepfake attacks are included in ASVspoof 2021 , which Pmiss (θ) = (2) consider data compression effects. However, these datasets #{total genuine trials} have not covered some real-life challenging situations. The So Pf a (θ) and Pmiss (θ) are, respectively, monotoni- datasets in ADD 2022 challenge are designed to ll the cally decreasing and increasing functions of θ. The EER gap. The fake utterances in LF dataset are generated corresponds to the error rate at the threshold θEER at using the latest state-of-the-art TTS and VC models, which which the two detection error rates are equal, i.e. EER = contain diversied noise interference. The fake utterances in Pf a (θEER ) = Pmiss (θEER ). JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 5 There are two rounds of evaluations in the detection task 3.1.1 Short-term magnitude based features of audio fake game track in ADD challenges ,. Each round evaluation has each own ranking in terms of EER. The statistical averaging inherent in parametric modeling The nal ranking is in terms of the weighted EER (WEER), of the magnitude spectrum may introduce artefacts, such which is dened as follows. as over-smoothed spectral envelopes. The use of magni- tude based spectrum can therefore be useful for detecting W EER = α ∗ EER R1 + β ∗ EER R2 (3) generated speech. Short-term magnitude features include where α and β denotes the weight of the corresponding magnitude spectrum and power spectrum features. EER, EER R1 and EER R2 are the EER of the rst and Magnitude spectrum features are directly derived from second round evaluation in the detection task of audio fake the magnitude spectrum ,. The logarithm of mag- game track, respectively. nitude spectrum is called log magnitude spectrum (LMS) containing the formant information, harmonic structure and all the spectral details of speech signal. The logarithmic is 3 D ISCRIMINATIVE F EATURES used to reduce the the dynamic range of the magnitude The feature extraction is a key module of the pipeline spectrum. Formant information contained in LMS is impor- detector. The goal of feature extraction is to learn dis- tant for speech recognition but may not be useful for fake criminative features via capturing audio fake artifacts from detection as most of the fake techniques (e.g. TTS or VC) are speech signals. Large amounts of efforts – have effective in modelling the formant of speakers. Therefore, shown the importance of useful features for detecting fake residual log LMS (RLMS) is proposed by employing inverse attacks. The features used in previous studies can be roughly linear predictive coding (LPC) lter to reduce the impact divided into four categories: short-term spectral features, of formant information but better analyse the details of long-term spectral features, prosodic features and deep fea- spectrum such as harmonics. tures. Short- and long-term spectral features are extracted Power spectrum features are derived from the power largely by relying on digital signal processing algorithms. spectrum, which may be the most well studied in fake audio Short-term spectral features, extracted from short frames detection. They include log power spectrum (LPS), cep- typically with durations of 20-30 ms, describe the short-term strum (Cep), lter bank based cepstral coefcients (FBCC), spectral envelope involving an acoustic correlate of voice all-pole modeling based cepstral coefcient (APCC) and timbre. However, short-term spectral features have been subband spectral (SS) features. LPS, commonly called log- demonstrated inadequate in capturing temporal character- spectrum, is computed directly on raw power spectrum by istics of speech feature trajectories. In response to this, some the logarithm. Cep is derived from the power spectrum researchers propose long-term spectral features to capture by applying discrete cosine transform (DCT). However, the long-range information from speech signals. In addition, dimensionality of LPS and Cep features is too high. FBCC prosodic features are used to detect fake speech. Unlike the features are proposed to address the aforementioned short-term spectral features from short duration, prosodic issue, and include rectangular lter cepstral coefcients features spans over longer segments, such as phones, sylla- (RFCC), linear frequency cepstral coefcients (LFCC), mel bles, words, and utterances etc. Most of the aforementioned frequency cepstral coefcient (MFCC), inverted MFCC (IM- spectral and prosodic features are hand-crafted features, the FCC). RFCC is computed using linear scale rectangular design of which is awed by biases due to limitations of lters. LFCC is extracted with linear triangular lters. handmade representations. So deep features, extracted MFCC is derived from mel scale triangular lters, with via deep neural network based models, are motivated to denser placement in lower-frequencies to simulate human- ll the gap. The characteristics and relationships of different ear perception. IMFCC utilizes triangular lters that features are listed in Figure 3. are linearly spaced on inverted-mel scale, giving higher emphasis to the high-frequency region. Mel-frequency prin- 3.1 Short-term Spectral Features cipal coeftients (MFPC) features are obtained similarly to the MFCC coefcients, but using principal component Short-term spectral features are computed mainly by ap- analysis (PCA) instead of the DCT to reomve the rela- plying the short-time Fourier transform (STFT) on a speech tions of the acoustic features. Mel spectrum (Mel spec) signal. Given a speech signal x(t), it is assumed to be also is derived similarly to the MFCC coefcients without quasi-stationary within a short period (e.g. 25ms). The STFT DCT. LFCC features are well-known, which together with of the speech signal x(t) is formulated as follows: Gaussian mixture model (GMM) and light convolutional X(t, ω) = |X(t, ω)| ejϕ(ω) , (4) neural network (LCNN) are used as the baseline models for ASVspoof and ADD , challenges. APCC fea- where |X(t, ω)| is the magnitude spectrum and ϕ(ω) is the tures are derived from all-pole modeling representation phase spectrum at frame t and frequency bin ω. The power of signal converted to linear prediction cepstral coefcients 2 spectrum is dened to be |X(t, ω)|. (LPCC). SS features include subband spectral ux Short-term spectral features are mainly composed of coefcients (SSFC), spectral centroid magnitude coefcients short-term magnitude and phase based features. Usually, (SCMC), subband centroid frequency coefcients (SCFC) few of the magnitude based features are directly derived and discrete Fourier mel subband transform (DF-MST). from the magnitude spectrum but most of them are derived The subband features mostly extract information such as from the power spectrum. The phase based features are spectral ux and centroid magnitude without looking into derived from the phase spectrum. the details within each subband. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 6 Fig. 3. The typical features used in previous studies can be roughly divided into four categories: short-term spectral features, long-term spectral features, prosodic features and deep features. 3.1.2 Short-term phase based features the derivative of the phase spectrum along the time axis. Even though phase information is important in human Different from the raw phase spectrum that scarcely reects speech perception, most TTS and VC systems use a simpli- any patterns, the IF spectrum capturing the temporal infor- ed, minimum phase model which may introduce artefacts mation of phase has clear patterns. The IF and GD contain into the phase spectrum. Therefore, phase based features very different patterns, which could provide complemen- can be used to discriminate between human and generated tary information for spoofed speech detection. BPD speech. The phase spectrum itself does not have stable is a phase feature extracted from baseband STFT, which patterns for fake audio detection due to phase warping. can provide more stable time-derivative phase information Post-processing methods are instead utilised to generate compared to the IF. RPS reects the ”phase shift” of har- useful short-term phase based features including group monic components in relation to the fundamental frequency. delay (GD) based and other phase features. Another way to reveal the patterns in phase spectrum is to GD based features involve GD, modied GD (MGD), use pitch synchronous STFT, where the patterns are called MGD cepstral coefcients (MGDCC) and all-pole group PSP features. CosPhase features are extracted from delay (APGD). GD is the derivative of phase spectrum along the phase spectrum by applying the cosine function to the frequency axis, which is referred as to a representation of unwrapped phase spectrum followed by DCT. In order to lter phase response. MGD is computed from the spectrum reduce the dimensionality of CosPhase features, CosPhase after cepstral smoothing frame-by-frame, which is a varia- principal coefcients (CosPhasePC) are computed by tion of GD and can extract a more clear phase pattern than means of PCA. GD. Xiao et al. use two factors to control the dynamic range of the MGD for anti-spoong. MGDCC is computed 3.2 Long-term Spectral Features from the MGD phase spectrum, using both phase and mag- nitude information ,. Wu et al. use MGDCC Short-term spectral features are not good at capturing tem- features to distinguish between synthetic speech and hu- poral characteristics of speech feature trajectories due to man speech, which outperform MFCC features. APGD is being computed in a frame-by-frame fashion. Therefore, a phase-based feature using all-pole modeling, whose role long-term spectral features have been proposed to capture in spoofed speech detection is investigated, notably in , long-range information from speech signals, and studies which has fewer parameters compared to MGD due to only have shown that they are critical to fake speech detec- the all-pole predictor order needing to be optimized. tion. The long-term features can be roughly categorized Other phase features , include instantaneous into four types in terms of time-frequency analysis ap- frequency (IF), baseband phase difference (BPD), relative proaches: STFT based features, constant-Q transform (CQT) phase shift (RPS), pitch synchronous phase (PSP) and based features, Hilbert transform (HT) based features and cosine-phase (CosPhase) based features. IF features is wavelet transform (WT) based features. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 7 3.2.1 STFT based features cients (MHEC). The Hilbert envelope is computed from There are four kinds of STFT based features: modulation each Gammatone lter output on the signal and a low-pass features, shifted delta coefcients (SDC), frequency domain lter is used for smoothing. linear prediction (FDLP) and local binary pattern (LBP) features. 3.2.4 WT based features Modulation features include modulation spectrum WT based features are derived mainly by performing WT on (ModSpec) and global modulation (Global M). ModSpec a speech sigal, which include mel wavelet packet coefcients contains long-term temporal characteristics of speech sig- (MWPC), cochlear lter cepstral coefcients (CFCC) and nal. Global M features combines spectral (e.g. MFCC) CFCC plus instantaneous frequency (CFCCIF). and temporal modulation information for better long range MWPC is computed by performing wavelet-packet feature modeling to further improve the performance of transform on speech signals. Novoselov et al. applied fake audio detection. SDC captures long-term speech PCA to the log mel scale information to derive 12 coef- information and are computed by augmenting delta coef- cients, which are called MWPC features. CFCC is de- cients of multiple speech frames. FDLP is obtained by rived based on wavelet transform-like auditory transform, performing DCT on speech signal using linear prediction the relevant mechanism of which occurs in the cochlea of analysis performed on different subbands, which are stud- the human ear. CFCCIF denotes CFCC plus IF features at ied in fake audio detection. LBP features obtain long the output of each subband lters. The IF and phase of the span information upon spectral features via LBP analysis envelope of the cochlear lter are vital features for speech in computer vision tasks ,. Alegre et al. , perception of human listeners. TTS and VC models generate use uniform LBP analysis to convert LFCC features into a the speech in a frame-by-frame pattern. However, human so-called textrogram, the histograms of which are used for speech production system does not produce speech at frame spoof detection. level rather in continuum. Therefore, Patel et al. propose CFCCIF features with variation capturing across frames to 3.2.2 CQT based features discriminate the real speech from the spoofed one, which Unlike short-term spectral features derived from a window won the best result of ASVspoof 2015. of tens of milliseconds, the CQT is a long-term window transform. The CQT provides higher frequency resolution at lower frequencies, but higher temporal resolution at higher 3.3 Prosodic Features frequencies in contrast to the STFT. The center frequencies Prosody refers to non-segmental information of speech sig- of each lter and the octaves are geometrically distributed nals, including syllable stress, intonation patterns, speaking for CQT. Various CQT based features are derived using rate and rhythm. Unlike the short-term spectral features CQT in different ways, which include CQT spectrum, CQ from short duration typically of 20–30 ms, it spans over cepstral coefcient (CQCC), extended CQCC (eCQCC), in- longer segments, such as phones, syllables, words, and verted CQCC (ICQCC), and CQT-based modied group utterances etc. The important prosodic parameters include delay (CQTMGD) etc. fundamental frequency (F0), duration (e.g. phone duration, CQT spectrum is known as CQTgram , which is pause statistics), energy distribution, speaking rate etc. Pre- computed by directly applied the logarithm on raw power vious studies on fake audio detection mainly consider magnitude spectrum obtained via CQT. Lavrentyeva el three major prosodic features: F0, duration and energy. al. obtain the best results by using the CQT spectrum These features are less sensitive to channel effects when for audio replay attack detection of ASVspoof 2017. CQCC compared to spectral features. They can provide com- is obtained from the DCT of the log power magnitude plementary information to spectral features for improving spectrum derived by CQT. In 2016, Todisco et al. the performance of fake audio detection. achieved promising performce of detecting unit selection F0 is also known as pitch. The pitch pattern of synthetic TTS based attacks via CQCC features. CQCC features enjoy speech is different from that of natural speech. It is wide usage, including as input features of the baseline difcult for TTS or VC models to precisely model human models of ASVspoof and ADD challenges ,. Over physiological features required to properly synthesize nat- half of the ASVspoof 2019 participants (26 out of 48) ultilsed ural speech. So synthetic speech has a different mean pitch CQCC features as the input of their classiers , many stability than human speech. In addition, co-articulation of of which obtain top-performing results ,. eCQCC human speech is smoother and more relaxed than that of is derived from the combination of coefcients from octave synthetic speech. This difference is captured by the jitters power spectrum with the CQCC features that are computed in the pitch pattern of the latter. Therefore, De Leon et from linear power spectrum. ICQCC is derived from al. use pitch pattern statistics like mean pitch stability the inverted linear power spectrum of long-term CQT. derived from image analysis for synthetic speech detection. Cheng et al. propose to incorporate the CQT and MGD Wu et al. introduce pitch pattern calculated by dividing for a more powerful representation of phase-based features the short-range autocorrelation function for anti-spoong named CQTMGD, which won the 1st place of the ASVspoof in 2016. Since TTS usually predicts F0 from text resulting 2019 physical access sub-challenge. in unnatural trajectories but VC usually copies a source speaker’s natural F0 trajectories, pitch pattern is more useful 3.2.3 HT based features for detecting synthetic speech than VC speech, especially HT based features are computed from the analytical signal for unit selection synthesis attack. In 2018, Pal et al. obtained by the HT, such as mean Hilbert envelope coef- extracted pitch variation at frame-level as complementary JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 8 information to magnitude and phase based features to im- Fully learnable spectral features are learned directly prove the performance of synthetic speech detection. How- from raw waveforms to approximate the standard ltering ever, the distribution of F0 is irregular so that it is difcult to process. They are different from partially learnable spectral use it directly. In order to address this issue, Xue et al. features extracted by training a lterbank matrix with a propose a method to capture discriminative features of the spectrogram. Zeghidour et al. propose time-domain F0 subband for fake speech detection, which not only uses lterbanks (TD-FBanks) via scattering transform approxi- the F0 information but also spectral features in 2022. How- mation of mel-lterbanks. The TD-FBanks are learned ever, pitch extraction algorithms are generally unreliable in without any constraints to approximate mel-lterbanks at noisy environments and the extraction of prosodic features initialization. SincNet is proposed to learn a convolution requires relatively large amounts of training data due to with sine cardinal lters, a non-linearity and a max-pooling their sparsity. Most recently, Wang et al. rst try to fuse layer, as well as a variant using Gabor lters. Tak F0, phoneme duration and energy for fake audio detection. et al use SincNet as the rst layer of the end-to-end Phoneme duration features are extracted from a pre-trained anti-spoong model called RawNet2. In 2021, Zeghidour et model HuBERT trained using a large amount of speech al. designed a new learnable ltering layer with Gabor data. lters called LEAF, which can be used as a drop-in substitute of mel-lterbanks. Unlike Sinc lters that require using a window function , Gabor lters are optimally localized 3.4 Deep Features in time and frequency domain. Tomilov et al. obtain The aforementioned spectral features and prosodic features promising results by using LEAF features for detecting are almost all hand-crafted features have strong and de- replay attacks of ASVspoof 2021. sirable representation abilities. However, their design is awed by biases due to limitations of handmade represen- 3.4.2 Supervised embedding features tations. Therefore, deep features are motivated to ll Supervised embedding features involve the extraction of in the gap. Deep features are learned by using deep neural deep embeddings from deep neural networks via super- networks, which can be roughly categorized into: learnable vised training. There are about four kinds of supervised spectral features, supervised embedding features and self- embedding features for audio deepfake detection: spoof supervised embedding features. embeddings, emotion embeddings, speaker embedings and pronunciation embeddings. 3.4.1 Learnable spectral features Spoof embeddings are extracted from a neural network Learnable spectral features involve using learnable neural based model trained on the bonade and spoofed data. layers to estimate the standard ltering process, which can Chen et al. use a DNN based model to compute ro- be categorized in terms of the procedures they perform: bust and abstract feature representation for spoofed speech partially and fully learnable spectral features. detection in ASVspoof 2015 challenge. Qian et al. extract Partially learnable spectral features are extracted by sequence-level bottleneck features, named s-vector, from training a neural network based lterbank matrix with a recurrent neural network (RNN) models for anti-spoong. spectrogram obtained by applying STFT on a speech sig- Das et al. train a deep feature extractor using a DNN nal. In 2017, Hong et al. developed deep neural net- model with LPS features in 2019. The spoof embeddings are work (DNN) lter bank cepstral coefcients (FBCC) named computed from the feature extractor by removing the output learned FBCC, to distinguish natural speech from spoofed layer. In order to learn sequence contextual information one. The learned FBCC can capture better differences be- for fake audio detection, Alejandro et al. propose a tween real and synthetic speech than most hand-crafted light convolutional gated RNN to learn utterance-level deep FBCC, especially for detecting unseen attacks. Sailor et al. embeddings, which are then used as the inputs of the back- propose a method to learn lterbank representation end classication. Most recently, Doan et al. use RNN, using convolutional restricted boltzmann machine named convolutional sequence-to-sequence and transformer based ConvRBM features. In 2020, Cheuk et al. presented encoder to learn breathing and talking sounds as well as a neural network-based audio processing toolkit named silence in an audio clip for deepfake detection. nnAudio, which uses 1D convolutional neural networks to Emotion embeddings are learned using a supervised transform audio signal from time domain to frequency do- speech emotion recognition model trained with emotion main. This toolkit on make the waveform-to-spectram labelled data. The emotion embeddings are directly used to transformation layer trainable via back-propagation. How- detect fake utterances. Conti et al. proposed a method ever, Fu et al. report that nnAudio based anti-spoof to detect fake speech via emotion recognition in 2022. The methods obtain limited improvement due to the fact that rationale behind this method is that the emotional behavior nnAudio is implemented by a set of unconstrained learn- of the generated audio is not natural like that of real human able lterbanks. Zhang et al. use a neural network to speech. The results in show that the method can learn the frequency centre, bandwidth, gain, and shape of generalize well in cross-dataset scenarios. the lter banks performing different constraints to extract Speaker embeddings are trained using a supervised features. Fu et al. propose a front-end named FastAudio speaker recognition model using training data with speaker whose input is a spectrogram of the STFT. The learnable identity label. The speaker embeddings are used as auxiliary layer of FastAudio is instead of replacing xed lterbanks features to improve the performance of fake audio detection. by performing lterbank shape constraints for anti-spoong In 2022, Pan et al. joinly train a speaker recognition tasks. model and a fake audio detection model via multi-objective JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 9 learning. The speaker embeddings and LFCC features are with target data, the classier needs to be deep for the anti- both used as the input features of fake audio detection spoong models. However, a simple neural network with model. just an average temporal pooling and linear layer is suf- Pronunciation embeddings are extracted from a speech cient when the pre-trained model is ne-tuned with anti- recognition model trained with labelled data. The pronun- spoong data. Learning deep embedding features using ciation embeddings can be directly used to discriminate the self-supervised training is suggested as a potential direction real speech from the fake one. In 2023, Wang et al. fuse to improve the generalization of fake audio detection. pronunciation embeddings and prosodic features to train a fake audio detection model. The pronunciation embeddings are computated from a pretrained conformer-based speech 4 C LASSIFICATION A LGORITHMS recognition model. The back-end classier is also very important for audio deepfake detection, which aims to learn high-level feature representation of the front-end input features and model 3.4.3 Self-supervised embedding features excellent discrimination capabilities. The classication algo- Although supervised embedding features are well general- rithms are mainly divided into two categories: traditional ized to unknown conditions, they are learnt using a plentiful and deep learning classication. supply of labeled training data. However, obtaining annotated speech data or fake utterances is costly and 4.1 Traditional Classication technically demanding. This motivates researchers to ex- tract deep embedding features from self-supervised speech Many classic pattern classication approaches have been model trained using any bona de speech data. Despite employed to detect fake speech, including logistic regres- training an effective self-supervised model costly, there are sion (LR) , , probabilistic linear discriminant anal- a number of pre-trained self-supervised speech models are ysis (PLDA) , , random forest (RF) , gradient publicly available, such as Wav2vec , , XLS-R boosting decision Tree (GBDT) , extreme learning ma- and HuBERT. chine (ELM) , k-nearest neighbor (KNN) and so Wav2vec based features are extracted from the pre- on. The most widely used classiers are the support vector trained Wav2vec or Wav2vec2.0 models. Xie et al. machine (SVM) and GMM. propose a Siamese neural network based representation learning model to distinguish real and fake speech in 2021. 4.1.1 SVM based classiers The model is trained with the wav2vec features extracted One of the extensively used traditional classiers in pre- from a pretrained Wav2vec model. Tak et al. use self- vious early work for spoong audio detetion is SVM due supervised learning in the form of a Wav2vec2.0 front-end to its excellent classication capabilities. Alegre et al. with ne tuning for fake audio detection in 2022. Although suggest that SVM classiers are inherently robust to articial the pretrained Wav2vec2.0 is trained using only genuine signal spoong attacks. However, it is very difcult to know speech data without any fake audio, they obtain the state- the exact nature of spoong attacks in practical scenarios. of-the-art results reported in the literature for both the Therefore, Alegre et al. and Villalba et al. propose ASVspoof 2021 LA and Deepfake datasets. a one-class SVM classier only trained using genuine ut- XLS-R based features are extracted from the pre- terances to classify real and fake voices, which generalizes trained XLS-R models which is a variant of Wav2vec2.0. well to unknown spoof attacks. Hanilçi et al. use i- Martin-Donas utilize deep features extracted from pre- vectors as the input features of SVM to discriminate the real trained XLS-R , which is a large-scale model for cross- utterance from the fake one. lingual speech representation learning based on a pretrained Wav2vec2.0. The method ranked rst in the LF track of ADD 4.1.2 GMM based classiers 2022 challenge , where the utterances are interfered Another conventional classier well-known as GMM is with various noises. Lv et al. use a self-supervised widely used in fake audio detection as it is an effective model XLS-R as a feature extractor for fake audio detection. generative model employed as the baseline model in a series The features generalize well for unknown partially fake of competitions, such as ASVspoof 2017 , 2019 , voices and obtain the best results of PF task of ADD 2022 2021 and ADD 2022. Amin et al. train a competition. GMM classier fed with MFCC features to detect voice HuBERT based features are extracted from the pre- disguise from speech variability. De Leon et al. use trained HuBERT models. Wang et. use a pre-trained a GMM classication to discriminate between human and HuBERT model to extract the duration encoding vector synthetic voices. Wu et al. propose a method which for audio deepfake detection. The encoding vector is an decided between real and converted speech by using log- encoding similar to speech phonemes. Wang et al. scale likelihood ratio based upon the GMM model for real directly use the embeddings from the pre-trained HuBERT and converted speech. Sizov et al. use i-vectors trained as the input features of the detection models. with GMM mean supervector to jointly perform VC attacks Wang et al. also investigate the performance detection and speaker verication obtaining promising per- of spoof speech detection using embedding features ex- formance. Many participants of ASVspoof 2015 and 2017 tracted from different pre-trained self-supervised models, have obtained promising performance by adopting GMM e.g. Wav2vec2.0, XLS-R and HuBERT, providing some useful for classifying genuine and spoofed speech. Sahidullah ndings in. If the pre-trained model is not ne-tuned et al. choose GMM classiers for benchmarking of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 10 TABLE 4 Comparison of representative audio deepfake detection classications Algorithm Advantages Disadvantages Traditional SVM , , Early work, excellent classication capabilities Restricted by the limited training samples Classication GMM , – Most widely used baseline Performance needs to be further improved Deeper networks are difcult to train and result CNN based LCNN , , , – Easier to benchmark due to popularity in performance degradation ResNet , – Avoiding performance degradation Limitted generalizabilities to unseen fake attacks ResNet based Enhances feature representations AFN Out-of-domain generalizabilities should be improved in both the frequency and time domains Enlarging the receptive elds and improve the Res2Net based Res2Net , Not considering channel relationship Deep Learning generalization to unseen fake utterance Classication Deeper networks are difcult to train and result in SENet , Modelling interdependencies between channels performance degradation SENet based Combining SENet and ResNet, achieving better Not learning the relationships between ASSERT performance neighbouring sub-bands or segment Modelling relationships between temporal GNN based GAT Not automatically optimize network architectures segments or spectral sub-bands Little human effort, automatically optimizes the DARTS based PC-DARTS Difcult to train operations in network architecture blocks Early end-to-end work CNN based CRNNSpoof Performance need be improved for audio deepfake detection Not optimize the parameters of the Sinc-conv during RawNet2 , Widely used end-to-end model training RawNet2 based Reducing the correlation between lters in the TO-RawNet Not learning adjacent temporal relationships Sinc-conv Data boosting and augmentation technique Not considering two heterogeneous graphs via a RawGAT-ST with spectro-temporal GAT heterogeneous attention mechanism GNN based Modelling artefacts spanning temporal and spectral AASIST Unreliably for unknown fake attacks End-to-End segments with a heterogeneous attention mechanism Model Little human effort DARTS based Raw PC-DARTS Not easy to train directly upon the raw speech Use positional-related local and global dependency Rawformer Acquire local dependency not well for synthetic speech detection Transformer based Using squeeze-and-excitation operation SE-Rawformer Computation costly to acquire local dependency various features as it yields reasonably good accuracy in 4.2.2 ResNet based classiers the ASVspoof 2015 dataset. In addition, Todisco et al. Although deep CNNs have achieved promising results for use GMM in a standard 2-class classier in which the classes fake audio detection, deeper neural networks are more dif- correspond to genuine and spoofed speech. cult to train and result in performance degradation. In order 4.2 Deep Learning Classication to address this problem, ResNet is introduced as the classier , employing a residual mapping. Tomilov et The back-end classications of the latest fake audio de- al. and Chen et al. both use ResNet as the classier tection systems are mostly based on deep learning meth- for audio deepfake detection and achieve promising results ods, which signicantly outperform the SVM and GMM in the deepfake task of ASVspoof 2021. Yan et al. em- based classiers due to their powerful modelling capabil- ploy a standard 34-layer ResNet with multi-head attention ities. The model architectures of back-end classica- pooling layer for detecting deepfake audio, ranking in the tion are generally based on convolutional neural network rst place in the FG-D task of the ADD 2022. Lai et al. (CNN), deep residual network (ResNet), modied ResNet propose a ResNet-based classier named Attentive Filtering (Res2Net), squeeze-and-excitation networks (SENet), graph Network (AFN) to further improve the performance. AFN is neural network (GNN), differentiable architecture search based on dilated residual network, using convolution layers (DARTS) and Transformer. instead of fully connected layers, and modifying the resid- 4.2.1 CNN based classiers ual units by adding a dilation factor. A light ResNet based Since CNNs are good at capturing spatially-local correla- fake audio detection system is introduced by Parasu et al. tion, CNN based classiers have achieved promising per- , reducing network parameters to prevent overtting. formance , such as light CNN (LCNN) consisting Kwak et al. introduce a compact network ResMax of convolutional and max-pooling layers with Max-Feature- combining MFM activation and ResNet for improving the Map (MFM) activation. LCNN is used as the baseline model performance of spoofed audio detection system. of the ASVspoof and ADD competitions. The best system in ASVspoof 2017 and the best single system in 4.2.3 Res2Net based classiers the LA task of ASVspoof 2019 are also utilize LCNN for fake audio detection. The MFM activation of LCNN not Although ResNet based models have strong ability to cap- only lters the noise effects (ambient noise, signal distortion, ture fake cues, their generalizabilities to unseen fake attacks etc.), retains the core information, but also reduces the are still limited. Therefore, Li et al. propose the incor- computational cost and storage space. Zeinali et al. use poration of Res2Net in fake audio detection systems, where VGG-like network comprising several convolutional and the feature maps within one ResNet block are splitted into to pooling layers followed by a statistics pooling and several multiple channel groups linking by a residual-like connec- dense layers to detect fake utterances. Wu et al. propose tion. The connection of Res2Net enlarges the receptive elds a feature genuinization transformer with CNN trained only and improve the generalization to unseen fake utterances. using genuine speech, and the outputs of this transformer Kim et al. utilize Res2Net and Phase network, fed with are then fed into the LCNN based classier. phase and magnitude features, for detecting fake audio. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 11 4.2.4 SENet based classiers classiers fed with hand-crafted or learnable features. Al- Convolution operators of CNN aim to fuse both spatial though past literature , , , shows that and channel-wise information within local receptive elds the use of well-designed classier usually leads to better at each layer, not considering channel relationship. Hu et performing models, the performance of a given classier al. and Zhang et al. propose a focus Squeeze- can vary greatly when combined with different features. In and-Excitation network (SENet) adaptively modelling inter- recent years, deep neural network based approaches that dependencies between channels. Lai et al. use SENet integrate feature extraction and classication in an end- to train one of the fake audio detection models. Further- to-end manner have achieved competitive performance, more, they propose a system named Anti-Spoong with where both the feature extractor and the classier are jointly Squeeze-Excitation and Residual neTworks (ASSERT) com- optimized directly upon the raw speech waveform. The bining SENet and ResNet. The ASSERT are ranked as one end-to-end models avoid limitations introduced from the of the top performing systems in the two sub-challenges use of knowledge-based features and are optimized for the in ASVspooof 2019. Wu et al. use SENet with self- application rather than generic decompositions. The attention layer to detect partially fake audio, achieving top end-to-end architectures of audio deepfake detection can be performance in ADD 2022 challenge. Xue et al. utilize roughly classied into four types: CNN, RawNet2, ResNet, SENet with efcient channel attention via self-distillation for GNN, DARTS and Transformer. fake speech detection. 5.1 CNN based models 4.2.5 GNN based classiers Some researchers attempt the CNN based models to end- Graph neural networks (GNNs) , like graph attention to-end fake audio detection. Muckenhirn et al. em- network (GAT) or graph convolutional network (GCN), are ployed a simple CNN-based end-to-end approach to de- used to learn underlying relationships among data. The fake tection spoofed attacks. The proposed model consists of a artefacts used to detect spoong attacks are often located in single convolution layer and a multilayer perceptron (MLP) specic temporal segments or spectral sub-bands. However, layer, which performs well for VC and TTS attacks. A the aforementioned studies do not focus on learning the raw waveform convolutional long short term neural net- relationships between neighbouring sub-bands or segments. work (CLDNN) based anti-spoong method is proposed by Tak et al. use a GAT to model these relationships to Dinkel et al.. The CLDNN model employs time- and improve the performance of fake audio detection systems. frequency-convolutional layers to reduce time and spectral More recently, Chen et al. utilize GCN incorporat- variations, as well as long-term temporal memory layers to ing prior knowledge to learn spectro-temporal dependency model long-term temporal information. In 2020, Chintha information for anti-spoong, which achieves promising et al. proposed a convolution-recurrent neural net- performance on the ASVspoof 2019 LA dataset. work for spoong detection named CRNNSpoof, which is composed of ve 1-D convolution layers, a bidirectional 4.2.6 DARTS based classiers LSTM layer and two fully-connected layers. However, the A particular variant of neural architecture search known aforementioned models do not perform well in cross-dataset as differentiable architecture search (DARTS) , auto- evaluation. In order to alleviate this issue, Hua et al. matically optimizes the operations contained within archi- propose a time-domain synthetic speech section net, called tecture blocks, including convolutional, pooling, residual TSSDNet, including Inception parallel convolutions struc- connections operations. Ge et al. introduce a variant of tures named Inc-TSSDNet. The proposed model has promis- DARTS known as partial channel connections (PC-DARTS) ing generalization capability to unseen datasets. for audio deepfake detection. The PC-DARTS based model, with little human effort and containing 85% fewer parame- 5.2 RawNet2 based models ters than a Res2Net model, obtains competitive results com- Motivated by the power of RawNet2 in text-independent pared to the best performing systems in previous studies. speaker verication , Tak et al. employ RawNet2 to Wang et al. propose a light DARTS, which combines anti-spoong. RawNet2 is a convolutional neural network DARTS with MFM activation, playing the role of feature with residual blocks, the rst layer of which has a bank selection. of sinc-shaped lters, which is essentially the same as that of SincNet. RawNet2 operates directly on raw audio 4.2.7 Transformer based classiers through time-domain convolution and has potential to learn Different from the fully fake utterances, the partially fake cues that are not detectable using knowledge-based meth- utterances contains some discontinuity artifacts between ods. Wang et al. use RawNet2 with weighted additive concatenated audio clips. Transformer is good at modelling angular margin loss for fake audio detection. However, local and global artifacts and relationship. So Cai et al. RawNet2 does not optimize the parameters of the Sinc-Conv use Transformer and ResNet-1D as the backend classier to layer during training, limiting its performance. In order detect partially fake audio and locate the fake regions. to alleviate this problem, Wang et al. propose TO- RawNet to improve its discriminability, which incorporates orthogonal convolution into RawNet2 reducing the correla- 5 E ND - TO -E ND M ODELS tion between lters in the sinc-conv. TO-RawNet based fake The aforementioned methods to audio deepfake detection audio detection models observably outperform RawNet2 have focused on the design of machine learning based based models. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 12 5.3 ResNet based models Liu et al. also propose a squeeze-and-excitation Raw- Deeper neural network with residual mapping named former called SE-Rawformer using squeeze-and-excitation ResNet become easy to train and achieve promising per- operation to acquire local dependency, which outperforms formance. Hua et al. propose a TSSDNet with resid- the Rawformer. ual skip connection named Res-TSSDNet, obtaining better performance. Ma et al. propose a speech anti-spoong 6 G ENERALIZATION M ETHODS model named RW-ResNet composed of Conv1D Resblocks Although most of the existing audio deepfake detection and backbone ResNet34. methods have achieved impressive performance in in- domain test, their performance drops sharply when dealing 5.4 GNN based models with out-of-domain dataset in real-life scenarios –. Inspired by the success of GAT to model complicated rela- In other words, the generalization ability of audio deepfake tionships among graph representations , Tak et al. detection systems is still poor. Several attempts have been propose a spectro-temporal GAT named RawGAT-ST, which made to try to tackle this challenge from different perspec- learns the relationship, outperforming the RawNet2 and tives, such as loss function and continual learning. Res-TSSDNet on the LA evaluation set of ASVspoof 2019. However, the RawGAT-ST consists of a pair of parallel 6.1 Loss Function graphs to combine information by employing element-wise It has become increasingly challenging to improve the gen- multiplication to the two graphs. In fact, it will be benecial eralization ability of audio deepfake detection systems to to combine the two heterogeneous graphs via a hetero- unknown attacks. In order to overcome this problem, Chen geneous attention mechanism. Therefore, Jung et al. et al. ensure the neural network to learn more robust propose a heterogeneous stacking graph attention layer to feature embeddings using large margin cosine loss (LMCL) model artefacts spanning temporal and spectral segments function and online frequency masking augmentation. The with a heterogeneous attention mechanism, named AASIST. generalization ability of detection models is increased by The AASIST outperforms the current state-of-the-art end- using LMCL and applying data augmentation. Zhang et to-end models. Furthermore, a proposed lightweight vari- al. use one-class learning to deal with unknown fake ant called AASIST-L obtains competing performances. attacks, the key idea of which is to construct a compact These methods perform reliably under seen encoding and representation of genuine audio representation and utilize transimission conditions but unreliably for unknown tele- an angular margin to separate the fake utterances in the phony scenarios. In order to alleviate the problem, Tak et al. embedding space. This method outperforms all previous propose a model based on RawGAT-ST and RawNet2 existing single systems on the evaluation set of the LA systems, named RawBoost, which is a data boosting and task in ASVspoof 2019 challenge, without any data aug- augmentation technique based upon the combination of mentation methods. These methods address the difculties, linear and non-linear convolutive noises as well as im- to some degree, in detecting unknown attacks in practical pulsive and stationary additive noises that can be applied use. However, the compactness of bona de utterances in directly to raw audio. In addition, the AASIST does not the embedding space lacks consideration of the diversity of optimize the parameters of the sinc-conv during training, speakers. Ding et al. propose speaker attractor multi- which limited its performance. Therefore, Wang et al. center one-class learning (SAMO) to address the problem. employ orthogonal regularization in the Sinc-conv layer of The core idea of the SAMO is that real utterances are the AASIST, which is called Orth-AASIST, outperforming clustered around a number of speaker attractors and the the AASIST based model. method pushes away fake voices from all the attractors in a high-dimensional embedding space. 5.5 DARTS based models The aforementioned end-to-end methods are encouraging 6.2 Continual Learning and promising. However, they can only automatically learn Continual learning focuses on the continuous training and features and network parameters rather than network archi- adaptation of models on new information, aiming to over- tecture. Therefore, Ge et al. try to employ an automatic come catastrophic forgetting existing in ne-tuning. In order approach, which not only operates directly upon the raw to improve the performance to unseen deepfake audio, Ma speech signal but also jointly optimizes of both the net- et al. propose a regularization based continual learning work architecture and network parameters. The appoarch is method, named Detecting Fake Without Forgetting (DFWF), implemented based upon partially-connected differentiable to make the model learn new fake attacks incrementally. architecture search from the raw audio waveform (Raw PC- This method doesn’t need to access old data but can ensure DARTS). the model remember previous information. It also improves the detection performance on the new dataset and over- 5.6 Transformer based models comes catastrophic forgetting by introducing regularization. In order to modelling local and global artefacts and relation- However, the approximation of the DFWF may result in ship directly on raw audio, Liu et al. propose a model error accumulation in continual learning, leading to de- named Rawformer composed of convolution layer and teriorating learning performance. Most recently, Zhang et transformer to detect fake utterances. The Rawformer gen- al. propose a continual learning algorithm for fake eralizes better than the AASIST on cross-dataset evaluation. audio detection to solve this problem, called regularized JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 13 TABLE 5 Features and classiers of the top-3 submitted systems for each task in ASVspoof and ADD competitions. The performance of each independent task is evaluated in terms of the EER (%). Top 1 Top 2 Top 3 Competition Year Task EER (%) Features Classiers EER (%) Features Classiers EER (%) Features Classiers MFCC, MFCC, MFPC, Mahalanobis 2015 LA 1.21 GMM 1.97 SVM 2.53 s-vector CFCCIF CosPhase Distance LPS, LCNN, CQCC, MFCC, GMM, GBDT, MFCC, IMFCC, PLP, GMM 2017 PA 6.73 12.34 14.03 LPCC GMM PLP SVM, RF LFCC, RFCC, CQCC ANN ResNet, LFCC, CQT, MFCC, IMFCC, GMM, SVM LA 0.22 Cep 1.86 LCNN 2.64 MobileNet LPS SCMC, CQCC, Raw CNN, CRNN 2019 CQT, MGD, GD, LFCC, CQT, LFCC, ASVspoof PA 0.39 ResNet 0.54 ResNet 0.59 LCNN CQTMGD Cep, IMFCC Cep Mel spec, LCNN, ResNet, LA 1.32 2.77 LFCC ResNet 3.13 Mel Spec SincNet,Raw ResNet SENet VAE, LEAF, LCNN, 2021 PA 24.25 LPS 26.42 27.59 SincNet ResNet GMM Mel spec ResNet Mel spec, LCNN, ResNet, DF 15.64 16.05 Fbank 18.30 CQT LCNN SincNet, Raw ResNet MLP ResNet, CQT, LCNN, LF 21.70 XLS-R DNN 23.00 Log Fbank 23.80 SENet Mel spec