Lecture 8 - DF 2.pdf

OFFENSIVE AI LECTURE 8: VOICE DEEPFAKES & RECORD TAMPERING Dr. Yisroel Mirsky [email protected] Today’s Agenda  Mouth Re-enactment  Voice synthesis  Voice cloning  Attack Automation  Record Tampering Dr. Yisroel Mirsky 2 3 Mouth Re-enactment (dubbing) Dr. Yisroel Mirsky Dr. Yisroel Mirsky 4 Mouth Re-enactment Attack Goals Misinformation “Opposition leader bob is right!” Social Engineering “I hate ” “Hey Joe, one which server do we keep the credentials?” “Today TESLA stocks fell 10%” “Turn off the firewall for 10 minutes, I’m doing some tests” Dr. Yisroel Mirsky 5 Mouth Re-enactment General Approach In-painting Original Frame Driving Signal In-painted Masked Model Dr. Yisroel Mirsky 6 Mouth Re-enactment The Pipeline Target Extraction Pre-processing Generation 𝑀 𝑥𝑡 𝑥𝑡𝑚 𝑥𝑡𝑚 optional reference Source (driver) 𝑥𝑠𝑚 𝑥𝑠𝑎 𝑙𝑠 , 𝑙𝑡 𝑎𝑠 Post-processing Dr. Yisroel Mirsky 7 Mouth Re-enactment Audio Representations: Framing Audio 𝑥𝑎 “hello – my-name---is-Joe-and you?” Entire Clip: only if short (seconds) 𝑀 Per Frame: (miliseconds) 𝑀 Non-overlapping Overlapping Dr. Yisroel Mirsky 8 Mouth Re-enactment Audio Representations: Indirect Direct Amplitude Fourier Transform Summarize the frequencies in a frame 𝑥𝑎 = −12,230,118 − 60, … 234 235 236... Spectrogram Summarize the frequencies in a number of frames Dr. Yisroel Mirsky 9 Mouth Re-enactment Audio Representations: Fourier Theory: Any periodic signal can be decomposed into a sum of sine waves each with different amplitudes and phases 0Hz  Periodic: we use frames and “pretend” that the segment is periodic  Amplitude: the energy level of a wave  Phase: the offset of a wave Measured in Hz: Repetitions per second Example: A1khz frequency Oscillates 1000 times a second Dr. Yisroel Mirsky 10 Mouth Re-enactment Audio Representations: Fast Fourier Transform (FFT):  An efficient algorithm that maps signals to their frequencies 𝐹𝐹𝑇: 𝑋 → 𝐹, 𝑋 ∈ ℤ𝑛 , 𝐹 ∈ ℤ𝑛  𝑛 must be a power of 2 (128, 512, 1024,...) or padded  Complex values are used for frequencies since they capture both phase and amplitude (magnitude) Imaginary X = Numpy.fft.fft(x) 𝜃 Real angle FFT i FFT i = (1.3 − 0.8𝑖) FFT i = = −0.442 1.32 − 0.82 = 1.87 Dr. Yisroel Mirsky 11 Mouth Re-enactment Audio Representations: More important Facts:  An inverse FFT (iFFT) reverses the mapping (𝐹𝐹𝑇: 𝑋 → 𝐹)  For FFT, real valued signals (e.g., audio) only the first are used (the rest are a mirror) 𝑛 2 values Nyquist Theorem (simplified): X = Numpy.fft.fft(x) 𝑓  The highest frequency captured in an FFT of 𝑥 is 2𝑠 where 𝑓𝑠 is the sample rate of 𝑥 (microphones record around 44KHz)  The highest frequency in an FFT of a real signal can be found 𝑛 at index 2 Dr. Yisroel Mirsky 12 Mouth Re-enactment Audio Representations: Spectrogram A visualization of frequency magnitude in 𝑥 varying over time Audio Audible range for humans: 100-20,000 Hz Voice range for humans: 100- 3,200 Hz Humans have hard time distinguishing between high frequencies Mel scale ‘normalizes’ spectrogram to human hearing scale Exponential scale compresses high range (like human hearing) Frequency Spectrogram https://andrelucascourant.medium.com/mel-frequency-cepstral-coefficients-ae245a9709ba Time https://youtu.be/4_SH2nfbQZ8 13 Mouth Re-enactment Audio Representations: Mel Frequency Cepstrum Coefficients (MFCC) Features that capture the presence of the perceived harmonics in an audio frame Takes the amplitudes of the main (most important) frequencies to human hearing Of all the MFCCs, we usually take the first 20-130 python_speech_features.mfcc() Spectrum of a spectrum (ceptrum) Amir J, et al. You said that?: Synthesising talking faces from audio. 2017 14 Mouth Re-enactment (dubbing) Many-to-Many Speech Driven Animation Turing Test https://docs.google.co m/forms/d/e/1FAIpQLS ftFTMoCmNl6evECx4La oPqIKgZoRo1pB7GrsC msRXDQij4Xg/viewform 15 Voice Synthesis Dr. Yisroel Mirsky Dr. Yisroel Mirsky 16 Voice Synthesis Attack Goals Evasion New Identity Voice Masking Tan x. Et al. A Survey on Neural Speech Synthesis 2021 17 Voice Synthesis Popular Approach: Text to Speech (TTS)  Converters text to audio Three main components in TTS: “Hello, my name is Joe” Can be DNN models Tan x. Et al. A Survey on Neural Speech Synthesis 2021 18 Voice Synthesis Text to Speech (TTS) Linguistic Features: Can’t we just input the characters into an RNN? Consider: “Please turn on the light” Which ‘e’ sound to use? compound ‘reh-neh’ or ‘ern’? Are we supposed to hear ‘g’? Which ‘i’ sound? Tan x. Et al. A Survey on Neural Speech Synthesis 2021 19 Voice Synthesis Text to Speech (TTS) Linguistic Features Phonemes Why give model useless information regarding sound of text?  Special Cases such as unspoken letters: “light” sounds like “lite”  Complex rules such as vowels: “lite” should be encoded as “l-I-y-t”   Same characters have different sounds in different contexts Multi-character sounds: “thanks”, “photon” Tan x. Et al. A Survey on Neural Speech Synthesis 2021 20 Voice Synthesis Text to Speech (TTS) IPA Alphabet Linguistic Features Phonemes Encoding Tan x. Et al. A Survey on Neural Speech Synthesis 2021 21 Voice Synthesis Text-to-Speech (TTS) “hello” Text Analysis hə-loʊ Acoustic Model spectrogram Usually a specialized DNN Vocoder audio signal Tan x. Et al. A Survey on Neural Speech Synthesis 2021 22 Voice Synthesis Text-to-Speech (TTS) Design patterns Each method combines many prior works and techniques in their pipline Evolution of DNN TTS Shai Shalev-Shwartz 23 Voice Synthesis Text-to-Speech (TTS) Problem: Models generate arbitrary durations or have uniform durations Glow-TTS + HiFi-GAN Kim J, et al. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech 2021 24 Voice Synthesis Text-to-Speech (TTS) State of the Art – an end-to-end model “that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day,” Ground Truth Tacotron 2 + HiFi-GAN VITS 25 Voice Cloning Dr. Yisroel Mirsky Dr. Yisroel Mirsky 26 Voice Cloning Attack Goals Scams Authentication “Mom, I need help!” “Transfer that 100k right away!” “Alexa, unlock front door” “I’m Robert, send my new SIM to...” Social Engineering “The levels are too low...” “What’s the IP of our portal?” Dr. Yisroel Mirsky 27 Voice Cloning 2019 2021 Dr. Yisroel Mirsky 28 Voice Cloning Common Methods 1) 2) 3) Text to Speech (TTS): Teach a TTS system to use a specific individual’s voice Speech to Speech (Voice Conversion) Train a model to modify audio style, not content Replay Cut paste victims word from another source “Hello” “Hello” “Hello” “Hello Joe” “Hello” https://youtu.be/VnFC-s2nOtI https://youtu.be/spg2NMoKYU8 29 Voice Cloning Voice Cloning via TTS 2018 Services 2020 – quality improving... Requires: 10 minutes of audio https://youtu.be/0fO7CBDMGNA 30 Voice Cloning Voice Cloning via TTS Services Are they used by attackers? Almost certainly Many require accepting terms of use “use only voices you have right to” Others will only train on a reading script But... attacker could collect words from past recordings Corridor digital Jia Y, et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. 2019 31 Voice Cloning Voice Cloning via TTS Zero-shot (only 3 seconds!) WaveNet Attention is used to help network align sequence to audio Jia Y, et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. 2019 32 Voice Cloning Voice Cloning via TTS Zero-shot (only 3 seconds!) Fake and Real identities fall close in embedding space Dr. Yisroel Mirsky 33 Voice Cloning Voice Cloning via TTS Remember the CEO that was scammed in 2019? Here is the actual recording Dr. Yisroel Mirsky 34 Voice Cloning Voice Cloning via TTS Stream Demo of deepfake phone scam (fictitious) by Protocol | Listen online for free on SoundCloud Dr. Yisroel Mirsky 35 Voice Cloning Voice Cloning via Voice Conversion (VC)  TTS can’t capture expression or emotion  Voice conversion transfers ‘style’ of one recording to the ‘content’ of another   Different Timbre  Different accent  Different emotion ... 𝑎𝑡 target voice (content) Voice Cloning can be accomplished with voice conversion 𝑎𝑔 converted voice 𝑎𝑠 source voice (style, timbre,...) https://youtu.be/glwgybvxk-0 36 Voice Cloning Voice Cloning via VC Services 2022 https://youtu.be/AALf9w37COM 37 Voice Cloning Voice Cloning via VC Services 2022 Huang T. How far are we from robust voice conversion: a survey. 2021 38 Voice Cloning Voice Cloning via Voice Conversion (VC) Two common approaches for many-to-many 1. Content-Style disentanglement (encoder decoder) 2. Conditional GANs Huang T. How far are we from robust voice conversion: a survey. 2021 39 Voice Cloning Voice Cloning via Voice Conversion (VC) Disentanglement Approach Timbre transferred as ‘style’ Identity removed from content using instance normalization Discriminator ensures content holds no identity ED that disentangles content from timbre Rebryk Yurii, et al. ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network. 2020 40 Voice Cloning Voice Cloning via Voice Conversion (VC) Example Zero-shot: Con Voice 𝑎𝑠1 𝑎𝑠2 𝑎𝑡1 𝑎𝑔1 𝑎𝑡2 𝑎𝑔2 𝑎𝑡3 Cross gender VC is hard: Dr. Yisroel Mirsky 41 Voice Cloning Voice Cloning via Voice Conversion (VC) Conditional GAN approach Recall Cycle GAN… CycleGAN 𝐻𝑎𝑏 𝑥𝑎 Networks Generative Discriminator 𝑥 𝑏′ 𝑥ො𝑎 𝑥𝑔𝑎 𝐷𝑎 𝐷𝑏 𝑥𝑔𝑏 𝑥ො𝑏 𝐻𝑏𝑎 𝑥𝑏 𝑥𝑎′ Cycle consistency loss means: no need for paired training (e.g., pix2pix)… …what if there are more than one domain? E.g., more than 2 voices? Takuhiro K, et al. StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion. 2019 42 Voice Cloning Voice Cloning via Voice Conversion (VC) Conditional GAN approach StarGAN-VC2 Takuhiro K, et al. StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion. 2019 43 Voice Cloning Voice Cloning via Voice Conversion (VC) Real Time Voice Cloning: Framework 𝑏: buffer size Delay depends on (1) buffer size [protection] and (2) end-to-end processing time [MFCC, GAN, …] Audio Buffer IN Audio Buffer OUT spectrogram spectrogram MFCC 2 Wave Wave 2 MFCC 𝑏 𝑠 𝑠 𝑥𝑖+𝑘 𝑥𝑖+1 𝑥𝑖𝑠 𝑔 𝑥𝑖 𝑠 𝑠 𝑥𝑖−1 𝑥𝑖−𝑘 (vocoder) 𝑏 Huang T. How far are we from robust voice conversion: a survey. 2021 44 Voice Cloning Voice Cloning via Voice Conversion (VC) Subjective Evaluation (many-to-many) In Practice Comparing real to fake Majority of cases, volunteers have doubt Measuring quality of fakes Without comparison to real, anomalies are missed  Victims are not ‘listening’ for a fake voice  Victims will accept stronger anomalies when under duress Dr. Yisroel Mirsky 45 Voice Cloning Where we are today… 46 Automation STT-LLM-TTS Dr. Yisroel Mirsky Dr. Yisroel Mirsky 47 Vishing What is Vishing?  Voice-Phishing  The impersonation of legitimate companies, government agencies, or other entities to create a sense of urgency or fear, prompting the victim to act quickly. How it Works 1. Initiation: The attacker calls the victim, typically using spoofed caller ID to appear more credible. 2. Manipulation: Through persuasive language and social engineering techniques, the attacker convinces the victim to provide sensitive information or perform actions that compromise security (e.g., transferring funds, revealing passwords). 3. Exploitation: The obtained information is then used for fraudulent activities, identity theft, or unauthorized access to accounts. Dr. Yisroel Mirsky 48 Vishing Scenario: The Amazon Customer Service Impersonation 1. 2. Initiation:  Call from “Amazon Customer Service”  Caller ID might even display "Amazon" or a convincing phone number Manipulation: Urgent Problem Presented:  Informed of suspicious activity on their Amazon account, such as an unauthorized purchase of an expensive item  Caller creates a sense of urgency to prevent financial loss. Request for Information: 3.  Caller requests sensitive information to fix the issue.  Amazon login credentials, credit card numbers, or remote access Exploitation:  Information used to commit fraud or information is sold to 3rd parties. Dr. Yisroel Mirsky 49 A Dark Future This is all done manually by malicious call centers Dr. Yisroel Mirsky 50 A Dark Future What if it were automated?  Scalability = Mass exploitation  All of the required technologies exist! TTS This is Amazon customer service, … STT Hello? Who is this? Dr. Yisroel Mirsky 51 A Dark Future The Large Language Model (LLM)  LLMs read sequences of tokens (words) and generate sequences of tokens.  Modern models are pretrained on massive text corpuses, then fined tuned for a specific task.  Chat-based LLM: Specialized for text conversations, simulating human-like dialogue. Double edge sword of AI  Attacker…  downloads existing pretrained model, or Mistral, GPT-2, LLAMA2, …  pays for API access to a service ChatGPT-4 Turbo, Google Gemini, … AI safe-guards aren’t perfect: Record Tampering Dr. Yisroel Mirsky *Schreyer M, et al. Adversarial Learning of Deepfakes in Accounting. 2019 53 Record Tampering Example Attack Motivations:    Money  Fraud (hide tampering*)  Ransom  Blackmail Crime  Court evidence  Surveillance (evasion) Damage  Medical Records  Logs Dr. Yisroel Mirsky 54 Record Tampering Common Methods     Refine Tampered Sample  Tamper record manually  Use GAN to refine record (hide anomalies/artifacts) Style Transfer  Transfer from sample (zeroshot - AdaIN network)  Transfer from domain (e.g., CycleGAN) Modify attribute encodings  StyleGAN  VAE Inpainting  Masking  Semantic Dr. Yisroel Mirsky 55 Record Tampering Inpainting Definition: The task of filling in missing content Isola P, et al. Image-to-Image Translation with Conditional Adversarial Networks. 2016 56 Record Tampering Inpainting Pix2Pix Approach 𝑥′ mask 𝑥′ 𝑚 𝐷 𝑋 𝑥 mask 𝑥 𝑚 𝐺 𝑦 real or fake? 𝑥𝑔 𝐷 makes decision on samples of images with their mask (knows where to look) Mirsky Y, et al. CT-GAN: Malicious Tampering of 3D Medical Imagery using Deep Learning. 2019 1 Investigation Medical Tampering Attacker The scan is saved on a storage server over an Ethernet network 2D slices of 3D body Patient Receives CT scan (DICOM files) 3 Treatment 2 Diagnosis Report Patient Oncologist, Neurologist, … Report PACS Server A radiologist downloads and analyzes the scan Radiologist A report is made and sent to a referring doctor to plan the treatment and next steps Inject/Remove Evidence Mirsky Y, et al. CT-GAN: Malicious Tampering of 3D Medical Imagery using Deep Learning. 2019 Attacker’s Motivation Aspect:  Psychological: Traumatization or change in life course   Physical: Patient receives harmful uneeded treatment/biopsy, or does not receive needed treatment.   Affect politics, remove leader/boss, terrorism, … Murder, assassination, terrorism, … Monetary:  Sabotage/falsify research, ransomware, insurance fraud, … Mirsky Y, et al. CT-GAN: Malicious Tampering of 3D Medical Imagery using Deep Learning. 2019 59 Record Tampering CT-GAN 62 “An astronaut on a horse” A Comment on Diffusion Models “into a cyborg robot” Dr. Yisroel Mirsky Dr. Yisroel Mirsky 63 Reccomended Reading: Week 8 ConVoice: Real-Time Zero-Shot Voice Style Transfer https://arxiv.org/pdf/2005.07815.pdf https://rebryk.github.io/convoice-demo/

Lecture 8 - DF 2.pdf

Document Details

Tags

Related

Full Transcript