Consultants Support Transformers Partie 2 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is an overview of transformers, explaining their architecture and function. Covering attention mechanisms, pre-training, fine-tuning, different architectures and examples, it aims to summarize essential details.
Full Transcript
“Attention is All You Need : Transformers” 1 What is a Transformer? The magic of the Attention Mechanism 9 2 The different Transformers architectures 3 Pre-training and Foundation Models 4 Specialization of Foundation...
“Attention is All You Need : Transformers” 1 What is a Transformer? The magic of the Attention Mechanism 9 2 The different Transformers architectures 3 Pre-training and Foundation Models 4 Specialization of Foundation Models (especially LLM) 5 NLP back in the days Convolutional Neural Networks for Sentence Classification, Yoon Kim , 201 4, https://arxiv.org/pdf/1408.5882.pdf 9 NLP back in the days https://medium.com/saarthi-ai/elmo-for-contextual-word-embedding-for-text-classification-24c9693b0045 4 NLP back in the days 5 What do we want ? Process sequences (ideally the entire sentence) Easy to distribute on multiple GPUs Faster training than with RNN Initially for NLP tasks Allows to train huge models on gigantic datasets Allows for a pretraining session to pool trainings (at least partially) for multiple tasks 6 The King is dead. Long live the King! 7 Size evolution of transformers 8 1 What is a Transformer? The magic of the Attention Mechanism 9 2 The different Transformers architectures 3 Pre-training and Foundation Models 4 Specialization of Foundation Models (especially LLM) 5 Example of NLP system 10 Feature extraction reminder Feature extraction Predict/Decide 11 Feature extraction with Transformers Feature extraction Predict/Decide 12 Vanilla Transformer architecture 13 Transformer layer simplified 14 Feed Forward layer 15 Transformer layer 16 Vanilla Transformer architecture summary 17 Attention explained - 1 18 Attention explained - 1 (animated) 19 Attention explained - 1 (full) 20 Attention explained - 2 21 Attention explained - 3 22 Attention explained - 3 (detailed) 23 Intuition behind the attention mechanism 24 Intuition behind the attention mechanism - 0 25 Intuition behind the attention mechanism - 1 26 Intuition behind the attention mechanism - 2 27 Intuition behind the attention mechanism - 3 28 Intuition behind the attention mechanism - 3 (detailed) 29 Multi-head Attention explained - 1 41 Multi-head Attention explained - 2 42 Multi-head Attention explained - 3 43 1 What is a Transformer? The magic of the Attention Mechanism 9 2 The different Transformers architectures 3 Pre-training and Foundation Models 4 Specialization of Foundation Models (especially LLM) 5 Transformer architectures 45 Encoder architecture BERT / Encoder / Auto-encoding 46 Decoder architecture GPT / Decoder / Auto-regressive 47 Bidirectional vs Unidirectional attention Bidirectional attention for Encoder Unidirectional attention for Decoder 48 Mask attention 49 Unidirectional attention detailed 50 Encoder-Decoder architecture T5 / Encoder-Decoder / Sequences-to-Sequences 51 Encoder-Decoder translation example 52 Encoder-Decoder whisper example 53 Encoder-Decoder attention - 1 54 Encoder-Decoder attention - 2 55 Encoder-Decoder attention - 3 56 Misunderstanding about Mixture of Experts 57 Mixture of Expert Transformer Architecture 58 Sparse Feed Forward Layer 59 1 What is a Transformer? The magic of the Attention Mechanism 9 2 The different Transformers architectures 3 Pre-training and Foundation Models 4 Specialization of Foundation Models (especially LLM) Training a language model I want to make a chatbot to I want to make a bot which can help people better understand filter out respectful comments the civil code. for a real reddit experience. 62 Training a language model What do you need to train a large language model ? A truckload of data A mighty compute infrastructure Transformers are ravenous. GPUs with high throughput and large You need to feed them with VRAM can execute this training in a a substantial and hard-to- reasonable amount of time. come-by dataset. A copious amount An abundance of of electricity manpower Storage, memory and compute Cleaning the dataset, making power consume a lot of experiments, monitoring SOTA electricity. advancements is a lot of work. 63 Training a language model Understands the language in general Not particularly good at any task I detect toxicity I summarize PRE-TRAINING FINE-TUNING long texts I generate high quality clickbaits A lot of work, data and Lower amount of work, data money is required. and money is required. 64 Pretraining a GPT-style transformer A transformer is a deep learning model that adopts the mechanism of self What do we do with this output ? Same embedding 65 Pretraining a GPT-style transformer Classification target Cheese — 3% Cheese — 0 Attention — 13% Attention — 0 Model — 56% Model — 1 Calf — 0.04% Calf — 0 Muscle — 0.05% Muscle — 0 Mercenary — 6% Mercenary — 0 66 Pretraining a GPT-style transformer Input Target A transformer transformer is is a a deep deep learning learning model model that that adopts adopts the the mechanism mechanism of of self self attention Next word prediction 67 Pretraining a BERT-style transformer Sample Input Target [CLS] / A A / transformer [MASK] transformer is is / a a / deep deep / learning learning / model model / that that / adopts [MASK] adopts the the / mechanism ~ 15% mechanism / of of / self self / attention [MASK] attention Masked words prediction 68 Finding a pretrained model The largest free pretrained transformer models database 1 — Go to https://huggingface.co/ 2 — Look for a model 3 — Download the model 4 — Enjoy 69 1 What is a Transformer? The magic of the Attention Mechanism 9 2 The different Transformers architectures 3 Pre-training and Foundation Models 4 Specialization of Foundation Models (especially LLM) 5 Fine-tuning of language models My tailor is rich ! Great, now, assign a number of stars to this film review: Star Wars The Last Jedi is really bad! The story is awful. And so are characters. What a boring pile of garbage ! 72 Fine-tuning a BERT — Sentence classification Target: 0 [CLS] Star Wars The Last Jedi is really bad! 73 Fine-tuning a BERT — Sentence classification Target: 0 [CLS] Star Wars The Last Jedi is really bad! 74 Fine-tuning a GPT — Sentence classification Star Wars The Last Jedi is really bad! Target: 0 75 Fine-tuning a transformer — Named Entity Recognition Output Target SOFTMAX Person: 0.45 Person: 1 Zidane came to Paris to watch Organization: 0.12 Organization: 0 the Paris Saint- Location: 0.27 Location: 0 Germain None: 0.16 None: 0 76 Fine-tuning a GPT with prompting Review This film is really trash! Output Target Positive: 0.18 Positive: 0 Negative: 0.44 Negative: 1 Template Neutral: 0.38 Neutral: 0 {{ REVIEW }} This review is (positive, negative or neutral): 77 Fine-tuning a GPT — Example of summarization Input GENERATION HEAD L orem i ps um d ol or si t a met, co ns ec te tu r ad ip isc ing el it. Viva mus a t p re tiu m ni si ,u t u l am corper di am. Sed i mperdi et, q ua mq ui s tris ti qu e da pi bus ,tortor e st ma t i s tortor, rho nc us ve sti bu lu me ro s a ug ue vi tae di am. Sed feug ia ta rc u a t n ib h l aoree t, se d i nterd um u rn a so l ici tu di n. Pra ese nt e t se m a c o di o co nd i men tu m max imu s. Etia m temp or n is l a e ui smo d sce le risqu e. Vestib ul um n on n is i tel us.Na mse d ma le su ad a i psum ,sed tinc id unt nu nc. Integ er a t e ni m ve l e ni m co nse qua t i nte rdum. Cras rho nc us l uc tu s co ndi men tu m.Interd um e t ma le su ad a fame s a c a nte i ps um p rimi s i n fauci bu s. Etia m al iq ue tn un c e t tinc id unt so dal es. Ali qu am tempo r, ma ssa u t fau ci bu s so dal es, u rn a q ua mel ei fe nd do lo r, i d l ac in ia me tu s l ibe rose d au gue. Aene an sce le risqu e co mmo do l igu la i n el ei fe nd. Sed l uc tu s bl an di t fauci bu s. Etia m o rn are e t q ua mi d fermentum. Etia m ferme ntum ve li t eg et tinc id unt ve hic ul a. Sed vi ve rra tris ti qu e nu l a a t pu lvi na r. Ut ma t i s ma xi mus l aci ni a. Vestib ul um cu rsu s, el it n ec ph aretra ul la mcorpe r, tel l us o rc i he ndre rit l ectus, a pl ace rat tel l us tel l us n ec ni sl. Viva mus e le men tum al iq ua mqu am, n on faci li sis se m e ge stas a. Etia m di gn iss im mi e u da pi bus tempo r. Ma ec en as co mmo do d ia m i n l eo sa git tis tristiqu e. Nul la mve ne na tis l igu la ni si , si t a met vi ve rra ma gn a el ei fe nd eg et. Sed effi ci tu r, el it a c fau ci bu s co nva l is, fel i s el it l uc tu s ni si , q ui s fau ci bu s tortor e ra t e u el it. Cu ra bi tur e ro s ne que , tinc id unt e u l ibe rovel , Target bi be ndu md ic tu m en im. Nu nc tris ti qu e j us to e t e x so dal es matti s. Dui s vi tae e x i nterdum d ol or pu lvi na r a ccu msa n se d vi tae orci. Do ne c p orta mi n ec l ac us eu ism od, i d rho nc us e ni m di ctum. Vestib ul um n ec el eme ntum orci. Fus ce e u ve li t an te. Sed va riu s se mp er e ni m vi tae co mmodo. In n on n is i co nva l is, co ng ue o di o ne c, e le men tum tel us. Nu nc i ps um pu ru s, p osu ere eg et l ore mvi ta e, so l ici tu di n pe l en tes que do lo r. Sed e ge stas au gue a t ve sti bu lu m l aoree t. Do ne c q ua mtel us , temp or vi ve rra e ni m ne c, d ic tu m co mmo do l eo. Vestib ul um se d i nterd um mas sa, u t su sci pit orci. Proi n ph aretra , fel i s eg et di gn iss im co ndi men tu m,a rc u a rc u co ng ue el it , et co nd i men tu m sa pi e ntel l us vi tae e st. Ma uri s co nva l is d ui a j us to g ra vi da ve ne na tis a ti d risu s. Pra ese nt ma ssa sa pie n, vo lutpa t e t mo l i s eg et,co ns ec te tu r n ec l eo. Mo rbi l igu la e st,fau ci bu s co ns ec te tu r tinc id unt id , pl ace rat ul tri cie s l eo. Vestib ul um ve l e x mal es uad a, co nd i men tu m l igu la se d, ma le su ad a tortor. Nul la m pu lvi na r ma ssa fin i bu s me tu s vu l pu ta te l aci ni a. Ut ma le su ad a se mp er ri sus i n o rn are. Susp end iss e n is l mau ris, fring il la ve l o rn are e u, so l ici tu di n q ui s turpi s. Do ne c mol li s al iq ua man te , u t cu rsu s el it p orta i n. Susp end iss e g ra vi da ma uri s p ortti tor n is i faci li sis faci li sis. Do ne c e ge stas u rn a si t a met l ore m vi verra, si t a met di gn iss im d ia m fermentum. Viva mus bi be ndu m, se m vel vu l pu ta te tempu s, p urus fel i s o rn are el it , a c bi be ndu m e ra t a nte si t a met ve li t. Do ne c l aci ni amo le sti e mol es ti e. Ali qu am ma t i s e f i ci tu r l ac us e u faci li sis. Viva mus n ec su sci pit au gue , a c d ic tu m l ibe ro. Vestib ul um sit a met n un c al iq ue tn is l e f i ci tu r max imu s. Cu ra bi tur di gn iss im, l ec tu s a mo le sti e mol li s, n ib h se m o rn are l orem, vi tae pu lvi na r o di o a nte a pu ru s. In p urus turpi s, feug ia tn on d ig ni ss im so dal es, i nterd um a c arcu. Fusce i mperdi et so l ici tu di n fauci bu s. Nu nc se d vi ve rra metus.Integ er ma gn a orci , sa gi tti s a t p re tiu m e t, a li qu am vi tae metus. Viva mus a t ma ssa au gue. In se d e le men tum en im. Nu nc si t a met tris ti qu e urna , vi tae pu lvi na r qu am. Sed se d i mperdi et nu l a. Nu nc p ha re tra ma ssa vi tae co mmo do au cto r. Nul la ve sti bu lum , n ib h n ec feug ia t sa git tis, ex tel l us ferme ntum e x, n on va riu s o rc i n is l d ic tu m arcu. Do ne c mo l i s ma t i s l eo eg et tinc id unt. Mo rbi turp is nu nc, p ha re tra a t l eo e f i ci tu r, bi be ndu m rutrum l eo. Ut su sci pit p urus a o di o bi be ndu m co nse qua t. Nu l a el ei fe nd a ug ue a t tinc id unt so l ici tu di n. Phas el lus ferme ntum sce le risqu e pu ru s, ve l tinc id unt turp is mol li s vi ta e. Qu is qu e p la ce ra t ri sus a t ri sus pe l en tes que effi ci tu r. In eg et pl ace rat du i. Sed si t a met e st e u n is i p re tiu m tinc id unt q ui s a tpu ru s. Ma ec en as e f i ci tu r u l am corp er ul tri cie s. Nul la mal iq ue t e x i n nu l a i acul i s, a c he ndrerit ma uri s rutrum. Al iq ua m l uc tu s l oremn ec so l ici tu di n tempu s. Sed fring il la e ni m eros , a t tinc id unt ve li t vi ve rra ne c. Sed vi tae e ui smo d l ibe ro. Dui s fau ci bu s cu rsu s ve nen atis. Integ er so l ici tu di n co mmo do su sci pit. In co mmo do o di o e t so dal es tinc id unt. Integ er me tu s pu u r s, l ob ortis a t se m ne c, su sci pit bi be ndu mj us to. Vestib ul um i nterd um i nterd um l igu la i mperdi et eu ism od. Al iq ua m u l am corp er fel i s urna , se d i acul i sd ol or co ng ue a. Na ml uc tu s u t l ac us ac co ndi men tu m.Ma ec en as mol li s n is l a t l igu la co ng ue i acul i s. Na m pu lvi na r l ibe roeg et co nse qua t l uctus. Cras i n ma gn a vi tae e ni m ve ne na tis i mperdi et ve l u t ve li t. Sed n on ri sus q ui s q ua mtemp or he ndre rit. Mae cen as a t l ec tu s metus. Pel le nte squ e el ei fe nd n ec a nte si t a met su sci pit. Viva mus i acul i s g ra vi da arcu , ve l fin i bu s j us to so l ici tu di n ac. Do ne c he ndre rit n eq ue vi tae e ra t i acul i si mperdi et. Etia m q ui s sa pi e n a t me tu s co ns ec te tu r p osu ere se d a feli s. Viva mus e ui smo d tinc id unt n ib h temp us tinc id unt. Nul la mn ec l eo u t l igu la fin i bu s se mper. Cu ra bi tur et tel l us u t turp is el ei fe nd vu lpu ta te.Do ne c e t o di o ve li t. Nu nc fac il i si s l ore mdi am, i d ul tri cie s j us to faci li sis no n. Ma ec en as p ha re tra d ol or i d n ib h preti um, n on e ui smo d n is i eu ism od. Cras se mp er tinc id unt e ra t eu al iq ua m. Viva mus a t el ei fe nd e x. Ma ec en as feug ia t p ortti tor qu am, e t g ra vi da se m feug ia t eg et. Proi n a t ma ssa sa git tis, ma xi mus p urus a , cu rsu s tel us. Pra ese nt p ortti tor vo lutpa t ve li t. Pra ese nt p urus p urus , da pi bus n ec fel i seg et,ma le su ad a fau ci bu s eros.Nu nc a ccu msa n rho nc us vi verra. Cu ra bi tur ut eg esta s sa pie n. Proi n g ra vi da no nl eo s it ame t i nterd um. Vestib ul um se d me tu s ul tri cie s, vu l pu ta te o di o ve l, l ob ortis a nte. Ma ec en as se d d ui v it a ema gn a se mpe r p re tiu m eu eg et tel us.Ma uri s tempo r faci li sis qu am, si t a met l aoree t n eq ue p ha re tra a c. Fus ce fring il la vo lutpa t se m. Do ne c eg et i mperdi et sa pie n. Nul la m bl an di t sce le risqu e i ps um q ui s pl ace rat. Nul la faci li si. Pra ese nt ri sus e st, co nd i men tu m i d mi eu, al iq ua mso dal es mag na.Orci va riu s n atoq ue p en atib us e t ma gn is d is pa rturi ent mon te s, n asc etur rid icu lu s mu s. Vestib ul um d ic tu m e le men tum o di o eg et sce le risqu e. She dumped me Ut tris ti qu e d ia m a t bl an di tcu rsu s. Ma ec en as a t co mmo do e x, a t sa gi tti s eros. Nul la ma so l ici tu di n urna , a c su sci pit do lo r. Phas el lus e ge t cu rsu s l eo. Etia m sce le risqu e ma gn a n ec d ol or ferme ntum co ndi men tu m. because I took a sip Pel le nte squ e ha bi ta nt mo rbi tris ti qu e se ne ctus e t n etus e t ma le su ad a fame s a c turp is eg estas. Ma uri s mi nu nc, co nd i men tu m a c el ei fe nd a c, ul tri cie s si t a met ni bh. Cu ra bi tur fring il la u rn a i n d ia m tris ti qu e ve nen atis. Dui s bi be ndu mturp is i d l ore mvi ve rra po rta. Aene an e f i ci tu r e ra t j usto, u t fau ci bu s l ac us e f i ci tu r i d. Pel le nte squ e si t a met p orta mag na.Na ma t ma gn a a t l ac us e f i ci tu r al iq ue t. Ut ul la mcorpe r,d ol or vi tae faci li sis ac cumsa n, n is i o rc i ve hic ul adu i, si t a met temp us turp is ma ssa si t a met l ibe ro. before toasting. Template {{ INPUT }} TL;DR: 78 Orienting the style with templates Ahoy, me hearties! I be System hearin’ ‘bout a mathematical concept called the Cayley- Hamilton theorem. It’s a fancy Speak as if you were a way o’ sayin’ that every pirate. square matrix is a root of its GENERATION HEAD own characteristic equation, which is a polynomial Input equation that describes the matrix's properties. This What is the Cayley- theorem has important Hamilton theorem ? implications in linear algebra and matrix theory, as it allows for the calculation of matrix Template powers and inverses in certain cases. So, if you ever find yourself in a sea of {{ SYSTEM }} matrices, remember the Cayley-Hamilton theorem and {{ INPUT }} you’ll be sailing smoothly! 80