Mathematics of Transformers - GPT Meets Game Theory PDF

Mathematics of Transformers Hamidou Tembine Laboratory of Signal and System Integration Department of Electrical and Computer Engineering, University of Quebec in Trois-Rivieres, Canada, and...

Mathematics of Transformers Hamidou Tembine Laboratory of Signal and System Integration Department of Electrical and Computer Engineering, University of Quebec in Trois-Rivieres, Canada, and Learning and Game Theory Laboratory, TIMADIE, GPT Meets Game Theory Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 1 / 74 Interactive Decision-Making Decision-Makers, Information, Choices and Preferences Decision-Makers, Strategies, and Outcomes Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 2 / 74 more variety of games... zero-sum game theory / robust game theory / distributionally robust game theory algorithmic game theory / computational game theory repeated game theory / multi-stage game theory random matrix game theory / stochastic game theory quantum game theory sequential dynamic game theory / di↵erence game theory di↵erential game theory evolutionary game theory / mean-field game theory psychological game theory / experimental game theory coopetitive game theory mean-field-type game theory Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 3 / 74 Various behaviors Include various category of decision-makers: low to medium to high rationality, irrationality, self ab-negating, partial altruism, selfless, selfish, partial cooperation, maliciousness/spite, risk-sensitivity, risk-neutral, other-regarding preferences, belief-and or conjecture dependence Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 4 / 74 Noncooperative / cooperative games are extreme cases It is a no good idea to classify games as cooperative or non-cooperative noncooperative game: all the decision-makers have to be selfish. fully cooperative: all the decision-makers have to be fully cooperative. In Practice: it is a well-mixed behavior that is often observed. And cooperative behaviors can emerge in noncooperative games. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 5 / 74 Some Solution Concepts Cournot, Pareto Solution, Ross, von Neumann, Stackelberg solution, Nash, Wardrop, Berge solution, Empathetic solution, Conjectural variation, Risk-sensitive solution, K-strong solution, Signal-induced solution Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 6 / 74 Coopetition Empathy structure: ⇤ 2 Rn⇥n Empathetic objective function: R⇤j X R⇤j := jj Rj + ji Ri , i2Nj One-hop neighbors of j : Nj = {i 2 N | i 6= j, ji 6= 0}. Other-regarding payo↵s: empathy, antipathy, self-abnegation Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 7 / 74 Coopetition Partially Selfish Partially Spiteful ji = 0 & Altruistic ji < 0 jj > 0 ji > 0 Self ab- Partially Selfless negating Selfish jj = 0 jj < 0 jj > 0 Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 8 / 74 One decision-maker As if one has one single decision-maker minu L(x, u) u x Decision-maker Control problem Example: fully centralized control. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 9 / 74 When all decision-makers are selfish Multiple decision-makers I = {1,... , I } minu1 L1 (x, u) u1 u2 minu2 L2 (x, u) x Decision-maker 1 Decision-maker 2 Non-cooperative game problem Example: security problem. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 10 / 74 When all decision-makers are fully Cooperative Multiple decision-makers I = {1,... , I } minu L(x, u) u u2 u1 x Decision-makers {1, 2} Cooperative game problem Example: multi-agent common objective control. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 11 / 74 Adversarial behavior Multiple decision-makers I = {1,... , I } maxv L(x, u, v) v u minu L(x, u, v) x Decision-maker 1 Adversarial game Decision-maker 2 problem Example: One malicious/adversarial agent/attacker. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 12 / 74 Berge game : mutual support Two decision-makers I = {1, 2} minu1 L2 (x, u) u1 u2 minu2 L1 (x, u) x Decision-maker 1 Decision-maker 2 Berge game problem Example: Mutual support. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 13 / 74 Stackelberg problem Two decision-makers I = {1, 2} minu1 L1 (x, u) u1 Stage 1: Action x Leader Follower minu1 L1 (x, u) u1 u2 minu2 L2 (x, u) Stage 2: Reaction Leader Stackelberg game Follower problem uj⇤ 2 arg min {Lj (u) : ui 2 BRi (uj )}, ui⇤ 2 BRi (uj ), uj (·)2Uj Example: Government and individuals in a society / first and second Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 14 / 74 Co-opetitive behavior Decision-maker 1 competes with u2 decision-maker 2 minu2 L2 (x, u) x1 minu1 L3 (x, u) L2 (x, u) u1 Decision-maker 2 u3 minu3 L3 (x, u) Decision-maker 1 Decision-maker 1 x2 cooperates with decision-maker 3 Co-opetitive game Decision-maker 3 problem Example: Help the weak. Coalitions Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 15 / 74 Partial Altruism minu1 L1 (x, u) + L2 (x, u) minu2 L2 (x, u) u11 u21 u22 u12 x Partial-altruism game problem Example: Partial Cooperation. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 16 / 74 Self-Abnegation u11 u21 u22 u12 x maxu12 minu11 L1 (x, u) minu2 L2 (x, u) Self-abnegation game problem Example: Self-Destruction. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 17 / 74 Strategic Learning Learning the outcomes of an interactive decision-making Strategic learning Distributed strategic learning Fully/partially distributed strategic learning Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 18 / 74 From Strategic Learning and Game Theory to Practice Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 19 / 74 Strategic Machine Intelligence Introduction to Strategic Machine Intelligence Brief overview of Machine Intelligence and its impact on business The role of transformers in generative machine intelligence Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 20 / 74 Strategic Machine Intelligence What is Machine Intelligence? Machine Intelligence (MI) is an advanced computer science that allows a machine, device, software, program, code, or algorithm to interact intelligently with its environment, which means it can collect data, make decisions, perform actions, and develop strategies to maximize its chances of successfully achieving its preferences and objectives. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 21 / 74 Strategic Machine Intelligence What is Strategic Machine Intelligence? Interactive and interdependent Multiple Machine Intelligences or Multi-Machine Intelligence Agents. Machine Intelligence Technologies There is no one single machine intelligence. There are multiple machine intelligence technologies. Each company/industry has its own machine intelligences to interact with its market and environment. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 22 / 74 Strategic Machine Intelligence Strategic Machine Intelligence Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 23 / 74 Strategic Machine Intelligence What is Machine Learning ? Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 24 / 74 Strategic Machine Intelligence What is Machine Learning ? Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 25 / 74 Strategic Machine Intelligence What is a Neural Network? A neural network is a computational model inspired by the structure and function of the neural units, comprised of interconnected nodes or artificial neural units that process and transform information to solve various machine learning and pattern recognition tasks. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 26 / 74 Strategic Machine Intelligence What is a Large Learning Model? A large learning model typically refers to a learning model, often a neural network, that is pre-trained and trained on a substantial amount of data/knowledge to perform various tasks or make risk-aware predictions with high accuracy. 16 trillion tokens, 200 000 sequence size, 96 heads, 120 layers Example: Writing a 400-page book with high-quality image illustrations, tables, charts. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 27 / 74 Strategic Machine Intelligence From Transformer to Game Theory Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 28 / 74 Strategic Machine Intelligence From Deep Learning to Game Theory Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 29 / 74 Deep Learning $ Game Theory The Classical Deep Learning as a Game game theory deep learning agent neuron unit action weight, bias objective function objective function sub-goal feature measurement output game design enhanced architecture design Table: Deep Learning vs Game Theory terminologies Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 30 / 74 Deep Learning $ Game Theory Deep Strategic Learning Tembine, H., Khan, M. A., & Bamia, I. (2024). Mean-Field-Type Transformers. Mathematics, 12(22), 3506. Djehiche & Tembine: Outcomes of Neural Networks are Nash Equilibria of a Non-Potential Game, In Partial Identification in Econometrics and Related Topics (pp. 57-80). Cham: Springer Nature Switzerland, 2024 H. Tembine: Game Theory Meets Deep Learning, IEEE SMC, 2020 Gao & Tembine, Distributed Mean-Field-Type Filters for Traffic Networks, IEEE Transactions on Intelligent Transportation Systems, 2018 Gao & Tembine: Bregman Learning for Generative Adversarial Networks, CCDC, 2018 (Finalist Best Paper Award) Gao & Tembine: Distributed mean-field-type filters for Big Data assimilation, IEEE International Conference on Data Science and Systems (HPCC-SmartCity-DSS), Sydney, Australia, Dec, 2016, pp. 1446-1453 J. Gao & H. Tembine, Distributionally Robust Games for Deep Generative Learning, IEEE World Congress on Computational Intelligence, Windsor Convention Centre, Rio de Janeiro, Brazil 08-13 July 2018 J. Gao, Y. Xu, J. Barreiro-Gomez, M. Ndong, M. Smyrnakis & H. Tembine. Distributionally Robust Optimization. In Optimization Algorithms, IntechOpen, Editor Poom Kumam, 2018 Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 31 / 74 Transformer Architecture Transformer Architecture: the classical one Attention mechanism Encoder-decoder layers Positional encoding Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 32 / 74 Transformer Architecture What is a Transformer ? Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 33 / 74 Transformer Architecture What is a Transformer ? Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 34 / 74 Transformer Architecture Multi-Head Self-Attention Multi-Head Self-Attention Mathematical formulation Ol,att Visualization of the attention process Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 35 / 74 Transformer Architecture Multi-Head Self-Attention What is an attention mechanism in Transformer ? Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 36 / 74 Transformer Architecture Multi-Head Self-Attention Librairies from numpy import array from numpy import random from numpy import dot from scipy.special import softmax Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 37 / 74 Transformer Architecture Multi-Head Self-Attention encoder representations of four di↵erent words word 1 = array([1, 0, 0]) word 2 = array([0, 1, 0]) word 3 = array([1, 1, 0]) word 4 = array([0, 0, 1]) # stacking the word embeddings into a single array words = array([word 1, word 2, word 3, word 4]) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 38 / 74 Transformer Architecture Multi-Head Self-Attention Boltzmann-Gibbs weights matrices random.seed(42) W Q = random.randint(3, size=(3, 3)) W K = random.randint(3, size=(3, 3)) W V = random.randint(3, size=(3, 3)) # generating the queries, keys, and values Q = words @ W Q K = words @ W K V = words @ W V # scoring the query vectors against all key vectors scores = Q @ K.transpose() # computing the weights by a softmax operation weights = softmax(scores / K.shape ** 0.5, axis=1) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 39 / 74 Transformer Architecture Multi-Head Self-Attention Attention attention = weights @ V print(attention) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 40 / 74 Transformer Architecture Multi-Head Self-Attention Attention mechanism is an interaction between data points Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 41 / 74 Transformer Architecture Layer Normalization Layer Normalization Purpose and benefits Mathematical explanation Ol,nn Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 42 / 74 Transformer Architecture Layer Normalization class BatchNorm(nn.Module): def init (self, size: int, eps: float = 1e-5): Batch Normalization. Assumes the shape of the input x is (batch, seq len, d model) Args: size: shape of the feature dimension (i.e. d model) eps: For numerical stability. Defaults to 1e-5. super(BatchNorm, self). init () self.eps = eps self.gamma = nn.Parameter(torch.ones(size), requires grad=True) self.beta = nn.Parameter(torch.ones(size), requires grad=True) def forward(self, x): x var, x mean = torch.var mean(x, dim=[0,1], keepdim=True, correction=0) x std = torch.sqrt(x var + self.eps) x norm = (x - x mean)/ x std return self.gamma.unsqueeze(0).unsqueeze(1) * x norm + self.beta.unsqueeze(0).unsqueeze(1) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 43 / 74 Transformer Architecture Layer Normalization class LayerNorm(nn.Module): def init ( self, size: int, eps: float = 1e-5, ): Layer Normalization. Assumes the shape of the input x is (batch, seq len, d model) Args: size: shape of the feature dimension (i.e. d model) eps: For numerical stability. Defaults to 1e-5. super(Layernorm, self). init () self.eps = eps self.gamma = nn.Parameter(torch.ones(size), requires grad=True) self.beta = nn.Parameter(torch.ones(size), requires grad=True) def forward(self, x): x var, x mean = torch.var mean(x, dim=-1, keepdim=True, correction=0) x std = torch.sqrt(x var + self.eps) x norm = (x - x mean)/ x std return self.gamma.unsqueeze(0).unsqueeze(1) * x norm + self.beta.unsqueeze(0).unsqueeze(1) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 44 / 74 Transformer Architecture Layer Normalization class RMSNorm(nn.Module): def init ( self, size: int, eps: float = 1e-5, ): Root-Mean-Square Layer Normalization. Assumes the shape of the input x is (batch, seq len, d model) Args: size: shape of the feature dimension (i.e. d model). eps: For numerical stability. Defaults to 1e-5. super(RMSnorm, self). init () self.eps = eps self.gamma = nn.Parameter(torch.ones(size), requires grad=True) def forward(self, x): rms = torch.sqrt((x ** 2).mean(dim=-1, keepdim=True) + self.eps) # as an alternative can also use the Frobenius norm to compute rms x norm = x / rms return self.gamma.unsqueeze(0).unsqueeze(1) * x norm Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 45 / 74 Transformer Architecture Layer Normalization Type Properties Formula Pd p k=1 xlik ) d 1(xli center/std constant issue li r Pd d Pd + l,i x x h(xli k=1 lik ),(xli k=1 lik )i d d xli RMSnorm not invertible. kxli k Ixli 6=0 Maps into the unit sphere q p xli ✏l RMSenorm design ✏l > 0. , yli 1 kyli k2 ✏l +hxli ,xli i Maps into the the unit ball ✏RMSnorm design ✏l > 0. pxli , ✏l 1 kykyli kli k kyylili k ✏l + hxli ,xli i Maps into the unit ball. xli yli id/(1 + one-to-one map. 1+kxli k , 1 kyli k norm) q kxli k2 xli yli kyli k CapsuleNorm maps into the unit 1+kxli k2 kxli k , kyli k 1 kyli k ball Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 46 / 74 Transformer Architecture 2-Layered FeedForward Neural Network 2-Layered FeedForward Neural Network Weights W1l , W2l and biases b1l , b2l Activation functions rl yli = W2l rl (W1l xli + b1l ) + b2l , Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 47 / 74 Transformer Architecture 2-Layered FeedForward Neural Network Transformer block Input, Normalization, Attention, Add&Norm,FeedForward, Add&Norm, Add&Norm implements a state feedback loop Ôl = (Id + Ol,↵ Ol,nn ) (Id + Ol,att Ol,nn ). Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 48 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Hands-on Session 1: Building a Simple Transformer Practical implementation of self-attention Code walkthrough and explanation (afternoon session) Parameters Input : x y1 = (Id + O1,↵ O1,nn ) (Id + O1,att O1,nn )(x), for l 2 {2,... , L} : yl = (Id + Ol,↵ Ol,nn ) (Id + Ol,att Ol,nn )(yl 1 ) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 49 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer GPT GPT stands for Generative Pre-trained Transformer. GPT generates text/image/audio/video given a prompt. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 50 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer GPT Input/Output We look the text case here: Input/Output: def gpt(inputs: list[int]) =) list[list[float]]: # inputs has shape [n seq] # output has shape [n seq, n vocab] output = # neural network output return output Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 51 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 52 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Parameters training data, vocabulary, token size: several trillions input/prompt sequence length output sequence length number of layers For each layer normalization parameters attention parameters: number of heads, queries, keys, values, weights/biases, normalization parameters: specific to the norm used and possibly weights/biases 2-feedforward parameters: in activation function and weights/biases (2 sub-layers) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 53 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Boltzmann-Gibbs Transformer Parameters : L, D, d, k, k 0 , q, H 2 N, 1l L: 0 0 1  h  H : Qlh , Klh 2 L(Hd , Hk ), Vlh 2 L(Hd , Hk ), Wlh 2 L(Hk , Hd W1l 2 L(Hd , Hq ), b1l 2 Hq , W2l 2 L(Hq , Hd ), b2l 2 Hd , Input : y0 = x0 2 HDd , 1l L: Input at l : xl = yl 1 2 HDd , xli 1  i  D : uli = 1+kx li k , P Pi p1 hQlh uli ,Klh ulj i 1  i  D : ũli = p1H H h=1 W lh j=1 P e k p1 hQ u ,K u i Vlh ulj , i lh li lh lj 0 j0 e k 1  i  D : ûli = xli + ũli , 1  i  D : ŷli = 1+kûliûli k , 1  i  D : ỹli = W2l rl (W1l ŷli + b1l ) + b2l , 1  i  D : yli = ûli + ỹli , Output at l : yl Return yL 2 HDd Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 54 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Sigmoid Transformer Parameters : L, D, d, k, k 0 , q, H 2 N, 1l L: 0 0 1  h  H : Qlh , Klh 2 L(Hd , Hk ), Vlh 2 L(Hd , Hk ), Wlh 2 L(Hk , Hd W1l 2 L(Hd , Hq ), b1l 2 Hq , W2l 2 L(Hq , Hd ), b2l 2 Hd , Input : y0 = x0 2 HDd , 1l L: Input at l : xl = yl 1 2 HDd , xli 1  i  D : uli = 1+kx li k , P Pi p1 hQlh uli ,Klh ulj i 1  i  D : ũli = p1H H h=1 W lh j=1 e k p1 hQ u ,K u i Vlh uj , lh li lh lj D+e k 1  i  D : ûli = xli + ũli , 1  i  D : ŷli = 1+kûliûli k , 1  i  D : ỹli = W2li rl (W1li ŷli + b1li ) + b2li , 1  i  D : yli = ûli + ỹli , Output at l : yl Return yL 2 HDd Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory (2) 55 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Mixture of Experts Boltzmann-Gibbs Transformer Parameters : E = {1,... , E }, L, D, d, k, k 0 , q, H 2 N, 1l L: 0 0 1  h  H : Qlh , Klh 2 L(Hd , Hk ), Vlh 2 L(Hd , Hk ), Wlh 2 L(Hk , Hd W1l,e 2 L(Hd , Hq ), b1l,e 2 Hq , W2l,e 2 L(Hq , Hd ), b2l,e 2 Hd , e 2 E W3l 2 L(Hd , RE ), b3l 2 RE , Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 56 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Mixture of Experts Boltzmann-Gibbs Transformer Input : y0 = x0 2 HDd , 1l L: xli Input at l : xl = yl 1 2 HDd , 1  i  D : uli = 1+kx li k , 1 P Pi p hQ u ,K u lh li lh lj i 1  i  D : ũli = p1H Hh=1 Wlh j=1 P e k p1 hQ u ,K u i Vlh ulj , i lh li lh lj 0 j0 e k ûli 1  i  D : ûli = xli + ũli , ŷli = 1+kûli k , 1  i  D : Choose Ei , e (W̃3l ŷli +b̃3l )e ⌘lie = P I (W̃3l ŷli +b̃3l )e 0 e2Ei , e 0 2E e i e = E + 1 : ỹli,e = W2l,e rl (W1l,e ŷli,e + b1l,e ) + b2l,e , 1  e  Ei : ỹli,e = W P2l,e rl (W1l,e ŷli,e + b1l,e ) + b2l,e , ỹli = ⌘li,E +1 ỹli,E +1 + e2Ei ⌘lie ỹli,e , 1  i  D : yli = ûli + ỹli , Output at l : yl Return yL 2 HDd Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 57 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Mixture of Experts Sigmoid Transformer Input : y0 = x0 2 HDd , 1l L: xli Input at l : xl = yl 1 2 HDd , uli = 1+kx li k , P P p1 hQlh uli ,Klh ulj i 1  i  D : ũli = p1H Hh=1 Wlh i j=1 e k p1 hQ u ,K u i Vlh uj , lh li lh lj D+e k 1  i  D : ûli = xli + ũli , ŷli = 1+kûliûli k , (W̃ ŷ +b̃ )e 1  i  D : Choose Ei , ⌘lie = P e 3l(W̃li ŷ3l+b̃ ) 0 Ie2Ei , 3l li 3l e e 0 2E e i e = E + 1 : ỹli,e = W2l,e rl (W1l,e ŷli,e + b1l,e ) + b2l,e , 1  e  Ei : ỹli,e = W P2l,e rl (W1l,e ŷli,e + b1l,e ) + b2l,e , ỹli = ⌘li,E +1 ỹli,E +1 + e2Ei ⌘lie ỹli,e , 1  i  D : yli = ûli + ỹli , Output at l : yl Return yL 2 HDd Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 58 / 74 Transformer Architecture Hands-on Session 1: Building a Simple Transformer Di↵erence Transformer Parameters : L, D, d, k, k 0 , q, H 2 N, 2 R 1l L: (a) (a) 0 0 1  h  H : Qlh , Klh , 2 L(Hd , Hk ), Vlh 2 L(Hd , Hk ), Wlh 2 L(Hk , H W1l 2 L(Hd , Hq ), b1l 2 Hq , W2l 2 L(Hq , Hd ), b2l 2 Hd , Input : y0 = x0 2 HDd , 1l L: Input at l : xl = yl 1 2 HDd , 1  i  D : uli = normalized(xli ), P Pi (1) (1) 1  i  D : ũli = p1H H h=1 Wlh j=1 (attentionij (Qlh , Klh , ul ) (2) (2) attentionij (Qlh , Klh , ul ))Vlh ulj , 1  i  D : ûli = xli + ũli , ŷli = normalized(ûli ), 1  i  D : ỹli = W2l rl (W1l ŷli + b1l ) + b2l , yli = ûli + ỹli , Output at l : yl , Return yL 2 HDd Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 59 / 74 Fixed-Points of Composition of Operators What does it do? Consistency Take a prompt. Look at the output. Take the output to summarize into a next prompt. And iterate several times after some iterations, is the output relevant to the initial input? is the context similar? is the meaning similar? what is the output of the output? what is the output of the output of the output? Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 60 / 74 Fixed-Points of Composition of Operators Fixed-Points of Composition of Operators composition operators fixed points in the context of deep learning A fixed point x ⇤ of a function f is a point such that f (x ⇤ ) = x ⇤. In the context of neural networks, a fixed point represents a state where the output of a layer is identical to its input. Visual representation of the concept of fixed points in a neural network. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 61 / 74 Fixed-Points of Composition of Operators Fixed-Point Behavior in Transformers Fixed-Point Behavior in Transformers How fixed points influence model stability Stable fixed points can lead to convergence and improved performance. Unstable fixed points can cause inconsistencies, incoherence, instability and divergence. Potential benefits and drawbacks of fixed-point behavior Benefits: Faster convergence, improved generalization. Drawbacks: Potential for reduced model capacity. Strategies for fixed-point behavior Careful initialization Regularization techniques Adaptive learning rate schedules Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 62 / 74 Training and Consistency as Intersection of Kernels Consistency as Intersection of Kernels Kernel functions and antiderivatives in deep learning. Maximally cyclically monotone attention Consistency as Intersection of Kernels Averaged operators Reverse Ishikawa algorithm Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 63 / 74 Transformer Training Transformer Design Find L, H, D, d and the operators Ô1 ,... , ÔL : Yo = ÔL... Ô1 (Xo ), o 2 {1,... , D} We can turn it into a variational inequality as follows h(Yo ÔL... Ô1 (Xo )), Xo0 Xo i 0, 8Xo0 2 H0 , o 2 {1,... , D} Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 64 / 74 Transformer Training Transformer Design Design Problem Find L, H, D, d and the operators Ô1 ,... , ÔL : R R arg minL,D 1 arg minÔL ,...,Ô1 X0 YL [ (YL , ÔL... Ô1 X0 , P)]P(dX0 dYL ), where is a real-valued risk-aware cost function. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 65 / 74 Transformer Training Transformer Training We focus on a sub-goal which is the minimization over parameters ✓ = (Qlh , Klh , Vlh , Wlh , W1l , W2l , b1l , b2l )1lL the variational inequality becomes h(YL ÔL... Ô1 (Xo ))(✓), A†L (✓0 ✓)i 0, 8✓0 , o 2 {1,... , D} where A†L is the adjoint operator which returns in the space as YL. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 66 / 74 Transformer Training Transformer Mini-Batch Training Given D = {(X̂o , Ŷo ) 2 H0 ⇥ HL , o 2 {1, 2,... , D}}, the mini-batch training problem is to find a time-and-state- independent control action ✓ such that 8 > > D = {(X̂oP , Ŷo ) 2 H0 ⇥ HL , o 2 {1, 2,... , D}}, > > D > 1 < inf ✓2⇥ D o=1 (Yo,T , Ŷo ), such that > > > > Y0 = X̂ , > : Y t+1 = Yt + t (ÔL... Ô1 (Yt ) Yt , ) Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 67 / 74 Transformer Training Transformer Training: Small Learning Rate Regime 8 > > D = {(X̂o , Ŷo ) 2 H0 ⇥ HL , o 2 {1, 2,... , D}}, > > PD > < inf ✓2⇥ D1 o=1 (Yo (T ), Ŷo ), such that > > > > Y (0) = X̂ , > : Ẏ = (Ô... Ô )(Y ) Y , T > t > 0 L 1 Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 68 / 74 Backcasting Backcasting Definition: Backcasting involves using historical data to estimate past values of a variable. Kernel-Based Approach: Use kernel methods to model the underlying dynamics of the system. Train a transformer model on historical data to learn the kernel function. Apply the learned kernel to backcast missing values. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 69 / 74 Nowcasting Nowcasting Definition: Nowcasting involves estimating the current state of a system using real-time data. Kernel-Based Approach: Use kernel methods to fuse information from multiple data sources. Train a transformer model on real-time data to learn the kernel function. Apply the learned kernel to nowcast the current state. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 70 / 74 Forecasting and Forward-Looking Problems Forecasting and Forward-Looking Problems Definition: Forecasting involves predicting future values of a variable. Forward-looking problems involve decision-making based on future predictions. Kernel-Based Approach: Use kernel methods to model the temporal dependencies in the data. Train a transformer model on historical data to learn the kernel function. Apply the learned kernel to forecast future values. Use the forecasts to make informed decisions. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 71 / 74 Forecasting and Forward-Looking Problems Limitation No Best. Only a better algorithm There is no best generative machine intelligence in pointwise forecasting Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 72 / 74 Forecasting and Forward-Looking Problems Limitation “Current” Training Algorithms are suboptimal Constant Weight/Bias are Suboptimal Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 73 / 74 Forecasting and Forward-Looking Problems Limitation State-feedback loop feasibility in Transformer The add-and-Norm operator, used either before or after the self-attention mechanism, includes the state feedback, though it is not exploited in weight/bias design. Hamidou Tembine (UQTR) Mathematics of Transformers GPT Meets Game Theory 74 / 74

Mathematics of Transformers - GPT Meets Game Theory PDF

Document Details

Tags

Related

Summary

Full Transcript