ML-24-NN-part1-v.0.2.pdf

Notes on Neural Networks: part 1 Alessio Micheli [email protected] 2024 Dipartimento di Informatica Università di Pisa - Italy ML Course structure Where we are...

Notes on Neural Networks: part 1 Alessio Micheli [email protected] 2024 Dipartimento di Informatica Università di Pisa - Italy ML Course structure Where we are Dip. Informatica University of Pisa Function approximation framework 1 INTRO Data, Task, Model, Learn. Alg., Validation Discrete H Continuous H Probabilistic 2 3 4 Concept Linear K-nn learning models (LTU-LMS) Ind. Bias SOM 10 5 RNN 11 12 Many lect.s Neural Networks Deep L. 6 Bayesian Networks 7 Theory Validation & SLT Bias/Variance 13 8 SVM 9 14 Applications/Project Advanced topics 2 A. Micheli NN: which target? Dip. Informatica University of Pisa NN to study and model biological systems and learning processes – Philosophy, Psychology and neurobiology, cognitive science, Computational neuroscience. – Biological realism is essential NN to introduce effective ML systems/algorithms (often losing a strict biological realism) – Statistics, Artificial Intelligence, Physics, Math., Engineering, … – ML, computational and algorithmic properties are essential For us: ANN (Artificial Neural Networks): indeed a (flexible) Machine Learning tool (in the sense of approximation of functions: it realizes a mathematical functions h(x) with special properties) 3 A. Micheli A first look to NN (as ML tool) Dip. Informatica University of Pisa A historical learning machine, while still a powerful approach to ML – NN can learn from examples – NN are universal approximators (Theorem of Cybenko): flexible approaches for arbitrary functions (including non–linear) ! – NN can deal with noise and incomplete data Performance degrades gracefully under adverse operating conditions – NN can handle continuous real and discrete data for both regression and classification tasks – Successful model in ML due to the flexibility in applications – NN encompasses a wide set of models: it is a paradigm ! (in the class of subsymbolic approaches) 4 A. Micheli “Philosophy”: Connectionism Dip. Informatica University of Pisa Complex behavior emerging from the interaction of simple computational units Connectionism: mental or behavioural phenomena as the emergent processes of interconnected networks of simple units. 5 A. Micheli Nervous system Dip. Informatica University of Pisa Human Brain: more than 1010 neurons 104-105 connections for each neurons Response: ~0.001 sec. (msec order) Image recognition: 0.1 sec (only 100 serial computations)→ highly parallel computations (Note that face recognition can be complex for conventional computers) Three stained pyramidal neurons. These neurons stand nearly 2mm high and receive over 10,000 inputs from other neurons which they process in their complex dendritic arbors using active regenerative mechanisms. 6 A. Micheli Biological Neuron Dip. Informatica University of Pisa Inputs Inputs Inputs Outputs charge / discharge cycle (sodium / potassium ions) changing the potential of the cell (from dendrites), threshold spike generation (voltage pulse), time interval among spikes 7 A. Micheli Signal propagation and Synapse Dip. Informatica University of Pisa A signal propagating down on axon to the cell body and dendrites of the next cell Electr. Chem. Electr. 8 A. Micheli GIF: https://it.m.wikipedia.org/wiki/Neurone#/media/File%3ASinapsi.gif Synapse Dip. Informatica University of Pisa Pre and post synaptic membrane Neurotransmitter Inhibitory/excitatory Change due to learning: Strength (weight) reinforcement due to stimuli Plasticity/adaptivity of the nervous system Hebbian learning (Hebb 1949) A synapse is strengthened when the neuron inputs (and correlated outputs) are repeated. Input stimulates synapsis strength, so weights become closer to inputs 9 A. Micheli Neurons in the art Michelangelo “The Creation of Adam” Dip. Informatica University of Pisa 10 A. Micheli Neurons in the art Michelangelo “The Creation of Adam” Dip. Informatica University of Pisa Synapse? 11 A. Micheli Artificial Neuron: processing unit Dip. Informatica University of Pisa Node, neuron or unit Input: from extern source or other units (R) Input Connections: weights w: free parameters: these can be modified by learning (synaptic strength) x0=1 wi0 Bias x1 wi1  neti ( x ) =  wij x j  j oi ( x ) = f (neti ( x )) x2 wi2 wi3 Σ f x3 … … Unit i xn win The weighted sum neti is called the net input to unit i Note that wij refers to the weight of the unit i, i.e.from unit/input j to unit i (not the other way around, e.g. see the next slide). The function f is the unit's activation function (e.g linear, LTU, …). 12 A. Micheli Notation for wij Dip. Informatica University of Pisa Some media e.g. Wikpedia (backprogation section) and some NN simulators/libraries use a different notation, whereas, for example w2j (in the blue box) is the weight from input 2 to unit j, instead of using as us the traditional wj2 (because wj belong to unit j) Take care: We fix the traditional one provided in the previous slide during our course. Unit j Not used 13 Neuron: Three activation functions Dip. Informatica University of Pisa h(x ) =  wi xi i Identity function Linear activation function Perceptron (see LTU) Threshold activation function Logistic function : sigmoid (see later) 14 A. Micheli Perceptron Dip. Informatica University of Pisa Frank Rosenblatt (1957-1958,1960, …) Different letters Perceptron used for pattern classification can appear (simulating human perception) Visual analyzer photoelectric cell © 1969 Mit Press 15 A. Micheli Perceptron II Dip. Informatica University of Pisa Nice: single neuron is a very simple computational unit Minsky: “for the kind of recognition it can do, it is such a simple machine that it would astonishing if nature did not make use of it somewhere” Hence, we are going to discuss : A biologically inspired model with a simple computational unit (and its learning algorithm) that is a model of historical importance for the ML, with different approaches since the early 60s 16 A. Micheli Models with Perceptron Dip. Informatica University of Pisa Perceptrons can be composed and connected to build a networks: from LTU to NN !!! (MLP NN = Multi Layer Perceptron) – Paradigmatic for the transition from linear to non-linear models towards the flexible models at the state-of-the-art in ML First, by an historical model…McCulloch&Pitts Note: actually NN are realized by software (simulators) or hardware on chips 17 A. Micheli McCulloch & Pitts Networks Dip. Informatica University of Pisa ``A logical calculus of the ideas immanent in nervous activity'‘ 1943 – What does a given net compute? – Can a given net compute a given logical sentence? Neurons are in two possible states: firing (1) and not firing (0) All synapses (connections) are equivalent and characterized by a real number (their strength w), which is positive for excitatory connections and negative for inhibitory connections; a neuron i becomes active when the sum of those connections wij coming from neurons j connected to it which are active, plus a bias, is larger than zero. Binary inputs Again binary output (aka binary classification task) “Basically” the LTU/Perceptron 18 A. Micheli AND, OR Boolean functions Dip. Informatica University of Pisa 𝑛 1 if 𝑛𝑒𝑡 > 0 Assume 𝑛𝑒𝑡 = 𝐰𝑇𝐱 = ෍ 𝑤𝑖𝑥𝑖 sign( 𝑛𝑒𝑡) = ቊ 0 (or −1) otherwise AND 𝑖=0 x1 x2 f(x) x2 x2 w2=1 0 0 0 - + 0 1 0 w1=1 x1 1 0 0 1 1 1 1 w0= −1.5 - - OR x1 x1 x2 f(x) x2 w2=1 0 0 0 x2 0 1 1 x1 w1=1 + + 1 0 1 1 1 1 1 w0= −0.5 + - Exercise: find a network for the NOT x1 19 A. Micheli Exclusive OR (XOR) Dip. Informatica University of Pisa x2 x1 x2 f(x) 0 0 0 + - 0 1 1 1 0 1 1 1 0 - + x2 w2=? x1 w1=? x1 No linear separation surface exists!! Minsky and Papert (1969) 1 w0=  20 A. Micheli XOR by a Two Layers Network Dip. Informatica University of Pisa (dot = and) (+ = or ) (next slide) x=(0,1) or (1,0) w0 x2 h2 -0.5 + - + - -1 1 AND OR w0 h1 h2 w0 + -0.5 -1.5 1 - - 1 1 1 x1 h1 Points are linearly E.g. (1,0) becames h1=0 h2=1 separable in the x1 x2 new space! Exercise: separate by other planes→ find different networks solving XOR 21 A. Micheli Detalis to solve the XOR (proof) Dip. Informatica University of Pisa DE MORGAN rule : not(a and b) = not (a) or not(b) Hence, if h1 = (a and b) → not(h1)=(not(a) or not(b)) And if h2 = (a or b) → not(h1) and h2 = (not(a) or not(b)) and (a or b)= = (not(a) and a) or (not(a) and b) or (a and not(b)) or (b and not(b)) = =(not(a) and b) or (a and not(b)) = (by xor def.) = a xor b (C.V.D.) Exercise: redo by yourself with x, +, etc 22 A. Micheli Hidden layer & Representation Dip. Informatica University of Pisa Useful concept of internal (re)representation of input variables via internal (hidden) units Developing high level (hidden) features is a key factor in NN The representation in the hidden layer makes easier the task to the output layer (last unit). – E.g. See the figure on the right: point (1,0) (in the sud-est corner) becames h1=0 h2=1 (north-west corner) → now it is a linear separable problem A look ahead: Such composition of sub-/intermediate operations to perform a complex task can be extended through many layers of abstraction In NN such internal representation/intermediate features can be learned Learning internal distributed representation → representation learning and deep learning concepts (link to next parts of the ML course) 23 A. Micheli McCulloch&Pitts Summary Dip. Informatica University of Pisa Properties of Networks of Perceptrons: Perceptrons can represent AND, OR, NOT (NAND, NOR) → every Boolean function can be represented by some network of interconnected perceptrons. Only two levels deep is sufficient (or more can be more efficient*). … but single layer cannot model all the possible functions (Minsky&Papert 1969): limitations of single perceptrons due to the linear separable problems! Note: no provided (up to now) learning algorithm (even for one unit)!!!! Let us start (again) with a single unit model 24 A. Micheli Learning for one unit model - Overview Dip. Informatica University of Pisa Two kinds of method for learning (also historical view): 1. Adaline = Adaptive Linear Neuron (Widrow, Hoff): linear unit during training: LMS direct solution and gradient descent solution – Regression tasks: See the LMS algorithm – For classification: See the LTU and LMS algorithm of a previous lectures (on linear model) – An approach that we will generalize to MLP 2. Perceptron (Rosenblatt): non-linear unit during training: with hard limiter or Threshold activation function – Only classification: Capabilities studies, convergence theorem – Next slides 25 A. Micheli Introduction to 2.: the Perceptron Learning Alg. Dip. Informatica University of Pisa Rosenblatt’s Perceptron Learning Algorithm (1958-1962) Minimize the number of misclassified patterns: – find w s.t. sign(wTx) =d – On-line algorithm: a step can be made for each input pattern – Note that to build linear classifier we will have from now 3 learning alg: 1. Adaline LMS direct solution 2. Adaline LMS gradient based alg. 3. Perceptron learning algorithm (now) And others will come (later in the course, e.g. SVM) 26 A. Micheli The “Perceptron Learning Algorithm” Dip. Informatica University of Pisa 1. Initialize the weights (either to zero or to a small random value) 2. pick a learning rate  (this is a number between 0 and 1) 3. until stopping condition is satisfied (e.g. weights don't change): (x, d) (d = +1 or –1) [let out be the output]: For each training pattern Compute output activation out = sign(wT x) [+1,-1] If out = d, don't change weights I.e. “minimize (only) misclassifications” If out  d, update the weights: – wnew= w +  d x add +  x if wTx ≤ 0 and d=+1 -  x if wTx > 0 and d=-1 Or (in a different form): wnew= w + 1/2 (d-out) x Note w in LMS was different because we have here the sign() in the out computation A. Micheli 27 Geometrical view Dip. Informatica University of Pisa An example: Before updating the weight w (pointing the positive region), we note that both p1 and p2 are incorrectly classified (the red dashed line is decision boundary). Suppose we choose p1 to update the weights as in picture below. p1 has target value d=1, so that w is moved a small amount in the direction of p1. The new boundary (blue dashed line) is better than before (and wnew closer to p1). Exercise: what happens if the training pattern p2 is chosen ? * → d=+1 0 → d= -1 wnew= w +  d x  dx Note: only for this example (w0) bias =0 wnew Sum of vectors p2 w  dx Note the Hebbian rule: w moves towards x 28 A. Micheli Delta Rule Dip. Informatica University of Pisa The form wnew= w +  d x is seen as Hebbian learning The other form w = w +  (d-out) x = w +   x new is in the form of error-correction learning Recall from LMS: This is an “error correction” rule (delta/ Widrow- Hoff/Adaline/LMS rule) that changes the w proportionally to the error (target d -output): – E.g. (target d – output) = err=0 → no correction – (input>0) if err + (output is too low), increase w → incr. wTx → reduce err – … In terms of “neurons”: the adjustment made to a synaptic weight is proportional to the product of error signal and the input signal that excite the synapse Easy to compute when errors signal  is directly measurable (we know the desired response for each unit) 29 A. Micheli Represent / Learn Dip. Informatica University of Pisa What can Perceptron represent? linear decision boundaries (see LTU) – Perceptron can solve linearly separable problems Can it always learn the solution? Yes! The Perceptron with the “Perceptron Learning Algorithm” is always able to learn what it is able to represent: Perceptron Convergence Theorem Comp Math Proof in the next lesson! Bio Note: A milestone results!!!! A biologically (Bio) inspired model with well-defined and proved computational capabilities (Comp)!!! And proved by a theorem (Math) 30 A. Micheli Perceptron Convergence Theorem Dip. Informatica University of Pisa The perceptron is guaranteed to converge (classifying correctly all the input patterns) in a finite number of steps if the problem is linearly separable. (independently of the starting point, although the final solution is not unique and it depends on the starting point) May be “unstable” if the problem is not separable – In particular it develop cycles (with a set of weights that are not necessarily optimal) Note: I’ll simply the notations in the proof (e.g. assume the not reported scalar product with T as usual among vectors) 31 A. Micheli Perceptron Convergence Theorem: preliminaries (1) Dip. Informatica University of Pisa We can focus on a “all positive patterns” task (as shown in the following): Assume (xi , di) in the TR set, with di= +1 or -1 and i=1...l Linearly separable →  w* solution s.t. (omitting T) di (w*xi) ≥, with = mini di (w*xi)>0 Hence, w*( di xi) ≥ Defining xi’= (di xi ) , then w* is a solution iff w* is a solution of (xi’, +1) (if) w* solve → di (w*xi) ≥ → (w* di xi) ≥ → (w*xi’) ≥→ w* is a solution of (xi’, +1) (only if) w* is a solution of (xi’, +1) → (w* di xi) ≥ → di (w*xi) ≥ → w* solve for xi 33 A. Micheli Perceptron Convergence Theorem: preliminaries (2) Dip. Informatica University of Pisa Assuming w(0)=0 (at step 0),  =1,  = maxi ||xi||2 (i=1...l, and || || denotes the Euclidian norm, see first lesson) After q errors (all false negative) ij is used to denote the patterns w (q) = x j =1−  q (i j ) belonging to the subset of misclassified patterns e.g., the indices 2, 8, 9,14 in the TR set. Because w(j)= w(j-1) + xij Rule: wnew= w +  d x (but all d are +1 here, only positive patterns, and =1) 34 A. Micheli Perceptron Convergence Theorem: Proof Dip. Informatica University of Pisa Theorem “For linearly separable tasks, the perceptron algorithm converges after a finite number of steps” Basic idea: we can find lower and upper bound to ||w(q)||2 as a function of q2 steps (lower bound ) and q steps (upper bound ) → we can find q (number of steps) s.t. the algorithm converges: [Proof also in Haykin 2nd ed. pages 139-141] ||w(q)||2 (q steps)2 q steps Max q Steps 36 A. Micheli Perceptron Convergence Theorem: Proof (part 1) Dip. Informatica University of Pisa Lower bound on ||w(q)||2: Recall that : ∗ 𝒘 is the solution 𝒘∗ 𝑻 𝒘(𝑞) = 𝒘∗ 𝑻 ෍ 𝐱 𝑖 𝑗 ≥ 𝑞𝛼  = mini ( 𝒘∗ 𝑻 xi ) 𝑗=1→𝑞 Cauchy-Swartz: (𝒂𝑻 𝒃)2 ≤ 𝒂 2 𝒃 2 𝒂 2 𝒃 2 ≥ (𝒂𝑻 𝒃)2 Use 𝒂 =𝒘∗ and 𝒃 = 𝒘(𝑞) 𝒘∗ 2 𝒘(𝑞) 2 ≥ ( 𝒘∗ 𝑻 𝒘(𝑞))2 ≥ (𝑞𝛼)2 |𝒘(𝑞)|2 ≥ (𝑞𝛼)2 / 𝒘∗ 2 The lower bound 37 A. Micheli Proof (part 2) Dip. Informatica University of Pisa 2 ||𝐚 + 𝐛||2 = ෍(𝑎𝑖 + 𝑏𝑖 )2 = ෍ 𝑎𝑖 2 + ෍ 2𝑎𝑖 𝑏𝑖 + ෍ 𝑏𝑖 = ||𝐚||2 + 2𝐚𝐛 + ||𝐛||2 𝑖 𝑖 𝑖 𝑖 Upper bound on ||w(q)||2: ||𝒘(𝑞)||2 = ||𝒘(𝑞 − 1) + 𝐱 𝑖 𝑞 ||2 = ||𝒘(𝑞 − 1)||2 + 2𝒘(𝑞 − 1)𝐱 𝑖 𝑞 + ||𝐱 𝑖 𝑞 ||2 For the q-th error: this is 𝑞 The upper bound 38 A. Micheli Proof (part 3): bring them together Dip. Informatica University of Pisa Upper bound Lower bound 𝑞𝛽 ≥ ||𝒘(𝑞)||2 ≥ (𝑞𝛼)2 /||𝒘∗ ||2 q  q 2 '   q ' q   /' i.e. a finite number of steps 39 A. Micheli Differences between “Perc. Learning Alg.” and LMS Alg. (I) Dip. Informatica University of Pisa wnew= w +  (d-out) x Apparently similar but note that: – LMS rule was derived (by the gradient) without threshold activation functions: minimization of the error of the linear unit (using directly wTx) – The perceptron use out = sign(wT x) Hence, for training: –  = (d- wT x) for the LMS approach versus –  = (d- sign(wT x)) for the perceptron learning alg. Of course the model trained with LMS can still be used for classification applying the threshold function h(x)= sign(wT x) (LTU) 41 A. Micheli LMS can misclassify in linear separable problems Dip. Informatica University of Pisa LMS not necessarily minimize the number of TR examples misclassified by the LTU 𝐸(𝐰) = ෍(𝑦𝑝 − 𝐱 𝑝 𝑇 𝐰)2 𝑝 It changes the weights not only for the misclassified patterns! 42 A. Micheli Differences Summary Dip. Informatica University of Pisa Perceptron Learn. Alg. LMS Alg. Min. misclassifications (out = Min. E(w) with out = w T x sign(w T x) ) asymptotic convergence (*), it always converges for linear also for not linear separable separable problems (in a finite problems number of steps) to a perfect not always zero classification classifier errors, linear separable (else no convergence) problems difficult to be extended to it can be extended to networks networks of units (NN) of units (NN), using the gradient based approach (*) example of asymptotic convergence (training with MSE loss) 43 A. Micheli Activation functions Dip. Informatica University of Pisa 1. Linear function 2. Threshold (or step) function (Perceptron/LTU) 3. A non-linear squashing function like the sigmoidal logistic function: assumes a continuous range of values in the bounded interval [0,1] 1 f ( x ) = 1 + e ( − ax) The sigmoidal-logistic function has the property to be a smoothed differentiable threshold function. a is the slope parameter of the sigmoid function 44 A. Micheli Activation functions (type 3.) Dip. Informatica University of Pisa LOGISTIC TanH Hyperbolic tangent 1 f ( x ) = 1 + e ( − ax) [0,+1] a=0.5 green, a=1 red, a=2 blue fsymm= 2 f(x) –1= tanh(ax/2) [-1,+1] Semilinearity close to 0 What if a→0? (to linear) a→  ? (to LTU) 45 A. Micheli Sigmoidal functions and classifier outcomes Dip. Informatica University of Pisa These functions provide continues outputs but… For the Logistic function an output value – ≥0.5 (threshold) correspond to the positive class –

ML-24-NN-part1-v.0.2.pdf

Document Details

Related

Full Transcript