Linear and Logistic Regression - PDF
Document Details

Uploaded by StrongestQuail
The University of Zambia
Tags
Summary
This document introduces linear and logistic regression, core topics in machine learning. It covers learning outcomes, algorithms, tasks, performance measures, and the concept of convexity alongside examples of practical problems and activities.
Full Transcript
Shaping the Future Linear and Logistic Regression Learning outcomes ▪ By the end of this session you will: – Learn about the main ingredients constituting most Machine Learning problem – Master the key concepts associated with linear...
Shaping the Future Linear and Logistic Regression Learning outcomes ▪ By the end of this session you will: – Learn about the main ingredients constituting most Machine Learning problem – Master the key concepts associated with linear and logistic regression – Build a design matrix from training data – Understand the method of Gradient Descent applied to some class of Machine Learning problem, its limitations and parameters 2 Learning algorithms ▪ Machine learning algorithm? – A machine learning algorithm is an algorithm that is able to learn from data ▪ Learning?: – A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. – So, for every machine learning algorithm, need to define what T, P and E are! 3 The Task, T ▪ The process of learning itself is not the task ▪ Learning is our means of attaining the ability to perform the task – if we want a robot to be able to walk, then walking is the task ▪ Machine learning tasks are usually described in terms of how the machine learning system should process an example – An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process ▪ Typically, an example is represented as a vector 𝒙 ∈ ℝ𝑛 where each entry 𝑥𝑖 of the vector is a feature – The features of an image are the values of the pixels in the image 4 The Task, T ▪ The most common machine learning tasks are: – Classification: the computer program is asked to specify which of 𝑘 categories some input belongs to. ▪ Object recognition: pedestrians, cars, buses… – Regression: the computer program is asked to predict a numerical value given some input ▪ prediction of the expected claim amount that an insured person will make (used to set insurance premiums) – Transcription: the machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe it into discrete, textual form ▪ Optical character recognition ▪ Speech recognition (siri etc…) 5 The Task, T ▪ The other most common machine learning tasks are: – Machine translation – Synthesis and sampling: the machine learning algorithm is asked to generate new examples that are similar to those in the training data ▪ Video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel – Imputation of missing values: the machine learning algorithm is given a new example 𝒙 ∈ ℝ𝑛 , but with some entries 𝑥𝑖 of 𝒙 missing – Denoising 6 The Performance Measure, P ▪ In order to evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance ▪ Usually, this performance measure P is specific to the task T being carried out by the system ▪ For classification and transcription, – we often measure the accuracy of the model. – An equivalent way is to measure the error rate. ▪ Usually, we are interested in how well the machine learning algorithm performs on data that it has not seen before. – Training set used for training the machine learning system – Test set used for measuring the performance of the machine learning system 7 The Experience, E ▪ Machine learning algorithms can be broadly categorized as Unsupervised & Supervised (we will see Reinforcement Learning later) ▪ Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset – The data is unlabelled used in clustering methods for example – In unsupervised learning, there is no instructor or guide, the algorithm must learn to make sense of the data without this guide ▪ Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target. – Classify iris plants into three different species based on their measurements – An instructor or teacher shows the machine learning system what to do 8 The dataset ▪ Most machine learning algorithms simply experience a dataset. A dataset can be described in many ways. In all cases, a dataset is a collection of examples, which are in turn collections of features. ▪ One common way of describing a dataset is with a design matrix. – One sample per row – One feature per column ▪ You might also see: – One feature per row – One sample per column 9 The first dataset: the iris dataset ▪ The Iris dataset – 150 samples & 4 features – The design matrix 𝑿 ∈ ℝ150×4 where 𝑋𝑖,1 is the sepal length, 𝑋𝑖,2 is the sepal width of plant (sample) 𝑖 etc. ▪ Not all dataset can be described this way as each sample must have the same length. We can use of set containing m elements {𝒙(1) , 𝒙(2) ,... , 𝒙(𝑚) } 10 Dataset in supervised learning ▪ In the case of supervised learning, the example contains a label or target as well as a collection of features ▪ For example, if we want to use a learning algorithm to perform object recognition from photographs, we need to specify which object appears in each of the photos. – We might do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying a cat, etc. ▪ Often, when working with a dataset containing a design matrix of feature observations 𝑿, we also provide a vector of labels 𝒚, with 𝑦𝑖 providing the label for example 𝑖. 11 The iris dataset 𝑿 𝒚 12 Individual activity (10 min) ▪ We have 2 features 𝑥1 , the age in years, and 𝑥2 , the height in cm, and 4 samples labelled with respect to their weight 𝑦 in kg. – Paul: 𝑥1 = 4, 𝑥2 = 103 and 𝑦 = 17 – Ronaldo: 𝑥1 = 6, 𝑥2 = 116 and 𝑦 = 21 – Chang: 𝑥1 = 9, 𝑥2 = 132 and 𝑦 = 28 – Manuel: 𝑥1 = 10, 𝑥2 = 138 and 𝑦 = 31 ▪ How many rows and columns has the design matrix 𝑿? ▪ How many elements has the vector of labels 𝒚? ▪ Write down the design matrix 𝑿 and the corresponding vector of labels 𝒚 13 Linear regression - Task ▪ This concrete model can be used to introduce most Neural Networks concepts: – Analytical vs. iterative solutions – Convexity – Gradient Descent – Capacity 14 Linear regression - Task ▪ Task: the goal is to determine the values of the weights 𝒘 to predict the value of the output scalar 𝑦 ∈ ℝ given the input vector 𝒙 ∈ ℝ𝑛. ▪ In linear regression, the assumption is that the output 𝑦 is a linear function of the input 𝒙: – Let 𝑦ො be the value that our model predicts given 𝒙, the true value is written 𝑦 – 𝑦ො = 𝒘𝑇 𝒙 where 𝒘 ∈ ℝ𝑛 is a vector of parameters or weights to be determined – 𝑦ො = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 in its expanded form – In this model, 𝑦ො = 0 when 𝒙 = 𝟎 which is a strong assumption 15 Augmented linear regression model ▪ It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated/general model with one additional parameter: an intercept term b. – 𝑦ො = 𝒘𝑇 𝒙 + 𝑏 where 𝑏 ∈ ℝ ▪ This extension to affine functions means that the plot of the model’s predictions still looks like a line but needs not pass through the origin. 16 Augmented linear regression model ▪ Instead of adding the bias parameter 𝑏, one can continue to use the model with only weights but augment 𝒙 with an extra feature that is always set to 1 for every sample. 𝑇 𝑏 1 – 𝑦ො = = 𝒘𝑻𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑. 𝒙𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑. 𝒘 𝒙 𝑏 𝑤0 𝑥0 𝑏 𝑤1 𝑤1 1 𝑥1 – 𝒘𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 = = ⋮ = ⋮ and 𝒙𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 = = ⋮. 𝒘 𝒙 𝑤𝑛 𝑤𝑛 𝑥𝑛 ▪ The intercept term 𝑏 is often called the bias parameter for the linear regression model. ▪ This terminology derives from the point of view that the output is biased toward being 𝑏 in the absence of any input. 17 Individual activity (5 min) ▪ We have 2 features 𝑥1 , the age in years, and 𝑥2 , the height in cm, and 3 samples labelled with respect to their weight 𝑦 in kg. – Paul: 𝑥1 = 4, 𝑥2 = 103 and 𝑦 = 17 – Ronaldo: 𝑥1 = 6, 𝑥2 = 116 and 𝑦 = 21 – Chang: 𝑥1 = 9, 𝑥2 = 132 and 𝑦 = 28 – Manuel: 𝑥1 = 10, 𝑥2 = 138 and 𝑦 = 31 ▪ You initially found that 4 103 17 6 116 21 – 𝑿= and 𝒚 = 9 132 28 10 138 31 ▪ Rewrite the augmented design matrix 𝑿 to include the bias in the linear regression model ▪ What are the entries for 𝒙 and 𝒘? And check that 𝑦ො = 𝒘𝑇 𝒙. 18 Introduction to Linear Regression https://www.youtube.com/watch?v=PaFPbb66DxQ 19 Augmented linear regression model ▪ From now on, by linear regression model we mean the extended linear model 𝑦ො = 𝒘𝑇 𝒙 with 𝑥0 – The feature vector 𝒙 = ⋮ and 𝑥0 = 1 for all samples, 𝑥𝑛 𝑤0 – The weight vector 𝒘 = ⋮ and 𝑤0 = 𝑏 is the bias parameter. 𝑤𝑛 – 𝑦ො = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 – Equivalently, – 𝑦ො = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛. ▪ Solving this machine learning problem is to determine the weight vector 𝒘. 20 Recall: Finding the minimum of a function (10 min) ▪ Let 𝑓 𝑥 = 𝑥 2 − 6𝑥 + 2 – Find the unique stationary point 𝑥0 , i.e., find value of 𝑥 for which 𝑓 ′ 𝑥0 = 0. – Is the point you determined above a local minimum/maximum? For this, you will need to determine the sign of 𝑓 ′′ 𝑥0. ▪ Let 𝑓 𝑥 = 5𝑥 3 + 2𝑥 2 − 3𝑥 + 4 – Find the stationary points of the function and determine their nature (local maximum or minimum). 21 Recall: Norm 2 (5 min) 𝑥1 ▪ For a vector 𝒙 = ⋮ ∈ ℝ𝑛 , the norm 2 of 𝒙 is given by 𝑥𝑛 – 𝒙 2 = σ𝑖=𝑛 2 𝑖=1 𝑥𝑖 = 𝑥12 + ⋯ + 𝑥𝑛2 and is non-negative real number. 𝑥1 – One result we’ll use later is 𝒙𝑇 𝒙 = 𝑥1 … 𝑥𝑛 ⋮ = 𝑥12 + ⋯ + 𝑥𝑛2 = 𝑥𝑛 𝒙 22 , i.e. 𝒙𝑇 𝒙 = 𝒙 2 2 6 ▪ Let 𝒙 = 2 , compute 𝒙 2. 9 22 Notation: The Euclidean distance 𝑎1 𝑏1 ▪ For any two points in ℝ𝑛 , say 𝐴 = ⋮ and 𝐵 = ⋮ , the Euclidean 𝑎𝑛 𝑏𝑛 distance between them is given by: 𝐵 – 𝑑 𝐴, 𝐵 = 𝐴𝐵 , 2 – 𝐴𝐵 is the vector with origin 𝐴 and head 𝐵. 𝐴 9 12 ▪ Let 𝐴 = 3 and 𝐵 = 7 , compute 𝑑 𝐴, 𝐵 13 25 23 Training the linear regression model ▪ One way of measuring the performance of the model is to compute the mean squared error of the model on the test set a.k.a. loss ▪ Since we train our ML algorithm on the training data, to determine the best fit, what we have to minimise is 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 – 𝜖𝑖 = 𝑦ො𝑖 for all sample 𝑖 where is 𝜖𝑖 the residual or error − 𝑦𝑖 between the predicted output 𝑦ො𝑖 𝑡𝑟𝑎𝑖𝑛 and its true value 𝑦𝑖 𝑡𝑟𝑎𝑖𝑛. ▪ We need to jointly minimise 𝜖𝑖 for every sample but 𝜖𝑖 can be either positive or negative. So, we could instead minimise the following function: 1 𝑁 1 2 – 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 (𝒘) = σ𝑖=1(𝑦ො𝑖 𝑡𝑟𝑎𝑖𝑛 − 𝑦𝑖 𝑡𝑟𝑎𝑖𝑛 )2 = ෝ 𝒚 𝑡𝑟𝑎𝑖𝑛 −𝒚 𝑡𝑟𝑎𝑖𝑛 𝑁 𝑁 2 – 𝑁 is the number of samples in my training set. 24 Training the linear regression model (10 min) ▪ Compute and expand the product 𝑿(𝑡𝑟𝑎𝑖𝑛). 𝒘. To what 𝑿(𝑡𝑟𝑎𝑖𝑛). 𝒘 is equal to? 25 Training the linear regression model ▪ To what 𝑿(𝑡𝑟𝑎𝑖𝑛). 𝒘 is equal to? (0) (0) 𝑥0 … 𝑥𝑛 𝑤0 ▪ 𝑿(𝑡𝑟𝑎𝑖𝑛). 𝒘 = ⋮ ⋱ ⋮. ⋮ = (𝑁−1) 𝑥0 … (𝑁−1) 𝑥𝑛 𝑤𝑛 (0) (0) 𝑤0 𝑥0 + ⋯ + 𝑤𝑛 𝑥𝑛 𝑦ො0 ⋮ = ⋮ ෝ =𝒚 (𝑁−1) (𝑁−1) 𝑦ො𝑁−1 𝑤0 𝑥0 + ⋯ + 𝑤𝑛 𝑥𝑛 ▪ We can calculate all the predicted labels from the design matrix and the weight vector. 26 Training the linear regression model (5 min) ▪ Rewrite 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 (𝒘) in terms of the training data and 𝒘. 27 Training the linear regression model ▪ Rewrite 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 (𝒘) in terms of the training data and 𝒘. 1 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 1 (𝑡𝑟𝑎𝑖𝑛) 𝑡𝑟𝑎𝑖𝑛 2 ▪ 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = ෝ 𝒚 −𝒚 2 = 𝑿.𝒘 − 𝒚 2 𝑁 𝑁 – In this expression, 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 has been expressed in terms of 𝑿(𝑡𝑟𝑎𝑖𝑛) and 𝒚 𝑡𝑟𝑎𝑖𝑛 which are given to us by the training data set, and thus, are known. – The only unknown to be determined is the weight of vectors 𝒘 that minimised 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 (𝒘) which we will denote by 𝒘𝑚𝑖𝑛. ▪ Solving linear regression or finding the best fit is simply determining the value of 𝒘 that minimises the mean squared error. 28 Random search: solving the linear regression model ▪ One simple method to find the minimum could be to let 𝒘 take different values and evaluate 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 (𝒘). ▪ In this example, we have 2 features. 𝑥0 1 – The feature vector 𝒙 = 𝑥 = , 1 𝑥 𝑤0 𝑏 – The weight vector 𝒘 = 𝑤 = where 𝑏 is the bias parameter. 1 𝑎 – 𝑦ො = 𝑤0 𝑥0 + 𝑤1 𝑥1 = 𝑏 + 𝑎𝑥. 29 Random search: solving the linear regression model ▪ Let’s calculate 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 for various values of 𝒘. −10 – If 𝒘 = , 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = 25.84 2 8 – If 𝒘 = , 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = 18.79 0.5 2 – If 𝒘 = , 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = 6.25 1.5 ▪ The smaller 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 , the better. But how do we know we have 30 found the best possible fit? Training the linear regression model ▪ Instead of going for a random search, we can use Vector Calculus to determine the exact value of 𝒘 that minimises 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛. ▪ Let’s remind us how to do that for one variable. Let 𝑓 be a function of one variable 𝑥. – Find the zeroes of 𝑓′ 𝑥 (a.k.a. critical/stationary points). – Say 𝑥0 is the only value such that 𝑓’ 𝑥0 = 0. ▪ If 𝑓’’(𝑥) ≤ 0 on ℝ then 𝑓 𝑥0 is global maximum at 𝑥0 ▪ if 𝑓’’ 𝑥 ≥ 0 on ℝ then 𝑓 is global minimum at 𝑥0 – We may not have 𝑓’’(𝑥) to be always negative or positive. However, we can still find local extrema. ▪ If 𝑓’’(𝑥0 ) ≤ 0, then 𝑓 𝑥0 is a local maximum at 𝑥0 ▪ if 𝑓’’ 𝑥0 ≥ 0, then 𝑓 is a local minimum at 𝑥0 31 Training the linear regression model ▪ As 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 (𝒘) is a function 𝑁 + 1 variables, the approach is slightly different but essentially the same. 1 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 1 (𝑡𝑟𝑎𝑖𝑛) 𝑡𝑟𝑎𝑖𝑛 2 ▪ 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = ෝ 𝒚 −𝒚 2 = 𝑿.𝒘 − 𝒚 2 𝑁 𝑁 ▪ This is mathematically written as – 𝒘𝑚𝑖𝑛 = argmin 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 , which is to find the argument 𝒘 that 𝒘 minimises 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘. 1 2 – 𝒘𝑚𝑖𝑛 = argmin 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = argmin 𝑿(𝑡𝑟𝑎𝑖𝑛). 𝒘 −𝒚 𝑡𝑟𝑎𝑖𝑛 2 𝒘 𝒘 𝑁 – Does 𝑁 play a role in the minimisation of 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 ? 32 Closed form solution to the linear regression model ▪ We can use Vector Calculus to determine the exact value of 𝒘 that 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 ෝ minimises 𝒚 −𝒚 2. We first need to compute the gradient 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 ෝ of 𝒚 −𝒚 2 with respect to 𝒘. 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑇 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 – ෝ 𝒚 −𝒚 2 ෝ = ൫𝒚 −𝒚 ෝ ൯. ൫𝒚 −𝒚 ൯ – Expanding yields 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 𝑡𝑟𝑎𝑖𝑛 𝑇. 𝒚 ▪ ෝ 𝒚 −𝒚 2 ෝ =𝒚 ෝ 𝑡𝑟𝑎𝑖𝑛 ෝ − 2. 𝒚 𝑡𝑟𝑎𝑖𝑛.𝒚 𝑡𝑟𝑎𝑖𝑛 + 𝑡𝑟𝑎𝑖𝑛 𝑇. 𝒚 𝑡𝑟𝑎𝑖𝑛 𝒚 – As 𝒚 ෝ 𝑡𝑟𝑎𝑖𝑛 = 𝑿(𝑡𝑟𝑎𝑖𝑛). 𝒘 and 𝑨. 𝑩 𝑇 = 𝑩𝑇. 𝑨𝑇 , this gives 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 𝑡𝑟𝑎𝑖𝑛 𝑇 𝑿(𝑡𝑟𝑎𝑖𝑛) 𝒘 − 𝑡𝑟𝑎𝑖𝑛 𝑇 ▪ ෝ 𝒚 −𝒚 2 = 𝒘𝑇 𝑿 2𝒘𝑇 𝑿 + 𝒚 𝑡𝑟𝑎𝑖𝑛 𝑇. 𝒚 𝑡𝑟𝑎𝑖𝑛 33 Closed form solution to the linear regression model 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 𝑇 𝑡𝑟𝑎𝑖𝑛 𝑇 ▪ ෝ 𝒚 −𝒚 2 =𝒘 𝑿 𝑿(𝑡𝑟𝑎𝑖𝑛) 𝒘 − 2𝒘𝑇 𝑿 𝑡𝑟𝑎𝑖𝑛 𝑇 𝒚 𝑡𝑟𝑎𝑖𝑛 +𝒚 𝑡𝑟𝑎𝑖𝑛 𝑇. 𝒚 𝑡𝑟𝑎𝑖𝑛 ▪ Using the gradient properties where 𝛻𝒘 𝒘𝑇 𝑨𝒘𝑇 = 𝑨 + 𝑨𝑇 𝒘 and 2 𝛻𝒘 𝒘𝑇 𝑨 = 𝑨, we can calculate the gradient of 𝒚 ෝ 𝑡𝑟𝑎𝑖𝑛 −𝒚 𝑡𝑟𝑎𝑖𝑛 2 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 𝑡𝑟𝑎𝑖𝑛 𝑇 𝑿(𝑡𝑟𝑎𝑖𝑛) 𝒘 − 𝑡𝑟𝑎𝑖𝑛 𝑇 𝒚 𝑡𝑟𝑎𝑖𝑛 – 𝛻𝒘 ෝ 𝒚 −𝒚 2 = 2𝑿 2𝑿 34 Closed form solution to the linear regression model 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 2 ▪ ෝ 𝛻𝒘 𝒚 −𝒚 2 = 𝟎, solving the equation for 𝒘 yields, −1 𝑡𝑟𝑎𝑖𝑛 𝑇 𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑻 (𝑡𝑟𝑎𝑖𝑛) ▪ 𝒘𝑚𝑖𝑛 = 𝑿 𝑿 𝑿 𝒚 and is the only solution 𝑡𝑟𝑎𝑖𝑛 𝑇 𝑡𝑟𝑎𝑖𝑛 provided that 𝑿 𝑿 is invertible. ▪ This equality is known as the normal equations and is the analytical solution of the linear regression problem. ▪ For this particular value of 𝒘 = 𝒘𝑚𝑖𝑛 , it is proven that 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 isa global minimum. 35 Evaluating the linear regression model ▪ Now, we have found the best fit to our linear regression problem, but it has only been determined using the training dataset. We need another performance measure to determine whether our model performs well to predict labels for unseen data points. This where the test set will play its role. Similarly, – We have a design matrix of 𝑀 example inputs that we will not use for training, only for evaluating how well the model performs: 𝑿(𝑡𝑒𝑠𝑡) ∈ ℝ𝑀×𝑛 – We have a vector of regression targets providing the exact value of 𝑦 for each of the examples, the test set, 𝒚(𝑡𝑒𝑠𝑡) ∈ ℝ𝑀 ෝ(𝑡𝑒𝑠𝑡) ∈ ℝ𝑀 gives the predictions of the model on the test set ▪ 𝒚 1 𝑡𝑒𝑠𝑡 𝑡𝑒𝑠𝑡 1 2 ▪ 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 = 𝑀 σ𝑀 𝑖=1(ෝ 𝒚𝑖 − 𝒚𝑖 )2 = 𝑀 𝒚 ෝ 𝑡𝑒𝑠𝑡 −𝒚 𝑡𝑒𝑠𝑡 2 ෝ(𝑡𝑒𝑠𝑡) is to 𝒚(𝑡𝑒𝑠𝑡) which would mean that our ▪ The smaller 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 is, the closer 𝒚 model generalises well and is able to perform well on previously unseen inputs. 36 Individual activity (20 min) ▪ The augmented design matrix and the corresponding vector of labels of the previous problem are 1 4 103 17 21 – 𝑿 = 1 6 116 and 𝒚 = 1 9 132 28 1 10 138 31 𝑥0 𝑤0 – Also, 𝒙 = 𝑥1 and 𝒘 = 𝑤1 i.e. 𝑦ො = 𝒘𝑇 𝒙 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 with 𝑥2 𝑤2 𝑤0 = 𝑏 and 𝑥0 = 1 for all samples. 33.38983 ▪ On Python, compute 𝒘𝑚𝑖𝑛 and prove that 𝒘𝑚𝑖𝑛 = 4.1695 i.e. −0.3220 𝑦ො = 33.38983 + 4.1695𝑥1 − 0.3220𝑥2 ▪ Compute 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 for 𝒘 = 𝒘𝑚𝑖𝑛. 37 Finding the minimum of a function and Convexity ▪ Studying the convexity of a function is about determining if a function is convex, concave or neither ▪ In mathematics, a real-valued function is called convex/concave if the line segment between any two distinct points on the graph of the function lies above/below the graph between the two points. 38 Finding the minimum of a function and Convexity ▪ A function that is not convex is not necessarily concave and vice-versa. Most functions are neither: – Convex functions: 𝑓 𝑥 = 𝑥 2 ,𝑓(𝑥) = exp(𝑥) – Concave functions: 𝑓 𝑥 = √𝑥, 𝑓 𝑥 = ln(𝑥) – Neither convex, nor concave: 𝑓 𝑥 = 𝑥 3 , 𝑓 𝑥 = cos(𝑥) ▪ The objective/loss function, 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 , is a convex function but not all ML optimisations problem are convex or concave! – Deep neural networks loss functions are typical examples of not being convex or concave – So, what we will talk about next does not generalise to all ML problems 39 Finding the minimum of a function and Convexity ▪ Convex functions play an important role in many areas of mathematics. They are especially important in the study of optimisation problems where they are distinguished by a number of convenient properties. For instance, a strictly convex function on an open set has no more than one minimum. – So, if we find one local minimum, then it turns out that this local minimum is actually the global minimum. ▪ Another important property is that a twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain. ▪ The objective function, 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 , is a convex function – Deep neural networks loss functions are typical examples of not being convex or concave – So, what we will talk about next, unfortunately, does not generalise to all 40 ML problems Finding the minimum of a function and Convexity ▪ Why is convexity important? – It allows us to conclude whether a local minimum (resp. maximum) is a global minimum (resp. maximum) – A local min (resp. max) is a global min (resp. max) over ℝ𝑛 if the objective function is convex (for a max) or concave (for a min) – The 𝒘𝑚𝑖𝑛 we found is only making the gradient of 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 to vanish ▪ We need to show that it is a local minimum and convexity will guarantee us that it is the minimum. ▪ It involves more involved mathematics by calculating the Hessian matrix of 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 at 𝒘𝑚𝑖𝑛 and prove that it is positive (resp. negative) semi-definite on ℝ𝑛 for a min (resp. for max) ▪ This is just a generalisation of A-level maths for multivariable functions! – Find the zeroes of 𝑓’ 𝑥 (a.k.a. critical points). Say 𝑥0 is the only value such that 𝑓’ 𝑥0 = 0. » If 𝑓’’(𝑥) ≤ 0 on ℝ then 𝑓 is maximum at 𝑥0 41 Linear regression ▪ So, we could ask ourselves whether ML is just an optimisation problem? – The central challenge in machine learning is that we must perform well on new, previously unseen inputs and not only on the training data set. – 𝒘𝑚𝑖𝑛 we found is only minimising 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 and obviously depends on the chosen training set. ▪ Can we therefore guarantee that 𝒘𝑚𝑖𝑛 will also be a good for 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 ? – This will be dependent on whether the statical properties of the training set is the same as for the test set. – It will also be dependent on whether the data set we use captures entirely the phenomenon/phenomena of interest. 42 Linear regression ▪ So far so good, we have solved the problem analytically, but can we always solve all ML problems analytically? – The answer is no, and it depends on the complexity of the problem to solve ▪ Next question is, can we find a numerically approximation of the solution? – This time, the answer is yes but with some caveats. ▪ The minimum we may find will be a local minimum ▪ And the local min we find will depend on the implementation of our algorithm 43 Training and test error ▪ The ability to perform well on previously unobserved inputs is called generalisation. ▪ As we train our model (through numerical methods) we will keep track of two errors: – The training error (or validation error) which we want to minimise – The test error (or generalisation error) that we also want to keep low ▪ By training, we mean iteratively updating the weights to minimise the loss function. Be careful to not confuse training and learning! ▪ What is the difference? 44 Factors determining ML performance ▪ Method used to train a ML algorithm – Sample the data set to generate the training/test sets. – Choose parameters to reduce the training set error ▪ Then it is proved that the expected training error is always less than the expected test error. We can conclude that we need to: – Make the training error small, – Make the gap between training and test error small 45 Capacity, Overfitting and Underfitting ▪ Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set ▪ Overfitting occurs when the gap between the training error and test error is too large ▪ We can control whether a model is more likely to overfit or underfit by altering its capacity. It measures the ability of a model to fit a wide variety of functions. – Models with low capacity may struggle to fit the training set – Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set 46 Hypothesis space ▪ One way to control the capacity of a learning algorithm is by choosing its hypothesis space – the set of functions that the learning algorithm is allowed to select as being the solution ▪ What does it mean for linear regression? – The linear regression algorithm has the set of all linear functions of its input as its hypothesis space – We can generalise linear regression to include polynomials, rather than just linear functions, in its hypothesis space 𝑥0 1 𝑤0 𝑏 ▪ 𝑦ො = 𝑏 + 𝑤𝑥, 𝒙 = 𝑥 = and 𝒘 = 𝑤 =. 1 𝑥 1 𝑤 𝑥0 1 𝑤0 𝑏 ▪ 𝑦ො = 𝑏 + 𝑤1 𝑥 + 𝑤2 𝑥 2, 𝒙 = 𝑥1 = 𝑥 and 𝒘 = 𝑤1 = 𝒘 = 𝑤1 has a 𝑥2 𝑥2 𝑤2 𝑤2 larger capacity than the previous model as its hypothesis space is larger. 47 Overfitting and Underfitting 48 Bias and Variance ▪ Empirically, when training a model, it has been noticed that a model can have – An irreducible large training error with a small gap between the training and testing error – A small training error with a large gap between the training and testing error – An acceptable low training error with an acceptable low gap between training and testing error ▪ A high/low bias happens when the training error is high/low ▪ A high/low variance happens when the gap between the training error and testing error is high/low 49 Bias and Variance https://www.youtube.com/watch?v=EuBBz3bI-aA&t=3s 50 Gradient descent 51 Gradient descent for linear regression https://www.youtube.com/watch?v=sDv4f4s2SB8 53 Shaping the Future Logistic regression Binary classifier ▪ It is a classifier with exactly two classes – Each input (element in the dataset) belong either to one or the other but never to both – One class is called the positive class (or class 1) – The other class is called the negative class (or class 0) – E.g. Reptile vs. Non-reptile class ▪ How to measure performance of the model? – True positive, 𝑇𝑝 (e.g. crocodile is classed as a reptile) – True negative, 𝑇𝑛 (e.g. cat is classed as a non-reptile) – False positive, 𝐹𝑝 (e.g. cat is classed as a reptile although it is not) – False negative, 𝐹𝑛 (e.g. crocodile is classed as non-reptile although it is) 55 Confusion matrix for binary classifier ▪ A confusion matrix is a table summarising the performance of a classifier. Predicted by the model Ground + - truth + #𝑇𝑝 #𝐹𝑛 (reality) - #𝐹𝑝 #𝑇𝑛 ▪ Performance metrics of the model can be easily calculated from the table: – 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑝 + 𝑇𝑛 )/(𝑇𝑝 + 𝐹𝑝 + 𝐹𝑛 + 𝑇𝑛 ) – 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑝 /(𝑇𝑝 + 𝐹𝑛 ) – 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑝 /(𝑇𝑝 + 𝐹𝑝 ) – 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑛 /(𝑇𝑛 + 𝐹𝑝 ) 56 Exercise on confusion matrix ▪ A data set with 4 classes {𝐶1 = car, 𝐶2 = pedestrian, 𝐶3 = motor bike, 𝐶4 = lorry} is used to train a classifier. The confusion matrix generated to evaluate the performance of the classifier is given by: Predicted 𝐶1 𝐶2 𝐶3 𝐶4 – How many elements are in every class? 𝐶1 52 3 7 2 Ground truth – Compute the accuracy of the model? 𝐶2 2 28 2 0 𝐶3 5 2 25 12 𝐶4 1 1 9 40 ▪ To compute precision and recall, we need to generate another confusion matrix with one positive class (e.g. 𝐶1 ) and one negative class (e.g. 𝐶2 , 𝐶3 , 𝐶4 ). + - – Complete the confusion matrix where C1 is the positive class. + ▪ 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝐶1 - ▪ 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 𝐶2 ∪ 𝐶3 ∪ 𝐶4 – Compute the precision and recall. 57 Logistic regression ▪ The Task is a classification task – A somewhat strange name since we use the model for classification rather than regression. However, we predict the probability of an input belonging to the positive class, hence the name. ▪ It is binary classifier – Two classes: class 0 and class 1 – we need only specify the probability of one of these classes since 𝑃 𝒙 ∈ 𝐶𝑙𝑎𝑠𝑠1 = 1 − 𝑃(𝒙 ∈ 𝐶𝑙𝑎𝑠𝑠0 ) – 𝑃 𝒙 ∈ 𝐶𝑙𝑎𝑠𝑠1 ∈ [0,1] ▪ This is a supervised machine learning algorithm. 58 Let’s work through an example… ▪ We have one feature. The value of sample 𝑖 is denoted by 𝑥 (𝑖) ▪ So, we also have a label attached to each example: 𝑦 (𝑖) ▪ Let the design matrix and label vector be: −2.2 𝑟𝑒𝑑 0 −1.4 𝑟𝑒𝑑 0 −0.8 𝑔𝑟𝑒𝑒𝑛 1 0.2 𝑟𝑒𝑑 0 0.4 𝑔𝑟𝑒𝑒𝑛 1 – 𝑿= and 𝒚 = 𝑔𝑟𝑒𝑒𝑛 or 𝒚 =. 0.8 1 1.2 𝑔𝑟𝑒𝑒𝑛 1 2.2 𝑔𝑟𝑒𝑒𝑛 1 2.9 𝑔𝑟𝑒𝑒𝑛 1 4.6 𝑔𝑟𝑒𝑒𝑛 1 59 The problem to be solved ▪ Given the value of 𝑥, what is the probability that the point is red? green? ▪ Basically, we want to predict the label from the feature value of the point. Ideally, we want – green points from our dataset would have a probability of 1 (of being green) – red points from our dataset would have a probability of 0 (of being green) ▪ What would the measure performance be? – Given what we know about the colour of the points, how can we evaluate how good (or bad) are the predicted probabilities? – It should return high values for bad predictions and low values for good predictions. 60 ▪ Let’s split the data into our two classes against the value their label takes ▪ We want to fit an S-curve which will give us 𝑃 𝑥 ∈ 𝐶𝑙𝑎𝑠𝑠1 – The chosen fitted regression curve is a sigmoid curve representing the probability of a point being green for any given 𝑥. 61 Logistic regression ▪ Logistic regression uses the logistic sigmoid 𝜎 defined by: 1 – 𝜎 𝑡 = 1+𝑒 −𝑡 ▪ For one feature – 𝑡 = 𝑏 + 𝑤1 𝑥 ▪ For more features – 𝑡 = 𝒘𝑇 𝒙 – 𝒘 and 𝒙 augmented ▪ The idea is to find the best distribution. Based on the S-curve, – Its shift – Its stretch – Which are both controlled by 𝑏 and 𝑤1 62 Predicted probabilities for the positive class ▪ For all points belonging to the positive class (green), what are the predicted probabilities given by our classifier? – These are the green bars under the sigmoid curve, at the 𝑥 coordinates corresponding to the points. 63 Predicted probabilities for the negative class ▪ What is the probability of a given point being red? The red bars ABOVE the sigmoid curve 64 Predicted probabilities ▪ The bars represent the predicted probabilities associated with the corresponding true class of each point 65 Predicted probabilities for both classes ▪ We do not need to keep the value of 𝑥 as we only care of the predicted probability of each example 66 Compute the loss function ▪ Aim: Penalise bad predictions & no loss for good predictions – If the probability associated with the true class is 1.0, we need its loss to be zero – if that probability is low, say, 0.01, we need its loss to be HUGE ▪ Which function does this for us? – −log(𝑃 𝑥 ) does the job with 𝑃 𝑥 = 𝑃 𝑥 ∈ 𝐶𝑙𝑎𝑠𝑠1 or 𝑃 𝑥 = 𝑃(𝑥 ∈ 𝐶𝑙𝑎𝑠𝑠0 ) – Since the log of values between 0.0 and 1.0 is negative, we take the negative log to obtain a positive value for the loss 67 ▪ Let’s take the (negative) log of the probabilities — these are the corresponding losses of each and every point ▪ The mean is the value of our loss function 68 ▪ More generally the loss function is given by 1 – 𝐿𝑜𝑔𝐿𝑜𝑠𝑠 = − σ𝑁 𝑖=1 𝑦 (𝑖) log 𝑃 𝑥 (𝑖) + 1 − 𝑦 (𝑖) log(1 − 𝑃 𝑥 (𝑖) ) 𝑁 – 𝑦 (𝑖) ∈ {0,1} is the label attached the sample 𝑖 – 𝑃 𝑥 (𝑖) = 𝜎(𝑏 + 𝑤1 𝑥 (𝑖) ) is the probability that sample 𝑖 belongs to the positive class – 𝑁 is the total number of samples ▪ It is also called the cross-entropy/log loss function ▪ Now, the ultimate question – How do we minimise it? – There are no closed form solutions – But the problem is convex and differentiable so gradient descent will do 69 Questions [email protected] 70