Lecture Notes: Supervised Learning - Chapter 3 Part 1 PDF

CHAPTER 3: SUPERVISED LEARNING Test Data Model Testing/ Data Data Pre- Model Training Model Model Acquisition processing & Building Evaluation Deployment RECAP: MACHINE LEARN...

CHAPTER 3: SUPERVISED LEARNING Test Data Model Testing/ Data Data Pre- Model Training Model Model Acquisition processing & Building Evaluation Deployment RECAP: MACHINE LEARNING PROCESS CONTENTS 1.Regression vs Classification 2.Traditional Machine Learning Model/Algorithm Linear and Logistic Regression Naïve Bayes K-Nearest Neighbour (k-NN) Support Vector Machine (SVM) Artificial Neural Network (ANN) Decision Tree 3. Ensemble Classifiers Bagging Boosting Stacking 4. Uncertainty Estimates from Classifiers 5. No Free Lunch Theorem COURSE OUTCOMES By the end of this chapter, students should be able to:  understand the concepts of regression and classifications in solving machine learning problems.  know some applications of regression analysis and classification in solving real world machine learning problems.  understand the concept of uncertainty estimates from classifiers and no free lunch theorem. VS CLASSIFICATION REGRESSION Predict/Classify the discrete values Predict the continuous Process of finding a function which helps in Process of finding the correlations between dividing the dataset into classes based on dependent and independent variables. different parameters. The task of the classification algorithm is to The task of the Regression algorithm is to find the find the mapping function to map the mapping function to map the input variable(x) to input(x) to the discrete output(y). the continuous output variable(y). Types of algorithms including k-NN, SVM, Types of algorithms including linear regression, logistic regression, decision tree, ANN, polynomial regression, support vector regression, Naïve Bayes and ensemble classifiers and other regression. LINEAR REGRESSION Linear regression is one of the most common statistical modeling approaches used in data science. Linear regression is commonly used to quantify the relationship between two or more variables. Introduction In data science applications, it is very common to be interested in the relationship between two or more variables. Simple Linear Regression Estimate the model parameter and the prediction ŷ  ˆ0  ˆ1 x where ŷ = estimated dependent (response) variable x = independent (or predictor/ regressor /explanatory) variable ̂ 0 = estimate of y – intercept, the point which the line intersects the y-axis (regression constant) ˆ1 = estimate of slope, the amount of increase/decrease of y for each unit increase (or decrease) in x (regression coefﬁcient) DRKU2021 Python library sklearn will be used to build a simple linear regression model that finds the line of best fit. Then, the coefficients β0 and β1 will be Building a calculated so that the residuals can be Simple minimized. Linear Because we'll want to evaluate the model's predictions on data it hasn't seen before (to Regression give us a sense of accuracy), we'll train the model on a subset of the data and test it on Model another subset. This can be visualized as follows (in the next slide) Building linear regression by using train_test_split (hold out method) Step 1: Define our train and test data # Import the module train_test_split from sklearn.model_selection import train_test_split # Define our predictor and target variables X = houses[['sqft_living']] Y = houses['price’] # Create four groups using train_test_split. In this example, we use 75% of data to train, the rest 25% to test. x_train, x_test, y_train, y_test = train_test_split(X, Y) Step 2: Build and fit the model # Import the library from sklearn.linear_model import LinearRegression # Initialize a linear regression model object lr = LinearRegression() # Fit the linear regression model object to our data lr.fit(x_train, y_train) # Print the intercept and the slope of the model print(lr.intercept_) print(lr.coef_) # Show line of best fit plt.plot(x_train, lr.coef_*x_train + lr.intercept_, '-r', label='Intercept: -39,163 \nSlope: 279.4') Figure 3: Linear regression model The coefficient β1 of our model tells us that the price increases approximately $279.4 for every additional square foot in a house. So now, let's say we want to predict the price of a house with 4600 squared feet. We can use the.predict() method on our model lr to obtain the price. lr.predict([]) Multiple Linear Regression  Used to describe linear relationships involving a dependent variable (y) with more than two independent variables  The general form of multiple linear regression model is given by y   0  1 x1   2 x2 ...   k xk   where  0 , 1 ,...,  k : unknown parameter (regression coefficient) ϵ : error term DRKU2021 Implementing Multiple Linear Regression in Phyton Let’s say we have this kind of data here Start with importing the dataset import numpy as np import matplotlib.pyplot as plt import pandas as pd url = 'https://raw.githubusercontent.com/content-anu/dataset- multiple-regression/master/50_Startups.csv' dataset = pd.read_csv(url) dataset.head() Data Pre-processing First: Building the matrix of features and dependent vector. X = dataset.iloc[:,:-1].values y = dataset.iloc[:,4].values Second: Encoding the categorical variables from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelEncoder_X = LabelEncoder() X[:,3] = labelEncoder_X.fit_transform(X[ : , 3]) from sklearn.compose import ColumnTransformer ct = ColumnTransformer([('encoder', OneHotEncoder(), )], remainder='passthrough') X = np.array(ct.fit_transform(X), dtype=np.float) Data Pre-processing Third: Avoiding the dummy variable trap X = X[:, 1:] Fourth: Splitting the test and train set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) Last step: Fitting the model from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) Predicting the Test Set Results We create a vector containing all the predictions of the test set profit. The predicted profits are then put into the vector called y_pred.(contains prediction for all observations in the test set). ‘predict’ method makes the predictions for test set. Hence, input is the test set. The parameter for predict must be an array or sparse matrix, hence input is X_test. y_pred = regressor.predict(X_test) R-Squared R-squared, also known as the coefficient of determination, is the proportion of variation that is explained by a linear model. In general, a higher r-squared value represents a better fit of the data. # Import r2_score module from sklearn.metrics import r2_score # Print R2 Score print(r2_score(y_test, y_pred)) Linear regression can be used to predict the sale of products in the future based on past buying behavior. Economists use linear regression to predict the economic growth of a country or state. Regression Sports analysts use linear regression to predict the number of runs or goals a player would score in the coming matches Applications based on previous performances. An organization can use linear regression to figure out how much they would pay to a new joiner based on the years of experience. Linear regression analysis can help a builder to predict how much houses it would sell in the coming months and at what price. Other Regression Techniques 1. Polynomial Regression: Extends linear regression by adding polynomial terms to the model, allowing it to fit non-linear relationships between the independent and dependent variables. Use Case: When the relationship between variables is non-linear but can be approximated by a polynomial function. 2. Lasso Regression (L1 Regularization): Similar to ridge regression but uses L1 regularization, which can lead to sparsity in the coefficient estimates (some coefficients become zero). Use Case: When you want feature selection and regularization to reduce the number of predictors. 3. Ridge Regression (L2 Regularization): A type of linear regression that includes a regularization term to penalize large coefficients, which helps prevent overfitting by shrinking the coefficients. Use Case: When dealing with multicollinearity or when you need to prevent overfitting. 4. Elastic Net Regression: Combines both L1 and L2 regularization, balancing the benefits of ridge and lasso regression. Use Case: When you want both regularization and feature selection. Other Regression Techniques 1. Poisson Regression: Used for count data, where the dependent variable represents the number of occurrences of an event. Use Case: When modelling count data or rates, such as the number of events occurring in a fixed period of time. 2. Quantile Regression: Estimates conditional quantiles (e.g., median) of the response variable, rather than the conditional mean, which provides a more comprehensive view of the relationship between variables. Use Case: When you want to understand the impact of predictors on different points of the distribution of the response variable. 3. Robust Regression: Techniques designed to be less sensitive to outliers and violations of model assumptions, such as the Huber or RANSAC (Random Sample Consensus) methods. Use Case: When dealing with data that may contain outliers or is not well-behaved. LOGISTIC REGRESSION Logistic Regression  Logistic regression uses an equation as the representation, very much like linear regression.  Input values (x) are combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta) to predict an output value (y).  Logistic regression is another technique borrowed by machine learning from the field of statistics.  It is the go-to method for binary classification problems (problems with two class values). Logistic Regression  Logistic regression is named for the function used at the core of the method, the logistic function.  The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment.  It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.  Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform.  Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function. All regression models attempt to model the relationship f between a dependent variable y and a number of independent variable xi. Differences between linear and logistic regression Linear Regression Logistic Regression Dependent variable, y Numeric Nominal Functional relationship..dependent variables …class probability between independent and.. y  f ( x1 ,..., xn ,  o ,...,  n ) P(y= class i) y   o  1 x1 ...   n xn ) P ( y  ci )  f ( x1 ,..., xn ,  0 ,...,  n ) Dependent variable has two possible outcomes, such as: y  white , y  black  which can be coded as y  0, y  1 Function that describes the relationship between the probability of a class y  1 , and the independent variables 1 exp( z) P( y  1 x)  ( z )  ( z)  0,1 1  exp 1  exp With z   0  1 x1 ...   n xn  x , from linear regression model Then P( y  1 x)... aka as  The probabilities of all classes have to sum up to 1:  P( y  1 x)  P( y  0 x)  1  P( y  0 x)  1    1  P( y  1 x) we don’t need any coefficients for the second class. Binary Logistic Regression Calculating the regression coefficients In order to calculate the coefficients, we maximize the likelihood function L(  , y, X ) to get the best approximation of the probability m L(  , y, X )    iyi (1   i )1 yi i 1 yi  0 if yi is equal to the reference category yi  1 if yi isn’t equal to the reference category Calculating the regression coefficients The algorithm is a monotonically increasing function => Maximizing the algorithm of the Likelihood function LL (  ; y , X ) max LL (  ; y , X )  max  i 1 yi ln( i )  ln(1  yi ) ln(1   i ) n   is equivalent to maximizing the original Likelihood function. max LL (  ; y , X )  max  i 1  iyi (1   i )1 yi m   Meaning of the Regression Coefficients Interpretation of the sign:  i  0 : Higher xi leads to higher probability  i  0 : Higher xi leads to smaller probability Interpretation of the p-value, which is the result of the Wald test (show whether a feature has significant impact) Other advanced interpretation method: Odd ratio. OddsRatio ( xi )  exp(  i ) Logistic Regression Applications Credit scoring: ID Finance is a financial company that makes predictive models for credit scoring. They need their models to be easily interpretable. They can be asked by a regulator about a certain decision at any moment. Medicine: Medical information is gathered in such a way that when a research group studies a biological molecule and its properties, they publish a paper about it. Thus, there is a huge amount of medical data about various compounds, but they are not combined into a single database. Text editing: used to make some claim about a text fragment. Toxic speech detection, topic classification for questions to support, and email sorting are examples where logistic regression shows good results. DRKU2021 Recap: Bayes Theorem  Naive Bayes is a classification algorithm that works based on the Bayes theorem.  Bayes theorem is used to find the probability of a hypothesis with given evidence. In this, using Bayes theorem we can find the probability of A, given that B occurred. A is the hypothesis and B is the evidence. P(B|A) is the probability of B given that A is True. P(A) and P(B) is the independent probabilities of A and B.  Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.  To demonstrate the concept of Naïve Bayes, consider the example displayed in the illustration. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.  Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED.  In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.  Thus, we can write:  Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:  Having formulated our prior probability, we are now ready to classify a new object (WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular colour.  To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:  From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:  Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761). (0.0167) (0.05)  Finally, we classify X as RED since its class membership achieves the largest posterior probability. Note: The above probabilities are not normalized. However, this does not affect the classification outcome since their normalizing constants are the same. K-NEAREST NEIGHBOUR METHOD k-NN is the simplest classifier and the most straightforward since classification of the datasets is based on their nearest neighbours class It is also known as lazy learning because its training is held up to run time k-Nearest This classifier is memory-based and Neighbour (k-NN) requires no model to be fitted The method was studied by Fix and Hodges in 1951 in the US Air Force Scholl of Aviation Medicine There are 2 parameters in this classifier which are k value and distance that is used Majority vote within the k nearest neighbors. new K= 1: dark green K= 3: green X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x. Choosing the Right Value for k To select the k that is right for your data, run the k-NN algorithm several times with different values of k and choose the k that reduces the number of errors while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before. Here are some things to keep in mind: 1. As the value of k is decreasing to 1, your predictions will become less stable. Imagine k=1 and you have a query point surrounded by several reds and one green but the green is the single nearest neighbor. Reasonably, you would think the query point is most likely red, but because k=1, k-NN incorrectly predicts that the query point is green. 2. Inversely, when the value of k is increasing, your predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, you will begin to witness an increasing number of errors. It is at this point; you have pushed the value of k too far. 3. In cases where a majority vote is taken (for an example picking the mode in a classification problem) among labels, k value will be taken as an odd number to have a tiebreaker Error Rate vs K Value Plot Python Code error_rate = [] for i in range(1,40): knn = KNeighborsClassifier(n_neighbors=i) knn.fit(X_train,y_train) pred_i = knn.predict(X_test) error_rate.append(np.mean(pred_i != y_test)) plt.figure(figsize=(10,6)) plt.plot(range(1,40),error_rate,color=’b lue’, linestyle=’dashed’, marker=’o’, markerfacecolor=’red’, markersize=10) plt.title(‘Error Rate vs. K Value’) plt.xlabel(‘K’) plt.ylabel(‘Error Rate’) k-NN Distance Metrics Minkowski Manhattan Euclidean Distance Distance Distance Cosine Jaccard Hamming Distance Distance Distance Advantages 1.The algorithm is simple and easy to implement. 2.There’s no need to build a model, tune several parameters, or make additional assumptions 3.The algorithm is versatile. It can be used for classification, regression, and search Disadvantages 1.Computationally expensive especially when dealing with large datasets 2.Sensitivity to outliers. 3.Selecting the optimal k value can be challenging and problem-dependent. 4.Sensitive to class imbalance 5.Noisy data, or data with errors, can have a significant impact on k-NN's performance, as it relies on the similarity of data points. 6.Not suitable for real-time or online learning applications. 7.Lack of model interpretability. k-NN Simplified k-NN algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It is easy to implement and understand but has a major drawback of becoming significantly slows as the size of that data in use grows. k-NN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (k) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression). Choosing the right k for your data can be done by trying several values of k or by plotting the graph and picking the one that works best. KNN can search for semantically similar documents. Each document is considered as a vector. K-Nearest Finding diabetics ratio: Diabetes diseases are based on age, health condition, family Neighbors tradition, and food habits. But is a particular locality we can judge the ratio of diabetes Applications based on the KNN algorithm. Finding the ratio of breast cancer: In the medical sector, the KNN algorithm is widely used. It is used to predict breast cancer. SUPPORT VECTOR MACHINE (SVM) The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. Possible hyperplanes Hyperplanes and Support Vectors Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3. Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM. ***We want data points to not only fall on the correct side of the hyperplane but also to be located beyond the margin. Cost Function & Gradient Updates In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss. The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below. Minimization objective Loss term The smaller C is, the stronger the regularization. Accordingly, the model will attempt to maximize the margin and be more tolerant towards misclassifications. If we set C to a large number, then the SVM will pursue outliers more aggressively, which potentially comes at the cost of a smaller margin and may lead to overfitting on the training data. The classifier might be less robust on unseen data. Kernel Functions in SVM Kernel Function is a method used to take data as input and transform it into the required form of processing data. Kernel is used due to a set of mathematical functions used in Support Vector Machine providing the window to manipulate the data. Kernel Function generally transforms the training set of data so that a non-linear decision surface is able to transform to a linear equation in a higher number of dimension spaces. Basically, It returns the inner product between two points in a standard feature dimension. The choice of the kernel and their hyperparameters affect greatly the separability of the classes (in classification) and the performance of the algorithm. Kernel Functions in SVM Linear Kernel Polynomial Kernel Gaussian Kernel Gaussian Radial Sigmoid Basis Function (RBF) - This kernel is used - This kernel is - It is a general- - Same as above - This function is when data is linearly popular in image purpose kernel; kernel function, equivalent to a two- separable processing. used when there is adding radial basis layer, perceptron - When using this - It represents the no prior knowledge method to improve model of the neural kernel, we only similarity of vectors about the data. the transformation. network, which is have one in the training set of used as an hyperparameter to data in a feature activation function set which is C space over for artificial neurons. polynomials of the - Also similar to original variables logistic regression used in the kernel function Gamma and sigma are the same things. Gamma is a hyperparameter which we have to set before training model if we use kernel other than linear kernel. Gamma decides that how much curvature we want in a decision boundary. Gamma high means more curvature. Gamma, Gamma low means less curvature. So the question is when we should tune high or low gamma? The answer is it totally depend upon data to data. For choosing C we generally choose the value like 0.001, 0.01, 0.1, 1, 10, 100 and same for Gamma 0.001, 0.01, 0.1, 1, 10, 100. Which value is the best depends on your dataset and this optimal value can be found by using GridSearchCV (in Chapter 4) 1. Binary Classification. 2. High-Dimensional Data (when the number of features/dimensions id large compared to the number of samples) When to 3. Non-linear Boundaries (when the decision boundary between classes is not a straight line/ non-linear classification) apply SVM? 4. When the dataset is relatively small. 5. SVM also good in text classification like sentiment analysis or document categorization. Pros & Cons associated with SVM Pros Cons It works really well with a clear It doesn’t perform well when we margin of separation have large data set because the It is effective in high dimensional required training time is higher spaces. It also doesn’t perform very well, It is effective in cases where the when the data set has more noise number of dimensions is greater i.e. target classes are overlapping than the number of samples. It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

Lecture Notes: Supervised Learning - Chapter 3 Part 1 PDF

Document Details

Tags

Related

Summary

Full Transcript