Introduction to Machine Learning PDF by Prof. Karlik

Introduction to Machine Learning Prof. Dr. Bekir KARLIK Overview ❖ Definition on Machine Learning ❖ Difference Between AI and ML ❖ Linear Regression ❖ Implementation for Linear Regression ❖ Logistic Regression ❖ Implementation for Logistic Regression ❖ Data Processing for ML ❖ Appendix-1: Review of Logic Operators ❖ Appendix-2: Review of Matrix Operations ❖ Appendix-3: Calculus and Differential Equations Bekir Karlık 2 Definition on Machine Learning Learning algorithm is an adaptive method by network computing units self-organizes to realize the target (or desired) behavior. Machine learning is about learning to predict from samples of target behaviors or past observations of data. Machine learning algorithms are classified as; 1. Supervised learning where the algorithm creates a function that maps inputs to target outputs. The learner then compares its actual response to the target and adjusts its internal memory in such away that it is more likely to produce the appropriate response the next time it receives the same input. Bekir Karlık 3 Definition on Machine Learning 2. Unsupervised learning (clustering, dimensionality reduction, recommender systems, self organizing learning) which models a set of inputs. There is no target outputs (no any labeled examples). The learner receives no feedback from environment. 3. Semi-supervised learning where the algorithm creates both labeled and unlabeled examples a special function. 4. Reinforcement learning is learning by interacting with an environment. The learner receives feedback about the appropriateness of its response. 5. Learning to learn where the algorithm learns its own inductive bias based on previous experience. It calls as inductive learning. Bekir Karlık 4 Deep Learning Deep learning is a class of machine learning algorithms that: use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be supervised or unsupervised and applications include pattern analysis (unsupervised) and classification (supervised). are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. are part of the broader machine learning field of learning representations of data. learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts. 5 Bekir Karlık 6 Difference Between AI and ML 7 Linear Regression vs Logistic Regression Bekir Karlık 8 Linear Regression vs Logistic Regression Linear Regression is a machine learning algorithm based on supervised regression algorithm. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between the dependent and independent variables, they are considering and the number of independent variables being used. Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for a given set of features(or inputs), X. 9 Linear Regression Logistic Regression Linear Regression is a supervised Logistic Regression is a supervised regression model. classification model. Equation of logistic regression Equation of linear regression: y(x) = e(a0 + a1x1 + a2x2 + … + y = a0 + a1x1 + a2x2 + … + aixi aixi) / (1 + e(a0 + a1x1 + a2x2 + … Here, + aixi)) y = response variable Here, xi = ith predictor variable y = response variable ai = average effect on y as xi xi = ith predictor variable increases by 1 ai = average effect on y as xi increases by 1 In Linear Regression, we predict In Logistic Regression, we predict the value by an integer number. the value by 1 or 0. Here activation function is used to Here no activation function is convert a linear regression used. equation to the logistic regression 10 equation. Linear Regression: no threshold value Logistic Regression: a threshold value is needed. is added Here we calculate Root Mean Square Here we use precision to predict the Error(RMSE) to predict the next weight next weight value. value. Here the dependent variable consists of only two categories. Logistic regression Here dependent variable should be estimates the odds outcome of the numeric and the response variable is dependent variable given a set of continuous to value. quantitative or categorical independent variables. It is based on the least square It is based on maximum likelihood estimation. estimation. Any change in the coefficient leads to a change in both the direction and the Here when we plot the training steepness of the logistic function. It datasets, a straight line can be drawn means positive slopes result in an S- that touches maximum plots. shaped curve and negative slopes result in a Z-shaped curve. 11 Linear Regression vs Logistic Regression Linear regression is used to estimate the dependent variable Whereas logistic regression is in case of a change in used to calculate the probability independent variables. For of an event. For example, classify example, predict the price of if tissue is benign or malignant. houses. Linear regression assumes the Logistic regression assumes the normal or gaussian distribution of binomial distribution of the the dependent variable. dependent variable. Applications of logistic regression: Applications of linear regression: Medicine Financial risk assessment Credit scoring Business insights Hotel Booking Market analysis Gaming Text editing 12 Linear Regression vs Logistic Regression 13 Code Implementation for Linear Regression Here is a simple Python Logistic Regression implementation using scikit-learn: 1. Import Required Libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 14 2. Load and Prepare the Data # Load dataset (Example: Breast Cancer dataset from sklearn) from sklearn.datasets import load_breast_cancer data = load_breast_cancer() df = pd.DataFrame(data.data, columns=data.feature_names) df['target'] = data.target # Split into features and target X = df.drop(columns=['target']) y = df['target'] # Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Normalize features for better performance scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) 15 3. Train Logistic Regression Model # Create and train the model model = LogisticRegression() model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) 4. Evaluate the Model # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.4f}') # Display confusion matrix conf_matrix = confusion_matrix(y_test, y_pred) print('Confusion Matrix:\n', conf_matrix) # Classification report print('Classification Report:\n', classification_report(y_test, y_pred)) 5. Visualizing Decision Boundary (For 2D Data) from matplotlib.colors import ListedColormap def plot_decision_boundary(X, y, model): X = X[:, :2] # Consider only two features for visualization x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(('red', 'blue'))) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(('darkred', 'darkblue'))) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.title("Decision Boundary of Logistic Regression") plt.show() # Use first two features for visualization plot_decision_boundary(X_train, y_train, model) 17 Terminologies involved in Logistic Regression Here are some common terms involved in logistic regression: Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions. Dependent variable: The target variable in a logistic regression model, which we are trying to predict. Logistic function: The formula used to represent how the independent and dependent variables relate to one another. The logistic function transforms the input variables into a probability value between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0. Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the probability is the ratio of something occurring to everything that could possibly occur. 18 Terminologies involved in Logistic Regression Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression, the log odds of the dependent variable are modeled as a linear combination of the independent variables and the intercept. Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables relate to one another. Intercept: A constant term in the logistic regression model, which represents the log odds when all independent variables are equal to zero. Maximum likelihood estimation: The method used to estimate the coefficients of the logistic regression model, which maximizes the likelihood of observing the data given the model 19 Code Implementation for Logistic Regression So far, we’ve covered the basics of logistic regression with all the theoretical concepts, but now let’s focus on the hands on code implementation part which makes you understand the logistic regression more clearly. We will discuss Binomial Logistic regression and Multinomial Logistic Regression one by one. Binomial Logistic regression: Target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc., in this case, sigmoid functions are used, which is already discussed above. Importing necessary libraries based on the requirement of model. This Python code shows how to use the breast cancer dataset to implement a Logistic Regression model for classification. 20 Code for Binominal Logistic Regression 21 22 Code for Multinomial Logistic Regression 23 How to Evaluate Logistic Regression Model? 24 Data Processing for Machine Learning 25 Normalization The most common tools for designers or automatic recognition system is used to obtain better results is to use data normalization. Ideally a system designer wants the same range of values for each input feature in order to minimize bias within the neural network for one feature over another. Data normalization can also speed up training time by starting the training process for each feature within the same scale. Data normalization is especially useful for modeling applications where the inputs are generally on widely different scales. The use of data mining normalization has a number of advantages: the application of data mining algorithms becomes easier the data mining algorithms get more effective and efficient the data is converted in to the format that everyone can get their heads around the data can be extracted from databases faster it is possible to analyze the data in a specific manner Normalization The dataset have numeric values. But numeric values have different formats. So they are converted to the numbers which are varying to 0 and 1 which is called as normalized the data. Before processing the row data, it should be normalized. Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. There are many types of data normalization. Most useful and well-known methods are; Min-Max normalization Z-score normalization Data normalization by decimal scaling Min-Max Normalization The rescaling is often accomplished by using a linear interpolation formula such as that given in equation.  (xi − min value)  x = (max t arg et − min t arg et )  '  + min t arg et  (max value− min value) i where value of , when for a feature, the data for this feature shows a constant value. If a feature value was found in the data with a constant value, it must be removed, because it doesn’t provide any information to the neural network. As can be seen in equation, the maximum and minimum values for each feature in the data are calculated and the data are linearly transformed to lie within the desired range of values. If this min-max normalization is applied, each property value will lie in the new distance; but the underlying distributions of the corresponding features within the new range of values will continue with the same values. Exactly the min-max normalized data has an advantage by protecting all relations and features, and it does not introduce any extension. Statistical or Z-Score Normalization The statistical or Z-score normalization technique uses the mean (μ) and standard deviation (σ) for each feature across a set of training data to normalize each input feature vector. The mean and standard deviation are computed for each feature, and then the transformation given in Equation is made to each input feature vector as it is presented. This produces data where each feature has a zero mean and a unit variance. Sometimes the normalization technique is applied to all of the feature vectors in the data set first, creating a new training set, and then training is commenced. Once the means and standard deviations are computed for each feature (xi) over a set of training data, they must be retained and used as weights in the final system design. One of the statistical norm's advantages is that it reduces the effects of outliers in the data. The statistical norm is given at equation: x '= ( xi − i ) i i Decimal Scaling Normalization Normalization by decimal scaling normalizes data by moving the decimal point of values of any attribute. The count of decimal points moved depends on the maximum absolute value. A sample value v of cluster is normalized to v’ by computing equation: v v' = j 10 In this formula, j is the lowest integer while Max (v'|) < 1. For instance, values for feature F scale from 850 to 825. Suppose j equals to three. In this case, the maximum absolute value of feature F equals 850. Applying the normalization with decimal scaling, we need to divide all values by 1,000. Therefore, we get 850 normalized to 0,850 as well as 825 transformed to 0,825. This technique entails the transformation of the decimal points of the values according to the absolute value of the maximum. It follows that the means of the normalized data will always be between 0 and 1. Example: Data Normalization Max-Min normalization formula is as follows: v − min A v' = (new _ max A − new _ min A) + new _ min A max A − min A Example: we want to normalize data to range of the interval [-1,0]. We put: new_maxA= 1, new_minA= 0. Say, max A was 80 and min A was 20 ( That means maximum and minimum values for the attribute ). Now, if v = 50 ( If for this particular pattern , attribute value is 40 ), v’ will be calculated as , v’ = (50-20) x (1-0) / (80-20) + 0 => v’ = 30 x 1/60 => v’ = 0.5 Data Reduction Today in data mining research we are daily confronted with large amount of data. Most of the time, these data contain redundant and irrelevant data that it is important to extract before a learning task in order to get good accuracy. The fact that today’s computers are more powerful does not solves the problems of this ever-growing data. It is therefore crucial to find techniques which allow handling these large databases often too big to be processed. Data reduction techniques are therefore a very important step to prepare the data before data mining and knowledge discovery. To have successful machine learning task, it is important to take into consideration many factors and among them, the most significant is the quality of the dataset. In Theory, having many features should result in a best discriminability, yet, practically it has not always been the case; sometimes, good discrimination (classification) is achieved with limited dataset. However if the data contains noisy, unreliable or irrelevant data, it becomes difficult to learn throughout the training. Commonly there are two kind of data reduction: Instance Reduction and Feature Selection. Instance Reduction In a task of instance reduction, the original dataset is used as the input and the output is a subset of the original dataset. As it can be seen in the figure below, this task is done in order to remove the superfluous instances from the original data; the instance reduction is a success if the accuracy of the reduced data is the same with or even better than the accuracy of the original dataset. Some of well- known data reduction techniques are Consistency Subset, Correlation Attribute, Info Gain Attribute and Wrapper Subset. Bekir Karlık Feature Selection Feature Selection is a process that consists of identifying redundant and irrelevant features and then removes them; this process helps reduce the dimensionality and at the same time allow a fast and effective machine learning task. Moreover, in some cases, the future test performance can be better; in other words, the outcome is more compact and easily interpretable. Training Feature Selected Data Mining Set Selector Features Algorithm Testing Model Set Obtained Example: The vertebral column dataset utilized in this study is taken from the online medical dataset repository for machine learning (UCI); it has been organized in order to define whether yes or not the patient has a disk hernia, a spondylolisthesis, or if he is healthy. Each patient is represented by six features along with a class label. The dataset is composed of 310 instances organized as follow: 100 patients are normal, 60 patients suffering of disc hernia and 150 having spondylolisthesis. The Table-1 presents the testing results on the original dataset using ANN-MLP, C4.5 and K-NN learning algorithms. For the training set, 75% of the available patterns are used and the remaining 25% are used for the test. TABLE I: TESTING RESULTS OF THE ORIGINAL DATA USING DIFFERENT MACHINE LEARNING ALGORITHMS Learning Reduction Experiment dataset Training Test Number Test Well Misclassifie Algorithm Technique set set of accuracy classified d instances Iteration instances Features Instance Class ANN-MLP None 7 310 3 75% 25% 1000 84.4156% 65 12 C4.5 None 7 310 3 75% 25% - 80.5195% 62 15 K-NN None 7 310 3 75% 25% - 74.026% 57 20 Example: TABLE II : TESTING RESULTS USING CONSTENCYSUBEVAL FEATURE SELECTOR AND DIFFERENT MACHINE LEARNING ALGORITHMS Learning Feature Selector Experiment dataset Traini Test Number Test Well Misclassifie Algorithm Features Instance Class ng set set of accuracy classified d instances Iteration instances ANN- consistencySubset 6 310 3 75% 25% 1000 88.3117% 68 9 MLP Eval C4.5 consistencySubset 6 310 3 75% 25% - 80.5195% 62 15 Eval K-NN consistencySubset 6 310 3 75% 25% - 74.026% 57 20 Eval In the Table II, a feature selector is used to reduce the original data; then the obtain feature subset is trained and tested, here also as in the Table I, the same amount of data for training and testing is used. As we can see in the Table II, the number of the features has been reduced and secondly the test results of the reduced data (Table II) and the original data (Table I) is the same using C4.5 and K-NN algorithms, but the test result of ANN-MLP using the reduced data (Table II) is better than the result of the original data. Example: TABLE III : TESTING RESULTS USING RESAMPLE INSTANCE REDUCER AND DIFFERENT MACHINE LEARNING ALGORITHMS Experiment dataset Learning Instance Training Test Number Test Well Misclassified Algorithm Reducer Features Instance Class set set of accuracy classified instances Iteration instances ANN-MLP Resample 7 186 3 80% 20% 1000 91.8919% 34 3 C4.5 Resample 7 186 3 80% 20% - 89.1892% 33 4 K-NN Resample 7 186 3 80% 20% - 97.2973% 36 1 In the Table III, the instances of the original dataset has been reduced using a resample method, after reduction 60% of the original data remained; then 80% of this remained data is used for training and 20% for testing. The obtained results in the Table III are all far better than the results of the original data (Table I). Bekir Karlık Example: FIGURE 1 : TESTING RESULTS USING ANN FIGURE 2 : TESTING RESULTS USING K-NN Bekir Karlık Example:. FIGURE 3 : TESTING RESULTS USING C4.5 Example: In this example, the learning algorithm used are Artificial Neural Network, C4.5 and K-Nearest Neighbors learning. Many researches have shown that machine learning algorithms are negatively influenced by redundant and irrelevant data. Event the algorithm is sensible to redundant and irrelevant features; in K-NN the data complexity increases exponentially with the amount of irrelevant data. For decision tree also in some cases such as parity concept, the data complexity can increase exponentially as well. In C4.5, the training samples can overfit sometimes, having as a result a large tree. Therefore, by removing noisy data, in many cases the result can be better, resulting in a small tree easy to interpret. Bekir Karlık Example: For a better view of the results, a graphical representation is presented in Fig.1 to Fig.4. In the Fig.1 using ANN-MLP all the reduced data perform better than the original data, but for the Fig.2 and Fig.3 using respectively C4.5 and K-NN the original data (blue line) have the same accuracy with the feature reduction (orange line) and as we said previously, a data is successfully reduced if its accuracy is the same with or better than the accuracy of the original data. Therefore we can deduce that the feature selection methods used in this task were successful. The gray line in the same figure represents the accuracy of the data of after instance reduction; its accuracy compared to the original data is far better. Bekir Karlık Example: FIGURE 4 : COMPARISON OF TESTING RESULTS FOR ALL METHODS Bekir Karlık Appendix-1: Review of Logic Operators Bekir Karlık Review of Logic Operators Bekir Karlık Review of Logic Operators Bekir Karlık Appendix-2: Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector X (2, 1) length = sqrt(2*2+1*1) orientation angle = a a x = (x1, x2, ……, xn), an n dimensional vector a point in an n dimensional space column vector: row vector 1  y = ( 1 2 5 8 ) = xT    2 x =   transpose 5 (x ) = x T T   8    Bekir Karlık norms of a vector: (magnitude) L1 norm x 1 = n i =1 x i L2 norm x 2 = ( n i =1 x 2 1/ 2 i ) L norm x  = max x i 1 i  n vector operations: rx = (rx1 , rx2 ,...... rxn )T r : a scaler , x : a column vector inner ( dot ) product x, y are column vectors of same dimension n  y1   x1  y  n x  T x  y = ( x1 , x2......xn )   =  xi yi = ( y1 , y2...yn )  2  = y  x T 2   y  i =1 x   n  n Bekir Karlık  x1   x  n n xT x = ( x1 , x2......xn )  2  =  xi xi =  ( xi ) 2  0   x  i =1 i =1  n Cross product: x  y defines another vector orthogonal to the plan formed by x and y. Bekir Karlık Matrix:  a11 a12...... a1n  Am n  =   = {ai j }m  n a a...... a   m1 m 2 mn  aij :the element on the ith row and jth column aii : a diagonal element wij :a weight in a weight matrix W each row or column is a vector a j : jth column vector ai : ith row vector  a1  Am x n = (a 1...... a n ) =    a   m  Bekir Karlık a column vector of dimension m is a matrix of m x 1  a11 a21...... am1  transpose:   T Am n =   a a...... a   1n 2 n mn  jth column becomes jth row square matrix: An  n identity matrix: 1 0..... 0   0 1...... 0  1 if i = j I =  ai j =   0 0......1  0 otherwise   Bekir Karlık symmetric matrix: m = n A = AT , or  i a i = ai , or  ij aij = a ji matrix operations: rA = (ra 1 ,......ra n ) = (rai j ) x Am  n = ( x1......xm )(a 1 ,......a n ) T = ( xT a 1 ,......xT a n ) The result is a row vector, each element of which is an inner product of x T and a column vector a j Bekir Karlık product of two matrices: Am n  Bn  p = Cm p where Cij = ai b j Amn  I nn = Amn vector outer product:  x1   x1 y1 , x1 y2 ,...... x1 yn          x  y = xi ( y1......yn ) =  T     x     m  xm y1 , xm y2 ,...... xm yn  Bekir Karlık Appendix-3: Calculus and Differential Equations xi (t), the derivative of xi with respect to time t System of differential equations   x1 (t ) = f1 (t )     xn (t ) = f n (t ) solution: difficult to solve unless are simple ( x1 (t ), xn (t )) f i (t ) Bekir Karlık Multi-variable calculus: y(t ) = f ( x1(t ), x2 (t ),......xn (t )) Partial derivative: gives the direction and speed of change of y, with respect to xi y = sin( x1 ) + x2 2 + e − ( x1 + x2 + x3 ) y −( x1 + x2 + x3 ) = cos( x1 ) − e x1 y −( x1 + x2 + x3 ) = 2 x2 − e x2 y = −e −( x1 + x2 + x3 ) x3 Bekir Karlık the total derivative: y(t ) = f ( x1(t ), x2 (t ),......xn (t )) gives the direction and speed of change of y, with respect to t df f f y (t ) = = x1(t ) +...... xn (t ) dt x1 xn = f ( x1(t )...... x n (t ))T f f Gradient of f : f = ( ,...... ) x1 xn Chain-rule: z is a function of y, y is a function of x, x is a function of t dz dz dy dx =   dt dy dx dt Bekir Karlık  Dynamic system:  x1 (t ) = f1 ( x1 ,..... xn )      x (t ) = f ( x ,...... x )  n n 1 n change of xi may potentially affect another x all xi continue to change (the system evolves) reaches equilibrium when xi = 0 i stability/attraction: special equilibrium point (minimal energy state) pattern of ( x1,......xn ) at a stable state often represents a solution Bekir Karlık

Introduction to Machine Learning PDF by Prof. Karlik

Document Details

Tags

Related

Summary

Full Transcript