Python Machine Learning PDF
Document Details
Uploaded by FaithfulDanburite6249
ENSAM
Tags
Summary
Python Machine Learning provides a tutorial on the fundamental concepts of machine learning, including supervised, unsupervised, and reinforcement learning. It covers classification and regression techniques and gives examples using Python code.
Full Transcript
1 Python Machine Learning Giving Computers the Ability to Learn from Data...
1 Python Machine Learning Giving Computers the Ability to Learn from Data orch T Machine Learning (ML) has evolved as a subfield of AI, enabling self-learning Py algorithms to make predictions by analyzing structured and unstructured data. This code source advancement has significantly improved efficiency and accuracy in various domains, including computer science, healthcare, and climate change. Notable examples w include robust email spam filters, voice recognition software, accurate medical rFlo o diagnoses, and even predictions of COVID-19 patient oxygen needs. ML also holds ns Te promise in addressing climate change, particularly through precision agriculture code source techniques. By harnessing ML's predictive power, we have the potential to make significant strides in tackling some of the most pressing challenges of our time. The three different types of machine learning Making predictions about the future with supervised learning Supervised learning aims to predict outcomes based on labeled training data, where the model learns the relationship between inputs and corresponding outputs. In supervised machine learning, models are trained using labeled data to predict outcomes, such as categorizing emails as spam or non-spam in a classification task, or predicting continuous values in regression tasks. Classification for predicting class labels In classification, a supervised learning subcategory, models predict categorical labels for new data based on past observations. Examples include the binary classification of emails as spam or non-spam, where machine learning algorithms learn rules to distinguish between classes. Using a 30-example scenario in Figure 1.3, with 15 examples each for two classes (A and B) in a two-dimensional dataset, an algorithm can be trained to create a decision boundary (dashed line) to categorize new data based on their x1 and x2 values. Supervised learning is capable of assigning any class label from the training dataset to new, unlabeled instances. Multiclass classification tasks, like handwritten character recognition, involve training a model on multiple examples of each letter to predict the correct letter from the alphabet. However, if the dataset lacks certain classes (e.g., digits 0-9), the model won't accurately predict those absent classes. Regression for predicting continuous outcomes Supervised learning includes regression analysis, where predictor variables (features) are used to predict continuous outcomes (target variables), such as predicting math SAT scores based on study time. depicts linear regression, where a straight line is fitted to minimize the distance (often the average squared distance) between data points and the line. The intercept and slope learned from this regression can then be used to predict the target variable of new data. Solving interactive problems with reinforcement learning einforcement learning is a type of machine learning where an agent interacts with an environment to improve its performance. Unlike supervised learning, where we have labeled data, reinforcement learning uses a reward signal to guide the agent’s actions. The goal is to learn a sequence of actions that maximize this reward, either through trial and error or careful planning. Reinforcement learning can be exemplified by a chess program. The agent, based on the board state, makes moves. The reward is defined as the game’s outcome: win or lose. In sum, reinforcement learning is concerned with learning to choose a series of actions that maximizes the total reward, which could be earned either immediately after taking an action or via delayed feedback. Discovering hidden structures with unsupervised learning Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function. Finding subgroups with clustering Clustering is a technique for exploratory data analysis that organizes information into meaningful subgroups without prior knowledge of their memberships. It’s used to structure information, derive meaningful relationships, and allows marketers to discover customer groups based on interests for distinct marketing programs. Dimensionality reduction for data compression Dimensionality reduction, a subfield of unsupervised learning, is used when dealing with high-dimensional data. It helps overcome challenges related to storage space and computational performance. This technique compresses data onto a smaller subspace, retaining most relevant information, and aids in noise removal. It’s also useful for data visualization, allowing high-dimensional data to be projected onto lower-dimensional spaces for visualization via scatterplots or histograms. A roadmap for building machine learning systems Training and selecting a predictive model Machine learning algorithms are diverse and need to be compared for optimal performance. The “No Free Lunch” theorems suggest no single model is superior without task assumptions. Performance is measured using metrics like classification accuracy. Cross- validation techniques help estimate model generalization by dividing data into training and validation subsets. Default parameters of learning algorithms may not be optimal, necessitating hyperparameter optimization for performance fine-tuning. Training Simple Machine Learning Algorithms for Classification Artificial neurons – a brief glimpse into the early history of machine learning Biological neurons are interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals McCulloch and Pitts modeled a nerve cell as a logic gate with binary outputs. Signals are integrated in the cell body and if they exceed a threshold, an output signal is generated. Rosenblatt later proposed the perceptron learning rule, an algorithm that learns optimal weight coefficients for input features to decide whether a neuron fires or not, useful for classifying new data points. The formal definition of an artificial neuron Artificial neurons can be used for binary classification tasks. A decision function, 𝜎, takes a linear 2 combination of input values, x, and a weight vector, w, to calculate the net input, z. If z exceeds a threshold, 𝜃, we predict class 1, otherwise class 0. In the perceptron algorithm, 𝜎 is a variant of a unit step function. To simplify the code implementation later, we can modify this setup The perceptron learning rule The whole idea behind the MCP neuron and Rosenblatt’s thresholded perceptron model is to use a reductionist approach to mimic how a single neuron in the brain works: it either fires or it doesn’t. Thus, Rosenblatt’s classic perceptron rule is fairly simple, and the perceptron algorithm can be summarized by the following steps: 1. Initialize the weights and bias unit to 0 or small random numbers The output value is the class label predicted by the unit step function. The bias unit and each weight in the weight vector are updated simultaneously : 2. For each training example, x(i) : a. Compute the output value, b. Update the weights and bias unit The update values (“deltas”) are computed as follows: Each weight, wj, corresponds to a feature, xj, in the dataset and is involved in determining the update value, Δ𝑤𝑗. 𝜂 is the learning rate (typically between 0.0 and 1.0), is the true class label of the ith training example, and is the predicted class label. The bias unit and all weights in the weight vector are updated simultaneously, meaning we don’t recompute the predicted label, , before all of the weights are updated via the respective update values, Δ𝑤j and Δ𝑏. Concretely, for a two-dimensional dataset, we would write the update as: when the perceptron correctly predicts the class label, the bias and weights stay the same as the update values are zero. in the case of a wrong prediction, the weights are being pushed toward the direction of the positive or negative target class Let’s assume that 1.5 and we misclassify this example as class 0. In this case, we would increase the corresponding weight by 2.5 in total so that the net input, would be more positive the next time we encounter this example, and thus be more likely to be above the threshold of the unit step function to classify the example as class 1: It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable, which means that the two classes can be perfectly separated by a linear decision boundary. If classes can’t be linearly separated, we can limit the epochs or set a misclassification threshold. Otherwise, the perceptron would endlessly update weights. The preceding diagram illustrates how the perceptron receives the inputs of an example (x) and combines them with the bias unit (b) and weights (w) to compute the net input. The net input is then passed on to the threshold function, which generates a binary output of 0 or 1—the predicted class label of the example. During the learning phase, this output is used to calculate the error of the prediction and update the weights and bias unit. Implementing a perceptron learning algorithm in Python We’re defining the perceptron interface as a Python class for an object-oriented approach. This allows us to initialize new Perceptron objects that learn via a fit method and predict via a separate method. Attributes not created during initialization, like self.w_, are appended with an underscore (_). """Perceptron classifier. class Perceptron: def __init__(self, eta=0.01, n_iter=50, random_state=1): Parameters self.eta = eta ------------ self.n_iter = n_iter eta : float self.random_state = random_state Learning rate (between 0.0 and 1.0) n_iter : int def fit(self, X, y): Passes over the training dataset. rgen = np.random.RandomState(self.random_state) random_state : int self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape) Random number generator seed for random weight initialization. self.b_ = np.float_(0.) Attributes self.errors_ = [] ----------- w_ : 1d-array for _ in range(self.n_iter): Weights after fitting. errors = 0 b_ : Scalar for xi, target in zip(X, y): Bias unit after fitting. update = self.eta * (target - self.predict(xi)) errors_ : list self.w_ += update * xi Number of misclassifications (updates) in each epoch. self.b_ += update """ errors += int(update != 0.0) """Fit training data. self.errors_.append(errors) return self Parameters ---------- def net_input(self, X): X : {array-like}, shape = [n_examples, n_features] """Calculate net input""" Training vectors, where n_examples is the number of return np.dot(X, self.w_) + self.b_ examples and n_features is the number of features. y : array-like, shape = [n_examples] def predict(self, X): Target values. """Return class label after unit step""" return np.where(self.net_input(X) >= 0.0, 1, 0) Returns ------- self : object """ We have a Perceptron model. We can create new Perceptrons with a specific learning rate (eta) and number of training rounds (n_iter). When we use the fit method, it sets up an initial bias and weights for each feature in the dataset. The weights start as small random numbers. If we started with all weights as zero, the learning rate would only change the size of the weight vector, not its direction. Once the weights are set up, the fit method goes through each example in the training data and adjusts the weights using the perceptron learning rule. The predict method is used to guess the class labels, which is used during training for weight updates and can also be used to predict labels of new data after training. We keep track of the number of wrong classifications in each training round in the self.errors_ list to evaluate the performance of our perceptron. The np.dot function in the net_input method calculates the vector dot product, Training a perceptron model on the Iris dataset To test our Perceptron, we limit our analysis to two features - sepal length and petal length. This allows us to visualize the decision regions in a scatterplot. We only consider two flower classes, setosa and versicolor, from the Iris dataset because the Perceptron is a binary classifier. However, it can be extended to multi-class classification using techniques like one-versus-all (OvA). load the Iris dataset import pandas as pd 3 s = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' # select setosa and versicolor df = pd.read_csv(s, header=None, encoding='utf-8') y = df.iloc[0:100, 4].values y = np.where(y == 'Iris-setosa', 0, 1) We take the first 100 class labels, which correspond to 50 Iris-setosa and 50 Iris- # extract sepal length and petal length versicolor flowers. We convert these labels into two integer labels, 1 for versicolor and X = df.iloc[0:100, [0, 2]].values 0 for setosa, and assign them to a vector, y. We also extract the first feature (sepal # plot data length) and third feature (petal length) of these 100 examples and assign them to a plt.scatter(X[:50, 0], X[:50, 1], feature matrix, X. We can then visualize this data in a two-dimensional scatterplot. color='red', marker='o', label='Setosa') plt.scatter(X[50:100, 0], X[50:100, 1], color='blue', marker='s', label='Versicolor') plt.xlabel('Sepal length [cm]') plt.ylabel('Petal length [cm]') plt.legend(loc='upper left') # plt.savefig('images/02_06.png', dpi=300) plt.show() Now, it’s time to train our perceptron algorithm on the Iris data subset that we just extracted. Also, we will plot the misclassification error for each epoch to check whether the algorithm converged and found a decision boundary that separates the two Iris flower classes: Note that the number of misclassification errors and the number of updates is the same, since the perceptron weights and bias are updated each time it misclassifies an example. After executing the preceding code, we should see the plot of the misclassification errors versus the number of epochs, as shown in Figure Let’s implement a small convenience function to visualize the decision boundaries for two-dimensional datasets: See Code Source ppn = Perceptron(eta=0.1, n_iter=10) ppn.fit(X, y) plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_, marker='o') plt.xlabel('Epochs') plt.ylabel('Number of updates') # plt.savefig('images/02_07.png', dpi=300) plt.show() Adaptive linear neurons and the convergence of learning Adaline is a type of single-layer neural network. It’s an improvement on the perceptron algorithm, developed by Bernard Widrow and Tedd Hoff. The Adaline algorithm is important because it helps us understand how to define and minimize continuous loss functions, which is key for understanding other machine learning algorithms. The main difference between Adaline and the perceptron is that Adaline updates the weights based on a linear activation function, not a unit step function. Even though we use a linear activation function to learn the weights, we still use a threshold function to make the final prediction. the Adaline algorithm compares the true class labels with the linear activation function’s continuous valued output to compute the model error and update the weights. In contrast, the perceptron compares the true class labels to the predicted class labels. Minimizing loss functions with gradient descent In supervised machine learning, there’s an objective function that needs to be optimized. This is often a loss or cost function that we aim to minimize. For Adaline, the loss function is defined as the mean squared error (MSE) between the predicted outcome and the actual class label. This function is used to learn the model parameters. In Adaline, this linear activation function, 𝜎(𝑧), is simply the identity function of the net input, so that 𝜎(𝑧) = 𝑧. The continuous linear activation function, unlike the unit step function, makes the loss function differentiable. This loss function is convex, so we can use gradient descent, a simple but powerful optimization algorithm, to find the weights that minimize our loss function. The concept of gradient descent is like descending a hill until we reach a local or global minimum loss. In each step, we move in the opposite direction of the gradient. The step size is determined by the learning rate and the slope of the gradient. Using gradient descent, we can now update the model parameters by taking a step in the opposite direction of the gradient, ∇L(w,b), of our loss function, L(w,b): The parameter changes, Δ𝒘 and ∆𝑏, are defined as the negative gradient multiplied by the learning rate, 𝜂: To compute the gradient of the loss function, we need to compute the partial derivative of the loss function with respect to each weight, wj: Similarly, we compute the partial derivative of the loss with respect to the bias as: So we can write the weight update as: Since we update all parameters simultaneously, our Adaline learning rule becomes: While the Adaline learning rule may seem similar to the perceptron rule, it’s important to note that the output of the activation function in Adaline is a real number, not an integer class label. Also, the weight update in Adaline is based on all examples in the training dataset, not just one example at a time. This approach is known as batch gradient descent. To be more specific and avoid confusion, we’ll refer to this process as full batch gradient descent. Implementing Adaline in Python See Code Source See AdalineGD Class 4 Figure illustrates what might happen if we change the value of a particular weight parameter to minimize the loss function, L. The left subfigure illustrates the case of a well-chosen learning rate, where the loss decreases gradually, moving in the direction of the global minimum. The subfigure on the right, however, illustrates what happens if we choose a learning rate that is too large—we overshoot the global minimum: Improving gradient descent through feature scaling Gradient descent is an algorithm that benefits from feature scaling. One method of feature scaling is standardization, which helps gradient descent converge faster. However, it doesn’t make the dataset normally distributed. Standardization centers the mean of each feature at zero and gives each feature a standard deviation of 1. To standardize a feature, we subtract the sample mean from every training example and divide it by its standard deviation. Standardization aids gradient descent learning by making it easier to find a suitable learning rate for all weights. If features have different scales, a learning rate that’s good for one weight might not work well for another. Using standardized features can stabilize training, allowing the optimizer to find a good or optimal solution with fewer steps. As we can see in the plots, Adaline has now converged after training on the standardized features. However, note that the MSE remains non-zero even though all flower examples were classified correctly. Large-scale machine learning and stochastic gradient descent We learned that minimizing a loss function involves moving in the opposite direction of the loss gradient, calculated from the entire training dataset. This is known as full batch gradient descent. However, if we have a very large dataset with millions of data points, which is common in machine learning, full batch gradient descent can be computationally expensive. This is because we need to reevaluate the entire training dataset each time we take a step towards the global minimum. A popular alternative to the batch gradient descent algorithm is stochastic gradient descent (SGD), which is sometimes also called iterative or online gradient descent. Instead of updating the weights based on the sum of the accumulated errors over all training examples, x(i) : we update the parameters incrementally for each training example, for instance: Stochastic Gradient Descent (SGD) can be seen as an approximation of gradient descent. It usually converges faster due to more frequent weight updates. Each gradient is calculated from a single training example, making the error surface noisier than in gradient descent. This can be beneficial as SGD can escape shallow local minima more easily when working with nonlinear loss functions. To get good results with SGD, it’s important to present the training data in a random order and shuffle the training dataset for every epoch to avoid cycles. Stochastic Gradient Descent (SGD) is beneficial for online learning, where the model is trained as new data comes in. This is useful when dealing with large data sets, like customer data in web applications. The system can quickly adapt to changes, and the training data can be discarded after updating the model to save storage space. A Tour of Machine Learning Classifiers Using Scikit-Learn no single classifier works best across all possible scenarios In practice, it is always recommended that you compare the performance of at least a handful of different learning algorithms to select the best model for the particular problem; The five main steps that are involved in training a supervised machine learning algorithm can be summarized as follows: 1. Selecting features and collecting labeled training examples 2. Choosing a performance metric 3. Choosing a learning algorithm and training a model 4. Evaluating the performance of the model 5. Changing the settings of the algorithm and tuning the model. First steps with scikit-learn – training a perceptron Now we will take a look at the scikit-learn API, which, as mentioned, combines a user-friendly and consistent interface with a highly optimized implementation of several classification algorithms. The scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models. We’ll put the petal length and petal width of 150 flowers into a table called from sklearn import datasets ‘X.’ Then, we’ll create a list called ‘y’ to hold the names of the flower species. import numpy as np iris = datasets.load_iris() The np.unique(y) function returned the three X = iris.data[:, [2, 3]] unique class labels stored in iris.target y = iris.target we randomly split the X and y arrays into 5 from sklearn.model_selection import train_test_split 30 percent test data (45 examples) and X_train, X_test, y_train, y_test = train_test_split( 70 percent training data (105 examples): X, y, test_size=0.3, random_state=1, stratify=y) The train_test_split function rearranges the training data before splitting it into training and test sets. Without this shuffling, all examples from class 0 and class 1 would end up in the training set, leaving only class 2 examples in the test set. By setting a fixed random seed (using random_state=1), we ensure that our results can be reproduced consistently. 1. Stratification: When we say “stratify,” we mean that the train_test_split method ensures that the training and test sets have the same proportions of different classes as the original dataset. It’s like keeping the same mix of flower species in both sets. 2. Verification: To check if this worked, we can use NumPy’s bincount function. It counts how many times each class appears in our arrays. If the proportions match, our stratification is successful. We will standardize the features using the StandardScaler class from scikit-learn’s preprocessing module: from sklearn.preprocessing import StandardScaler We initialized a StandardScaler object, sc, using the code. It estimated parameters sc = StandardScaler() (mean, 𝜇, and standard deviation, 𝜎) for each feature from the training data. The sc.fit(X_train) training data was then standardized using these parameters. The same parameters X_train_std = sc.transform(X_train) were used to standardize the test dataset, ensuring comparability. X_test_std = sc.transform(X_test) Most algorithms in scikit-learn already support multiclass classification by default via the one-versus-rest (OvR) method, which allows us to feed the three flower classes to the perceptron all at once. we can make predictions via the predict method from sklearn.linear_model import Perceptron ppn = Perceptron(eta0=0.1, random_state=1) y_pred = ppn.predict(X_test_std) ppn.fit(X_train_std, y_train) print('Misclassified examples: %d' % (y_test != y_pred).sum()) from sklearn.metrics import accuracy_score print('Accuracy: %.3f' % accuracy_score(y_test, y_pred)) print('Accuracy: %.3f' % ppn.score(X_test_std, y_test)) See Code Source See plot_decision_regions for the code plot Modeling class probabilities via logistic regression Logistic regression and conditional probabilities Logistic regression is an easily implemented, widely used classification model. It excels with linearly separable classes and, like perceptron and Adaline, is a linear model for binary classification. Note that logistic regression can be readily generalized to multiclass settings, which is known as multinomial logistic regression, or softmax regression. Logistic regression, a binary classification model, uses the odds ratio, defined as p/(1−p), where p is the probability of the event we want to predict. This event is represented as class label y=1 and the features x are the conditions for this event. The probability p is thus defined as p:=p(y=1∣x). We can then further define the logit function, which is simply the logarithm of the odds (log-odds): In computer science, the logit function, which maps values from [0, 1] to the real number range, uses the natural logarithm. The logistic model assumes a linear relationship between the weighted inputs and the log-odds. The logistic model assumes a linear relationship between log-odds and net inputs. However, we’re interested in the probability p, the class- membership probability. The inverse of the logit function, the sigmoid function, maps real numbers back to the [0, 1] probability range. The sigmoid function’s output, σ(z)=p(y=1∣x; w, b), is the probability of an example belonging to class 1, given its features, x, and parameters, w and b. For instance, if σ(z)=0.8, the example has an 80% chance of being in class 1, and a 20% chance of being in class 0. The predicted probability can then simply be converted If we look at the preceding plot of the sigmoid function, into a binary outcome via a threshold function: this is equivalent to the following: Logistic regression is not only used for class prediction but also for estimating class-membership probability. It’s used in weather forecasting to predict and report the chance of rain, and in medicine to predict the likelihood of a disease given symptoms. Learning the model weights via the logistic loss function We simplified a function to learn our model’s parameters. We defined a likelihood, ℒ, to maximize in our logistic regression model, assuming our data examples are independent. The formula follows this. In practice, it is easier to maximize the (natural) log of this equation, which is called the log-likelihood function: 6 let’s rewrite the log-likelihood as a loss function, L, that can be minimized using gradient descent as in Chapter 2: If we were to implement logistic regression ourselves, we could simply substitute the loss function, L, in our Adaline implementation from Chapter 2, with the new loss function See Code Source (LogisticRegressionGD) Training a logistic regression model with scikit-learn We trained a logistic regression model using scikit-learn, which supports multiclass settings. We used the LogisticRegression class and fit method to train the model on a standardized dataset. We set multi_class=‘ovr’ for illustration, but ‘multinomial’ is the default and recommended for mutually exclusive classes. from sklearn.linear_model import LogisticRegression lr = LogisticRegression(C=100.0, solver='lbfgs', multi_class='ovr') lr.fit(X_train_std, y_train) plot_decision_regions(X_combined_std, y_combined, classifier=lr, test_idx=range(105, 150)) plt.xlabel('Petal length [standardized]') plt.ylabel('Petal width [standardized]') plt.legend(loc='upper left') plt.tight_layout() plt.show() Note that there exist many different algorithms for solving optimization problems. For minimizing convex loss functions, such as the logistic regression loss, it is recommended to use more advanced approaches than regular stochastic gradient descent (SGD). In fact, scikit-learn implements a whole range of such optimization algorithms, which can be specified via the solver parameter, namely, 'newton-cg', 'lbfgs', 'liblinear', 'sag', and 'saga'. The probability that training examples belong to a certain class can be array([[3.81527885e-09, 1.44792866e-01, 8.55207131e-01], computed using the predict_ proba method. [8.34020679e-01, 1.65979321e-01, 3.25737138e-13], lr.predict_proba(X_test_std[:3, :]) [8.48831425e-01, 1.51168575e-01, 2.62277619e-14]]) Tackling overfitting via regularization Overfitting is when a model does well on training data but not on unseen data. It can be due to too many parameters making the model too complex. Underfitting is when the model isn’t complex enough to capture the data pattern, leading to poor performance. In machine learning, “high variance” means overfitting and “high bias” means underfitting. Variance measures the model’s consistency when retrained on different data subsets. Bias measures how off predictions are from correct values when the model is rebuilt on different datasets. Regularization helps find a good bias-variance tradeoff and handle collinearity. It introduces extra information to penalize extreme parameter values. L2 regularization is common, with the formula: λ is the regularization parameter. The loss function for logistic regression can be regularized by adding a simple regularization term, which will shrink the weights during model training: Maximum margin classification with support vector machines Another powerful and widely used learning algorithm is the support vector machine (SVM), which can be considered an extension of the perceptron. Using the perceptron algorithm, we minimized misclassification errors. However, in SVMs, our optimization objective is to maximize the margin. The margin is defined as the distance between the separating hyperplane (decision boundary) and the training examples that are closest to this hyperplane, which are the so-called support vectors. Dealing with a nonlinearly separable case using slack variables The slack variable, introduced by Vladimir Vapnik in 1995, allows SVM optimization to converge even with misclassifications. It introduces a variable ‘C’, a hyperparameter controlling the penalty for misclassification. Large ‘C’ values mean large error penalties. We can use ‘C’ to control the margin width and tune the bias-variance tradeoff. This concept is related to regularization, which we discussed in the previous section in the context of regularized regression, 7 where decreasing the value of C increases the bias (underfitting) and lowers the variance (overfitting) of the model. let’s train an SVM model to classify the different flowers in our Iris dataset: from sklearn.svm import SVC svm = SVC(kernel='linear', C=1.0, random_state=1) svm.fit(X_train_std, y_train) plot_decision_regions(X_combined_std, y_combined, classifier=svm, test_idx=range(105, 150)) plt.xlabel('Petal length [standardized]') plt.ylabel('Petal width [standardized]') plt.legend(loc='upper left') plt.tight_layout() plt.show() Solving nonlinear problems using a kernel SVM Another reason why SVMs enjoy high popularity among machine learning practitioners is that they can be easily kernelized to solve nonlinear classification problems. The basic idea behind kernel methods for dealing with such linearly inseparable data is to create nonlinear combinations of the original features to project them onto a higher- dimensional space via a mapping function, 𝜙, where the data becomes linearly separable. As shown in Figure 3.14, we can transform a two-dimensional dataset into a new three-dimensional feature space, where the classes become separable via the following projection: This allows us to separate the two classes shown in the plot via a linear hyperplane that becomes a nonlinear decision boundary if we project it back onto the original feature space, as illustrated with the following concentric circle dataset: Using the kernel trick to find separating hyperplanes in a highdimensional space To solve a nonlinear problem using an SVM, we would transform the training data into a higher-dimensional feature space via a mapping function, 𝜙, and train a linear SVM model to classify the data in this new feature space. Then, we could use the same mapping function, 𝜙, to transform new, unseen data to classify it using the linear SVM model. However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play. Although we did not go into much detail about how to solve the quadratic programming task to train an SVM, in practice, we just need to replace the dot product.To save the expensive step of calculating this dot product between two points explicitly, we define a so-called kernel function: One of the most widely used kernels is the radial basis function (RBF) kernel, which can simply be called the Gaussian kernel: This is often simplified to: Here, is a free parameter to be optimized. Roughly speaking, the term “kernel” can be interpreted as a similarity function between a pair of examples. The minus sign inverts the distance measure into a similarity score, and, due to the exponential term, the resulting similarity score will fall into a range between 1 (for exactly similar examples) and 0 (for very dissimilar examples). Here, we simply use the SVC class from scikit-learn that we imported earlier and replace the kernel='linear' parameter with kernel='rbf': svm = SVC(kernel='rbf', random_state=1, gamma=0.10, C=10.0) svm.fit(X_xor, y_xor) plot_decision_regions(X_xor, y_xor, classifier=svm) plt.legend(loc='upper left') plt.tight_layout() plt.show() The 𝛾 parameter, which we set to gamma=0.1, can be understood as a cut-off parameter for the Gaussian sphere. If we increase the value for 𝛾, we increase the influence or reach of the training examples, which leads to a tighter and bumpier decision boundary. To get a better understanding of 𝛾, let’s apply an RBF kernel SVM to our Iris flower dataset: This illustrates that the 𝛾 parameter also plays an important role in controlling overfitting or variance when the algorithm is too sensitive to fluctuations in the training dataset. Decision tree learning 8 In decision algorithms, we start at the root and split data based on the largest information gain (IG). We repeat this until all leaves are pure, meaning all training examples at each node belong to the same class. To avoid overfitting, we often limit the maximum depth of the tree. Maximizing IG – getting the most bang for your buck In tree learning, we define an objective function to optimize. The goal is to maximize the Information Gain (IG) at each split. In tree learning, ‘f’ is the feature to split on. Dp and Dj are the parent and child node datasets. I is the impurity measure. Np and Nj are the number of examples in the parent and child nodes. Information gain is the difference between the parent’s impurity and the sum of the child nodes’ impurities. Most libraries use binary decision trees, splitting each parent node into two child nodes. The three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity (IG), entropy (I H), and the classification error (I E ). Let’s start with the definition of entropy for all non-empty classes (𝑝(𝑖|𝑡) ≠ 0): In tree learning, p(i|t) is the proportion of examples in class i for a node, t. Entropy is 0 if all examples belong to the same class and maximal for a uniform class distribution. The entropy criterion tries to maximize the mutual information in the tree. The Gini impurity can be understood as a criterion to minimize the probability of misclassification: Another impurity measure is the classification error: Building a decision tree Decision trees create complex boundaries by dividing the feature space. Deeper trees can overfit. We’ll train a decision tree with a max depth of 4 using Gini impurity as our impurity measure. Although feature scaling may be desired for visualization purposes, note that feature scaling is not a requirement for decision tree algorithms. The code is as follows: from sklearn.tree import DecisionTreeClassifier tree_model = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1) tree_model.fit(X_train, y_train) X_combined = np.vstack((X_train, X_test)) y_combined = np.hstack((y_train, y_test)) plot_decision_regions(X_combined, y_combined, classifier=tree_model, test_idx=range(105, 150)) plt.xlabel('Petal length [cm]') plt.ylabel('Petal width [cm]') plt.legend(loc='upper left') plt.tight_layout() plt.show() A nice feature in scikit-learn is that it allows us to readily visualize the decision tree model after training via the following code: from sklearn import tree feature_names = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'] tree.plot_tree(tree_model, feature_names=feature_names, filled=True) plt.show() In the plot_tree function, setting filled=True colors nodes by the majority class label. The decision tree splits data based on features, with left branches being “True” and right “False”. The root node starts with 105 examples, and the first split is based on sepal width. The left child node is pure, containing only Iris-setosa examples. Further splits separate other classes. Scikit-learn doesn’t support manual post- pruning but does offer automatic cost complexity post-pruning. Combining multiple decision trees via random forests Ensemble methods, like random forests, are popular in machine learning for their robustness and good classification performance. A random forest averages multiple decision trees to build a more robust model. The random forest algorithm can be summarized in four simple steps: 1. Draw a random bootstrap sample of size n (randomly choose n examples from the training dataset with replacement). 2. Grow a decision tree from the bootstrap sample. At each node: a. Randomly select d features without replacement. b. Split the node using the feature that provides the best split according to the objective function, for instance, maximizing the information gain. 3. Repeat steps 1-2 k times. 4. Aggregate the prediction by each tree to assign the class label by majority vote. Random forests, while less interpretable than decision trees, are robust and require less hyperparameter tuning. The key parameter is the number of trees, ‘k’. More trees typically improve performance but increase computational cost. Random forest hyperparameters include the bootstrap sample size and the number of features chosen for each split. Adjusting these controls the bias-variance tradeoff. Decreasing the bootstrap size can reduce overfitting but may lower overall performance. Increasing it may increase overfitting. The bootstrap size is usually equal to the number of training examples. The number of features at each split is typically the square root of the total features. from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators=25, random_state=1, n_jobs=2) forest.fit(X_train, y_train) plot_decision_regions(X_combined, y_combined, classifier=forest, test_idx=range(105, 150)) plt.xlabel('Petal length [cm]') plt.ylabel('Petal width [cm]') plt.legend(loc='upper left') plt.tight_layout() plt.show() We trained a random forest of 25 decision trees using the n_estimators parameter. It uses Gini impurity to split nodes. We used the n_jobs parameter to parallelize training across two cores. K-nearest neighbors – a lazy learning algorithm 9 KNN is a typical example of a lazy learner. It is called “lazy” not because of its apparent simplicity, but because it doesn’t learn a discriminative function from the training data but memorizes the training dataset instead. The KNN algorithm itself is fairly straightforward and can be summarized by the following steps: 1. Choose the number of k and a distance metric Parametric versus non- 2. Find the k-nearest neighbors of the data record that we want to classify parametric models 3. Assign the class label by majority vote Machine learning algorithms can be grouped into parametric and non- Based on the chosen distance metric, parametric models. Using parametric the KNN algorithm finds the k models, we estimate parameters from the examples in the training dataset that training dataset to learn a function that are closest (most similar) to the point can classify new data points without that we want to classify. The class label requiring the original training dataset of the data point is then determined by anymore. Typical examples of parametric a majority vote among its k nearest models are the perceptron, logistic neighbors. regression, and the linear SVM. In contrast, non-parametric models can’t be characterized by a fixed set of parameters, and the number of parameters changes from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, with the amount of training data. Two p=2, examples of non-parametric models that metric='minkowski') we have seen so far are the decision tree knn.fit(X_train_std, y_train) classifier/random forest and the kernel (but not linear) SVM. plot_decision_regions(X_combined_std, y_combined, classifier=knn, test_idx=range(105, 150)) plt.xlabel('Petal length [standardized]') plt.ylabel('Petal width [standardized]') plt.legend(loc='upper left') plt.tight_layout() plt.show() Choosing the right ‘k’ is key to balance overfitting and underfitting. We need an appropriate distance metric for the dataset. Often, Euclidean distance is used for real-value examples. It’s important to standardize data so each feature contributes equally to the distance. Minkowski distance is a generalization of Euclidean and Manhattan distance. Building Good Training Datasets – Data Preprocessing Dealing with missing data Identifying missing values in tabular data csv_data = '''A,B,C,D To find missing values in a big DataFrame, use isnull. It shows True for 1.0,2.0,3.0,4.0 missing data. Then, use sum to count these missing values in each column. 5.0,6.0,,8.0 10.0,11.0,12.0,''' Note that you can always access the underlying NumPy array of a DataFrame df = pd.read_csv(StringIO(csv_data)) via the values attribute df.values df.isnull().sum() Eliminating training examples or features with missing values One of the easiest ways to deal with missing data is simply to remove the corresponding features (columns) or training examples (rows) from the dataset entirely rows with missing values can easily be dropped via the dropna method: df.dropna(axis=0) Similarly, we can drop columns that have at least one NaN in any row df.dropna(axis=1) have a row with all values NaN) df.dropna(how='all') drop rows that have fewer than 4 real values df.dropna(thresh=4) Imputing missing values One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column. from sklearn.impute import SimpleImputer import numpy as np A convenient way to achieve this is by using the SimpleImputer imr = SimpleImputer(missing_values=np.nan, strategy='mean') class from scikit-learn, as shown in the following code: imr = imr.fit(df.values) Other options for the strategy parameter are median or most_frequent imputed_data = imr.transform(df.values) You use pandas’ fillna method and providing an imputation method as an argument. df.fillna(df.mean()) Understanding the scikit-learn estimator API We used scikit-learn’s SimpleImputer to fill missing values. It’s part of the transformer API for data transformation. It has two main methods: fit learns from the data, and transform applies these learnings. The data to be transformed should match the features of the fitted data. Classifiers in scikit-learn, like those in Chapter 3, are called estimators. They have predict and transform methods. We use fit to train them with class labels. Then, we can predict new, unlabeled data. Categorical data encoding with pandas Handling categorical data import pandas as pd df = pd.DataFrame([['green', 'M', 10.1, 'class2'], Categorical data has ordinal and nominal features. Ordinal ['red', 'L', 13.5, 'class1'], features, like t-shirt size, have an order (XL > L > M). ['blue', 'XL', 15.3, 'class2']]) Nominal features, like t-shirt color, don’t have an order. df.columns = ['color', 'size', 'price', 'classlabel'] Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical 10 string values into integers. size_mapping = {'XL': 3, 'L': 2, 'M': 1} df['size'] = df['size'].map(size_mapping) Encoding class labels Machine learning needs class labels as integers. Scikit-learn does this internally, but it’s good to provide them as integers to avoid issues. We can enumerate the class labels, starting at 0. class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))} df['classlabel'] = df['classlabel'].map(class_mapping) Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: from sklearn.preprocessing import LabelEncoder class_le = LabelEncoder() y = class_le.fit_transform(df['classlabel'].values) fit_transform is a shortcut for fit and transform. We can use inverse_transform to get original string labels from integers. Performing one-hot encoding on nominal features We previously used a dictionary to change sizes into numbers. Scikit-learn’s classifiers see class labels as unordered categories. So, we used LabelEncoder to turn string labels into numbers. We can do the same for the color column in our data. X = df[['color', 'size', 'price']].values blue = 0 green = 1 red = 2 color_le = LabelEncoder() X[:, 0] = color_le.fit_transform(X[:, 0]) If we use the array as it is, we’ll make a common mistake with categorical data. The problem is that models might think colors have an order, like green is bigger than blue, and red is bigger than green. This isn’t right, but the model can still work, just not at its best. One way to fix this is to use one-hot encoding. This makes a new feature for each unique value. So, the color feature becomes three features: blue, green, and red. We use binary values to show the color. For example, a blue example is blue=1, green=0, red=0. We can use scikit-learn’s OneHotEncoder to do this. X = df[['color', 'size', 'price']].values color_ohe = OneHotEncoder() color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray() We used OneHotEncoder on one column to avoid changing the others. To selectively transform columns in a multi-feature array, we can use ColumnTransformer. It takes a list of tuples (name, transformer, columns). X = df[['color', 'size', 'price']].values c_transf = ColumnTransformer([ ('onehot', OneHotEncoder(), ), ('nothing', 'passthrough', [1, 2])]) c_transf.fit_transform(X).astype(float) A simpler way to create dummy features is by using pandas’ get_dummies method. When applied to a DataFrame, it only changes string columns and leaves the rest as they are. pd.get_dummies(df[['price', 'color', 'size']]) One-hot encoding can cause multi-collinearity, making matrix inversion difficult for some methods. To reduce this, we can remove one column from the encoded array without losing information. For example, if ‘color_blue’ is removed and we see ‘color_green’=0 and ‘color_red’=0, it means the color is blue. With pandas’ get_dummies, we can drop the first column by setting drop_first=True. In order to drop a redundant column via the OneHotEncoder, we need to set drop='first' and set categories='auto' Optional: encoding ordinal features If we are unsure about the numerical differences between the categories of ordinal features, or the difference between two ordinal values is not defined, we can also encode them using a threshold encoding with 0/1 values. For example, we can split the feature size with values M, L, and XL into two new features, x > M and x > L. Partitioning a dataset into separate training and test datasets A convenient way to randomly partition this dataset into separate test and training datasets is to use the train_test_split function from scikit-learn’s model_selection submodule: from sklearn.model_selection import train_test_split X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y) Bringing features onto the same scale Feature scaling is important in preprocessing. It’s not needed for decision trees and random forests as they are scale-invariant. But most machine learning algorithms work better when features are on the same scale. This was seen in Chapter 2 when we used the gradient descent optimization algorithm. There are two common ways to scale features: normalization and standardization. Normalization usually means rescaling features to a range of [0, 1], a special case of min-max scaling. We can normalize data by applying min-max scaling to each feature column. from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler() X_train_norm = mms.fit_transform(X_train) X_test_norm = mms.transform(X_test) The procedure for standardization can be expressed by the following equation: from sklearn.preprocessing import StandardScaler stdsc = StandardScaler() X_train_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test) Selecting meaningful features 11 If we notice that a model performs much better on a training dataset than on the test dataset, this observation is a strong indicator of overfitting L1 and L2 regularization as penalties against model complexity L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights. We defined the squared L2 norm of our weight vector, w, as follows: Another approach to reduce the model complexity is the related L1 regularization: A geometric interpretation of L2 regularization Regularization adds a penalty to the loss function to favor smaller weights, penalizing large ones. By increasing the regularization parameter, we reduce weights towards zero, making our model less dependent on the training data. This concept is shown in the L2 penalty term. The L2 regularization term is like a shaded ball. Our weights can’t exceed our budget, meaning they can’t fall outside this ball. We aim to minimize the loss function, choosing the point where the L2 ball meets the contours of the unpenalized loss function. As the regularization parameter increases, the penalized loss grows, leading to a smaller L2 ball. If we increase the parameter to infinity, the weights become zero, at the center of the L2 ball. In summary, we aim to minimize the sum of the unpenalized loss and the penalty term, adding bias and preferring a simpler model to reduce variance when there’s not enough training data. Sparse solutions with L1 regularization L1 regularization and sparsity. The main concept behind L1 regularization is similar to what we discussed in the previous section. However, since the L1 penalty is the sum of the absolute weight coefficients (remember that the L2 term is quadratic), we can represent it as a diamond-shape budget we can see that the contour of the loss function touches the L1 diamond at w1 = 0. Since the contours of an L1 regularized system are sharp, it is more likely that the optimum—that is, the intersection between the ellipses of the loss function and the boundary of the L1 diamond—is located on the axes, which encourages sparsity. from sklearn.linear_model import LogisticRegression lr = LogisticRegression(penalty='l1', C=1.0, solver='liblinear', multi_class='ovr') Note that we also need to select a different optimization algorithm (for example, solver='liblinear'), since 'lbfgs' currently does not support L1-regularized loss optimization. C is the inverse of the regularization parameter, Sequential feature selection algorithms To simplify models and avoid overfitting, we can use dimensionality reduction. This includes feature selection and feature extraction. Feature selection picks a subset of original features, while feature extraction creates a new feature subspace from the original set. We’ll look at a classic family of feature selection algorithms. These algorithms reduce a d-dimensional feature space to a k-dimensional one (k