Machine Learning and Classification Algorithms PDF

Chapter 1 Machine Learning and Classification Algorithms Objectives: Thanks to the many powerful open source libraries that have been developed in recent years, there has probably never been a better time to break into the machine learning field and learn how to utilize powerful algorithms to spot patterns in data and make predictions about future events. In this chapter, we will learn about the main concepts and different types of machine learning. Together with a basic introduction to the relevant terminology, we will lay the groundwork for successfully using machine learning techniques for practical problem solving. In this chapter, we will cover the following topics: The general concepts of machine learning The three types of learning and basic terminology The building blocks for successfully designing machine learning systems Installing and setting up Python for data analysis and machine learning Building an intuition for machine learning algorithms Using pandas, NumPy, and matplotlib to read in, process, and visualize data Implementing linear classification algorithms in Python Structure: 1.1 Building intelligent machines to transform data into knowledge 1.2 Types of machine learning 1.3 An introduction to the basic terminology and notations 1.4 A roadmap for building machine learning systems 1.5 Using Python for machine learning 1.6 Artificial neurons – a brief glimpse into the early history of machine learning 1.7 Implementing a perceptron learning algorithm in Python 1.8 Adaptive linear neurons and the convergence of learning 1.9 Summary 1.10 Self Assessment Questions 1.1 Building Intelligent Machines To Transform Data Into Knowledge In this age of modern technology, there is one resource that we have in abundance: a large amount of structured and unstructured data. In the second half of the twentieth century, machine learning evolved as a subfield of artificial intelligence that involved the development of self-learning algorithms to gain knowledge from that data in order to make predictions. Instead of requiring humans to manually derive rules and build models from analyzing large amounts of data, machine learning offers a more efficient alternative for capturing the knowledge in data to gradually improve the performance of predictive models, and make data-driven decisions. Not only is machine learning becoming increasingly important in computer science research but it also plays an ever greater role in our everyday life. Thanks to machine learning, we enjoy robust e-mail spam filters, convenient text and voice recognition software, reliable Web search engines, challenging chess players, and, hopefully soon, safe and efficient self-driving cars. 1.2 Types of Machine Learning In this section, we will take a look at the three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. We will learn about the fundamental differences between the three different learning types and, using conceptual examples, we will develop an intuition for the practical problem domains where these can be applied: 1.2.1 Making predictions about the future with supervised learning The main goal in supervised learning is to learn a model from labeled training data that allows us to make predictions about unseen or future data. Here, the term supervised refers to a set of samples where the desired output signals (labels) are already known. Considering the example of e-mail spam filtering, we can train a model using a supervised machine learning algorithm on a corpus of labeled e-mail, e-mail that are correctly marked as spam or not-spam, to predict whether a new e-mail belongs to either of the two categories. A supervised learning task with discrete class labels, such as in the previous e-mail spam-filtering example, is also called a classification task. Another subcategory of supervised learning is regression, where the outcome signal is a continuous value: 1.2.1.1. Classification for predicting class labels Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels of new instances based on past observations. Those class labels are discrete, unordered values that can be understood as the group memberships of the instances. The previously mentioned example of e-mail-spam detection represents a typical example of a binary classification task, where the machine learning algorithm learns a set of rules in order to distinguish between two possible classes: spam and non-spam e-mail. However, the set of class labels does not have to be of a binary nature. The predictive model learned by a supervised learning algorithm can assign any class label that was presented in the training dataset to a new, unlabeled instance. A typical example of a multi-class classification task is handwritten character recognition. Here, we could collect a training dataset that consists of multiple handwritten examples of each letter in the alphabet. Now, if a user provides a new handwritten character via an input device, our predictive model will be able to predict the correct letter in the alphabet with certain accuracy. However, our machine learning system would be unable to correctly recognize any of the digits zero to nine, for example, if they were not part of our training dataset. The following figure illustrates the concept of a binary classification task given 30 training samples: 15 training samples are labeled as negative class (circles) and 15 training samples are labeled as positive class (plus signs). In this scenario, our dataset is two-dimensional, which means that each sample has two values associated with it: x1 and x2. Now, we can use a supervised machine learning algorithm to learn a rule—the decision boundary represented as a black dashed line—that can separate those two classes and classify new data into each of those two categories given its x1 and x2 values: 1.2.1.2. Regression for predicting continuous outcomes We learned in the previous section that the task of classification is to assign categorical, unordered labels to instances. A second type of supervised learning is the prediction of continuous outcomes, which is also called regression analysis. In regression analysis, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variables that allows us to predict an outcome. For example, let's assume that we are interested in predicting the Math SAT scores of our students. If there is a relationship between the time spent studying for the test and the final scores, we could use it as training data to learn a model that uses the study time to predict the test scores of future students who are planning to take this test. The term regression was devised by Francis Galton in his article Regression Towards Mediocrity in Hereditary Stature in 1886. Galton described the biological phenomenon that the variance of height in a population does not increase over time. He observed that the height of parents is not passed on to their children but the children's height is regressing towards the population mean. The following figure illustrates the concept of linear regression. Given a predictor variable x and a response variable y, we fit a straight line to this data that minimizes the distance—most commonly the average squared distance—between the sample points and the fitted line. We can now use the intercept and slope learned from this data to predict the outcome variable of new data: 1.2.2. Solving interactive problems with reinforcement learning Another type of machine learning is reinforcement learning. In reinforcement learning, the goal is to develop a system (agent) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a so-called reward signal, we can think of reinforcement learning as a field related to supervised learning. However, in reinforcement learning this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function. Through the interaction with the environment, an agent can then use reinforcement learning to learn a series of actions that maximizes this reward via an exploratory trial-and-error approach or deliberative planning. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as win or lose at the end of the game: 1.2.3. Discovering hidden structures with unsupervised learning In supervised learning, we know the right answer beforehand when we train our model, and in reinforcement learning, we define a measure of reward for particular actions by the agent. In unsupervised learning, however, we are dealing with unlabeled data or data of unknown structure. Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function. 1.2.3.1. Finding subgroups with clustering Clustering is an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of their group memberships. Each cluster that may arise during the analysis defines a group of objects that share a certain degree of similarity but are more dissimilar to objects in other clusters, which is why clustering is also sometimes called "unsupervised classification." Clustering is a great technique for structuring information and deriving meaningful relationships among data, For example, it allows marketers to discover customer groups based on their interests in order to develop distinct marketing programs. The figure below illustrates how clustering can be applied to organizing unlabeled data into three distinct groups based on the similarity of their features x1 and x2 : 1.2.3.2. Dimensionality reduction for data compression Another subfield of unsupervised learning is dimensionality reduction. Often we are working with data of high dimensionality—each observation comes with a high number of measurements—that can present a challenge for limited storage space and the computational performance of machine learning algorithms. Unsupervised dimensionality reduction is a commonly used approach in feature preprocessing to remove noise from data, which can also degrade the predictive performance of certain algorithms, and compress the data onto a smaller dimensional subspace while retaining most of the relevant information. Sometimes, dimensionality reduction can also be useful for visualizing data—for example, a high-dimensional feature set can be projected onto one-, two-, or three-dimensional feature spaces in order to visualize it via 3D- or 2D-scatterplots or histograms. The figure below shows an example where non-linear dimensionality reduction was applied to compress a 3D Swiss Roll onto a new 2D feature subspace: 1.3 An Introduction To The Basic Terminology And Notations Now that we have discussed the three broad categories of machine learning—supervised, unsupervised, and reinforcement learning—let us have a look at the basic terminology that we will be using in the next chapters. The following table depicts an excerpt of the Iris dataset, which is a classic example in the field of machine learning. The Iris dataset contains the measurements of 150 iris flowers from three different species: Setosa, Versicolor, and Viriginica. Here, each flower sample represents one row in our data set, and the flower measurements in centimeters are stored as columns, which we also call the features of the dataset: To keep the notation and implementation simple yet efficient, we will make use of some of the basics of linear algebra. In the following chapters, we will use a matrix and vector notation to refer to our data. We will follow the common convention to represent each sample as separate row in a feature matrix X , where each feature is stored as a separate column. The Iris dataset, consisting of 150 samples and 4 features, can then be written as a 150× 4 matrix X ∈ 150×4: For the rest of this book, we will use the superscript (i) to refer to the ith training sample, and the subscript j to refer to the jth dimension of the training dataset. We use lower-case, bold-face letters to refer to vectors ( x ∈ Rn×1 ) and upper-case, bold- face letters to refer to matrices, respectively ( X ∈ Rn×m ) ). To refer to single elements in a vector or matrix, we write the letters in italics x( n) or x((mn)) , respectively). For example, x1150 refers to the first dimension of flower sample 150, the sepal width. Thus, each row in this feature matrix represents one flower instance and can be written as four-dimensional column vector x( i ) ∈ R1× 4 , 𝑥( ) = 𝑥 () 𝑥 () 𝑥 () 𝑥 () Each feature dimension is a 150-dimensional row vector x( i ) ∈ R 150×1 , for example: 𝑥( ) ⎡ ⎤ ⎢ ( ) ⎥ 𝑥 =⎢ 𝑥 ⎥ ⎢ (⋮ ) ⎥ ⎣𝑥 ⎦ Similarly, we store the target variables (here: class labels) as a 150-dimensional column vector 𝑦( ) y= ⋮ 𝑦( ) (y ∈ { Setosa, Versicolor, Virginica}). 1.4 A Roadmap for Building Machine Learning Systems In the previous sections, we discussed the basic concepts of machine learning and the three different types of learning. In this section, we will discuss other important parts of a machine learning system accompanying the learning algorithm. The diagram below shows a typical workflow diagram for using machine learning in predictive modeling, which we will discuss in the following subsections: 1.4.1. Preprocessing – getting data into shape Raw data rarely comes in the form and shape that is necessary for the optimal performance of a learning algorithm. Thus, the preprocessing of the data is one of the most crucial steps in any machine learning application. If we take the Iris flower dataset from the previous section as an example, we could think of the raw data as a series of flower images from which we want to extract meaningful features. Useful features could be the color, the hue, the intensity of the flowers, the height, and the flower lengths and widths. Many machine learning algorithms also require that the selected features are on the same scale for optimal performance, which is often achieved by transforming the features in the range [0, 1] or a standard normal distribution with zero mean and unit variance, as we will see in the later chapters. Some of the selected features may be highly correlated and therefore redundant to a certain degree. In those cases, dimensionality reduction techniques are useful for compressing the features onto a lower dimensional subspace. Reducing the dimensionality of our feature space has the advantage that less storage space is required, and the learning algorithm can run much faster. To determine whether our machine learning algorithm not only performs well on the training set but also generalizes well to new data, we also want to randomly divide the dataset into a separate training and test set. We use the training set to train and optimize our machine learning model, while we keep the test set until the very end to evaluate the final model. 1.4.2. Training and selecting a predictive model In practice, it is essential to compare at least a handful of different algorithms in order to train and select the best performing model. But before we can compare different models, we first have to decide upon a metric to measure performance. One commonly used metric is classification accuracy, which is defined as the proportion of correctly classified instances. One legitimate question to ask is: how do we know which model performs well on the final test dataset and real-world data if we don't use this test set for the model selection but keep it for the final model evaluation? In order to address the issue embedded in this question, different cross-validation techniques can be used where the training dataset is further divided into training and validation subsets in order to estimate the generalization performance of the model. Finally, we also cannot expect that the default parameters of the different learning algorithms provided by software libraries are optimal for our specific problem task. Therefore, we will make frequent use of hyperparameter optimization techniques that help us to fine-tune the performance of our model in later chapters. Intuitively, we can think of those hyperparameters as parameters that are not learned from the data but represent the knobs of a model that we can turn to improve its performance, which will become much clearer in later chapters when we see actual examples. 1.4.3. Evaluating models and predicting unseen data instances After we have selected a model that has been fitted on the training dataset, we can use the test dataset to estimate how well it performs on this unseen data to estimate the generalization error. If we are satisfied with its performance, we can now use this model to predict new, future data. It is important to note that the parameters for the previously mentioned procedures—such as feature scaling and dimensionality reduction—are solely obtained from the training dataset, and the same parameters are later re-applied to transform the test dataset, as well as any new data samples—the performance measured on the test data may be overoptimistic otherwise. 1.5 Using Python For Machine Learning Python is one of the most popular programming languages for data science and therefore enjoys a large number of useful add-on libraries developed by its great community. Although the performance of interpreted languages, such as Python, for computation- intensive tasks is inferior to lower-level programming languages, extension libraries such as NumPy and SciPy have been developed that build upon lower layer Fortran and C implementations for fast and vectorized operations on multidimensional arrays. For machine learning programming tasks, we will mostly refer to the scikit-learn library, which is one of the most popular and accessible open source machine learning libraries as of today. 1.5.1. Installing Python packages Python is available for all three major operating systems—Microsoft Windows, Mac OS X, and Linux—and the installer, as well as the documentation, can be downloaded from the official Python website: https://www.python.org. This book is written for Python version >= 3.4.3, and it is recommended you use the most recent version of Python 3 that is currently available, although most of the code examples may also be compatible with Python >= 2.7.10. If you decide to use Python 2.7 to execute the code examples, please make sure that you know about the major differences between the two Python versions. A good summary about the differences between Python 3.4 and 2.7 can be found at https://wiki.python.org/moin/Python2orPython3. The additional packages that we will be using throughout this book can be installed via the pip installer program, which has been part of the Python standard library since Python 3.3. More information about pip can be found at https://docs.python.org/3/installing/index.html. After we have successfully installed Python, we can execute pip from the command line terminal to install additional Python packages: 1.5.2. pip install SomePackage Already installed packages can be updated via the --upgrade flag: pip install SomePackage --upgrade A highly recommended alternative Python distribution for scientific computing is Anaconda by Continuum Analytics. Anaconda is a free—including commercial use—enterprise-ready Python distribution that bundles all the essential Python packages for data science, math, and engineering in one user-friendly cross-platform distribution. The Anaconda installer can be downloaded at http://continuum.io/downloads#py34, and an Anaconda quick start-guide is available at https://store.continuum.io/static/img/Anaconda-Quickstart.pdf. After successfully installing Anaconda, we can install new Python packages using the following command: conda install SomePackage Existing packages can be updated using the following command: conda update SomePackage Throughout this book, we will mainly use NumPy's multi-dimensional arrays to store and manipulate data. Occasionally, we will make use of pandas, which is a library built on top of NumPy that provides additional higher level data manipulation tools that make working with tabular data even more convenient. To augment our learning experience and visualize quantitative data, which is often extremely useful to intuitively make sense of it, we will use the very customizable matplotlib library. The version numbers of the major Python packages that were used for writing this book are listed below. Please make sure that the version numbers of your installed packages are equal to, or greater than, those version numbers to ensure the code examples run correctly: NumPy 1.9.1 SciPy 0.14.0 scikit-learn 0.15.2 matplotlib 1.4.0 pandas 0.15.2 1.6 Artificial Neurons – A Brief Glimpse Into The Early History Of Machine Learning Before we discuss the perceptron and related algorithms in more detail, let us take a brief tour through the early beginnings of machine learning. Trying to understand how the biological brain works to design artificial intelligence, Warren McCullock and Walter Pitts published the first concept of a simplified brain cell, the so-called McCullock-Pitts (MCP) neuron, in 1943 (W. S. McCulloch and W. Pitts. A Logical Calculus of the Ideas Immanent in Nervous Activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943). Neurons are interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals, which is illustrated in the following figure: McCullock and Pitts described such a nerve cell as a simple logic gate with binary outputs; multiple signals arrive at the dendrites, are then integrated into the cell body, and, if the accumulated signal exceeds a certain threshold, an output signal is generated that will be passed on by the axon. Only a few years later, Frank Rosenblatt published the first concept of the perceptron learning rule based on the MCP neuron model (F. Rosenblatt, The Perceptron, a Perceiving and Recognizing Automaton. Cornell Aeronautical Laboratory, 1957). With his perceptron rule, Rosenblatt proposed an algorithm that would automatically learn the optimal weight coefficients that are then multiplied with the input features in order to make the decision of whether a neuron fires or not. In the context of supervised learning and classification, such an algorithm could then be used to predict if a sample belonged to one class or the other. More formally, we can pose this problem as a binary classification task where we refer to our two classes as 1 (positive class) and -1 (negative class) for simplicity. We can then define an activation function that takes a linear combination of certain input values x and a corresponding weight vector w, where z is the so-called net input ( z = w1 x1 +…+ wm xm ): 𝑤 𝑥 w= ⋮ , x= ⋮ 𝑤 𝑥 Now, if the activation of a particular sample x( i ) , that is, the output of φ ( z ) , is greater than a defined threshold θ , we predict class 1 and class -1, otherwise, in the perceptron algorithm, the activation function φ (⋅) is a simple unit step function, which is sometimes also called the Heaviside step function: 𝟏 𝒊𝒇 𝒛 ≥ 𝜽 ∅(𝒛) = −𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 For simplicity, we can bring the threshold θ to the left side of the equation and define a weight- zero as w0= −θ and x0 = 1, so that we write z in a more compact form z= w0x0+ w1x1+ …+ wmxm=wTx 𝟏 𝒊𝒇 𝒛 ≥ ∅ and ∅(𝒛) =. −𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 In the following sections, we will often make use of basic notations from linear algebra. For example, we will abbreviate the sum of the products of the values in x and w using a vector dot product, whereas superscript T stands for transpose, which is an operation that transforms a column vector into a row vector and vice versa: z = w0 x0 + w1 x1 +…. + wm xm = ∑mj=0 x j w j = w T x 4 For Example: [1 2 3 ] × 5 = 1 × 4 + 2 × 5 + 3 × 6 = 32 6 Furthermore, the transpose operation can also be applied to a matrix to reflect it over its diagonal, for example: 1 2 1 3 5 3 4 = 2 4 6 5 6 In this book, we will only use the very basic concepts from linear algebra. However, if you need a quick refresher, please take a look at Zico Kolter's excellent Linear Algebra Review and Reference, which is freely available at http://www.cs.cmu.edu/~zkolter/course/linalg/linalg_ notes.pdf. The following figure illustrates how the net input z = w T x is squashed into a binary output (-1 or 1) by the activation function of the perceptron (left subfigure) and how it can be used to discriminate between two linearly separable classes (right subfigure): The whole idea behind the MCP neuron and Rosenblatt's thresholded perceptron model is to use a reductionist approach to mimic how a single neuron in the brain works: it either fires or it doesn't. Thus, Rosenblatt's initial perceptron rule is fairly simple and can be summarized by the following steps: 1. Initialize the weights to 0 or small random numbers. 2. For each training sample x( i ) perform the following steps:  Compute the output value yˆ.  Update the weights. Here, the output value is the class label predicted by the unit step function that we defined earlier, and the simultaneous update of each weight wj in the weight vector w can be more formally written as: wj := wj + ∆wj. The value of ∆wj , which is used to update the weight wj , is calculated by the perceptron learning rule: ∆w j = η ( y( i ) − yˆ ( i ) )x(ji ) Where η is the learning rate (a constant between 0.0 and 1.0), y( i ) is the true class label of the i th training sample, and yˆ( i ) is the predicted class label. It is important to note that all weights in the weight vector are being updated simultaneously, which means that we don't recompute the before all of the weights wj were updated. Concretely, for a 2D dataset, we would write the update as follows: ∆ w0 = η (y( i ) − output( i ) ) ∆ w1 = η (y( i ) − output( i ) )x1( i ) ∆ w2 = η (y( i ) − output( i ) )x2(i ) Before we implement the perceptron rule in Python, let us make a simple thought experiment to illustrate how beautifully simple this learning rule really is. In the two scenarios where the perceptron predicts the class label correctly, the weights remain unchanged: ∆ wj = η ( −1( i ) − −1( i ) )x(ji ) = 0 ∆ wj = η (1( i ) − 1( i ) )x(ji ) = 0 However, in the case of a wrong prediction, the weights are being pushed towards the direction of the positive or negative target class, respectively: ∆wj = η (1( i ) − −1( i ) )x(ji ) = η ( 2) x(ji ) ∆ wj = η ( −1( i ) −1( i ) )x(ji ) = η ( −2) x(ji ) To get a better intuition for the multiplicative factor x (i ) , let us go through another simple j example, where: yˆ (ji ) = +1, y( i ) = −1, η = 1 Let's assume that x (ji ) = 0.5 , and we misclassify this sample as -1. In this case, we would increase the corresponding weight by 1 so that the activation x (ji ) = w(ji ) will be more positive the next time we encounter this sample and thus will be more likely to be above the threshold of the unit step function to classify the sample as +1: ∆ w(ji ) = (1( i ) − −1( i ) )0.5( i ) = ( 2)0.5( i ) = 1 The weight update is proportional to the value of x (ji ). For example, if we have another sample x (ji ) = 2 that is incorrectly classified as -1, we'd push the decision boundary by an even larger extend to classify this sample correctly the next time: ∆ wj = (1( i ) − −1( i ) )2 ( i ) = ( 2) 2 ( i ) = 4 It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate is sufficiently small. If the two classes can't be separated by a linear decision boundary, we can set a maximum number of passes over the training dataset (epochs) and/or a threshold for the number of tolerated misclassifications—the perceptron would never stop updating the weights otherwise: Now, before we jump into the implementation in the next section, let us summarize what we just learned in a simple figure that illustrates the general concept of the perceptron: The preceding figure illustrates how the perceptron receives the inputs of a sample and combines them with the weights w to compute the net input. The net input is then passed on to the activation function (here: the unit step function), which generates a binary output -1 or +1—the predicted class label of the sample. During the learning phase, this output is used to calculate the error of the prediction and update the weights. 1.7 Implementing A Perceptron Learning Algorithm In Python In the previous section, we learned how Rosenblatt's perceptron rule works; let us now go ahead and implement it in Python and apply it to the Iris dataset that we introduced. We will take an objected-oriented approach to define the perceptron interface as a Python Class, which allows us to initialize new perceptron objects that can learn from data via a fit method, and make predictions via a separate predict method. As a convention, we add an underscore to attributes that are not being created upon the initialization of the object but by calling the object's other methods—for example, self.w_. If you are not yet familiar with Python's scientific libraries or need a refresher, please see the following resources: NumPy: http://wiki.scipy.org/Tentative_NumPy_Tutorial Pandas: http://pandas.pydata.org/pandas-docs/stable/ tutorials.html Matplotlib: http://matplotlib.org/ussers/beginner.html Also, to better follow the code examples, I recommend you download the IPython notebooks from the Packt website. For a general introduction to IPython notebooks, please visit https://ipython. org/ipython-doc/3/notebook/index.html. import numpy as np class Perceptron(object): """Perceptron classifier. Parameters ------------ eta : float Learning rate (between 0.0 and 1.0) n_iter : int Passes over the training dataset. Attributes ----------- w_ : 1d-array Weights after fitting. errors_ : list Number of misclassifications in every epoch. """ def __init__(self, eta=0.01, n_iter=10): self.eta = eta self.n_iter = n_iter def fit(self, x, y): """Fit training data. Parameters ---------- X : {array-like}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. Returns ------- self : object """ self.w_ = np.zeros(1 + X.shape) self.errors_ = [] for _ in range(self.n_iter): errors = 0 for xi, target in zip(x, y): update = self.eta * (target - self.predict(xi)) self.w_[1:] += update * xi self.w_ += update errors += int(update != 0.0) self.errors_.append(errors) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.w_[1:]) + self.w_ def predict(self, X): """Return class label after unit step""" return np.where(self.net_input(X) >= 0.0, 1, -1) Using this perceptron implementation, we can now initialize new Perceptron objects with a given learning rate eta and n_iter, which is the number of epochs (passes over the training set). Via the fit method we initialize the weights in self.w_ to a zero-vector R m+1where m stands for the number of dimensions (features) in the dataset where we add 1 for the zero-weight (that is, the threshold). After the weights have been initialized, the fit method loops over all individual samples in the training set and updates the weights according to the perceptron learning rule that we discussed in the previous section. The class labels are predicted by the predict method, which is also called in the fit method to predict the class label for the weight update, but predict can also be used to predict the class labels of new data after we have fitted our model. Furthermore, we also collect the number of misclassifications during each epoch in the list self.errors_ so that we can later analyze how well our perceptron performed during the training. The np.dot function that is used in the net_input method simply calculates the vector dot product w T x. Instead of using NumPy to calculate the vector dot product between two arrays a and b via a.dot(b) or np.dot(a, b),we could also perform the calculation in pure Python via sum([j*j for i,j in zip(a, b)]. However, the advantage of using NumPy over classic Python for-loop structures is that its arithmetic operations are vectorized. Vectorization means that an elemental arithmetic operation is automatically applied to all elements in an array. By formulating our arithmetic operations as a sequence of instructions on an array rather than performing a set of operations for each element one at a time, we can make better use of our modern CPU architectures with Single Instruction, Multiple Data (SIMD) support. Furthermore, NumPy uses highly optimized linear algebra libraries, such as Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) that have been written in C or Fortran. Lastly, NumPy also allows us to write our code in a more compact and intuitive way using the basics of linear algebra, such as vector and matrix dot products. 1.7.1 Training a perceptron model on the Iris dataset To test our perceptron implementation, we will load the two flower classes Setosa and Versicolor from the Iris dataset. Although, the perceptron rule is not restricted to two dimensions, we will only consider the two features sepal length and petal length for visualization purposes. Also, we only chose the two flower classes Setosa and Versicolor for practical reasons. However, the perceptron algorithm can be extended to multi-class classification—for example, through the One-vs.-All technique. First, we will use the pandas library to load the Iris dataset directly from the UCI Machine Learning Repository into a DataFrame object and print the last five lines via the tail method to check that the data was loaded correctly: >>> import pandas as pd >>> df = pd.read_csv('https://archive.ics.uci.edu/ml/'... 'machine-learning-databases/iris/iris.data', header=None) >>> df.tail() Next, we extract the first 100 class labels that correspond to the 50 Iris-Setosa and 50 Iris- Versicolor flowers, respectively, and convert the class labels into the two integer class labels 1 (Versicolor) and -1 (Setosa) that we assign to a vector y where the values method of a pandas DataFrame yields the corresponding NumPy representation. Similarly, we extract the first feature column (sepal length) and the third feature column (petal length) of those 100 training samples and assign them to a feature matrix X, which we can visualize via a two-dimensional scatter plot: >>> import matplotlib.pyplot as plt >>> import numpy as np >>> y = df.iloc[0:100, 4].values >>> y = np.where(y == 'Iris-setosa', -1, 1) >>> x = df.iloc[0:100, [0, 2]].values >>> plt.scatter(X[:50, 0], X[:50, 1],... color='red', marker='o', label='setosa') >>> plt.scatter(X[50:100, 0], X[50:100, 1],... color='blue', marker='x', label='versicolor') >>> plt.xlabel('petal length') >>> plt.ylabel('sepal length') >>> plt.legend(loc='upper left') >>> plt.show() After executing the preceding code example we should now see the following scatterplot: Now it's time to train our perceptron algorithm on the Iris data subset that we just extracted. Also, we will plot the misclassification error for each epoch to check if the algorithm converged and found a decision boundary that separates the two Iris flower classes: >>> ppn = Perceptron(eta=0.1, n_iter=10) >>> ppn.fit(X, y) >>> plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_,... marker='o') >>> plt.xlabel('Epochs') >>> plt.ylabel('Number of misclassifications') >>> plt.show() After executing the preceding code, we should see the plot of the misclassification errors versus the number of epochs, as shown next: As we can see in the preceding plot, our perceptron already converged after the sixth epoch and should now be able to classify the training samples perfectly. Let us implement a small convenience function to visualize the decision boundaries for 2D datasets: from matplotlib.colors import ListedColormap def plot_decision_regions(X, y, classifier, resolution=0.02): # setup marker generator and color map markers = ('s', 'x', 'o', '^', 'v') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') cmap = ListedColormap(colors[:len(np.unique(y))]) # plot the decision surface x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min, x2_max, resolution)) Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) Z = Z.reshape(xx1.shape) plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) plt.xlim(xx1.min(), xx1.max()) plt.ylim(xx2.min(), xx2.max()) # plot class samples for idx, cl in enumerate(np.unique(y)): plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=cmap(idx), marker=markers[idx], label=cl) First, we define a number of colors and markers and create a color map from the list of colors via ListedColormap. Then, we determine the minimum and maximum values for the two features and use those feature vectors to create a pair of grid arrays xx1 and xx2 via the NumPy meshgrid function. Since we trained our perceptron classifier on two feature dimensions, we need to flatten the grid arrays and create a matrix that has the same number of columns as the Iris training subset so that we can use the predict method to predict the class labels Z of the corresponding grid points. After reshaping the predicted class labels Z into a grid with the same dimensions as xx1 and xx2, we can now draw a contour plot via matplotlib's contourf function that maps the different decision regions to different colors for each predicted class in the grid array: >>> plot_decision_regions(X, y, classifier=ppn) >>> plt.xlabel('sepal length [cm]') >>> plt.ylabel('petal length [cm]') >>> plt.legend(loc='upper left') >>> plt.show() After executing the preceding code example, we should now see a plot of the decision regions, as shown in the following figure: As we can see in the preceding plot, the perceptron learned a decision boundary that was able to classify all flower samples in the Iris training subset perfectly. 1.8 Adaptive Linear Neurons And The Convergence Of Learning In this section, we will take a look at another type of single-layer neural network: ADAptive LInear NEuron (Adaline). Adaline was published, only a few years after Frank Rosenblatt's perceptron algorithm, by Bernard Widrow and his doctoral student Tedd Hoff, and can be considered as an improvement on the latter (B. Widrow et al. Adaptive "Adaline" neuron using chemical "memistors". Number Technical Report 1553-2. Stanford Electron. Labs. Stanford, CA, October 1960). The Adaline algorithm is particularly interesting because it illustrates the key concept of defining and minimizing cost functions, which will lay the groundwork for understanding more advanced machine learning algorithms for classification, such as logistic regression and support vector machines, as well as regression models that we will discuss in future chapters. The key difference between the Adaline rule (also known as the Widrow-Hoff rule) and Rosenblatt's perceptron is that the weights are updated based on a linear activation function rather than a unit step function like in the perceptron. In Adaline, this linear activation function φ ( z ) is simply the identity function of the net input so that φ ( w T x) = w T x. While the linear activation function is used for learning the weights, a quantizer, which is similar to the unit step function that we have seen before, can then be used to predict the class labels, as illustrated in the following figure: If we compare the preceding figure to the illustration of the perceptron algorithm that we saw earlier, the difference is that we know to use the continuous valued output from the linear activation function to compute the model error and update the weights, rather than the binary class labels. 1.8.1 Minimizing cost functions with gradient descent One of the key ingredients of supervised machine learning algorithms is to define an objective function that is to be optimized during the learning process. This objective function is often a cost function that we want to minimize. In the case of Adaline, we can define the cost function J to learn the weights as the Sum of Squared Errors (SSE) between the calculated outcome and the true class label J ( w) = ∑i (y( i ) −φ ( z( i ) ))2 The term is just added for our convenience; it will make it easier to derive the gradient, as we will see in the following paragraphs. The main advantage of this continuous linear activation function is—in contrast to the unit step function—that the cost function becomes differentiable. Another nice property of this cost function is that it is convex; thus, we can use a simple, yet powerful, optimization algorithm called gradient descent to find the weights that minimize our cost function to classify the samples in the Iris dataset. As illustrated in the following figure, we can describe the principle behind gradient descent as climbing down a hill until a local or global cost minimum is reached. In each iteration, we take a step away from the gradient where the step size is determined by the value of the learning rate as well as the slope of the gradient: Using gradient descent, we can now update the weights by taking a step away from the gradient ∇J (w) of our cost function J (w): w := w + ∆w Here, the weight change w is defined as the negative gradient multiplied by the learning rate η : ∆w = −η∆ J (w) To compute the gradient of the cost function, we need to compute the partial derivative of the cost function with respect to each weight so that we can write the update of weight j w as: : Since we update all weights simultaneously, our Adaline learning rule becomes w := w + ∆w. For those who are familiar with calculus, the partial derivative of the SSE cost function with respect to the jth weight in can be obtained as follows: Although the Adaline learning rule looks identical to the perceptron rule, the φ ( z( i ) ) with z( i ) = w T x( i ) is a real number and not an integer class label. Furthermore, the weight update is calculated based on all samples in the training set (instead of updating the weights incrementally after each sample), which is why this approach is also referred to as "batch" gradient descent. 1.8.2. Implementing an Adaptive Linear Neuron in Python Since the perceptron rule and Adaline are very similar, we will take the perceptron implementation that we defined earlier and change the fit method so that the weights are updated by minimizing the cost function via gradient descent: class AdalineGD(object): """ADAptive LInear NEuron classifier. Parameters eta : float Learning rate (between 0.0 and 1.0) n_iter : int Passes over the training dataset. Attributes ----------- w_ : 1d-array Weights after fitting. errors_ : list Number of misclassifications in every epoch. """ def __init__(self, eta=0.01, n_iter=50): self.eta = eta self.n_iter = n_iter def fit(self, X, y): """ Fit training data. Parameters --------- X : {array-like}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. Returns ------- self : object """ self.w_ = np.zeros(1 + X.shape) self.cost_ = [] for i in range(self.n_iter): output = self.net_input(X) errors = (y - output) self.w_[1:] += self.eta * X.T.dot(errors) self.w_ += self.eta * errors.sum() cost = (errors**2).sum() / 2.0 self.cost_.append(cost) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.w_[1:]) + self.w_ def activation(self, X): """Compute linear activation""" return self.net_input(X) def predict(self, X): """Return class label after unit step""" return np.where(self.activation(X) >= 0.0, 1, -1) Instead of updating the weights after evaluating each individual training sample, as in the perceptron, we calculate the gradient based on the whole training dataset via self.eta * errors.sum() for the zero-weight and via self.eta * X.T.dot(errors) for the weights 1 to m where X.T.dot(errors) is a matrix-vector multiplication between our feature matrix and the error vector. Similar to the previous perceptron implementation, we collect the cost values in a list self.cost_ to check if the algorithm converged after training. In practice, it often requires some experimentation to find a good learning rate η for optimal convergence. So, let's choose two different learning rates η = 0.1 and η = 0.0001 to start with and plot the cost functions versus the number of epochs to see how well the Adaline implementation learns from the training data. Let us now plot the cost against the number of epochs for the two different learning rates: >>> fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4)) >>> ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, y) >>> ax.plot(range(1, len(ada1.cost_) + 1),... np.log10(ada1.cost_), marker='o') >>> ax.set_xlabel('Epochs') >>> ax.set_ylabel('log(Sum-squared-error)') >>> ax.set_title('Adaline - Learning rate 0.01') >>> ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, y) >>> ax.plot(range(1, len(ada2.cost_) + 1),... ada2.cost_, marker='o') >>> ax.set_xlabel('Epochs') >>> ax.set_ylabel('Sum-squared-error') >>> ax.set_title('Adaline - Learning rate 0.0001') >>> plt.show() As we can see in the resulting cost function plots next, we encountered two different types of problems. The left chart shows what could happen if we choose a learning rate that is too large— instead of minimizing the cost function, the error becomes larger in every epoch because we overshoot the global minimum: Although we can see that the cost decreases when we look at the right plot, the chosen learning rate η = 0.0001 is so small that the algorithm would require a very large number of epochs to converge. The following figure illustrates how we change the value of a particular weight parameter to minimize the cost function J (left subfigure). The subfigure on the right illustrates what happens if we choose a learning rate that is too large, we overshoot the global minimum: Gradient descent is one of the many algorithms that benefit from feature scaling. Here, we will use a feature scaling method called standardization, which gives our data the property of a standard normal distribution. The mean of each feature is centered at value 0 and the feature column has a standard deviation of 1. For example, to standardize the j th feature, we simply need to subtract the sample mean µj from every training sample and divide it by its standard deviation σ j: Here x j is a vector consisting of the jth feature values of all training samples n. Standardization can easily be achieved using the NumPy methods mean and std: >>> X_std = np.copy(X) >>> X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() >>> X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() After standardization, we will train the Adaline again and see that it now converges using a learning rate η = 0.01: >>> ada = AdalineGD(n_iter=15, eta=0.01) >>> ada.fit(X_std, y) >>> plot_decision_regions(X_std, y, classifier=ada) >>> plt.title('Adaline - Gradient Descent') >>> plt.xlabel('sepal length [standardized]') >>> plt.ylabel('petal length [standardized]') >>> plt.legend(loc='upper left') >>> plt.show() >>> plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o') >>> plt.xlabel('Epochs') >>> plt.ylabel('Sum-squared-error') >>> plt.show() After executing the preceding code, we should see a figure of the decision regions as well as a plot of the declining cost, as shown in the following figure: As we can see in the preceding plots, the Adaline now converges after training on the standardized features using a learning rate η = 0.01. However, note that the SSE remains non- zero even though all samples were classified correctly. 1.8.3. Large scale machine learning and stochastic gradient descent In the previous section, we learned how to minimize a cost function by taking a step into the opposite direction of a gradient that is calculated from the whole training set; this is why this approach is sometimes also referred to as batch gradient descent. Now imagine we have a very large dataset with millions of data points, which is not uncommon in many machine learning applications. Running batch gradient descent can be computationally quite costly in such scenarios since we need to reevaluate the whole training dataset each time we take one step towards the global minimum. A popular alternative to the batch gradient descent algorithm is stochastic gradient descent, sometimes also called iterative or on-line gradient descent. Instead of updating the weights based on the sum of the accumulated errors over all samples x( i ) : We update the weights incrementally for each training sample: Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since each gradient is calculated based on a single training example, the error surface is noisier than in gradient descent, which can also have the advantage that stochastic gradient descent can escape shallow local minima more readily. To obtain accurate results via stochastic gradient descent, it is important to present it with data in a random order, which is why we want to shuffle the training set for every epoch to prevent cycles. Another advantage of stochastic gradient descent is that we can use it for online learning. In online learning, our model is trained on-the-fly as new training data arrives. This is especially useful if we are accumulating large amounts of data—for example, customer data in typical web applications. Using online learning, the system can immediately adapt to changes and the training data can be discarded after updating the model if storage space in an issue. Since we already implemented the Adaline learning rule using gradient descent, we only need to make a few adjustments to modify the learning algorithm to update the weights via stochastic gradient descent. Inside the fit method, we will now update the weights after each training sample. Furthermore, we will implement an additional partial_fit method, which does not reinitialize the weights, for on-line learning. In order to check if our algorithm converged after training, we will calculate the cost as the average cost of the training samples in each epoch. Furthermore, we will add an option to shuffle the training data before each epoch to avoid cycles when we are optimizing the cost function; via the random_state parameter, we allow the specification of a random seed for consistency: from numpy.random import seed class AdalineSGD(object): """ADAptive LInear NEuron classifier. Parameters ------------ eta : float Learning rate (between 0.0 and 1.0) n_iter : int Passes over the training dataset. Attributes ----------- w_ : 1d-array Weights after fitting. errors_ : list Number of misclassifications in every epoch. shuffle : bool (default: True) Shuffles training data every epoch if True to prevent cycles. random_state : int (default: None) Set random state for shuffling and initializing the weights. """ def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None): self.eta = eta self.n_iter = n_iter self.w_initialized = False self.shuffle = shuffle if random_state: seed(random_state) def fit(self, X, y): """ Fit training data. Parameters ---------- X : {array-like}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. Returns ------- self : object """ self._initialize_weights(X.shape) self.cost_ = [] for i in range(self.n_iter): if self.shuffle: X, y = self._shuffle(X, y) cost = [] for xi, target in zip(X, y): cost.append(self._update_weights(xi, target)) avg_cost = sum(cost)/len(y) self.cost_.append(avg_cost) return self def partial_fit(self, X, y): """Fit training data without reinitializing the weights""" if not self.w_initialized: self._initialize_weights(X.shape) if y.ravel().shape > 1: for xi, target in zip(X, y): self._update_weights(xi, target) else: self._update_weights(X, y) return self def _shuffle(self, X, y): """Shuffle training data""" r = np.random.permutation(len(y)) return X[r], y[r] def _initialize_weights(self, m): """Initialize weights to zeros""" self.w_ = np.zeros(1 + m) self.w_initialized = True def _update_weights(self, xi, target): """Apply Adaline learning rule to update the weights""" output = self.net_input(xi) error = (target - output) self.w_[1:] += self.eta * xi.dot(error) self.w_ += self.eta * error cost = 0.5 * error**2 return cost def net_input(self, X): """Calculate net input""" return np.dot(X, self.w_[1:]) + self.w_ def activation(self, X): """Compute linear activation""" return self.net_input(X) def predict(self, X): """Return class label after unit step""" return np.where(self.activation(X) >= 0.0, 1, -1) The _shuffle method that we are now using in the AdalineSGD classifier works as follows: via the permutation function in numpy.random, we generate a random sequence of unique numbers in the range 0 to 100. Those numbers can then be used as indices to shuffle our feature matrix and class label vector. We can then use the fit method to train the AdalineSGD classifier and use our plot_decision_regions to plot our training results: >>> ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1) >>> ada.fit(X_std, y) >>> plot_decision_regions(X_std, y, classifier=ada) >>> plt.title('Adaline - Stochastic Gradient Descent') >>> plt.xlabel('sepal length [standardized]') >>> plt.ylabel('petal length [standardized]') >>> plt.legend(loc='upper left') >>> plt.show() >>> plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o') >>> plt.xlabel('Epochs') >>> plt.ylabel('Average Cost') >>> plt.show() The two plots that we obtain from executing the preceding code example are shown in the following figure: As we can see, the average cost goes down pretty quickly, and the final decision boundary after 15 epochs looks similar to the batch gradient descent with Adaline. If we want to update our model—for example, in an on-line learning scenario with streaming data—we could simply call the partial_fit method on individual samples—for instance, ada.partial_fit(X_std[0, :], y). 1.9 Summary We learned that supervised learning is composed of two important subfields: classification and regression. While classification models allow us to categorize objects into known classes, we can use regression analysis to predict the continuous outcomes of target variables. Unsupervised learning not only offers useful techniques for discovering structures in unlabeled data, but it can also be useful for data compression in feature preprocessing steps. We briefly went over the typical roadmap for applying machine learning to problem tasks, which we will use as a foundation for deeper discussions and hands-on examples in the following chapters. Eventually, we set up our Python environment and installed and updated the required packages to get ready to see machine-learning in action. In this chapter, we gained a good understanding of the basic concepts of linear classifiers for supervised learning. After we implemented a perceptron, we saw how we can train adaptive linear neurons efficiently via a vectorized implementation of gradient descent and on-line learning via stochastic gradient descent. Now that we have seen how to implement simple classifiers in Python, we are ready to move on to the next chapter where we will use the Python scikit-learn machine learning library to get access to more advanced and powerful off-the-shelf machine learning classifiers that are commonly used in academia as well as in industry. 1.10 Self Assessment Questions 1. Define machine learning and its type. 2. What do you mean by regression analysis. 3. Explain reinforcement learning. 4. Explain unsupervised learning in detail. 5. How to install and upgrade packages in python? 6. Explain Artificial neurons in terms of machine learning. 7. Write code in Python for perceptron implementation. 8. What is Adaline? 9. Differentiate between: Adaline rule and Rosenblatt's perceptron.

Machine Learning and Classification Algorithms PDF

Document Details

Tags

Related

Summary

Full Transcript