Machine Learning PDF

Machine Learning 1 How Do Humans Learn ? n Learning under expert guidance Ø baby ‘learns’ things from his parents (this is hand , water, food, sky is blue……) Ø when the baby starts going to school, teachers alphabets, digits, sciences……. Ø In all phases of life of a human being, learning is gained by someone who has experience in that field. n Learning based on knowledge gained from experts Ø A baby can group together all objects of same color even if his parents have not taught him to do so. He is able to do so because at some point of time his parents have told him which color is blue, which is red, which is green, etc. 2 Ø In postgraduate, researcher solve problems that have never been solved based in their previous knowledge. n Learning by self Ø When a baby learning to walk through obstacles. He bumps on to obstacles and falls down multiple times till he learns that whenever there is an obstacle, he needs to cross over it. Ø Not all things are taught by others, lot of things need to be learnt only from tries made in the past. We tend to form a check list on things that we should do, and things that we should not do, based on our experiences. 3 What is Machine Learning? ‘A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.’ [Tom Mitchell] 4 Types of Machine Learning n Machine learning can be classified into three broad categories: n Supervised learning algorithms are algorithms that are trained with labeled data. In other words, data composed of examples of the desired answers. n Unsupervised learning algorithms are algorithms that FIG. 1.3 Types of machine learning are trained on data with no labels, and the goal is to find relationships in the data. 5 Reinforcement Learning n Reinforcement learning algorithms are algorithms that Reinforcement Learning is a very different beast. The learning system, called an agent are learned by observing the environment, select and in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as shown in perform actions, Figure 1-12). It and must thenin return learn by itself whatget rewards is the best strategy, calledor the most reward over time. A policy defines what action the agent should choose penalties a policy, to get in the form ofwhennegative rewards. it is in a given situation. n ExamplesFigurereinforcement learning: self-driving cars, 1-12. Reinforcement Learning intelligent how robots, For example, manyetc. robotsGoogle's DeepMind implement Reinforcement has Learning algorithms todefeated learn to walk. DeepMind’s AlphaGo program is also a good example of Reinforcement the world'sat thenumber Learning: it made theone headlinesGo in Mayplayer 2017 when it Ke beat theJie worldin 2017. champion Ke Jie game of Go. It learned its winning policy by analyzing millions of games, and 6 n Some examples of supervised learning are Ø Predicting the results of a football game Ø Predicting whether a tumor is malignant or benign Ø Predicting the price of houses. Ø Classifying emails as spam or non-spam n When we are trying to predict a category of a new data sample, the problem is known as a classification. Some typical classification problems include: Ø Image classification Ø Prediction of disease Ø Win–loss prediction of games Ø Prediction of natural calamity like earthquake, flood, etc. Ø Recognition of handwriting Ø Recognition of car plate number 7 n Whereas when we are trying to predict a real-value of a new data sample, the problem is known as regression. Some typical regression problems include: Ø Prediction of Demand in retails Ø Sales prediction for managers Ø Price prediction in real estate Ø Weather forecast Ø Skill demand forecast in job market n Clustering is the main type of unsupervised learning. It intends to group or cluster similar objects, within the data, together. For that reason, objects belonging to the same cluster are quite similar to each other while objects belonging to different clusters are quite dissimilar. Some typical regression problems include: Ø Classify the crimes in Yemen based on (age, education, region,) Ø Classify the customers of streaming service 8 dimensional data space. A row or record represents a point in the four-dimensional data space as each row has specific values for each of the four attributes or features. Value of an attribute, quite understandably, may vary Types of Data in Machine Learning from record to record. For example, if we refer to the first two records in the Student data set, the value of attributes Name, Gender, and Age are different (Fig. 2.3). n A data set is a collection of related information or records. FIG. 2.2 Examples of data set n Data can broadly be divided into following two types: Ø Qualitative data Ø Quantitative data FIG. 2.3 Data set records and attributes 9 n Qualitative data is information about the quality of an object which cannot be measured. Ø For example, name or roll number of students. Also if we consider the performance of students (in terms of ‘Good’, ‘Average’, and ‘Poor’) are information that cannot be measured using some scale of measurement. n Qualitative data can be further subdivided into two types as follows: Ø Nominal data Ø Ordinal data n Nominal data is one which has no numeric value, BUT a named value. Examples of nominal data are Ø For example, Blood group: A, B, O, AB, etc. Ø Nationality: Yemeni, Indian, American, British, etc. Ø Gender: Male, Female 10 n It is obvious, mathematical operations (addition, subtraction, multiplication, etc.) cannot be performed on nominal data. For that reason, statistical functions such as mean, variance, etc. can also not be applied on nominal data. However, a basic count is possible. So mode can be identified for nominal data. n Ordinal data, is one which has NO numeric value BUT can be naturally ordered (we can say whether a value is better than or greater than another value) Ø For examples, Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, Ø Grades: ‘Excellent’, ‘Very Good’, ‘Good’, ‘Poor’ and ‘Fail’ Ø Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc. n Like nominal data, basic mode can be applied. Since ordering is possible in case of ordinal data, median, and quartiles can be applied also. Mean and variance, still can not be calculated. 11 n Quantitative data is information about the quantity of an object which can be measured (obviously ordered) Ø For example, the attribute ‘marks’ can be measured using a scale of measurement. n There are two types of quantitative data: Ø Interval data Ø Ratio data n Interval data is quantitive data for which the exact difference between values is also known. BUT do not have ‘absolute zero’ (meaningful zero point). Ø For example, temperature is interval data. The difference between 12°C and 18°C degrees is 6°C as the difference between 15.5°C and 21.5°C. Ø A temperature of zero degrees doesn’t mean that there is no temperature (or no heat at all) – it just means the temperature is 10 degrees less than 10. Ø Other examples include date, time, etc. 12 n For interval data, mathematical operations such as addition and subtraction are possible. For that reason, for interval data, mean, median, mode and standard deviation …etc. can be measured. n Ratio data is quantitative data for which the exact difference between values is known and ALSO having ‘absolute zero’. n For ratio data, mathematical operations such as addition and subtraction are possible. Mean, median, mode, standard deviation can be measured. Ø For examples, attributes such as height, weight, age, salary, etc. are ratio data. 13 problem. FIG. 2.4 Types of data Apart from the approach detailed above, attributes can also be categorized into types based on a number of values that can be assigned. The attributes can be either discrete or continuous based on this factor. 14 Exploring Structure of Data Exploring quantitative data Understanding central tendency n Measures of central tendency help to understand the central point of a set of data. Ø Mean: is a sum of all data values divided by the count of data elements Ø Median : Median, on contrary, is the value of the element appearing in the middle of an ordered list of data elements. n There might be a question, why two measures of central tendency are reviewed. The reason is mean and median are impacted differently by data values appearing at the beginning or at the end of the range. 15 n Mean is so sensitive to outliers (the values which are unusually high or low, compared to the other values). n If we observe that for certain attributes the deviation between values of mean and median are quite high, we should investigate those attributes further. Auto-MPG Mileage per gallon performances of various cars 16 % FIG. 2.6 Mean vs. Median for Auto MPG n The deviation is significant for the attributes ‘cylinders’, ‘displacement’ and ‘origin’. So, we need to further look at With a bit of investigation, we can find out that the some moreisstatistics problem occurring for these because ofattributes. the 6 data elements, as n Also, there shown is some in Figure 2.7,problem in the do not have values value for theofattribute the attribute ‘horsepower’ ‘horsepower’.because of which the mean and median calculation is not possible. 17 Understanding data spread n Now, we have a clear idea of which attributes have a large deviation between mean and median. Let’s look closely at those attributes in the form of Ø Dispersion of data Ø Position of the different data values data dispersion: n Consider the data values of two attributes: n Attribute 1 values : 44, 46, 48, 45, and 47 mean = 46 n Attribute 2 values : 34, 46, 59, 39, and 52 mean = 46 However, the set of values of attribute 1 is more concentrated around the mean value whereas the second set of values of attribute 2 is quite spread out or dispersed. 18 n To find out how much the different values of a data are spread out, the variance of the data is measured as ∑ ( , " %&' )% *+ ! = - where. is the mean of the data elements, / is its number n Larger value of variance indicates more dispersion in the data and vice versa. For the above example, (22*23), 5(23*23), 5(22*26), 5(22*27), 5(22*28), Ø !0" = =6 7 (:2*23), 5(23*23), 5(7;*26), 5(:;*27), 5(7"*28), Ø !"" = = 65.2 7 n So it is quite clear from the measure that attribute 1 values are quite concentrated around the mean while attribute 2 values are extremely spread out. 19 Measuring data value position: n Data values of an attribute are arranged in an increasing order then divided into two halves. Each have is divided into two halves. minimum Q1 Median(Q2) Q2 maximum n We look at he difference between the quartiles (minimum and Q1, Q1 and median, median and Q2, Q2 and maximum) the larger values are more spread out than the smaller ones. n This helps in understanding why the value of mean is much higher than that of the median for the attribute ‘displacement’. n However, we still cannot ascertain whether there is any outlier present in the data. For that, we can better adopt some means to visualize the data. 20 Plotting and exploring numerical data Box plots: n box plot (also called box and whisker plot) gives a standard visualization of the five-number summary statistics of a data set, namely: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. Below is a detailed interpretation of a box plot. 21 The box plot for attribute ‘cylinders’ looks pretty weird in FIG. 2.10 Box plot of Auto MPG attributes n 2.4.2.1.1 Analysing box plot for ‘cylinders’ shape. The upper whisker is missing, the median falls at The box plot for attribute ‘cylinders’ looks pretty weird in the bottom of shape. theThe box, evenis missing, upper whisker the the lower band for whisker is pretty median falls at the bottom of the box, even the lower small comparedwhiskerto the is pretty length small box! Is everything right? comparedof the to the lengthbox! of the Is everything right? The answer is a big YES, and you can figure it out if n The answer isyou a big YES. The attribute ‘cylinders’ is delve a little deeper into the actual data values of the attribute. The attribute ‘cylinders’ is discrete in nature discrete in nature having values from 3 to 8. having values from 3 to 8. Table 2.2 captures the frequency and cumulative frequency of it. 22 Histogram: Is a graph that shows the frequency of numerical data using rectangles whose area is proportional to the frequency of data and whose width is equal to the data interval. n Histograms might be of different shapes depending on the nature of the data 23 n The histograms for ‘mpg’ and ‘weight’ are right-skewed. The histogram for ‘acceleration’ is symmetric and unimodal, whereas the one for ‘model.year’ is symmetric and uniform. For remaining attributes, histograms are multimodal in nature. 24 Exploring qualitative data n here are not many options for exploring qualitive data. We show how many unique values are there for the attribute. n For example, For attribute ‘car name’ 1. Chevrolet chevelle malibu 2. Buick skylark 320 3. Plymouth satellite 4. Amc rebel sst 5. Ford torino 6. Ford galaxie 500 7. Chevrolet impala 8. Plymouth fury iii 9. Pontiac catalina 10. Amc ambassador dpl 25 In the same way, we may also be interested to know the proportion (or percentage) of count of data elements n Webelonging mayTable alsoto look a category. Say, e.g., for a little more fordetails the attributes and get a table 2.4 Count of Categories for ‘car name’ Attribute ‘cylinders’, consisting thecount the proportion of theofdata data elements elements belonging to the category 4 is 204 ÷ 398 = 0.513, i.e. 51.3%. Tables 2.6 and 2.7 contain the summarization of the categorical attributes by proportion of data elements. For attribute ‘car name’ n WeFor may also be attribute interested to know the proportion (or “cylinders” Table 2.6 Proportion of Categories for ‘“Cylinders’ Attribute percentage) of count of data elements 26 Exploring relationship between attributes Scatter plot: n A two-dimensional plot that helps to visualize relationship between two attributes (variables). Outlier 27 FIG. 2.13 Scatter plot of ‘displacement’ and ‘mpg’ 28 Data Quality and Data Handling Data quality: n Success of machine learning depends largely on the quality of data. A data which has the right quality helps to achieve better prediction accuracy. n We have already come across at least two types of problems: Ø Data elements without values or data with a missing values Ø Data elements having value different from the other elements, which we term as ‘outliers’. 29 Data Handling: n The issues in data quality, mentioned above, need to be handled for achieving the right amount of efficiency. 1) Outliers n Outliers are data elements with an abnormally high value which may impact prediction accuracy. Detecting Outliers: n There are a number of techniques to detect outliers, we will discuss some of them: Ø Box plot Ø Scatter plot Ø Tukey Fences Ø Z-Score 30 Tukey Fences Method n It is is based on Interquartile Range IQR (IQR=Q3-Q1) 116 Chapter 5 Getting Started with Scikit-lear n In Tukey Fences, outliers are values that are: Ø Less than Q1 – (1.5 × IQR), or Ø Z-Score More than Q3 + (1.5 × IQR) Z-Score The Methodsecond method for determining outli Z-score n A Z-score indicates indicates how manyhow many standard standard deviations devia a data point isThe fromZ-score the mean.has The the Z-score has the following following formula: formula: Z xi / where xi is the data point, μ is the mean of the dataset , and σ is where xithe isstandard the data point, μ is the mean deviation. o deviation. n A Z-score is considered to be an outlier if This is how you interpret the Z-score: Ø greater than 3 or Ø less than–3A negative Z-score indicates that the d 31 Handling outliers n Once the outliers are identified and the decision has been taken to fix those values, you may consider one of the following approaches: Ø Remove outliers: If the number of records which are outliers is not large, we can simply remove them. Ø Imputation: One other way is to impute the outlier value with mean or median or mode of all attribute values. Ø Capping: Removing outliers may result in the removal of a large number of records from your dataset which isn’t desirable in some cases. We use capping to replace the outliers with a maximum or minimum capped values. We can use percentile capping. Values < the value at 1st percentile are replaced by the value at 1st percentile, and values > than the value at 99th percentile are replaced by the value at 99th percentile. The capping at 5th and 95th percentile is also common. 32 2) Missing Values n In a data set, one or more data elements may have missing values in multiple records. n There are multiple strategies to handle missing values of data elements. Some of those strategies are: Ø Removing records having a missing value Ø Imputing records having a missing value: all missing values are imputed with the mean, median, or mode (as possible) of remaining values of the same attribute Ø Estimate missing values: If there are records similar to the ones with missing values, then the attribute values from those similar records can be planted in place of the missing value. For example, if the weight of a Russian student having age 12 years and height 5 ft. is missing. Then the weight of any other Russian student having age close to 12 years and height close to 5 ft. can be used. 33 Features n What is a feature? A feature is an attribute of a data set that is used in a machine learning process. Consider the Iris Data set: 34 Feature Engineering n What is feature engineering? Feature engineering is an important pre-processing step for machine learning. Feature engineering is the process of manipulation of a data set to form features such that represent the data set more effective and result in a better learning performance. n Feature engineering has two major elements: Ø Feature transformation Ø Feature selection 35 Feature Transformation n Feature transformation is the process of generating new features from the existing features. n There are two types of feature transformation: Ø Feature construction Ø Feature extraction n Feature construction is the process of creating additional new features from the existing features by discovering missing information about the relationships between features. Hence feature construction expanding the feature space. For example, if there are ‘n’ features in a raw data set, after feature construction ‘m’ more features may get added. So at the end, the data set will become of ‘n + m’ features. 36 data set.apartment, of the So such a feature, which isnamely not an apartment area, of existing feature canthe bedata addedset.toSothe data such set. In other a feature, namelywords, we transform apartment area, can For nthe example, thedata bethree-dimensional added to the following data dataInset set. to set other has three a words, four-dimensionalfeatures we transform apartment data theset, withlength, breadth, the newly three-dimensional dataand price. ‘discovered’ set to If it isapartment afeature used as an four-dimensional input area to a regression being theproblem, such dataThis can is bedepicted use for data set, added with thetonewly original data set. ‘discovered’ feature apartment training a regression model. It is more convenient and in area Figure 4.2.added to the original data set. This is depicted being makes more sense to use the area of the apartment, in Figure which 4.2.an existing feature of the data set. So is not apartment area, can be added to the data set. FIG. 4.2 Feature construction (example 1) 37 FIG. 4.2 Feature construction (example 1) example, such that. This andismdepicted )=" (, @% , space so that, after the transformation, it has a clear such that the transformed dataset is linearly separable in. In Figure 5, the used is , which after applied to every point in Figure 5 (left) yields the dividing margin between classes of data. linearly separable dataset Figure 5 (right). 132 Note: It is convention to use the Greek letter 'phi' for this transformation , so I'll use n kernelsIn are Figure 6, note that functions thatthe hyperplane convert learned linearly in is nonlinea non-separable Thus, we have improved the expressiveness of the Linear SVM data into a linearly separable data that SVM can handle. higher-dimensional space. n If we add a third dimension, say the !-axis, and define ! to be: ! = #2 + &2 n Once we plot the data points after adding the third dimension !, the points are now linearly separable. 133 Strengths and Weaknesses of SVM n Strengths of SVM Ø SVM can be used for both classification and regression. Ø SVM is robust, i.e. not much impacted by data with noise or outliers. Ø The prediction results using SVM are very promising. n Weaknesses of SVM Ø The SVM model is very complex when it deals with a high- dimensional data set. Hence, it is very difficult and close to impossible to understand the model in such cases. Ø SVM is slow for a large dataset, i.e. a data set with either a large number of features or a large number of instances. Ø SVM is quite memory-intensive. 134 Supervised Learning: Regression n Now, we will build concepts on prediction of numerical variables – which is another key area of supervised learning. This area, known as regression, focuses on solving problems such as predicting value of real estate, demand forecast in retail, weather forecast, etc. n The most common regression algorithms are Ø Simple linear regression Ø Multiple linear regression Ø Polynomial regression Ø Logistic regression 135 Simple Linear Regression n As the name indicates, simple linear regression is: Ø Simple = only one independent variable (predictor) Ø Linear = assumes a linear relationship between the dependent and independent variables 136 n For example, if we have the problem of finding the Price of a Property as a function of the Area of the Property. We can build a model using simple linear regression: !"#$% = ' + )×+"%' where ‘'’ and ‘)’ are intercept and slope of the straight line n Just remember that straight lines can be defined in a slope– intercept form. = (' + )0), where ' = intercept and ) = slope of the straight line. The value of intercept indicates the value of Y when 0 = 0. It is known as ‘the intercept or Y intercept’ because it specifies where the straight line crosses the Y-axis. n Slope of a straight line represents how much the line in a graph changes in the vertical direction (Y-axis) over a change in the horizontal direction (X-axis) 3456% = 7ℎ'9:% #9./7ℎ'9:% #9 0 137 Least Squares Technique to Estimate the Regression Line n Least Squares technique is used to estimate a line that will minimize the errors (residuals: differences between the predicted values "! on the regression line, and the actual values " of Y). This means minimizing the Sum of the Squares of the Errors ∑'$%& "$ − "!$ ) 138 examination will also be high. A random sample of 15 students in that class was selected, and the data is given n below: consider Example, a random A college professor believes that ifsample of 15 students the grade for internal examination is high in a class, the grade for external grades in a class forwillinternal examination and sample also be high. A random external of 15 exams students in that class was selected, and the data is given below: n A scatter plotbetween was A scatter drawn to plot was drawn explore to explore the relationship the relationship A scatter plot was drawn to explore the independent variable (internal the relationship marks) between the independent variable mapped to X-axis and dependent (internal marks) and variable (external between the marks) independent mapped to Y-axisvariable (internal as depicted in Figure 8.8. marks) dependent variable (external marks) mapped to X-axis and dependent variable (external marks) mapped to Y-axis as depicted in Figure 8.8. 139 n As we know, in simple linear regression, the line is drawn using the regression formula: ! = # + %& n If we know the values of ‘#’ and ‘%’, then it is easy to predict the value of Y for any given X by using the above formula. But the question is how to calculate the values of ‘a’ and ‘b’ for a given set of X and Y values? n We are going to use the Least Squares technique that minimize the Sum of the Squares of the Errors (SSE) ∑-*+,.* −.0* 1 It is observed that the SSE is least when b takes the value n It is observed that the SSE is least when % takes the value n The corresponding value of ‘#’ The corresponding calculated value using of ‘a’ calculated the using the above value of ‘%’ isabove given as:of ‘b’#is= ! − %& value 140 n So, let usor,calculate MExt = 19.04the × MInt of ! and " for the given value + 1.89 example. For detailed calculation: 141 Step 7: calculate the value of b ∑$ % & − % (& − ( 429.28 != = = 1.89 ∑$ % & − % 2 226.93 Step 8: calculate a using the value of b 1 = ( − !% = 56.8 − 1.89 ∗ 19.9 = 19.05 n Hence, for the above example, the estimated regression equation is constructed on the basis of the estimated values of a and b: 65 = 19.05 + 1.89 ∗ % n So, in the context of the given problem, we can say 819:; &< =>?=91A = 19.04 + 1.89 ∗ 819:; &< &

Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript