Machine Learning 702AI0C012 Data Exploration PDF
Document Details
Uploaded by ExcitedDarmstadtium156
Mukesh Patel School of Technology Management & Engineering, Mumbai
null
Tags
Summary
This document discusses machine learning, specifically focusing on data exploration and visualization. It covers a range of data types (categorical, nominal, ordinal, quantitative) and illustrates their roles in preparing and interpreting data. The document lays out concepts and techniques for data preparation, including methods for handling missing values and outliers. The text also introduces different types of normalization for data transformation.
Full Transcript
Machine Learning 702AI0C012 Data Exploration, Pre-processing and Visualization Unit-2 What will be covered… Missing Values Treatment, Handling Categorical data: Mapping ordinal features, Encoding class labels, Performing one-hot...
Machine Learning 702AI0C012 Data Exploration, Pre-processing and Visualization Unit-2 What will be covered… Missing Values Treatment, Handling Categorical data: Mapping ordinal features, Encoding class labels, Performing one-hot encoding on nominal features, Outlier Detection and Treatment. Feature Engineering: Variable Transformation and Variable Creation, Selecting meaningful features 2 3 ML Steps 3 4 ML Steps 4 5 Data set A data set is a collection of related information or records. 5 6 Types of Data in ML Data 6 7 Qualitative Data Provides information about the quality of an object or information which cannot be measured E.g. if we consider the quality of performance of students in terms of ‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data Qualitative data is also called categorical data and can be further subdivided into two types as follows: 1. Nominal data, 2. Ordinal data 7 8 Nominal Data Has no numeric value, but a named value Used for assigning named values to attributes Nominal values cannot be quantified Few examples: Blood group: A, B, O, AB, etc Nationality: Indian, American, British, etc. Gender: Male, Female Mathematical operations such as addition, subtraction, etc cannot be performed on nominal data Basic counting (mode- frequency of occurrence) is possible 8 9 Ordinal Data Ordinal data can also be naturally ordered => assigned named values to attributes Arranged in a sequence of increasing or decreasing value, whether a value is better or greater Eg: Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc Grades: A, B, C, etc Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’ etc Like nominal data, basic counting is possible for ordinal data Mode and median can be identified. But mean can still not be calculated 9 10 Quantitative Data Relates to information about the quantity of an object – hence it can be measured. For example, if we consider the attribute ‘marks’, it can be measured using a scale of measurement. Quantitative data is also termed as numeric data. There are two types of quantitative data: 1. Interval data 2. Ratio data 10 11 Interval Data Numeric data for which not only the order is known, but the exact difference between values is also known. E.g. The difference between 12°C and 18°C degrees is measurable and is 6°C Other examples include date, time, etc. Do not have true zero – 0 temperature – not possible Mathematical operations such as addition and subtraction are only possible, so the central tendency can be measured by mean, median, or mode. Ration cannot be applied - 40 °C into twice 20 °C temperature 11 12 Ratio Data Numeric data for which exact value can be measured. Absolute zero is available for ratio data. Also, these variables can be added, subtracted, multiplied, or divided. The central tendency can be measured by mean, median, or mode and also standard deviation. Examples of ratio data include height, weight, age, salary, etc. 12 13 Types of Data in ML 13 14 Test your understanding 14 15 Test your understanding 15 16 Test your understanding Favorite candy bar Weight of luggage Year of your birth Burger size (small, medium, large, extra large, jumbo) Military rank Number of children in a family Shoe size 16 17 Test your understanding Favorite candy bar – Nominal Weight of luggage – Ratio Year of your birth – Interval Burger size (small, medium, large, extra large, jumbo) – Ordinal Military rank – Ordinal Number of children in a family – Ratio Shoe size – Interval 17 18 Data Attributes based on value Discrete attributes can assume a finite or countably infinite number of values. Nominal attributes such as roll number, street number, pin code, etc. can have a finite number of values Numeric attributes such as count, rank of students, etc. can have countably infinite values. A special type of discrete attribute which can assume two values only is called binary attribute. Examples of binary attribute include male/ female, positive/negative, yes/no, etc. Continuous attributes can take any value which is a real number. Examples of continuous attribute include length, height, price, etc. 18 19 Test your understanding Number of emergency room patients Blood pressure of a patient Weight of a patient Pulse for a patient Emergency room wait time rounded to the nearest minute Tumor size 19 20 Test your understanding Number of emergency room patients – Discrete Blood pressure of a patient – Continuous Weight of a patient – Continuous Pulse for a patient – Discrete Emergency room wait time rounded to the nearest minute – Discrete Tumor size – Continuous 20 21 Exploring Structure of Data in ML Understand that in a data set, which of the attributes are numeric and which are categorical in nature. Because, the approach of exploring numeric data is different than the approach of exploring categorical data. Cyl, ModYr, Orig : Discrete, finite Identify types of data: Car name: Categorical (nominal) Mpg, disp, horse, weight, acclr: Continuous 21 Descriptive statistics: Exploring Structure of Data in ML 22 23 Descriptive statistics Measures of central tendency Mean is a sum of all data values divided by the count of data elements. E.g. mean of 21, 89, 34, 67, and 96 is 61.4 Median is the value of the element appearing in the middle of an ordered list of data elements. For above example, the ordered list would be 21, 34, 67, 89, and 96. Since there are 5 data elements, the 3rd element in the ordered list is considered as the median. Hence, the median value of this set of data is 67. 23 24 Descriptive statistics Since the mean is calculated from the cumulative sum of data values, it is impacted if too many data elements are having values closer to the far end of the range, i.e. close to the maximum or minimum values. Mean is especially sensitive to outliers, i.e. the values which are unusually high or low, compared to the other values. Mean is likely to get shifted drastically even due to the presence of a small number of outliers. If we observe that for certain attributes the deviation between values of mean and median are quite high, we should investigate those attributes further and find the root cause along with the need for remediation. 24 Descriptive statistics Weighted Mean Useful where each outcome has a different probability of occurring. When calculating an arithmetic mean, we make the assumption that all numbers used in the calculation show an equal probability of occurring or have equal weights. Example of weighted mean 25 Descriptive statistics Weighted Mean: You take three 100-point exams in your statistics class and score 80, 80 and 95. The last exam is much easier than the first two, so your professor has given it less weight. The weights for the three exams are: Exam 1: 40 % of your grade. (Note: 40% as a decimal is.4.) Exam 2: 40 % of your grade. Exam 3: 20 % of your grade. What is your final weighted average for the class? 1.Multiply the numbers in your data set by the weights:.4(80) = 32.4(80) = 32.2(95) = 19 2.Add the numbers up. 32 + 32 + 19 = 83. 26 27 Descriptive statistics For the attributes such as ‘mpg’, ‘weight’, ‘acceleration’, and ‘modelyear’ the deviation between mean and median is not significant which means the chance of these attributes having too many outlier values is less. However, the deviation is significant for the attributes ‘cylinders’, ‘displacement’ and ‘origin’. Also, horsepower attribute has missing values. 27 28 Descriptive statistics Mode is the peak of the distribution, i.e. most common value E.g. Mode of 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6, 7, 8, 10, 13 is 3. 28 29 Descriptive statistics 29 30 Descriptive statistics 30 31 Descriptive statistics 31 32 Descriptive statistics Measures of data spread We take a granular view of the data spread in the form of 1. Dispersion of data 2. Position of the different data values Dispersion of data: Consider data values of two attributes with Attribute 1 values : 44, 46, 48, 45, and 47 Attribute 2 values : 34, 46, 59, 39, and 52 Both the set of values have a mean and median of 46. However, the first set of values that is of attribute 1 is more concentrated or clustered around the mean/median value whereas the second set of values of attribute 2 is quite spread out or dispersed. 32 33 Descriptive statistics Measures of data spread Variance Standard deviation Larger value of variance or standard deviation indicates more dispersion in the data and vice versa. 33 34 Descriptive statistics Measures of data spread Calculate the variance of the two Attributes: Attribute 1 values : 44, 46, 48, 45, and 47 Attribute 2 values : 34, 46, 59, 39, and 52 Attribute 1 values are quite concentrated around the mean while Attribute 2 values are extremely spread out. 34 Descriptive statistics Range Let x1,x2,.. ,xN be a set of observations for some numeric attribute, X. The range of the set is the difference between the largest (max()) and smallest (min()) values. 35 Descriptive statistics Quantiles and Quartiles The 2-quantile is the data point dividing the lower and upper halves Split the data distribution into equal-size of the data distribution. It consecutive sets corresponds to median These data points are called quantiles The 4-quantiles are the three data Quantiles are points taken at regular intervals of points that split the data distribution a data distribution, dividing it into essentially into four equal parts; each part equal size consecutive sets represents one-fourth of the data distribution. They are more commonly referred to as quartiles The 100-quantiles are more commonly referred to as percentiles; they divide the data distribution into 100 equal-sized consecutive sets 36 Descriptive statistics Median, quartiles, and percentiles are the most widely used forms of quantiles. 37 Descriptive statistics Quantiles and Quartiles Median, quartiles, and percentiles are the most widely used forms of quantiles. Determine the quartiles of the list 1,3,3,4,5,6,6,7,8,8 38 Descriptive statistics Quantiles and Quartiles The quartiles give an indication of a distribution’s center, spread, and shape. The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The second quartile is the 50th percentile. As the median, it gives the center of the data distribution. 39 Descriptive statistics Interquartile range The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as IQR = Q3-Q1. 40 Descriptive statistics Interquartile range Suppose following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Quartiles are the three values that split the sorted data set into four equal parts. The quartiles for this data are the third, sixth, and ninth values, respectively, in the sorted list. So, 48.5,54,66.5, Q1=48.5, Q3=66.5 therefore, IQR= 66.5-48.5=18 41 Descriptive statistics Interquartile range With an Even Sample Size: For the sample (n=10) the median diastolic blood pressure is 71 (50% of the values are above 71, and 50% are below). There are 5 values below the median (lower half), the middle value is 64 which is the first quartile. There are 5 values above the median (upper half), the middle value is 77 which is the third quartile. The interquartile range is 77 – 64 = 13; the interquartile range is the range of the middle 50% of the data. 42 Interquartile range With an Odd Sample Size: When the sample size is odd, the median and quartiles are determined in the same way. When the sample size is 9, the median is the middle number 72. The quartiles are determined in the same way looking at the lower and upper halves, respectively. There are 4 values in the lower half, the first quartile is the mean of the 2 middle values in the lower half ((64+64)/2=64). The same approach is used in the upper half to determine the third quartile ((77+81)/2=79). 43 From blog.dailydoseo fds.com 44 Data Visualization Box Plot A box plot (also known as box and whisker plot) is a type of chart often used in data analysis to visually show the distribution of numerical data and skewness through displaying the data quartiles and averages. A boxplot is a graph that gives you a good indication of how the values in the data are spread out. Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). 45 Five number summary, Boxplots, and Outliers Minimum Score The lowest score, excluding outliers (shown at the end of the left whisker) Lower Quartile Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile) Median The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less 46 Five number summary, Boxplots, and Outliers Upper Quartile Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value Maximum Score The highest score, excluding outliers (shown at the end of the right whisker). Whiskers The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores). The Interquartile Range (or IQR) This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile). 47 Understanding the box plot 48 Why are box plots useful? Box plots divide the data into sections that each contain approximately 25% of the data in that set. Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. 49 Box plots are useful as they show outliers within a data set outside 1.5 times the interquartile range above the upper quartile and below the lower quartile (Q1 - 1.5 * IQR or Q3 + 1.5 * IQR). 50 Box plots useful for signs of skewness in the data 51 How to compare box plots Compare the medians of box plots Compare the respective medians of each box plot. If the median line of a box plot lies outside of the box of a comparison box plot, then there is likely to be a difference between the two groups. 52 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Make a box plot of the data. 53 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 1: Order the data from smallest to largest. Data is already in ascending order 54 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 2: Find the median The median is the mean of the middle two numbers 55 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 2: Find the median The median is the mean of the middle two numbers 56 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 57 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 5: Complete five number summary 1. Min : 25 2. Max: 38 3. Median: 32 4. Q1: 29 5. Q2: 35 58 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 6: Make a box plot 1. Min : 25 2. Max: 38 3. Median: 32 4. Q1: 29 5. Q2: 35 59 Finding five number summary A sample of 10 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 6: Make a box plot 1. Min : 25 2. Max: 38 3. Median: 32 4. Q1: 29 5. Q2: 35 60 Interpreting Quartiles 61 Draw box plot for given data set The following dollar amounts were the hourly collections from a Salvation Army kettle at a local store one day in December: $19, $26, $25, $37, $32, $28, $22, $23, $29, $34, $39, and $31. Construct the box-and-whisker plot for the amount collected. 62 Draw box plot for given data set The following dollar amounts were the hourly collections from a Salvation Army kettle at a local store one day in December: $19, $26, $25, $37, $32, $28, $22, $23, $29, $34, $39, and $31. Construct the box-and-whisker plot for the amount collected. 63 Draw box plot for given data set The following dollar amounts were the hourly collections from a Salvation Army kettle at a local store one day in December: $19, $26, $25, $37, $32, $28, $22, $23, $29, $34, $39, and $31. Construct the box-and-whisker plot for the amount collected. Solution: The five-number summary from the previous page is Minimum - 19, Q1- 24, Median - 28.5, Q2 - 33, Maximum - 39. 64 Self-Check Suppose that the box-and-whisker plots below represent quiz scores out of 25 points for Quiz 1 and Quiz 2 for the same class. What do these box-and-whisker plots show about how the class did on test #2 compared to test #1? 65 Self-Check Suppose that the box-and-whisker plots below represent quiz scores out of 25 points for Quiz 1 and Quiz 2 for the same class. What do these box-and-whisker plots show about how the class did on test #2 compared to test #1? These box-and-whisker plots show that the lowest score, highest score, and Q3 are all the same for both exams, so performance on the two exams were quite similar. However, the movement Q1 up from a score of 6 to a score of 9 indicates that there was an overall improvement. On the first test, approximately 75% of the students scored at or above a score of 6. On the second test, the same number of students (75%) scored at or above a score of 9. 66 data1=[2,43,49,50,51,51, 53,54,60,62,63] Median = 51 Q1 = 49 Q3 = 60 IQR = Q3-Q1 = 60 -49 = 11 1.5* IQR = 16.5 Q1 – 1.5* 1QR = 76.5 67 data1=[2,43,49,50,51,51, 53,54,60,62,63] Median = 51 Q1 = 49 Q3 = 60 IQR = Q3-Q1 = 60 -49 = 11 1.5* IQR = 16.5 Q1 – 1.5* 1QR = 76.5 68 Box plot for the unit price data for items sold at four branches of AllElectronics during a given period Boxplot shows for unit price data for items sold at four branches of AllElectronics during a given time period. For branch 1, we see that the median price of items sold is $80, Q is $60, and Q 1 3 is $100. Notice that two outlying observations for this branch were plotted individually, as their values of 175 and 202 are more than 1.5 times the IQR here of 40. 69 Handling Missing Values 70 Handling Missing Values 71 Handling Missing Values 72 Handling Missing Values 73 Handling Missing Values 74 Handling Missing Values 75 Handling Missing Values 76 Handling Missing Values 77 Handling Missing Values 78 Handling Missing Values 79 Data Cleaning- handling missing values 1. Ignore the tuple 2. Fill in the missing values manually 3. Use a global constant to fill in the missing value 4. Use a measure of central tendency for the attribute (e.g., mean or median) 5. Use the attribute mean or median for all samples belonging to the same class 6. Use the most probable value to fill in missing value 80 Data Cleaning- handling missing values 1. Ignore the tuple Usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. By ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple. Such data could have been useful to the task at hand. 2. Fill in the missing values manually : It is time consuming and may not be feasible given a large data set with many missing values. 81 Data Cleaning- handling missing values 3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant such as a label like “Unknown” or -∞. If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of “Unknown.” 4. Use a measure of central tendency for the attribute (e.g., mean or median) Replace the missing value with central tendency (mean\median – middle value of data distribution) For normal (symmetric) data distributions, the mean can be used, while skewed data distribution should employ the median 82 Data Cleaning- handling missing values 5. Use the attribute mean or median for all samples belonging to the same class For example, if classifying customers according to credit risk, we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple. If the data distribution for a given class is skewed, the median value is a better choice. 6. Use the most probable value to fill in missing value (the most popular strategy) This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. 83 Data Cleaning- handling missing values 6. Use the most probable value to fill in missing value (the most popular strategy) This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. 84 Normalization Min-max normalization: to [new_minA, new_maxA] Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA 73,600 − 12,000 (1.0 − 0) + 0 = 0.716 98,000 − 12,000 Z-score normalization (μ: mean, σ: standard deviation): v − A v' = A Ex. Let μ = 54,000, σ = 16,000 Then 73,600 − 54,000 = 1.225 16,000 85 Z-score Any data point whose Z-score falls out of 3rd standard deviation is an outlier treatment. Loop through all the data points and compute the Z-score using the formula (Xi-mean)/std. Define a threshold value of 3 and mark the datapoints whose absolute value of Z-score is greater than the threshold as outliers. 86 Treatment of outliers Three main methods of dealing with outliers, apart from removing them from the dataset: 1) Trimming/Remove the outliers - remove the outliers from the dataset 2) Reducing the weights of outliers (trimming weight) 3) Mean/Median Imputation - changing the values of outliers with mean/ median value Mean value is highly influenced by the outlier treatment, it is advised to replace the outliers with the median value. 4) Log transform 87 The easiest transformation relies on taking the logarithm of the variable of interest The log “squeezes” large values more, so that skewed distributions become more symmetrical and closer to a Normal distribution. 88