Data Description - Chapter 3 - Statistics - PDF
Document Details
![HighQualityOrientalism](https://quizgecko.com/images/avatars/avatar-10.webp)
Uploaded by HighQualityOrientalism
2000
Tags
Summary
This document is Chapter 3 of a statistics textbook, focusing on data description. It covers key concepts like measures of central tendency, variance, and position, including topics such as mean, median, mode, range, and exploratory data analysis. It provides example calculations and applications of these statistical tools, published by The McGraw-Hill Companies, Inc. in 2000.
Full Transcript
3-1 Chapter 3 Data Description © The McGraw-Hill Companies, Inc., 2000 3-2 Outline ⚫ 3-1 Introduction ⚫ 3-2 Measures of Central Tendency ⚫ 3-3 Measures of Variation ⚫ 3-4 Measures of Position ⚫ 3-5 Explo...
3-1 Chapter 3 Data Description © The McGraw-Hill Companies, Inc., 2000 3-2 Outline ⚫ 3-1 Introduction ⚫ 3-2 Measures of Central Tendency ⚫ 3-3 Measures of Variation ⚫ 3-4 Measures of Position ⚫ 3-5 Exploratory Data Analysis © The McGraw-Hill Companies, Inc., 2000 3-3 Objectives ⚫ Summarize data using the measures of central tendency, such as the mean, median, mode, and midrange. ⚫ Describe data using the measures of variation, such as the range, variance, and standard deviation. © The McGraw-Hill Companies, Inc., 2000 3-4 Objectives ⚫ Identify the position of a data value in a data set using various measures of position, such as percentiles, deciles and quartiles. © The McGraw-Hill Companies, Inc., 2000 3-5 Objectives ⚫ Use the techniques of exploratory data analysis, including stem and leaf plots, box plots, and five-number summaries to discover various aspects of data. © The McGraw-Hill Companies, Inc., 2000 3-6 3-2 Measures of Central Tendency ⚫ A statistic is a characteristic or measure obtained by using the data values from a sample. ⚫ A parameter is a characteristic or measure obtained by using the data values from a specific population. © The McGraw-Hill Companies, Inc., 2000 3-7 3-2 The Mean (arithmetic average) ⚫ The mean is defined to be the sum of the data values divided by the total number of values. ⚫ We will compute two means: one for the sample and one for a finite population of values. © The McGraw-Hill Companies, Inc., 2000 3-8 3-2 The Mean (arithmetic average) ⚫ The mean, in most cases, is not an actual data value. © The McGraw-Hill Companies, Inc., 2000 3-9 3-2 The Sample Mean The symbol X represents the sample mean. X is read as " X - bar". The Greek symbol is read as " sigma" and it means " to sum". X + X +... + X 1 2 n X= n X. = n © The McGraw-Hill Companies, Inc., 2000 3-10 3-2 The Sample Mean - Example The ages in weeks of a random sample of six kittens at an animal shelter are 3, 8, 5, 12, 14, and 12. Find the average age of this sample. The sample mean is X= X 3 + 8 + 5 + 12 + 14 + 12 = n 6 54 = = 9 weeks. 6 © The McGraw-Hill Companies, Inc., 2000 3-11 3-2 The Population Mean The Greek symbol represents the population mean. The symbol is read as " mu". N is the size of the finite population. X + X +... + X = 1 2 N N X. = N © The McGraw-Hill Companies, Inc., 2000 3-12 3-2 The Population Mean - Example A small company consists of the owner , the manager , the salesperson, and two technicians. The salaries are listed as $50,000, 20,000, 12,000, 9,000 and 9,000 respectively. ( Assume this is the population.) Then the population mean will be X = N 50,000 + 20,000 +12,000 + 9,000 + 9,000 = 5 = $20,000. © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for an 3-13 Ungrouped Frequency Distribution The mean for an ungrouped frequency distributuion is given by X= ( f X). n Here f is the frequency for the corresponding value of X , and n = f. © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for an Ungrouped 3-14 Frequency Distribution - Example The scores for 25 students on a 4 – point quiz are given in the table. Find the mean score Score , XX Score, FFrequency, requency, ff 00 22 11 44 22 12 12 33 44 5 44 33 5 © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for an Ungrouped 3-15 Frequency Distribution - Example Score Score,, XX FFrequency, requency, ff ff XX 00 22 00 11 44 44 22 12 12 24 24 33 44 12 12 44 33 12 12 5 5 f X 52 X= = = 2.08. n 25 © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for a 3-16 Grouped Frequency Distribution The mean for a grouped frequency distribution is given by ( f X ) m X=. n Here X is the corresponding m class midpoint. © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for a Grouped 3-17 Frequency Distribution - Example Class Class FFrequency, requency, ff 15.5 15.5--20.5 20.5 33 20.5 20.5--25.5 25.5 55 25.5 25.5--30.5 30.5 44 30.5 30.5--35.5 35.5 33 35.5 35.5--40.5 40.5 22 5 5 © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for a Grouped 3-18 Frequency Distribution - Example Table with class midpoints, Xm. Class Class Frequency, ff Frequency, XXmm ff XXmm 15.5 15.5--20.5 20.5 33 18 18 5454 20.5 20.5--25.5 25.5 55 23 23 115 115 25.5 25.5--30.5 30.5 44 28 28 112 112 30.5 --35 30.5.5 35.5 33 333 3 99 99 5 35.5 35.5--40.5 40.5 22 38 38 76 76 5 © The McGraw-Hill Companies, Inc., 2000 3-2 The Sample Mean for a Grouped 3-19 Frequency Distribution - Example f X = 54 + 115 + 112 + 99 + 76 m = 456 and n = 17. So X= f X m n 456 = = 26.82. 17 © The McGraw-Hill Companies, Inc., 2000 3-20 3-2 The Median ⚫ When a data set is ordered, it is called a data array. ⚫ The median is defined to be the midpoint of the data array. ⚫ The symbol used to denote the median is MD. © The McGraw-Hill Companies, Inc., 2000 3-21 3-2 The Median - Example ⚫ The weights (in pounds) of seven army recruits are 180, 201, 220, 191, 219, 209, and 186. Find the median. ⚫ Arrange the data in order and select the middle point. © The McGraw-Hill Companies, Inc., 2000 3-22 3-2 The Median - Example ⚫ Data array: 180, 186, 191, 201, 209, 219, 220. ⚫ The median, MD = 201. © The McGraw-Hill Companies, Inc., 2000 3-23 3-2 The Median ⚫ In the previous example, there was an odd number of values in the data set. In this case it is easy to select the middle number in the data array. © The McGraw-Hill Companies, Inc., 2000 3-24 3-2 The Median ⚫ When there is an even number of values in the data set, the median is obtained by taking the average of the two middle numbers. © The McGraw-Hill Companies, Inc., 2000 3-25 3-2 The Median - Example ⚫ Six customers purchased the following number of magazines: 1, 7, 3, 2, 3, 4. Find the median. ⚫ Arrange the data in order and compute the middle point. ⚫ Data array: 1, 2, 3, 3, 4, 7. ⚫ The median, MD = (3 + 3)/2 = 3. © The McGraw-Hill Companies, Inc., 2000 3-26 3-2 The Median - Example ⚫ The ages of 10 college students are: 18, 24, 20, 35, 19, 23, 26, 23, 19, 20. Find the median. ⚫ Arrange the data in order and compute the middle point. © The McGraw-Hill Companies, Inc., 2000 3-27 3-2 The Median - Example ⚫ Data array: 18, 19, 19, 20, 20, 23, 23, 24, 26, 35. ⚫ The median, MD = (20 + 23)/2 = 21.5. © The McGraw-Hill Companies, Inc., 2000 3-2 The Median-Ungrouped 3-28 Frequency Distribution ⚫ For an ungrouped frequency distribution, find the median by examining the cumulative frequencies to locate the middle value. © The McGraw-Hill Companies, Inc., 2000 3-2 The Median-Ungrouped 3-29 Frequency Distribution ⚫ If n is the sample size, compute n/2. Locate the data point where n/2 values fall below and n/2 values fall above. © The McGraw-Hill Companies, Inc., 2000 3-2 The Median-Ungrouped 3-30 Frequency Distribution - Example ⚫ LRJ Appliance recorded the number of VCRs sold per week over a one-year period. The data is given below. No. No.Sets SetsSold Sold Frequency Frequency 11 44 22 99 33 66 44 22 55 33 © The McGraw-Hill Companies, Inc., 2000 3-2 The Median-Ungrouped 3-31 Frequency Distribution - Example ⚫ To locate the middle point, divide n by 2; 24/2 = 12. ⚫ Locate the point where 12 values would fall below and 12 values will fall above. ⚫ Consider the cumulative distribution. ⚫ The 12th and 13th values fall in class 2. Hence MD = 2. © The McGraw-Hill Companies, Inc., 2000 3-2 The Median-Ungrouped 3-32 Frequency Distribution - Example No. No.Sets SetsSold Sold Frequency Frequency Cumulative Cumulative Frequency Frequency 11 44 44 22 99 13 13 33 66 19 19 44 22 21 21 55 33 24 24 This class contains the 5th through the 13th values. © The McGraw-Hill Companies, Inc., 2000 3-2 The Median for a Grouped 3-33 Frequency Distribution The median can be computed from: (n 2) − cf MD = (w) + Lm f Where n = sum of the frequencies cf = cumulative frequency of the class immediately preceding the median class f = frequency of the median class w = width of the median class Lm = lower boundary of the median class © The McGraw-Hill Companies, Inc., 2000 3-2 The Median for a Grouped 3-34 Frequency Distribution - Example Class Class Frequency, Frequency,ff 15.5 15.5--20.5 20.5 33 20.5 20.5--25.5 25.5 55 25.5 25.5--30.5 30.5 44 30.5 30.5--35.5 35.5 33 35.5 35.5--40.5 40.5 22 5 5 © The McGraw-Hill Companies, Inc., 2000 3-2 The Median for a Grouped 3-35 Frequency Distribution - Example Class Class Frequency, Frequency,ff Cumulative Cumulative Frequency Frequency 15.5 15.5--20.5 20.5 33 33 20.5 20.5--25.5 25.5 55 88 25.5 25.5--30.5 30.5 44 12 12 30.5 30.5--35.5 35.5 33 15 15 5 35.5 35.5--40.5 40.5 22 17 17 5 © The McGraw-Hill Companies, Inc., 2000 3-2 The Median for a Grouped 3-36 Frequency Distribution - Example ⚫ To locate the halfway point, divide n by 2; 17/2 = 8.5 9. ⚫ Find the class that contains the 9th value. This will be the median class. ⚫ Consider the cumulative distribution. ⚫ The median class will then be 25.5 – 30.5. © The McGraw-Hill Companies, Inc., 2000 3-2 The Median for a Grouped 3-37 Frequency Distribution n =17 cf = 8 f = 4 w = 25.5 –20.5= 5 Lm = 255. (n 2) − cf (17/ 2) – 8 MD = (w) + Lm = (5) + 255. f 4 = 26.125. © The McGraw-Hill Companies, Inc., 2000 3-38 3-2 The Mode ⚫ The mode is defined to be the value that occurs most often in a data set. ⚫ A data set can have more than one mode. ⚫ A data set is said to have no mode if all values occur with equal frequency. © The McGraw-Hill Companies, Inc., 2000 3-39 3-2 The Mode - Examples ⚫ The following data represent the duration (in days) of U.S. space shuttle voyages for the years 1992-94. Find the mode. ⚫ Data set: 8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10, 14, 11, 8, 14, 11. ⚫ Ordered set: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 11, 11, 14, 14, 14. Mode = 8. © The McGraw-Hill Companies, Inc., 2000 3-40 3-2 The Mode - Examples ⚫ Six strains of bacteria were tested to see how long they could remain alive outside their normal environment. The time, in minutes, is given below. Find the mode. ⚫ Data set: 2, 3, 5, 7, 8, 10. ⚫ There is no mode since each data value occurs equally with a frequency of one. © The McGraw-Hill Companies, Inc., 2000 3-41 3-2 The Mode - Examples ⚫ Eleven different automobiles were tested at a speed of 15 mph for stopping distances. The distance, in feet, is given below. Find the mode. ⚫ Data set: 15, 18, 18, 18, 20, 22, 24, 24, 24, 26, 26. ⚫ There are two modes (bimodal). The values are 18 and 24. Why? © The McGraw-Hill Companies, Inc., 2000 3-2 The Mode for an Ungrouped 3-42 Frequency Distribution - Example Given the table below, find the mode. Values Values Frequency, Frequency,ff 15 15 33 Mode 20 20 55 25 25 88 30 30 33 35 35 22 5 5 © The McGraw-Hill Companies, Inc., 2000 3-2 The Mode - Grouped Frequency 3-43 Distribution ⚫ The mode for grouped data is the modal class. ⚫ The modal class is the class with the largest frequency. ⚫ Sometimes the midpoint of the class is used rather than the boundaries. © The McGraw-Hill Companies, Inc., 2000 3-2 The Mode for a Grouped 3-44 Frequency Distribution - Example Given the table below, find the mode. Class Class Frequency, Frequency,ff Modal 15.5 15.5--20.5 20.5 33 Class 20.5 20.5--25.5 25.5 55 25.5 25.5--30.5 30.5 77 30.5 30.5--35.5 35.5 33 35.5 35.5--40.5 40.5 22 5 5 © The McGraw-Hill Companies, Inc., 2000 3-45 3-2 The Midrange ⚫ The midrange is found by adding the lowest and highest values in the data set and dividing by 2. ⚫ The midrange is a rough estimate of the middle value of the data. ⚫ The symbol that is used to represent the midrange is MR. © The McGraw-Hill Companies, Inc., 2000 3-46 3-2 The Midrange - Example ⚫ Last winter, the city of Brownsville, Minnesota, reported the following number of water-line breaks per month. The data is as follows: 2, 3, 6, 8, 4, 1. Find the midrange. MR = (1 + 8)/2 = 4.5. ⚫ Note: Extreme values influence the midrange and thus may not be a typical description of the middle. © The McGraw-Hill Companies, Inc., 2000 3-47 3-2 The Weighted Mean ⚫ The weighted mean is used when the values in a data set are not all equally represented. ⚫ The weighted mean of a variable X is found by multiplying each value by its corresponding weight and dividing the sum of the products by the sum of the weights. © The McGraw-Hill Companies, Inc., 2000 3-48 3-2 The Weighted Mean The weighted mean w1 X 1 + w2 X 2 +...+ wn X n wX X= = w1 + w2 +...+ wn w where w1 , w2 ,..., wn are the weights for the values X 1 , X 2 ,..., X n. © The McGraw-Hill Companies, Inc., 2000 3-49 Distribution Shapes ⚫ Frequency distributions can assume many shapes. ⚫ The three most important shapes are positively skewed, symmetrical, and negatively skewed. © The McGraw-Hill Companies, Inc., 2000 3-50 Positively Skewed Y Positively Skewed X Mode < Median < Mean © The McGraw-Hill Companies, Inc., 2000 3-51 Symmetrical Y Symmetrical X Mean = Median = Mode © The McGraw-Hill Companies, Inc., 2000 3-52 Negatively Skewed Y Negatively Skewed X Mean < Median < Mode © The McGraw-Hill Companies, Inc., 2000 3-53 3-3 Measures of Variation - Range ⚫ The range is defined to be the highest value minus the lowest value. The symbol R is used for the range. ⚫ R = highest value – lowest value. ⚫ Extremely large or extremely small data values can drastically affect the range. © The McGraw-Hill Companies, Inc., 2000 3-3 Measures of Variation - 3-54 Population Variance The variance is the average of the squares of the distance each value is from the mean. The symbol for the population variance is ( is the Greek lowercase letter sigma) 2 ( X − ) , where 2 = 2 N X = individual value = population mean N = population size © The McGraw-Hill Companies, Inc., 2000 3-3 Measures of Variation - 3-55 Population Standard Deviation The standard deviation is the square root of the variance. ( X − ) 2 = = 2. N © The McGraw-Hill Companies, Inc., 2000 3-3 Measures of Variation - 3-56 Example ⚫ Consider the following data to constitute the population: 10, 60, 50, 30, 40, 20. Find the mean and variance. ⚫ The mean = (10 + 60 + 50 + 30 + 40 + 20)/6 = 210/6 = 35. ⚫ The variance 2 = 1750/6 = 291.67. See next slide for computations. © The McGraw-Hill Companies, Inc., 2000 3-3 Measures of Variation - 3-57 Example XX–- (X –- )) 2 XX (X 2 10 10 -25 -25 625 625 60 60 +25 +25 625 625 50 50 +15 +15 225 225 30 30 -5 -5 25 25 40 40 +5 +5 25 25 20 20 -15 -15 225 225 210 210 1750 1750 © The McGraw-Hill Companies, Inc., 2000 3-3 Measures of Variation - Sample 3-58 Variance The unbiased estimator of the population variance or the sample variance is a statistic whose value approximates the expected value of a population variance. 2 It is denoted by s , where (X − X) 2 s 2 = , and − n 1 X = sample mean n = sample size © The McGraw-Hill Companies, Inc., 2000 3-3 Measures of Variation - Sample 3-59 Standard Deviation The sample standard deviation is the square root of the sample variance. − ( X X ) 2 s = s = 2. n −1 © The McGraw-Hill Companies, Inc., 2000 3-3 Shortcut Formula for the Sample 3-60 Variance and the Standard Deviation 2 X − ( X ) / n 2 2 s= n −1 X − ( X ) / n 2 2 s= n −1 © The McGraw-Hill Companies, Inc., 2000 3-61 3-3 Sample Variance - Example ⚫ Find the variance and standard deviation for the following sample: 16, 19, 15, 15, 14. ⚫ X = 16 + 19 + 15 + 15 + 14 = 79. ⚫ X2 = 162 + 192 + 152 + 152 + 142 = 1263. © The McGraw-Hill Companies, Inc., 2000 3-62 3-3 Sample Variance - Example 2 X − ( X ) / n 2 2 s = n −1 1263 − (79) / 5 2 = = 3.7 4 s= 3.7 = 19.. © The McGraw-Hill Companies, Inc., 2000 3-67 3-3 Coefficient of Variation ⚫ The coefficient of variation is defined to be the standard deviation divided by the mean. The result is expressed as a percentage. s CVar = 100% or CVar = 100%. X © The McGraw-Hill Companies, Inc., 2000 3-69 The Empirical (Normal) Rule ⚫ For any bell shaped distribution: ⚫ Approximately 68% of the data values will fall within one standard deviation of the mean. ⚫ Approximately 95% will fall within two standard deviations of the mean. ⚫ Approximately 99.7% will fall within three standard deviations of the mean. © The McGraw-Hill Companies, Inc., 2000 3-70 The Empirical (Normal) Rule −− -- 95% −− − − − + + + © The McGraw-Hill Companies, Inc., 2000 3-71 3-4 Measures of Position — z score ⚫ The standard score or z score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. ⚫ The symbol z is used for the z score. © The McGraw-Hill Companies, Inc., 2000 3-72 3-4 Measures of Position — z-score ⚫ The z score represents the number of standard deviations a data value falls above or below the mean. For samples: X−X z =. s For populations: X − z =. © The McGraw-Hill Companies, Inc., 2000 3-73 3-4 z-score - Example ⚫ A student scored 65 on a statistics exam that had a mean of 50 and a standard deviation of 10. Compute the z-score. ⚫ z = (65 – 50)/10 = 1.5. ⚫ That is, the score of 65 is 1.5 standard deviations above the mean. ⚫ Above - since the z-score is positive. © The McGraw-Hill Companies, Inc., 2000 3-4 Measures of Position - 3-74 Percentiles ⚫ Percentiles divide the distribution into 100 groups. ⚫ The Pk percentile is defined to be that numerical value such that at most k% of the values are smaller than Pk and at most (100 – k)% are larger than Pk in an ordered data set. © The McGraw-Hill Companies, Inc., 2000 3-75 3-4 Percentile Formula ⚫ The percentile corresponding to a given value (X) is computed by using the formula: number of values below X + 0.5 Percentile = 100% total number of values © The McGraw-Hill Companies, Inc., 2000 3-76 3-4 Percentiles - Example ⚫ A teacher gives a 20-point test to 10 students. Find the percentile rank of a score of 12. Scores: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10. ⚫ Ordered set: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. ⚫ Percentile = [(6 + 0.5)/10](100%) = 65th percentile. Student did better than 65% of the class. © The McGraw-Hill Companies, Inc., 2000 3-4 Percentiles - Finding the value 3-77 Corresponding to a Given Percentile ⚫ Procedure: Let p be the percentile and n the sample size. ⚫ Step 1: Arrange the data in order. ⚫ Step 2: Compute c = (np)/100. ⚫ Step 3: If c is not a whole number, round up to the next whole number. If c is a whole number, use the value halfway between c and c+1. © The McGraw-Hill Companies, Inc., 2000 3-4 Percentiles - Finding the value 3-78 Corresponding to a Given Percentile ⚫ Step 4: The value of c is the position value of the required percentile. ⚫ Example: Find the value of the 25th percentile for the following data set: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. ⚫ Note: the data set is already ordered. ⚫ n = 10, p = 25, so c = (1025)/100 = 2.5. Hence round up to c = 3. © The McGraw-Hill Companies, Inc., 2000 3-4 Percentiles - Finding the value 3-79 Corresponding to a Given Percentile ⚫ Thus, the value of the 25th percentile is the value X = 5. ⚫ Find the 80th percentile. ⚫ c = (10 80)/100 = 8. Thus the value of the 80th percentile is the average of the 8th and 9th values. Thus, the 80th percentile for the data set is (15 + 18)/2 = 16.5. © The McGraw-Hill Companies, Inc., 2000 3-4 Special Percentiles - Deciles and 3-80 Quartiles ⚫ Deciles divide the data set into 10 groups. ⚫ Deciles are denoted by D1, D2, …, D9 with the corresponding percentiles being P10, P20, …, P90 ⚫ Quartiles divide the data set into 4 groups. © The McGraw-Hill Companies, Inc., 2000 3-4 Special Percentiles - Deciles and 3-81 Quartiles ⚫ Quartiles are denoted by Q1, Q2, and Q3 with the corresponding percentiles being P25, P50, and P75. ⚫ The median is the same as P50 or Q2. © The McGraw-Hill Companies, Inc., 2000 3-4 Outliers and the 3-82 Interquartile Range (IQR) ⚫ An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. ⚫ The Interquartile Range, IQR = Q3 – Q1. © The McGraw-Hill Companies, Inc., 2000 3-4 Outliers and the 3-83 Interquartile Range (IQR) ⚫ To determine whether a data value can be considered as an outlier: ⚫ Step 1: Compute Q1 and Q3. ⚫ Step 2: Find the IQR = Q3 – Q1. ⚫ Step 3: Compute (1.5)(IQR). ⚫ Step 4: Compute Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR). © The McGraw-Hill Companies, Inc., 2000 3-4 Outliers and the 3-84 Interquartile Range (IQR) ⚫ To determine whether a data value can be considered as an outlier: ⚫ Step 5: Compare the data value (say X) with Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR). ⚫ If X < Q1 – (1.5)(IQR) or if X > Q3 + (1.5)(IQR), then X is considered an outlier. © The McGraw-Hill Companies, Inc., 2000 3-4 Outliers and the Interquartile 3-85 Range (IQR) - Example ⚫ Given the data set 5, 6, 12, 13, 15, 18, 22, 50, can the value of 50 be considered as an outlier? ⚫ Q1 = 9, Q3 = 20, IQR = 11. Verify. ⚫ (1.5)(IQR) = (1.5)(11) = 16.5. ⚫ 9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5. ⚫ The value of 50 is outside the range – 7.5 to 36.5, hence 50 is an outlier. © The McGraw-Hill Companies, Inc., 2000 3-4 Outliers and the Interquartile 3-85 Range (IQR) – Example (Excel) ⚫ Given the data set 5, 6, 12, 13, 15, 18, 22, 50, can the value of 50 be considered as an outlier? ⚫ Q1 = 10.5, Q3 = 19, IQR = 8.5. Verify. ⚫ (1.5)(IQR) = (1.5)(8.5) = 12.75. ⚫ 10.5 – 12.75 = – 2.25 and 19 + 12.75 = 31.75. ⚫ The value of 50 is outside the range – 2.25 to 31.75, hence 50 is an outlier. © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - 3-86 Stem and Leaf Plot ⚫ A stem and leaf plot is a data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes. © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - 3-87 Stem and Leaf Plot - Example ⚫ At an outpatient testing center, a sample of 20 days showed the following number of cardiograms done each day: 25, 31, 20, 32, 13, 14, 43, 02, 57, 23, 36, 32, 33, 32, 44, 32, 52, 44, 51, 45. Construct a stem and leaf plot for the data. © The McGraw-Hill Companies, Inc., 2000 25, 31, 20, 32, 13, 14, 43, 02, 57, 23, 3-88 36, 32, 33, 32, 44, 32, 52, 44, 51, 45 Leading Digit (Stem) Trailing Digit (Leaf) 0 2 1 3 4 2 0 3 5 3 1 2 2 2 2 3 6 4 3 4 4 5 5 1 2 7 © The McGraw-Hill Companies, Inc., 2000 Leading Digit (Stem) Trailing Digit (Leaf) 0 2 1 3 4 2 0 3 5 3 1 2 2 2 2 3 6 4 3 4 4 5 5 1 2 7 2 13 14 20 23 25 31 32 32 32 32 33 36 43 44 44 45 51 52 57 © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - 3-88 Stem and Leaf Plot - Example Leading Digit (Stem) Trailing Digit (Leaf) 0 2 1 3 4 2 0 3 5 3 1 2 2 2 2 3 6 4 3 4 4 5 5 1 2 7 © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - Box 3-89 Plot ⚫ When the data set contains a small number of values, a box plot is used to graphically represent the data set. These plots involve five values: the minimum value, the lower hinge, the median, the upper hinge, and the maximum value. © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - Box 3-89 Plot ⚫ Identify the minimum value ⚫ Solve the Q1 ⚫ Solve the Median ⚫ Solve the Q3 ⚫ Identify the maximum value © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - 3-87 Box Plot - Example ⚫ At an outpatient testing center, a sample of 20 days showed the following number of cardiograms done each day: 25, 31, 20, 32, 13, 14, 43, 02, 57, 23, 36, 32, 33, 32, 44, 32, 52, 44, 51, 45. Construct a box plot for the data. © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - Box 3-89 Plot ⚫ Identify the minimum value = 2 ⚫ Solve the Q1 = 24.5 ⚫ Solve the Median = 32 ⚫ Solve the Q3 = 44 ⚫ Identify the maximum value = 57 © The McGraw-Hill Companies, Inc., 2000 24.5 32 44 2 30 57 © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - Box 3-90 Plot ⚫ The lower hinge is the median of all values less than or equal to the median when the data set has an odd number of values, or as the median of all values less than the median when the data set has an even number of values. The symbol for the lower hinge is LH. © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - Box 3-91 Plot ⚫ The upper hinge is the median of all values greater than or equal to the median when the data set has an odd number of values, or as the median of all values greater than the median when the data set has an even number of values. The symbol for the upper hinge is UH. © The McGraw-Hill Companies, Inc., 2000 3-5 Exploratory Data Analysis - Box 3-92 Plot - Example (Cardiograms data) © The McGraw-Hill Companies, Inc., 2000 Information Obtained from a Box 3-93 Plot ⚫ If the median is near the center of the box, the distribution is approximately symmetric. ⚫ If the median falls to the left of the center of the box, the distribution is positively skewed. ⚫ If the median falls to the right of the center of the box, the distribution is negatively skewed. © The McGraw-Hill Companies, Inc., 2000 Information Obtained from a Box 3-94 Plot ⚫ If the lines are about the same length, the distribution is approximately symmetric. ⚫ If the right line is larger than the left line, the distribution is positively skewed. ⚫ If the left line is larger than the right line, the distribution is negatively skewed. © The McGraw-Hill Companies, Inc., 2000