Module 1 Descriptive Statistics Handout 1 PDF

1 Module 1: Descriptive Statistics Handout 1 Ch 1. Organizing and Displaying Data HW: 1, 3, 5, 7, 47, 48, 53, 61 plus handout Read: Chapter 1: pages 7-14 (1.0 – 1.2.3) and 26 – 38 (1.6.0 -1.6.6) Example 1: Continuous Data: Earthquake magnitudes: 57 observations Histograms Since the data set for a continuous variable may contain many distinct values, a table or plot of all the values provides little information. A histogram is a common way to display continuous data. Usually we construct histograms using relative frequencies, but you will see histograms based on percentage or actual cell frequencies. To construct a histogram when the data set is continuous, we must 1. Find the range of the data - find the difference between the smallest and largest measurements 2. Divide the range into class intervals a. that do not overlap, b. that are equal in length, c. so that most intervals will contain at least 5 measurements (this number may vary). 3. Make a frequency and relative frequency table, and form the relative frequency histogram. Round to two decimal places. 6.1 6.1 6.1 6.1 6.1 6.2 6.2 6.2 6.2 6.3 6.3 Relative 6.3 6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.5 6.5 6.5 Class Interval Frequency Frequency 6.5 6.5 6.5 6.6 6.6 6.6 6.7 6.7 6.7 6.8 6.8 6.01 – 6.30 12 12 = 0.21 6.8 6.8 6.8 6.9 6.9 7 7 7 7.1 7.1 7.1 57 7.2 7.2 7.2 7.2 7.2 7.3 7.3 7.3 7.3 7.4 7.8 6.31 – 6.60 7.8 7.9 6.61 – 6.90 What percentage of earthquakes were between 6.01 and 6.6? 6.91 – 7.20 7.21 – 7.50 What percentage of earthquakes were greater than 6.9? 7.51 – 7.80 7.81 – 8.10 What percentage of earthquakes were less than 7.21? 57 2014 Earthquake Magnitudes 0.30 0.25 0.20 0.15 0.10 0.05 0.00 6.01-6.30 (6.0,6.3] 6.31-6.60 (6.3,6.6] 6.61-6.90 (6.6,6.9] 6.91-7.20 (6.9,7.2] 7.21-7.50 (7.2,7.5] 7.51-7.80 (7.5,7.8] 7.81-8.10 (7.8,8.1] 2 Example 2 Categorical Data Recorded here are the blood types of 40 persons who have volunteered to donate blood at a plasma center. Summarize the data in a frequency table and included the relative frequencies. Create a frequency histogram. Round to two decimal places. O O A B A O A A A O Blood Type Frequency Relative frequency B O B O O A O O A A A A AB A B A A O O A A O O A A A O A O O AB AB Count of Blood Type 20 B 15 10 O 5 Totals 1 0 A AB B O What would a relative frequency histogram look like? Example 3. Stem and Leaf Display: A useful way of quickly displaying data which can be used instead of a histogram. 1. List first two digits 10 – 17 in a column and draw a vertical line. These digits correspond to the leading digits in each row. 2. For each observation, record the last digit to the right of the vertical line in the row where its first digits appear. 3. Finally, arrange the last digits in increasing order. The following data represent the scores of 14 students on a college qualification test. 157 162 152 155 136 140 133 154 166 125 163 155 144 176 12 12 13 13 14 14 15 15 16 16 17 17 Stop here and try Homework 1. 3 ANALYZING DATA: MEASURES OF CENTER AND MEASURES OF VARIATION Measures of Center: Central values about which the measurements are distributed such as the mean, mode, and median. Mean: “the average value” x1 + x2 + + xn x= , where x1 , x2 , , xn represent the n observed values. n Median: the value “in the middle” or the average of the two values “in the middle”. NOTE: to find the median, we must arrange our data values in order: smallest to largest. Which value to choose? Here is a simple rule: Suppose you have n observations. Divide this number by two. If the answer is a whole number, you will need to use the average of this observation and the next. If the answer is not a whole number, ROUND UP and that observation is the median. (Always round up.) Find the median of the Example 3 data: Mode: The value or values that occur most often. A data set can have no mode, or it can have multiple modes. Find the mode of the Example 3 data: Measures of Variation: a measure of the extent of variation around the center Range Variance Standard Deviation Interquartile Range When we analyze data, it is useful to know how much variation there is, that is, how much do the numbers in the sample vary from the mean. Sometimes there is very little variation and sometimes there is a great deal. Consider the table below. The mean, or average, for each row is 4 and they all have the same number of entries. What makes these rows so different? RANGE: the difference between the largest number and the smallest number in a sample is called the range. Example 1. Find the range in each row above. Row I: Row II: Row III: 4 It is more useful to know how far our data deviates from the mean. In the table above, in row two, there is no deviation from the mean. In row three, there is a great deal of deviation from the mean. Deviations from the mean: the differences found by subtracting the mean from each number in a sample. Example 2. A class had test scores of 72, 84, 96, 64, 88, 92, 74, and 78. Find the deviations from the mean. Solution. First, we must calculate the mean: 72 + 84 + 96 + 64 + 88 + 92 + 74 + 78 xi ( xi − x ) x= = 81. 8 72 Next we subtract the mean from each of these numbers to see how 84 much it deviates from the mean. Ideally, we would like to know the average deviation from the mean. However, note that 96 n ( x − x ) = 0. i =1 i 64 88 We must eliminate the signs associated with the deviations from the 92 mean, and we do this by squaring them and calculating the SAMPLE STANDARD DEVIATION. 74 78 SAMPLE STANDARD DEVIATION AND SAMPLE VARIANCE Total The sample standard deviation, s , is calculated by squaring each of the deviations from the mean, adding them up and dividing by n − 1 , where n is the number of observations. Finally, we take the square root. Sample Standard Deviation: the sample standard deviation, s , of n numbers with mean x is given by the Note that this will always be a positive number and that we divide by n − 1 and not n. n formula:  (x − x ) i 2 s= i =1 n −1 Example 3. xi ( xi − x ) ( xi − x ) 2 We will use the data from Example 2 and calculate s by hand. 72 -9 84 3 96 15 Sample variance, denoted s 2 , is simply the square of the sample standard deviation. The advantage of the standard deviation is 64 -17 that it will be in the same units as your original observations. 88 7 n (x − x ) 92 11 2 1  (  xi )  2 i s2 = i =1 n −1 or s = 2 n −1   xi − n  2 74 -7   78 -3 The second formula is sometimes more convenient to use. Totals 0 5 Interquartile Range: Box Plots Example 4. A zoologist collected thirty wild lizards and put them on a treadmill (!) and the recorded speed (meters/second) is the fastest time to run a half meter. Original Sorted Find the sample median, first quartile, and the third quartile. Round to 3 n Speed Speed decimal places. What is the interquartile range? What does the 1 1.28 0.5 interquartile range represent? Label the box plot. 2 1.56 0.76 3 2.57 1.02 4 1.04 1.04 5 1.36 1.2 6 2.66 1.24 7 1.72 1.28 8 1.92 1.29 9 1.24 1.36 10 2.17 1.49 11 0.76 1.55 12 1.55 1.56 13 2.47 1.57 14 1.57 1.57 15 1.02 1.63 16 1.78 1.7 17 1.94 1.72 18 2.1 1.78 19 1.78 1.78 20 1.7 1.92 21 2.52 1.94 22 2.54 2.1 23 0.5 2.11 24 1.2 2.17 Calculating the Sample 100 p-th Percentile 25 2.67 2.47 1. Order the data from smallest to largest. 26 1.63 2.52 27 1.49 2.54 2. Calculate the product: ( sample size )  ( proportion ) = np 28 1.29 2.57 If np is not a whole number, round it up to the next whole number and find the 29 2.11 2.66 corresponding data value. 30 1.57 2.67 If np is a whole number, k , calculate the average of the k th and ( k + 1) st ordered values. a) Find the sample 90th percentile. Round to three decimal places. Sample Quartiles Lower (first quartile) Q1 = 25th percentile Second quartile (or median) Q = 50th percentile Q2 = MEDIAN 2 Upper (third) quartile Q3 = 75th percentile Proportion = Relative Frequency. Note 0  p  1 , just like probability. 6 What does the value of the sample standard deviation tell us? A large value of s tells us that our observations, or data set, is spread out away from the mean, and a small value of s tells us that the data is more clustered around the mean. It is often useful to know how many standard deviations a data point is from the mean. We define z-scores , or standardized scores, as the distance from x per standard deviation. xi − x z= for each observation xi. s Example 5. HW Speedy Lizards again! Given that x = 1.72, and s = 0.573 calculate the z-scores for a lizard that ran at a speed of 2.11 m/s and for one that ran at a speed of 1.7 m/s. Example 6 What percentage of the observations from the lizard data fall a. within one standard deviation of the mean? b. within two standard deviations of the mean? OUTLIERS: An outlier is an observation that appears extreme relative to the rest of the data. If an observation or measurement falls beyond Q3 + 1.5  IQR or Q1 −1.5  IQR. These are represented by the “dots” in the boxplots below. Example 7. Match the histograms to the box plots. Robust Statistics: The Median Q2 and the IQR are called robust statistics because they are less affected by outliers – extreme observations in your data set. Example 1. Consider these 13 test scores and notice how the sample mean and the sample standard deviation are affected by the outliers in the second set of scores. Data Set 1: x = 75.7, s = 8.68 Median: Q2 = 75 IQR = 15 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 60 65 68 70 72 72 75 78 83 85 85 85 86 Data Set 2: x = 67.2, s = 27.3 Median: Q2 = 75 IQR = 15 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 5 10 68 70 72 72 75 78 83 85 85 85 86 Example 2. Given the data in the table below, find the quartiles and IQR. n weight Calculating the Sample 100 p-th Percentile 1 108 1. Order the data from smallest to largest. 2 124 3 136 2. Calculate the product: ( sample size )  ( proportion ) = np 4 140 If np is not a whole number, round it up to the next whole number and find the 5 141 6 143 corresponding data value. 7 148 If np is a whole number, k , calculate the average of the k th and ( k + 1) ordered st 8 153 9 158 values. 10 160 Sample Quartiles 11 168 Lower (first quartile) Q1 = 25th percentile 12 169 Second quartile (or median) Q2 = 50th percentile 13 171 14 179 Upper (third) quartile Q3 = 75th percentile 15 181 IQR = Q3 − Q1 16 193 17 199 18 203 NOTE: Proportion = Relative Frequency. Note 0  p  1 , just like probability. 19 206 Q1 = 20 213 21 216 Q2 = 22 217 Q3 = 23 222 24 226 25 227 IQR = 26 229 27 230 Example 3. The sample mean and sample standard deviation for the given data set are x = 180 and s = 36.3. 1. What percentage of the measurements that fall within one standard deviation of the mean? 2. What percentage of the measurements fall within two standard deviations of the mean? Example 4. Answer the questions about the box plot. 1. What are the quartiles? Q1 = Q2 = Q3 = 2. What is the IQR? 3. What percentage of students score lower than 70? 4. What percentage of students scored below 82? 5. What is the range of the data? Example 5. Suppose all three boxplots below are on the same scale, 0 – 100. Answer the following questions. a.Which group had the lowest median score?____________ b. Which group had the largest IQR (interquartile range)?__________ c. Which group had the smallest IQR?___________ d. Which group has the most outliers? ______

Module 1 Descriptive Statistics Handout 1 PDF

Document Details

Tags

Related

Summary

Full Transcript