Organizing and Displaying Data with Solutions SP 22 PDF

Summary

This document provides examples and exercises on organizing and displaying data, including continuous data (like earthquake magnitudes) and categorical data (blood types). It covers creating histograms, calculating relative frequencies, and understanding data distributions.

Full Transcript

1 Ch 1. Organizing and Displaying Data Example 1: Continuous Data: Earthquake magnitudes: Histograms Since the data set for a continuous variable may contain many distinct values, a table or plot of all the values provides little information. A histogram is a common way to display continuous data....

1 Ch 1. Organizing and Displaying Data Example 1: Continuous Data: Earthquake magnitudes: Histograms Since the data set for a continuous variable may contain many distinct values, a table or plot of all the values provides little information. A histogram is a common way to display continuous data. Usually we construct histograms using relative frequencies, but you will see histograms based on percentage or actual cell frequencies. To construct a histogram when the data set is continuous, we must 1. Find the range of the data - find the difference between the smallest and largest measurements 2. Divide the range into class intervals a. that do not overlap, b. that are equal in length, c. so that most intervals will contain at least 5 measurements (this number may vary). 3. Make a frequency and relative frequency table, and form the relative frequency histogram. Round to two decimal places. Class Interval Frequency Relative Frequency 6.1 6.1 6.1 6.1 6.1 6.2 6.2 6.2 6.2 6.3 6.3 6.01 – 6.30 12 0.21 6.3 6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.5 6.5 6.5 6.31 – 6.60 16 0.28 6.5 6.5 6.5 6.6 6.6 6.6 6.7 6.7 6.7 6.8 6.8 6.61 – 6.90 10 0.18 6.8 6.8 6.8 6.9 6.9 7 7 7 7.1 7.1 7.1 6.91 – 7.20 11 0.19 7.2 7.2 7.2 7.2 7.2 7.3 7.3 7.3 7.3 7.4 7.8 7.21 – 7.50 5 0.09 7.8 7.9 7.51 – 7.80 2 0.04 7.81 – 8.10 1 0.02 Totals 57 1 What percentage of earthquakes 2014 Earthquake Magnitudes were between 6.01 and 6.6? 0.30 12+16 = 28 of 57 = 49.1% 0.20 What percentage of earthquakes 0.10 were greater than 6.9? 0.00 33.3% (6.0,6.3] 6.01-6.30 (6.3,6.6] 6.31-6.60 (6.6,6.9] 6.61-6.90 (6.9,7.2] 6.91-7.20 (7.2,7.5] 7.21-7.50 (7.5,7.8] 7.51-7.80 (7.8,8.1] 7.81-8.10 What percentage of earthquakes Note that the histogram is not symmetric – it has a long right were less than 7.21? tail. We say the distribution is skewed to the right. 86% 2 Example 2 Categorical Data Recorded here are the blood types of 40 persons who have volunteered to donate blood at a plasma center. Summarize the data in a frequency table and included the relative frequencies. Create a frequency histogram. Round to two decimal places. O O A B A O A A A O Blood Type Frequency Relative frequency B O B O O A O O A A A A AB A B A A O O A A 18 0.45 O O A A A O A O O AB AB 2 0.05 20 Count of Blood Type B 4 0.1 15 O 16 0.4 10 Totals 40 1 5 0 A AB B O What would a relative frequency histogram look like? The shape would be identical – only the units would change. Example 3. Stem and Leaf Display: A useful way of quickly displaying data which can be used instead of a histogram. 1. List the digits 10 – 17 in a column and draw a vertical line. These digits correspond to the leading digit in each row. 2. For each observation, record the next digit to the right of the vertical line in the row where its first digit appears. 3. Finally, arrange the second digits in increasing order. The following data represent the scores of 14 students on a college qualification test. 157 162 152 155 136 140 133 154 166 125 163 155 144 176 12 5 12 5 13 6 3 13 36 14 0 4 14 04 15 7 4 2 5 5 15 24557 16 236 16 2 6 3 17 6 17 6 3 ANALYZING DATA: MEASURES OF CENTER AND MEASURES OF VARIATION Measures of Center: Central values about which the measurements are distributed such as the mean, mode, and median. Mean: “the average value” x1 + x2 + + xn x= , where x1 , x2 , , xn represent the n observed values. n Median: the value “in the middle” or the average of the two values “in the middle”. NOTE: to find the median, we must arrange our data values in order: smallest to largest. Which value to choose? Here is a simple rule: Suppose you have n observations. Divide this number by two. If the answer is a whole number, you will need to use the average of this value and the next. If the answer is not a whole number, ROUND UP and that value is the median. (Always round up.) Find the median of the Example 3 data: Since there are 14 observations, and 14  7 = 2 , we will need the average of the 7th and 8th ordered observations. 154 + 155 Median = = 154.5 2 Mode: The value or values that occur most often. A data set can have no mode, or it can have multiple modes. Find the mode of the Example 3 data: The mode is 155 (it occurs twice). Measures of Variation: a measure of the extent of variation around the center Range Variance Standard Deviation Interquartile Range When we analyze data, it is useful to know how much variation there is, that is, how much do the numbers in the sample vary from the mean. Sometimes there is very little variation and sometimes there is a great deal. Consider the table below. The mean, or average, for each row is 4 and they all have the same number of entries. What makes these rows so different? RANGE: the difference between the largest number and the smallest number in a sample is called the range. Example 1. Find the range in each row above. Row I: 6 – 3 =3 Row II: 4 – 4 = 0 Row III: 10 – 0 = 10 Deviations from the mean: The deviations from the mean are the differences found by subtracting the mean from each number in a sample. 4 Example 2. A class had test scores of 72, 84, 96, 64, 88, 92, 74, and 78. Find the deviations from the mean. Solution. First we must calculate the mean: 72 + 84 + 96 + 64 + 88 + 92 + 74 + 78 xi ( xi − x ) x= = 81. 8 72 -9 Next we subtract the mean from each of these numbers to see 84 3 how much it deviates from the mean. 96 15 64 -17 Ideally, we would like to know the average deviation from the n 88 7 mean. However, note that ( x − x ) = 0. i i =1 92 11 You should verify this. So, we must eliminate the signs 74 -7 associated with the deviations from the mean, and we do this by squaring them and calculating the SAMPLE STANDARD 78 -3 DEVIATION. Total 0 SAMPLE STANDARD DEVIATION AND SAMPLE VARIANCE The sample standard deviation, s , is calculated by squaring each of the deviations from the mean, adding them up and dividing by n − 1 , where n is the number of observations. Finally, we take the square root. Sample Standard Deviation: the sample standard deviation, s , of n numbers with mean x is given by the formula: n  (x − x ) i 2 Note that this will always be a positive number. s= i =1 Note also that we divide by n − 1 and not n. n −1 Example 3. xi ( xi − x ) ( xi − x ) 2 72 -9 81 n 84 3 9  ( xi − x ) 2 s= i =1 96 15 225 n −1 64 -17 289 832 = 118.857 = 10.902 88 7 49 7 92 11 121 74 -7 49 78 -3 9 Totals 0 832 n 2 ( x − x ) i =1 i = 81 + 9 + 225 + 289 + 49 + 121 + 49 + 9 = 832 5 Interquartile Range: Box Plots Example 4. A zoologist collected thirty wild lizards and put them on a treadmill (!) and the recorded speed (meters/second) is the fastest time to run a half meter. Original Sorted Find the sample median, first quartile, and the third quartile. Round to 3 n Speed Speed decimal places. What is the interquartile range? What does the 1 1.28 0.5 interquartile range represent? Label the box plot. 2 1.56 0.76 3 2.57 1.02 4 1.04 1.04 5 1.36 1.2 2.67 m/s 6 2.66 1.24 7 1.72 1.28 Q3 = 2.11 m 8 1.92 1.29 s 9 1.24 1.36 10 2.17 1.49 Q2 = 1.665 m/s 11 0.76 1.55 12 1.55 1.56 Q1 = 1.29 m 13 2.47 1.57 s 14 1.57 1.57 15 1.02 1.63 0.5 m/s 16 1.78 1.7 17 1.94 1.72 18 2.1 1.78 1.63 + 1.7 There are 30 observations, so Q2 = = 1.665 and 19 1.78 1.78 2 20 1.7 1.92 Q1 : np = 30  0.25 = 7.5 , so we need the 8th observation: Q1 = 1.29 and 21 2.52 1.94 Q3 : np = 30  0.75 = 22.5 , so we need the 23rd observation: Q3 = 2.11 22 2.54 2.1 23 0.5 2.11 24 1.2 2.17 Calculating the Sample 100 p-th Percentile 25 2.67 2.47 1. Order the data from smallest to largest. 26 1.63 2.52 2. Calculate the product: ( sample size )  ( proportion ) = np 27 1.49 2.54 28 1.29 2.57 If np is not a whole number, round it up to the next whole number and find the 29 2.11 2.66 corresponding data value. If np is a whole number, k , calculate the average of the k th and ( k + 1) ordered st 30 1.57 2.67 values. Sample Quartiles Lower (first quartile) Q1 = 25th percentile Second quartile (or median) Q2 = 50th percentile Q2 = MEDIAN th a) Find Upper (third) the sample 90 percentile. quartile Round to three decimal places. Q = 75th percentile 3 NOTE: Proportion = Relative Frequency. Note 0  p  1 , just like probability. 6 What does the value of the sample standard deviation tell us? A large value of s tells us that our observations, or data set, is spread out away from the mean, and a small value of s tells us that the data is more clustered around the mean. It is often useful to know how many standard deviations a data point is from the mean. We define z-scores , or standardized scores, as the distance from x per standard deviation. xi − x z= for each observation xi. s Example 5. Speedy Lizards again! Given that x = 1.724, and s = 0.573 calculate the z-scores for a lizard that ran at a speed of 2.11 m/s and for one that ran at a speed of 1.7 m/s. 2.11 − 1.724 1.7 − 1.724 z1 = = 0.674 and z2 = = −.042 0.573 0.573 The first lizard is.67 standard deviations above the mean while the second one is.42 standard deviations below the mean. Example 6 What percentage of the observations from the lizard data fall a. within one standard deviation of the mean? b. within two standard deviations of the mean? OUTLIERS: An outlier is an observation that appears extreme relative to the rest of the data. If an observation or measurement falls beyond Q3 + 1.5  IQR or Q1 −1.5  IQR. These are represented by the “dots” in the boxplots below. Example 7. Match the histograms to the box plots. Histogram (a) goes with boxplot (2). Notice that it is nearly symmetrical but has a few outliers. Histogram (b) goes with boxplot (3). This is a uniform distribution. Histogram (c) goes with boxplot (1). This distribution is skewed to the left, with several outliers there.

Use Quizgecko on...
Browser
Browser