Summary Measures PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a summary of summary measures, including measures of central tendency and variation. The document also details the different types of summary measures, their use and their properties.
Full Transcript
Summary Measures Measures of central Tendency, Variation and relative standing 2 Summary Measures Describing Data Numerically Center and Location Measures of Variation Relative Standing Mean...
Summary Measures Measures of central Tendency, Variation and relative standing 2 Summary Measures Describing Data Numerically Center and Location Measures of Variation Relative Standing Mean Range Percentiles Median Interquartile Range Quartiles Mode Variance Standard Deviation Coefficient of Variation 3 Measures of Central Tendency Inaddition to describing the shape of a distribution, want to describe the data set’s central tendency A measure of central tendency represents the center or middle of the data Central Tendency Mean Median Mode Arithmetic Mean The Mean is the arithmetic average of data values ArithmeticMean: the sum of the scores divided by the number of scores (what is generally thought of as the average) 5 Mean (Arithmetic Average) The Mean is the arithmetic average of data values Sample mean n x n = Sample Size x1 + x2 + + xn i i =1 x= = n n Population mean N N = Population Size x i x1 + x2 + + x N = i =1 = N N 6 Arithmetic Mean The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 Mean = 4 1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20 = =3 = =4 5 5 5 5 7 Median 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 In an ordered array, the median is the “middle” number (50% above, 50% below) Median not affected by extreme values 8 Finding the Median The location of the median: n +1 Median position = position in the ordered array 2 If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers Note that (n+1)/2 is not the value of the median, only the position of the median in the ranked data 9 Mode A measure of central tendency Value that occurs most often Not affected by extreme values Mainly used for grouped numerical data or categorical data There may may be no mode There may be several modes No Mode 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 Mode = 9 10 Which measure is the “best”? Mean is generally used, unless extreme values (outliers) exist Then median is often used, since the median is not sensitive to extreme values. For a relatively small number of extreme observations (either very small or very large, but not both), the median is usually better. Choosing: The mode is meaningful on a nominal scale. The median is meaningful on an ordinal scale. The mean is meaningful on an interval/ratio scale. 11 Shape of a Distribution Describes how data is distributed Symmetric or skewed If the distribution is symmetric, then mean=median. If the distribution is skewed to right, then mode < median < mean If the distribution is skewed to left, then mode > median > mean 12 Exercise The following data represent the ages of 20 randomly selected managers: 43 44 49 37 45 35 46 32 47 42 39 40 41 45 41 43 50 47 41 51 a) Find the mean, median and mode for the above data. b) Which measure would you choose to describe the data? Why? 13 Measures of Variability Variability Range Variance Standard Coefficient Deviation of Variation 14 Measures of Variation Knowing the measures of center is not enough Both of the distributions below have identical measures of central tendency Variation Range Variance Standard Coefficient Deviation of Variation 15 Range Simplest measure of variation Difference between the largest and the smallest observations: Range = maximum – minimum Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 16 Disadvantages of the Range Ignores the way in which data are distributed 7 8 9 10 11 12 7 8 9 10 11 12 Range = 12 - 7 = 5 Range = 12 - 7 = 5 Sensitive to outliers 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 - 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 - 1 = 119 17 Variance Average of squared deviations of values from the mean Population variance: Sample variance: n N (X i − μ) 2 i =1 (X i − X) 2 σ =2 i =1 S2 = N n -1 Where Where μ = population mean X = arithmetic mean N = population size n = sample size Xi = ith value of the variable X Xi = ith value of the variable X 18 Standard Deviation Most commonly used measure of variation The square root of the variance Shows variation about the mean Has the same units as the original data Sample standard deviation: n (X i =1 i − X) 2 S= n -1 Example: Sample Standard 19 Deviation Sample Data (Xi) : 10 12 14 15 17 18 18 24 n=8 Mean = X = 16 (10 − X)2 + (12 − X)2 + (14 − X)2 + + (24 − X)2 S = n −1 (10 − 16)2 + (12 − 16)2 + (14 − 16)2 + + (24 − 16)2 = 8 −1 126 = = 4.2426 7 20 Comparing Standard Deviations Data A Mean = 15.5 11 12 13 14 15 16 17 18 19 20 21 S = 3.338 Data B Mean = 15.5 11 12 13 14 15 16 17 18 19 20 21 S =.9258 Data C Mean = 15.5 11 12 13 14 15 16 17 18 19 20 21 S = 4.57 21 Coefficient of Variation Measures relative variation Always a percentage (%) Shows variation relative to mean Is used to compare two or more sets of data measured in different units S CV = 100% X Comparing Coefficients of 22 Variation Stock A: Average price last year = $50 Standard deviation = $5 S $5 CVA = 100% = 100% = 10% X $50 Both stocks Stock B: have the same standard Average price last year = $100 deviation, but stock B is less Standard deviation = $5 variable relative to its price S $5 CVB = 100% = 100% = 5% X $100 23 The Empirical Rule If the data distribution is bell-shaped, then the interval: a) (-, +) contains about 68.26% of the values in the population. b) (-2, +2) contains about 95.44% of the values in the population. c) (-3, +3) contains about 99.74% of the values in the population. 24 Example IQs measured on the Stanford Revision of the Binet–Simon Intelligence Scale have a mean of 100 points and a standard deviation of 16 points. The interval: a) (84, 116) contains about 68.26% of the IQ scores. b) (68, 132) contains about 95.44% of the IQ scores. c) (52, 148) contains about 99.74% of the IQ scores. The scores of 25 randomly selected people are shown below. 66 82 86 88 91 95 96 96 97 98 101 102 102 104 105 106 111 112 115 116 118 121 124 127 129 a) 18 scores (72%) fall in the interval (84, 116). b) 24 scores (96%) fall in the interval (68, 132). c) 25 scores (100%) fall in the interval (52, 148). 25 Exercise The exam scores for the students in an introductory statistics course are as follows. 88 67 64 76 86 85 82 39 75 34 90 63 89 90 84 81 96 100 70 96 a) Compute the descriptive statistics for the given exam scores. b) Apply the empirical rule and check the consistency with the sample results. Explain your conclusion. 26 Measures of Relative Standing Measures of Relative Standing Percentiles Quartiles The pth percentile in a data: 1st quartile = 25th percentile p% are less than or equal to 2nd quartile = 50th percentile this value = median (100 – p)% are greater than or 3rd quartile = 75th percentile equal to this value (where 0 ≤ p ≤ 100) 27 Percentiles The pth percentile in an ordered array of n values is the value in ith position, where p i= (n + 1) 100 Example: The 60th percentile in an ordered array of 19 values is the value in 12th position: p 60 i= (n + 1) = (19 + 1) = 12 100 100 In Excel, write =percentile(array, k), where array is the range of data and k is the percentile value in the range 0-1. 28 Quartiles Quartiles split the ranked data into 4 equal groups 25% 25% 25% 25% Q1 Q2 Q3 Example: Find the first quartile Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 (n = 9) Q1= 25th percentile, so find the (25/100)(9+1) = 2.5 position so use the value half way between the 2nd and 3rd values, so Q1= 12.5 29 Interquartile Range and Fences Difference between the first and third quartiles IQR = Q3 – Q1 Inner fences: Located 1.5IQR away from the quartiles: Q1 – (1.5 IQR) Q3 + (1.5 IQR) Outer fences: Located 3IQR away from the quartiles: Q1 – (3 IQR) Q3 + (3 IQR) 30 Outliers Outliers are measurements that are very different from other measurements They are either much larger or much smaller than most of the other measurements Outliers lie beyond the fences of the box-and- whiskers plot Measurements between the inner and outer fences are mild outliers Measurements beyond the outer fences are severe outliers The adjacent values are: The smallest data point falls above the lower fence. The largest data point falls below the upper fence. 31 Box and Whisker Plot (Boxplot) A Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum The box plots the: First quartile (Q1), median (Md), third quartile (Q3). Inner fences, outer fences The “whiskers” are dashed lines that plot the range of the data A dashed line drawn from the box below Q1 down to the minimum Another dashed line drawn from the box above Q3 up to the maximum. 32 Distribution shapes and boxplots 33 How to construct a Boxplot? 1. Determine the quartiles. 2. Determine the potential outliers and the adjacent values. 3. Draw a horizontal axis on which the numbers obtained in Steps 1 and 2 can be located. Above this axis, mark the quartiles and the adjacent values with vertical lines. 4. Connect the quartiles to each other to make a box, and then connect the box to the adjacent values with lines. 5. Plot the potential outlier with an asterisk. 34 Example: Box-and-Whiskers Plots 35 Example A sample of 20 people yielded the weekly TV viewing times, in hours, 25 41 27 32 43 66 35 31 15 5 34 26 32 38 16 30 38 30 20 21 The five-number summary is 5 24 30.5 35.75 66 IQR=35.75-24=11.75 1.5*IQR=1.5*13.5=17.625 Lower Fence=Q1-1.5*IQR=24-17.625=6.375 Upper Fence=Q3+1.5*IQR=35.75+17.625=53.375 The observations, 5 and 66, lie beyond the inner fences and hence should be classified as outlier. The adjacent values are 15 and 43. 36 Exercise IQs measured on the Stanford Revision of the Binet– Simon Intelligence Scale. The scores of 25 randomly selected people are shown below. 66 82 86 88 91 95 96 96 97 98 101 102 102 104 105 106 111 112 115 116 118 121 124 127 129 Identify potential outliers, if any, and construct and interpret a boxplot