Chapter 2 Descriptive Statistics PDF
Document Details
Uploaded by SaintlyWilliamsite6656
Tags
Summary
This document is a chapter on descriptive statistics, focusing on variables, data types, and their characteristics. It discusses topics like qualitative and quantitative variables, including discrete and continuous data.
Full Transcript
Chapter 2 Descriptive Statistics Variables and Data A variable is a characteristic that changes or varies over time and/or for different individuals or objects under consideration. Examples: Hair color, Temperature, Account balance, Number of Students Present in class ...
Chapter 2 Descriptive Statistics Variables and Data A variable is a characteristic that changes or varies over time and/or for different individuals or objects under consideration. Examples: Hair color, Temperature, Account balance, Number of Students Present in class Definitions An experimental unit is the individual or object on which a variable is measured. A measurement is the actual value you obtain when a variable is observed or recorded for an experimental unit. A set of measurements, called data, can be either a sample or a population. Example Variable –Hair color Experimental unit –Person Typical Measurements –Brown, black, blonde, etc. Example Variable –Time until a light bulb burns out Experimental unit –Light bulb Typical Measurements –1500 hours, 1535.5 hours, etc. How many variables have you measured? Univariate data: One variable is measured on a single experimental unit. Bivariate data: Two variables are measured on a single experimental unit. Multivariate data: More than two variables are measured on a single experimental unit. Identify Data based on the example – Exam scores and study hours of students. – Height of students in a classroom. – Demographic information (age, gender, income) of households. Types of Variables Qualitative Quantitative Discrete Continuous Types of Variables Qualitative variables measure a quality or characteristic on each experimental unit. Examples: Hair color (black, brown, blonde…) Make of car (Dodge, Honda, Ford…) Gender (male, female) State of birth (California, Arizona,….) Types of Variables Quantitative variables measure a numerical quantity on each experimental unit. ✓Discrete can only assume certain values and there are usually “gaps” between values. E.g., Number of students in a class, Number of cars in a parking lot. ✓Continuous can assume any value within a specified range. E.g., Height, Weight, Time Examples For each orange tree in a grove, the number of oranges is measured. – Quantitative discrete For a particular day, the number of cars entering a college campus is measured. – Quantitative discrete Time until a light bulb burns out – Quantitative continuous Discussion The most frequent use of your microwave oven functions The number of consumers who refuse to answer the telephone survey The door chosen by a mouse in a maze experiment The winning time for a horse running in the Kentucky Derby The number of children in fifth-grade class who are reading at or above grade level Graphing Qualitative Variables Use a data distribution to describe: – What values of the variable have been measured – How often each value has occurred “How often” can be measured 3 ways: – Frequency – Relative frequency = Frequency/n – Percent = 100 x Relative frequency Example A bag of M&Ms contains 25 candies: m m m m m m m m m m m m m m m m m m m m Raw Data: m m m m m Statistical Table: Color Tally Frequency Relative Percent Frequency Red mmm 3 3/25 =.12 12% Blue mmmmmm 6 6/25 =.24 24% Green mm mm 4 4/25 =.16 16% Orange mmmmm 5 5/25 =.20 20% Brown mm m 3 3/25 =.12 12% Yellow mmmm 4 4/25 =.16 16% 6 Graphs 5 Frequency 4 3 2 Bar Chart 1 0 Brown Yellow Red Blue Orange Green Color Green Brown 16.0% 12.0% Yellow Pie Chart Orange 20.0% 16.0% Red 12.0% Blue 24.0% Graphing Quantitative Variables A single quantitative variable measured for different population segments or for different categories of classification can be graphed using a pie or bar chart. 5 A Big Mac hamburger costs 4 Cost of a Big Mac ($) $4.90 in Switzerland, 3 $2.90 in the U.S. and 2 $1.86 in South 1 Africa. 0 Switzerland U.S. South Africa Country A single quantitative variable measured over time is called a time series. It can be graphed using a line or bar chart. CPI: All Urban Consumers-Seasonally Adjusted Sept Oct Nov Dec Jan Feb Mar 178.10 177.60 177.50 177.30 177.60 178.00 178.60 Dotplots The simplest graph for quantitative data Plots the measurements as points on a horizontal axis, stacking the points that duplicate existing points. Example: The set 4, 5, 5, 7, 6 4 5 6 7 Stem and Leaf Plots A simple graph for quantitative data Uses the actual numerical values of each data point. –Divide each measurement into two parts: the stem and the leaf. –List the stems in a column, with a vertical line to their right. –For each measurement, record the leaf portion in the same row as its matching stem. –Order the leaves from lowest to highest in each stem. –Provide a key to your coding. Example The prices ($) of 18 brands of walking shoes: 90 70 70 70 75 70 65 68 60 74 70 95 75 70 68 65 40 65 4 0 4 0 Reorder 5 5 6 580855 6 055588 7 000504050 7 000000455 8 8 9 05 9 05 Interpreting Graphs: Location and Spread Where is the data centered on the horizontal axis, and how does it spread out from the center? Interpreting Graphs: Shapes Mound shaped and symmetric (mirror images) Skewed right: a few unusually large measurements Skewed left: a few unusually small measurements Bimodal: two local peaks Interpreting Graphs: Outliers No Outliers Outlier Are there any strange or unusual measurements that stand out in the data set? Example A quality control process measures the diameter of a gear being made by a machine (cm). The technician records 15 diameters, but unintentionally makes a typing mistake on the second entry. 1.991 1.891 1.991 1.988 1.993 1.989 1.990 1.988 1.988 1.993 1.991 1.989 1.989 1.993 1.990 1.994 Relative Frequency Histograms A relative frequency histogram for a quantitative data set is a bar graph in which the height of the bar shows “how often” (measured as a proportion or relative frequency) measurements fall in a particular class or subinterval. Create Stack and draw bars intervals Relative Frequency Histograms Divide the range of the data into 5-12 subintervals of equal length. Calculate the approximate width of the subinterval as Range/number of subintervals. Round the approximate width up to a convenient value. Use the method of left inclusion, including the left endpoint, but not the right in your tally. Create a statistical table including the subintervals, their frequencies and relative frequencies. Relative Frequency Histograms Draw the relative frequency histogram, plotting the subintervals on the horizontal axis and the relative frequencies on the vertical axis. The height of the bar represents – The proportion of measurements falling in that class or subinterval. – The probability that a single measurement, drawn at random from the set, will belong to that class or subinterval. Example The ages of 50 tenured faculty at a state university. 34 48 70 63 52 52 35 50 37 43 53 43 52 44 42 31 36 48 43 26 58 62 49 34 48 53 39 45 34 59 34 66 40 59 36 41 35 36 62 34 38 28 43 50 30 43 32 44 58 53 We choose to use 6 intervals. Minimum class width = (70 – 26)/6 = 7.33 Convenient class width = 8 Use 6 classes of length 8, starting at 25. How to choose class & interval? Step 1: Decide on the number of classes. A useful recipe to determine the number of classes (k) is the “2 to the k rule.” such that 2k > n. There were 50 faculty, so n = 50. If we try k = 5, then 25 = 32, somewhat less than 50. Hence, 5 is not enough classes. If we let k = 6, then 26 = 64 , which is greater than 50. So, the recommended number of classes is 6. Step 2: Determine the class interval or width. The formula is: i (H-L)/k where i is the class interval, H is the highest observed value, L is the lowest observed value, and k is the number of classes. 70 is the highest no in the table, 26 is the lowest and k is equal to 6 , so when putting in the formula (70 – 26)/6 = 7.33 Round up to some convenient number such as 8 Hence, we are using 6 classes of length 8, starting at 25. Age Tally Frequency Relative Percent Frequency 25 to < 33 1111 5 5/50 =.10 10% 33 to < 41 1111 1111 1111 14 14/50 =.28 28% 41 to < 49 1111 1111 111 13 13/50 =.26 26% 49 to < 57 1111 1111 9 9/50 =.18 18% 57 to < 65 1111 11 7 7/50 =.14 14% 65 to < 73 11 2 2/50 =.04 4% 14/50 12/50 Relative frequency 10/50 8/50 6/50 4/50 2/50 0 25 33 41 49 57 65 73 Ages 14/50 Describing 12/50 Relative frequency 10/50 the 8/50 6/50 Distribution 4/50 2/50 0 25 33 41 49 57 65 73 Ages Shape? Skewed right Outliers? No. What proportion of the (14 + 5)/50 = 19/50 =.38 tenured faculty are younger than 41? What is the probability that (9 + 7 + 2)/50 = 18/50 =.36 a randomly selected faculty member is 49 or older? Describing Data with Numerical Measures Graphical methods may not always be sufficient for describing data. Numerical measures can be created for both populations and samples. – A parameter is a numerical descriptive measure calculated for a population. – A statistic is a numerical descriptive measure calculated for a sample. Measures of Center A measure along the horizontal axis of the data distribution that locates the center of the distribution. Arithmetic Mean or Average The mean of a set of measurements is the sum of the measurements divided by the total number of measurements. xi x= n where n = number of measurements xi = sum of all the measurements Example The set: 2, 9, 11, 5, 6 xi 2 + 9 + 11 + 5 + 6 33 x= = = = 6.6 n 5 5 If we were able to enumerate the whole population, the population mean would be called (the Greek letter “mu”). Median The median of a set of measurements is the middle measurement when the measurements are ranked from smallest to largest. The position of the median is.5(n + 1) once the measurements have been ordered. Example The set: 2, 4, 9, 8, 6, 5, 3 n = 7 Sort: 2, 3, 4, 5, 6, 8, 9 Position:.5(n + 1) =.5(7 + 1) = 4th Median = 4th largest measurement The set: 2, 4, 9, 8, 6, 5 n=6 Sort: 2, 4, 5, 6, 8, 9 Position:.5(n + 1) =.5(6 + 1) = 3.5th Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th measurements Mode The mode is the measurement which occurs most frequently. The set: 2, 4, 9, 8, 8, 5, 3 – The mode is 8, which occurs twice The set: 2, 2, 9, 8, 8, 5, 3 – There are two modes—8 and 2 (bimodal) The set: 2, 4, 9, 8, 5, 3 – There is no mode (each value is unique). Example The number of quarts of milk purchased by 25 households: 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5 Mean? xi 55 x= = = 2.2 10/25 n 25 8/25 Relative frequency Median? 6/25 m=2 4/25 Mode? (Highest peak) 2/25 mode = 2 0 0 1 2 3 4 5 Quarts Extreme Values The mean is more easily affected by extremely large or small values than the median. The median is often used as a measure of center when the distribution is skewed. Extreme Values Symmetric: Mean = Median Skewed right: Mean > Median Skewed left: Mean < Median Measures of Variability A measure along the horizontal axis of the data distribution that describes the spread of the distribution from the center. The Range The range, R, of a set of n measurements is the difference between the largest and smallest measurements. Example: A botanist records the number of petals on 5 flowers: 5, 12, 6, 8, 14 The range is R = 14 – 5 = 9. Quick and easy, but only uses 2 of the 5 measurements. The Variance The variance is measure of variability that uses all the measurements. It measures the average deviation of the measurements about their mean. Flower petals: 5, 12, 6, 8, 14 45 x= =9 5 4 6 8 10 12 14 The Variance The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean ( x − ) 2 2 = i N The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1) ( x − x ) 2 s = 2 i n −1 The Standard Deviation In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements. To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance. Population standard deviation : = 2 Sample standard deviation : s = s 2 Two Ways to Calculate the Sample Variance Use the Definition Formula: xi xi − x ( xi − x ) 2 5 -4 16 ( x − x ) 2 s2 = i 12 3 9 n −1 6 -3 9 60 8 -1 1 = = 15 14 5 25 4 Sum 45 0 60 s = s 2 = 15 = 3.87 Two Ways to Calculate the Sample Variance Use the Calculational Formula: xi xi2 2 ( x ) 2 5 25 xi − i s2 = n 12 144 n −1 6 36 452 8 64 465 − 14 196 = 5 = 15 Sum 45 465 4 s = s = 15 = 3.87 2 Some Notes The value of s is ALWAYS positive. The larger the value of s2 or s, the larger the variability of the data set. Why divide by n –1? –The sample standard deviation s is often used to estimate the population standard deviation Dividing by n –1 gives us a better estimate of Using Measures of Center and Spread: Tchebysheff’s Theorem Given a number k greater than or equal to 1 and a set of n measurements, at least 1-(1/k2) of the measurement will lie within k standard deviations of the mean. ✓ Can be used for either samples ( x and s) or for a population ( and ). ✓Important results: ✓If k = 2, at least 1 – 1/22 = 3/4 of the measurements are within 2 standard deviations of the mean. ✓If k = 3, at least 1 – 1/32 = 8/9 of the measurements are within 3 standard deviations of the mean. Using Measures of Center and Spread: The Empirical Rule Given a distribution of measurements that is approximately mound-shaped: ✓The interval contains approximately 68% of the measurements. ✓The interval 2 contains approximately 95% of the measurements. ✓The interval 3 contains approximately 99.7% of the measurements. Using Measures of Center and Spread: The Empirical Rule - Example Suppose we have a dataset representing the scores of students on a standardized test. The mean score (m) is 500, and the standard deviation (s) is 100. Using the Empirical Rule: Approximately 68% of the students scored between 500±100 points, which is between 400 and 600 points. Approximately 95% of the students scored between 500±2×100 points, which is between 300 and 700 points. Approximately 99.7% of the students scored between 500±3×100 points, which is between 200 and 800 points. So, according to the Empirical Rule: About 68% of students scored between 400 and 600 points. About 95% of students scored between 300 and 700 points. About 99.7% of students scored between 200 and 800 points. This rule helps us understand the distribution of scores on the test and provides insights into how spread out the scores are around the mean score. Example The ages of 50 tenured faculty at a state university. 34 48 70 63 52 52 35 50 37 43 53 43 52 44 42 31 36 48 43 26 58 62 49 34 48 53 39 45 34 59 34 66 40 59 36 41 35 36 62 34 38 28 43 50 30 43 32 44 58 53 x = 44.9 14/50 12/50 Relative frequency 10/50 s = 10.73 8/50 6/50 4/50 2/50 Shape? Skewed right 0 25 33 41 49 Ages 57 65 73 k x ks Interval Proportion Tchebysheff Empirical in Interval Rule 1 44.9 10.73 34.17 to 55.63 31/50 (.62) At least 0 .68 2 44.9 21.46 23.44 to 66.36 49/50 (.98) At least.75 .95 3 44.9 32.19 12.71 to 77.09 50/50 (1.00) At least.89 .997 Yes. Tchebysheff’s Do the actual proportions in the Theorem must be three intervals agree with those true for any data set. given by Tchebysheff’s Theorem? Do they agree with the Empirical No. Not very well. Rule? The data Why or why not? distribution is not very mound-shaped but skewed right. Example The length of time for a worker to complete a specified operation averages 12.8 minutes with a standard deviation of 1.7 minutes. If the distribution of times is approximately mound-shaped, what proportion of workers will take longer than 16.2 minutes to complete the task? 95% between 9.4 and 16.2 47.5% between 12.8 and 16.2.475.475.025 (50-47.5)% = 2.5% above 16.2 Measures of Relative Standing Where does one particular measurement stand in relation to the other measurements in the data set? How many standard deviations away from the mean does the measurement lie? This is measured by the z-score. Suppose s = 2. s x−x 4 z - score = s s s x =5 x=9 x = 9 lies z =2 std dev from the mean. z-Scores From Tchebysheff’s Theorem and the Empirical Rule – At least 3/4 and more likely 95% of measurements lie within 2 standard deviations of the mean. – At least 8/9 and more likely 99.7% of measurements lie within 3 standard deviations of the mean. z-scores between –2 and 2 are not unusual. z-scores should not be more than 3 in absolute value. z-scores larger than 3 in absolute value would indicate a possible outlier. Outlier Not unusual Outlier z -3 -2 -1 0 1 2 3 Somewhat unusual Measures of Relative Standing How many measurements lie below the measurement of interest? This is measured by the pth percentile. p% (100-p) % x p-th percentile Examples 90% of all men (16 and older) earn more than $319 per week. BUREAU OF LABOR STATISTICS 10% 90% $319 is the 10th $319 percentile. 50th Percentile Median 25th Percentile Lower Quartile (Q1) 75th Percentile Upper Quartile (Q3) Quartiles and the IQR The lower quartile (Q1) is the value of x which is larger than 25% and less than 75% of the ordered measurements. The upper quartile (Q3) is the value of x which is larger than 75% and less than 25% of the ordered measurements. The range of the “middle 50%” of the measurements is the interquartile range, IQR = Q3 – Q1 Calculating Sample Quartiles The lower and upper quartiles (Q1 and Q3), can be calculated as follows: The position of Q1 is.25(n + 1) The position of Q3 is.75(n + 1) once the measurements have been ordered. If the positions are not integers, find the quartiles by interpolation. Example The prices ($) of 18 brands of walking shoes: 40 60 65 65 65 68 68 70 70 70 70 70 70 74 75 75 90 95 Position of Q1 =.25(18 + 1) = 4.75 Position of Q3 =.75(18 + 1) = 14.25 ✓Q1is 3/4 of the way between the 4th and 5th ordered measurements, or Q1 = 65 +.75(65 - 65) = 65. Example The prices ($) of 18 brands of walking shoes: 40 60 65 65 65 68 68 70 70 70 70 70 70 74 75 75 90 95 Position of Q1 =.25(18 + 1) = 4.75 Position of Q3 =.75(18 + 1) = 14.25 ✓Q3 is 1/4 of the way between the 14th and 15th ordered measurements, or Q3 = 74 +.25(75 - 74) = 74.25 ✓and IQR = Q3 – Q1 = 74.25 - 65 = 9.25 Using Measures of Center and Spread: The Box Plot The Five-Number Summary: Min Q1 Median Q3 Max Divides the data into 4 sets containing an equal number of measurements. A quick summary of the data distribution. Use to form a box plot to describe the shape of the distribution and to detect outliers. Constructing a Box Plot ✓Calculate Q1, the median, Q3 and IQR. ✓Draw a horizontal line to represent the scale of measurement. ✓Draw a box using Q1, the median, Q3. Q1 m Q3 Constructing a Box Plot ✓Isolate outliers by calculating ✓Lower fence: Q1-1.5 IQR ✓Upper fence: Q3+1.5 IQR ✓Measurements beyond the upper or lower fence is are outliers and are marked (*). * Q1 m Q3 Constructing a Box Plot ✓Draw “whiskers” connecting the largest and smallest measurements that are NOT outliers to the box. * Q1 m Q3 Example Amt of sodium in 8 brands of cheese: 260 290 300 320 330 340 340 520 Q1 = 292.5 m = 325 Q3 = 340 m Q1 Q3 Example IQR = 340-292.5 = 47.5 Lower fence = 292.5-1.5(47.5) = 221.25 Upper fence = 340 + 1.5(47.5) = 411.25 myapplet Outlier: x = 520 * m Q1 Q3 Interpreting Box Plots ✓ Median line in center of box and whiskers of equal length—symmetric distribution ✓ Median line left of center and long right whisker—skewed right ✓ Median line right of center and long left whisker—skewed left Key Concepts I. How Data Are Generated 1. Experimental units, variables, measurements 2. Samples and populations 3. Univariate, bivariate, and multivariate data II. Types of Variables 1. Qualitative or categorical 2. Quantitative a. Discrete b. Continuous III. Graphs for Univariate Data Distributions 1. Qualitative or categorical data a. Pie charts b. Bar charts Key Concepts 2. Quantitative data a. Pie and bar charts b. Line charts c. Dotplots d. Stem and leaf plots e. Relative frequency histograms 3. Describing data distributions a. Shapes—symmetric, skewed left, skewed right, unimodal, bimodal b. Proportion of measurements in certain intervals c. Outliers Key Concepts IV. Describing Data with Numerical Measures Measures of Center 1. Arithmetic mean (mean) or average a. Population: xi x= b. Sample of size n: n 2. Median: position of the median =.5(n +1) 3. Mode 4. The median may preferred to the mean if the data are highly skewed. Measures of Variability 1. Range: R = largest − smallest Key Concepts Variance ( xi − ) 2 = 2 a. Population of N measurements: N b. Sample of n measurements: 2 ( xi ) 2 xi − ( x − x ) 2 n s2 = i = n −1 n −1 Standard deviation Population standard deviation : = 2 Sample standard deviation : s = s 2 Rough approximation for s can be calculated as s R / 4. The divisor can be adjusted depending on the sample size. Key Concepts V. Tchebysheff’s Theorem and the Empirical Rule 1. Use Tchebysheff’s Theorem for any data set, regardless of its shape or size. a. At least 1-(1/k 2 ) of the measurements lie within k standard deviation of the mean. b. This is only a lower bound; there may be more measurements in the interval. 2. The Empirical Rule can be used only for relatively mound- shaped data sets. – Approximately 68%, 95%, and 99.7% of the measurements are within one, two, and three standard deviations of the mean, respectively. Key Concepts VI. Measures of Relative Standing 1. Sample z-score: 2. pth percentile; p% of the measurements are smaller, and (100 − p)% are larger. 3. Lower quartile, Q 1; position of Q 1 =.25(n +1) 4. Upper quartile, Q 3 ; position of Q 3 =.75(n +1) 5. Interquartile range: IQR = Q 3 − Q 1 VII. Box Plots 1. Box plots are used for detecting outliers and shapes of distributions. 2. Q 1 and Q 3 form the ends of the box. The median line is in the interior of the box. Key Concepts 3. Upper and lower fences are used to find outliers. a. Lower fence: Q 1 − 1.5(IQR) b. Upper fence: Q 3 + 1.5(IQR) 4. Whiskers are connected to the smallest and largest measurements that are not outliers. 5. Skewed distributions usually have a long whisker in the direction of the skewness, and the median line is drawn away from the direction of the skewness.