Chapter 2 Describing Data with Numerical Measures PDF
Document Details
Uploaded by Deleted User
Ahmad Farooqi, PhD
Tags
Related
Summary
This document is a lecture or presentation on describing data using numerical measures. It covers concepts like mean, median, mode, and measures of variability like range, variance, and standard deviation. It also discusses z-scores, percentiles, quartiles, and boxplots.
Full Transcript
DESCRIBING DATA WITH NUMERICAL MEASURES By Ahmad Farooqi, PhD 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 1 Outlines of Chapter 2 Describ...
DESCRIBING DATA WITH NUMERICAL MEASURES By Ahmad Farooqi, PhD 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 1 Outlines of Chapter 2 Describing Data with Numerical Measures Numerical Measures Key Terms Measures of Centre or Average The Arithmetic Mean The Median The Mode The Extreme Values Measures of Variability or Dispersion The Range The Variance and The Standard Deviation Tchebysheff’s Theorem and the Empirical Rule Measures of Relative Standing and z-Score Percentile, Quartiles, the Inter Quartile Range (IQR) and the Box Plots Chapter Review 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 2 Learning Outcomes of Chapter 2 On completion of this chapter, with the aid of your course textbook, you should be able to know: 1. What are parameters, and what are statistics and what is the difference between them 2. How to compute and the meaning of arithmetic mean, median and mode (measures of center for a given sample or population data) 3. How to compute and the meaning of variance, standard deviation, range (measures of variability/spread for a given sample or population data) 4. What are the measures of relative standing (how big/small is a given number in a population or sample data) 5. Tchebysheff's theorem and the empirical rule and how to use them 6. Approximating standard deviation by range 7. The five number summary 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 3 Describing Data with Numerical Measures Graphical methods may not always be sufficient for describing data. Numerical measures can be created for both populations and samples 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 4 Population and Sample Population A population (doesn't mean human population) consist of all possible observations, individuals, subjects, or content of interest for which you want to draw some conclusion. The number of subjects in a population called population size and normally denoted by N. Sample A sample is a part/subset of population. The number of observations, individuals, or subjects in a sample is called a sample size and usually denoted by n. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 5 Parameter and Statistic Parameter Parameter is a numerical information obtained from a population data, such as population mean (µ), population SD(σ), population total, population proportion(P) etc. Note that, parameters are unknown and fix. Statistic Statistic is a numerical information obtained from a sample data, such as sample mean (𝑥̅ ), sample SD (s), sample total, sample proportion (p) etc. Note that statistics are variables. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 6 Measures of Central Tendency (or Average) Mean (Arithmetic Mean) is just The Arithmetic the sum of given set of values, Mean divided by the number of values. Median is a value which divides The Median the given set of values into two equal parts, after arranging the values into increasing or decreasing order of magnitude. Measures of Central Tendency Mode is the most repeated The Mode value, compared to other values (Averages) in each set of values. Some time we have 1 mode, 2 modes, or The Geometric many modes. Sometimes we don’t have any mode. Mean The Harmonic Mean 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 7 Measures of Centre or Average Measure of Centre or Central Location: A single measure along the horizontal axis of the data distribution that locates the Centre of the distribution. In other words, Measure of Centre or Central Location or Average is a single value that represent the given set of data. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 8 The Arithmetic Mean (or The Mean) The Mean: The sum of the measurements divided by the total number of measurements. Sample mean denoted by (read as x bar) Population mean denoted by (Greek letter “mu”) If our sample data is , then we calculate the sample mean as A Greek capital “sigma” is used to indicate the sum of measurements and called “summation”! Its sub- and super-scripts tell you the first and last indices of values to add together. Sometimes these are suppressed if you’re simply adding every measurement together. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 9 The Arithmetic Mean (or Mean)-Example Example: Suppose we have 9 patients with Hospital length of stay (in days) at Windsor Regional Hospital: 3, 7, 5, 8, 9, 8, 10, 12, 35 Calculate the Arithmetic mean (A.M) of given data ∑ A.M = days Example 2.1 (Test Book): Calculate the sample mean for set of data: 2, 9, 11, 5, 6 Solution: We have ∑ Arithmetic Mean= 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 10 The Median The Median : Median is a value which divides the given set of values into two equal parts, after ranking the values from smallest to largest ( or largest to smallest). In other words, Median is the middle measurement when the measurements are ranked from smallest to largest (or largest to smallest). We denote the median as. The th value, indicates the position of the median in the ordered data set. If the position of the median is a number that ends in 0.5 (1/2), you need to average the two adjacent values. Note that, If is odd: Median will simply equal the sorted data point in this position. If is even: Median will be the mean of the two “middle” observations. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 11 The Median-Example Example: Again, suppose we have 9 patients with Hospital length of stay (in days) at Windsor Regional Hospital in part a) and 8 patients in part b) as given: a) 3, 7, 5, 8, 9, 8, 10, 12, 35 b) 3, 7, 5, 8, 9, 10, 12, 35 Calculate the Median (m) of both data Solution: (a) The sorted data in an increasing order: 3, 5, 7, 8, 8, 9, 10, 12, 35 (here =9 ) The position of the Median is: = th observation=5th observation. The Median is the fifth measurement, that is Median: =8 days. Solution: (b) The sorted data set is: 3, 5, 7, 8, 9, 10, 12, 35 ( =8) The position of the Median is: = th observation = 4.5th observation. The Median is the average of 4th and 5th measurements, =(8+9)/2=8.5 days. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 12 The Mode-Example The Mode : Mode is the most repeated value, compared to other values in each set of values. Note that, some time we have 1 mode, 2 modes, or many modes. Sometimes we don’t have any mode. Examples: In the set 2, 4, 9, 8, 8, 5, 3 o The Mode is 8, which occurs twice In the set 2, 2, 9, 8, 8, 5, 3 o There are two Modes—8 and 2 (the distribution is bimodal) In the set 2, 4, 9, 8, 5, 3 o There is no Mode; each value is unique 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 13 Measures of Centre – Example The number of liters of milk purchased by 25 households: 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5 ∑ Mean: liters Median: , So, Median: liters Mode: We can identify this by finding the highest peak of its histogram. So, liters 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 14 Extreme Values and Relationship between Mean, Median, and Mode The mean is more easily affected by extremely large or small values than the median If a distribution is symmetric, Symmetric: the mean, median, and Mode Mean=Median=Mode are equal. If a distribution is skewed to the Skewed right: right, the mean is greater than Mean > Median > Mode median greater than mode. Skewed left: If a distribution is skewed to the left, the mean is less than Mean < Median < Mode median less than mode. The median is often used as a measure of centre when the distribution is skewed. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 15 Extreme Values-Example Example: Consider the following set of data: 1, 3, 5, 7, 8. Calculate the mean and median. If we then replace 8 with a “wrong” number of 100, what happens to these values? Solution: Original data: , and. Modified data: , and. The median didn’t change, but the mean had a large increase in value! Shows, mean is greatly affected by extreme value(s)/outliers. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 16 Measures of Centre–Blood Pressure Example Example: Suppose we take the first 10 observations from our SBP data set: 110, 127, 135, 120, 169, 104, 140, 116, 137, 138 Compute the mean, median, and mode of this data. Solution: ∑ Mean: mmHg. Median: Sorted Data: 104, 110, 116, 120, 127, 135, 137, 138, 140, 169 We have and value ( ) So, Median mmHg. Mode: Since each number occurs once, there is no mode in this data. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 17 Measures of Variability (Dispersion or Spread) Range is defined as the difference between The Range maximum value and the minimum value. It is the ratio of sum of The Mean absolute deviation from Deviation mean and the total Measures of number of observation. Dispersion (or Variability) IQR is the difference The Inter Quartile between the upper Range (Quartile quartile(Q3) and the Deviation) lower quartile(Q1). It is the ratio of sum of The Variance & squared deviation from Standard Deviation mean and the total number of observation. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 18 The Measures of Variability The Measures of Variability: Although measure of centre is a concise method of presentation of a statistical data, yet it does not itself give a clear picture of data for several reasons, such as it gives no indication of reliability of the data. More variability in the data, less reliable is the average. Another type of measure which help to describe more clearly the shape of the data distribution is the variability. Variability indicates how the measurements are spread out from its average. There are different measures of variability such as 1. Range. 2. The Variance & Standard Deviation. 3. The Inter Quartile Range (Quartile Deviation). 4. The Mean Deviation. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 19 The Range The Range (R) : Range is defined as the difference between maximum value and the minimum value. Example: Again, let the Length of stay (days) in WRH: 3, 7, 5, 8, 9, 8, 10, 12, 35 Solution: The Range is = (maximum value – minimum value) = (35 – 3) = 32 days Note: The calculation of range is quick and easy, but only uses 2 measurements from the data set, therefore not a good measure. Example: A botanist records the number of petals on five flowers: 5, 12, 6, 8, 14 Solution: The Range is = (14 – 5) = 9 petals 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 20 The Variance The Variance: A measure of variability (spread) that utilizes all the measurements; it measures the average deviation of the measurements about their mean ( ). Example: A botanist records the number of petals on five flowers: 5, 12, 6, 8, 14 45 x= =9 5 Note that, If the distances are large, the data is more spread out and variable. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 21 The Variance Variance of a Population: The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean (μ). It is denoted by and define by the formula given below Variance of a Sample: The variance of a sample of measurements is the sum of the squared deviations of the measurements about their mean , divided by ( – 1). It is denoted by and define by the formula given below å( xi - x ) 2 s = 2 n -1 Note: Note that, variance is always positive. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 22 The Standard Deviation (S.D) Note that, In calculating the variance, we squared all the deviations, and in doing so changed the scale of the measurements. To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance Population standard deviation : s = s 2 Sample standard deviation : s = s2 There are two convenient formulae for calculating the sample variance. We can obtain the second from manipulating our original definition: 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 23 The Standard Deviation Note: 1. The value of “s” is always greater than or equal to zero (means positive). 2. The larger the value of or “s”, the greater is the variability of the data, and hence mean is not a reliable measure. 3. If or s is equal to zero, which means all the measurements must have the same values, e.g., 6, 6, 6, 6,…, 6 has or s equal zero (no variability). 4. We compute standard deviation, to measure the variability in the same unit as the original observation. Why divide by (n – 1) in the variance and standard deviation formula? This is because the sample standard deviation s is often used to estimate the population standard deviation σ Dividing by ( ) gives us a better (unbiased) estimate of σ. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 24 Two Ways to Calculate the Sample Variance-Example Use the Definition Formula: ∑ Given that: 𝒊 𝒊 ) 𝒊 𝟐 å( xi - x ) 2 s =2 n -1 60 = = 15 4 Σ s = s = 15 = 3.87 2 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 25 Two Ways to Calculate the Sample Variance-Example Given that Use the Computing Formula: 𝟐 𝒊 𝒊 2 ( å x ) 2 å xi - i s2 = n n -1 45 2 465 - = 5 = 15 4 Σ s = s 2 = 15 = 3.87 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 26 On the Practical Significance of the Standard Deviation Tchebysheff’s Theorem: This Theorem is also called Chebyshev’s Theorem and is used in describing the variability of the measurements in the data. This theorem provides a way to estimate the proportion of measurements within a certain number of standard deviations from the mean in any distribution, regardless of its shape. This Theorem stats that for a given number (k is the number of SD away from mean) greater than or equal to and a set of measurements, at least [ – ( / )] of the measurements will lie within standard deviations (s or ) of the mean ( or ). 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 27 Important Results Regarding Tchebysheff’s Theorem When = 1: At least ( – / 𝟐 ) = , none of the measurements are within standard deviations of the mean i.e., with in( ± ). When = 2: At least ( – / 𝟐 ) = 3/4 = 75% of the measurements are within 2 standard deviations of the mean i.e., with in ( ±2 ). When = 3: At least ( – / 𝟐 )= 8/9 = 88.8% of the measurements are within 3 standard deviations of the mean i.e., with in ( ±3 ). Since Tchebysheff’s Theorem applies to any distribution, it’s very conservative (resulting in bounds that are often loose). If a distribution is mound-shaped, we have an alternative rule that can be more accurate. Note that, this theorem applies to any set of measurements and can be used for either samples ( and ) or for a population ( and ). 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 28 The Empirical Rules for Bell Shaped Data Mound-Shaped distribution: The bell-shaped curve or more commonly known as Normal curve is called a Mound-Shaped distribution as shown below For a given distribution of measurements that is approximately mound- shaped, following are the empirical rules 1. The interval ( ± ) contains approximately 68% of the measurements. 2. The interval ( ± 2 ) contains approximately 95% of the measurements. 3. The interval ( ± 3 ) contains approximately 99.7% of the measurements. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 29 Tchebysheff’s Theorem & Empirical Rule-Example Example 2.9 (Textbook): Student teachers are trained to develop lesson plans, on the assumption that the written plan will help them perform successfully in the classroom. In a study to assess the relationship between written lesson plans and their implementation in the classroom, 25 lesson plans were scored on a scale of 0 to 34 according to a Lesson Plan Assessment Checklist: 26.1 26.0 14.5 29.3 19.7 22.1 21.2 26.6 31.9 25.0 15.9 20.8 20.2 17.8 13.3 25.6 26.5 15.7 22.1 13.8 29.0 21.3 23.5 22.1 10.2 Use Tchebysheff’s Theorem and the Empirical Rule (if applicable) to describe the distribution of these assessment scores. Solution: It is straightforward to verify that =21.6 and =5.5. Can we apply Tchebysheff’s Theorem? Yes! It works for any distribution. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 30 Tchebysheff’s Theorem & Empirical Rule-Example (can't) Can we apply the Empirical Rule? One can see the distribution is relatively mound-shaped, so the Empirical Rule should work relatively well. Statistical Table for Empirical Rule & Histogram: 𝐾 𝑥̅ ± ks Interval Proportion Tchebysheff Empiric in Interval al Rule 1 21.6±5.5 16.1 to 27.1 16/25 (0.64) At least 0 Almost 0.68 2 21.6±11 10.6 to 32.6 24/25 (0.96) At least 0.75 Almost 0.95 3 21.6±16.5 5.1 to 38.1 25/25 (1.00) At least 0.89 Almost 0.997 Do the actual proportions in the three intervals agree with those given by Tchebysheff’s Theorem? o Yes. Tchebysheff’s Theorem must be true for any data set. Do they agree with the Empirical Rule? Why or why not? o Yes, relatively well, since the distribution is relatively mound-shaped (Bell Curve). 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 31 Empirical Rule – Tail Probability-Example Example-Tail Probability: The length of time for a worker to complete a specified operation averages 12.8 minutes with a standard deviation of 1.7 minutes. If the distribution of times is approximately mound-shaped, what proportion of workers will take longer than 16.2 minutes to complete the task? 95% Solution: A mound-shaped distribution is symmetric about , so, 50% of values are below and 50% are above ! 47.5% 47.5% Since there must be (approximately) 47.5% 2.5% 2.5% between and , that leaves 2.5% 𝜇 − 2𝜎 𝜇 𝜇 + 2𝜎 above minutes. = 16.2 95% of the area lies between and 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 32 Tchebysheff’s Theorem vs. Empirical Rule Tchebysheff’s Theorem is conservative (not precise) but is applicable to any distribution. Empirical Rule is precise but limited to mound-shaped distributions. Both rules help us to have an idea of how many measurements (sample or population) we expect to fall within 1, 2, or 3 standard deviations from the centre (mean). 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 33 Approximating Standard Deviation ( ) From Tchebysheff’s Theorem and the Empirical Rule, we know that most of the time, measurements lie within 2 standard deviations of their mean. The range is the difference between the maximum and minimum values. If we assume that our max is near ( 2 ) and our min is near ( 2 ), then ≈4. Equivalently, ≈ /4. We can use this as a shortcut to approximate from ! If we have a large data set (say, =50 or more), it’s more likely that our max is near ( 2s) and our min is near ( 2 ). In this case, we would use ≈ /6. For smaller samples (say, =5 or less), the range may be as small or smaller than 2.5. This is not intended to provide an accurate value for. Rather, its purpose is to detect gross errors in calculating (e.g., forgetting to take the square root). 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 34 Approximating Standard Deviation ( ) Sample Sizes and Divisors Table 2.6 (Textbook): The calculated should not differ substantially from the range divided by the appropriate ratio. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 35 Approximating Standard Deviation ( )-Lesson Plan Assessment Scores-Example Example 2.11 (Textbook): Using the lesson plan assessment scores data, approximate the standard deviation by using the range. Recall that the true value is =5.5. 26.1 26.0 14.5 29.3 19.7 22.1 21.2 26.6 31.9 25.0 15.9 20.8 20.2 17.8 13.3 25.6 26.5 15.7 22.1 13.8 29.0 21.3 23.5 22.1 10.2 Solution: The range is Our approximation for is therefore which is very close to the truth! 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 36 MEASURES OF RELATIVE STANDING The sample -score is a measure of relative standing defined by A -score describes the position of an observation relative to others in a set of data. It measures the distance between an observation and the mean, measured in units of standard deviation! Suppose that =2 and =5. The measurement =9 is =2 standard deviations from the mean. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 37 z-Scores Example Suppose s = 2 s 4 s s x =5 x=9 x = 9 lies z = 2 standard deviation from the mean 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 38 -Score – Intuitive Meaning ̅ The sample -score is for observation is ( ). The numerator, ( ), is the distance of from. How far is this measurement from the center of the distribution? If this distance is “small”, then is “usual”. If it is “large”, then is “unusual”. What is “small” or “large”? 185 km would be very long length for a rope, but not a very long distance of a city from Toronto! To remove the subjectivity of determining what distances are “small” or “large”, we divide by the standard deviation. A similar interpretation of ( ) holds for the population z-score. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 39 -Scores–Example Example (Textbook)-Test Score: Suppose that the mean and standard deviation of test scores are 25 and 4 (out of 35), respectively. A student received a score of 30. What is their -score? Their -score is: Therefore, their score lies 1.25 standard deviations above the mean: 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 40 -Scores – Tchebysheff’s and the Empirical Rule From Tchebysheff’s Theorem and the Empirical Rule: At least 3/4 (75%) and more likely 95% of measurements lie within 2 standard deviations of the mean. Observations with -scores exceeding 2 in absolute value happen less than 5% of the time and are considered somewhat unlikely. At least 8/9 (89%) and more likely 99.7% of measurements lie within 3 standard deviations of the mean. Observations with -scores exceeding 3 in absolute value happen less than 1% of the time and are considered very unlikely. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 41 z-Scores z-scores between –2 and 2 are not unusual z-scores should not be more than 3 in absolute value z-scores larger than 3 in absolute value would indicate a possible outlier. It may be recorded incorrectly or does not belong to the population being sampled. Or it may just be a highly unlikely observation (but valid, nonetheless)! Outlier Not unusual Outlier Z-score –3 –2 –1 0 1 2 3 Somewhat unusual 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 42 -Scores–Example Example 2.12 (Textbook) Test Scores: Consider this sample of measurements: The measurement seems unusually large! Calculate its -score. Solution: It is straightforward to calculate and. The -score is: Although it doesn’t exceed 3, it’s close enough that you may suspect that is an outlier. In this situation, you should examine the sampling procedure to see whether is a faulty observation. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 43 Quartiles and Interquartile Range of data Quartile: Quartile are the process of dividing the given data into four equal parts. There are three quartile. The 1st quartile(Q1) point has the 1/4 (25%) of the data below it and given by Position of Lower Quartile 1 th value The 2nd quartile(Q2 or Median) point has the 2/4 (50%) of the data below it. Quartile 2 th value The 3rd quartile (Q3) has the 3/4 (75%) of the sample below it. Quartile 3 th value The range of the “middle 50%” of the measurements is called the 𝟑 𝟏 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 44 Quartiles and IQR–Example Example-Running Shoes: Solve for the interquartile range of the ordered prices (in dollars) of 18 brands of walking shoes: 40 60 65 65 65 68 68 70 70 70 70 70 70 74 75 75 90 95 Solution: The Lower Quartile is th value = th value th value Lower Quartile= is (0.75) of the way between the 4th and 5th ordered value So, Lower Quartile= 4th 5th 4th dollars The Upper Quartile is 3 th value =3 th value th value Upper Quartile= is (0.25) of the way between the 14th and 15th ordered measurements, So, Upper Quartilr= 14th 15th 14th dollars The IQR is therefore dollars 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 45 Decile and Percentile Decile: Decile are the process of dividing the given data into ten equal parts. There are nine decile. The 1st decile(D1) point has the 1/10 (10%) of the data below it, the 2nd decile(D2) point has the 2/10 (20%) of the data below it so on, the 9th decile(D9) has the 9/10 (90%) of the sample below it. Similarly, Percentile: Percentile are the process of dividing the given data into hundred equal parts. There are ninety-nine percentile. The 1st percentile(P1) point has the 1/100 (1%) of the data below it, the 2nd percentile(P2) point has the 2/100 (2%) of the data below it so on, the 99th percentile (P99) has the 99/100 (99%) of the sample below it. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 46 Five-Number Summary Quartiles divide the data into 4 sets containing an (approximately) equal number of measurements. If we add the largest number (Max) and the smallest number (Min) to this group, we will have five numbers that provide a quick and rough summary of the data distribution. The five-number summary consists of the smallest number, the lower quartile, the median, the upper quartile, and the largest number, presented in order from smallest to largest. The Five-Number Summary: Min Q1 Median Q3 Max By definition, ¼(0.25) of measurements lie between each of the four adjacent pairs of numbers. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 47 Detecting Outliers To detect the outliers, we first define Any measurements beyond the upper or lower fence are the outliers. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 48 Boxplot (or box-and-whisker plot) Boxplot (or Box-and-Whisker plot): The boxplot is a nice graph to identify the central aspects of the data, as well as the extreme observations, or outliers. A boxplot consists of several components shown by the figure below 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 49 How to Construct a Boxplot Following are the steps to construct a Boxplot 1. Calculate Q1, Median, Q3, and the IQR for the given data. 2. Draw a horizontal line representing the scale of measurement. Create a box just above the horizontal line with right and left ends at Q1 and Q3. Draw a vertical line through the box at the location of Median. 3. Use IQR (or z score) to create imaginary “Fences”. 4. Mark * for an outliers above or below the Fences. 5. Extend the horizontal lines called “Whiskers” from the end of the box to the smallest and largest observation. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 50 Interpreting Box Plots One can use the box plot to detect skewness by looking at the position of the median line in comparison to and : Median line in Centre of box and whiskers of equal length: symmetric distribution Median line left of Centre and long right whisker: skewed right Median line right of Centre and long left whisker: skewed left 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 51 Five-Number Summary, Boxplot, & Outliers-Example Example 2.16 (Textbook): Amount of sodium per slice (in milligrams) in 8 brands of cheese are given. Calculate the five-number summary, construct a Boxplot, and look for outliers. 260 290 300 320 330 340 340 520 Solution: We have: , , and. Lower fence: Upper fence: Outlier: (as it is greater than 411.25:upper fence) 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 52 Five-Number Summary, Boxplot, & Outliers-Example We have: , , . 𝑚 Lower fence: 𝑄 𝑄 Upper fence: Outlier: 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 53 Chapter Review-Key Concepts Measures of Centre: 1. Arithmetic mean (mean) or average a. Population: ∑ b. Sample of size n: 2. Median: position of the median =.5(n + 1) the value 3. Mode: Most repeated value 4. The median may be preferred to the mean if the data are highly skewed 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 54 Chapter Review-Key Concepts (Cont’d) Measures of Variability 1. Range: = largest – smallest 2. Variance ∑ a. Population of measurements: ∑ ∑ ̅ ∑ b. Sample of measurements: 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 55 Chapter Review-Key Concepts (Cont’d) Measures of Variability 3. Standard deviation a. Population standard deviation: b. Sample standard deviation: 4. A rough approximation for can be calculated as. The divisor can be adjusted depending on the sample size. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 56 Chapter Review-Key Concepts (Cont’d) Tchebysheff’s Theorem and the Empirical Rule 1. Use Tchebysheff’s Theorem for any data set, regardless of its shape or size. a. At least of the measurements lie within standard deviation of the mean. b. This is only a lower bound; there may be more measurements in the interval. 2. The Empirical Rule can be used only for relatively mound-shaped data sets. Approximately 68%, 95%, and 99.7% of the measurements are within one, two, and three standard deviations of the mean, respectively. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 57 Chapter Review-Key Concepts (Cont’d) Measures of Relative Standing ( ̅) 1. Sample -score: 2. th percentile; of the measurements are smaller, and are larger. 3. Lower quartile, ; position of 4. Upper quartile, ; position of 5. Interquartile range: 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 58 Chapter Review-Key Concepts (Can’t) Box Plots 1. Box plots are used for detecting outliers and shapes of distributions. 2. and form the ends of the box, and the median line is in the interior of the box. 3. Upper and lower fences are used to find outliers. Lower fence: Upper fence: 4. Whiskers are connected to the smallest and largest measurements that are not outliers. 5. Skewed distributions usually have a long whisker in the direction of the skewness, and the median line is drawn away from the direction of the skewness. 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 59 PRACTICE EXERCISES Exercise 1: You are given n=5 measurements: 2,1,1,3,5. 1. Calculate the sample mean , 2. Calculate the sample variance using the formula given by the definition. 3. Find the sample standard deviation s. 4. Find and s using the computing formula. Exercise 2: You are given n=8 measurements: 3, 1, 5, 6, 4, 4, 3, 5. a. Calculate the sample mean. b. Calculate the range. c. Calculate the sample standard deviation. d. Compare the range with the standard deviation. The 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 60 PRACTICE EXERCISES Exercise 3: A distribution of measurements is relatively mound-shaped with mean 50 and standard deviation 10. a. What proportion of the measurements will fall between 40 and 60? b. What proportion of the measurements will fall between 30 and 70? c. What proportion of the measurements will fall between 30 and 60? d. If a measurement is chosen at random form this distribution, what is the probability that it will be greater than 60? Exercise 4: Given the following data set: 8, 7, 1, 4, 6, 6, 4, 5, 7, 6, 3, 0. a. Find the five number summary and the IQR. b. Calculate the sample mean and variance. c. Calculate the z-score for the smallest and largest observation. d. Is either of these observations unusually large or unusually small? 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 61 References Textbook: Mendenhall, W., Ahmed, S.E., Beaver, R. and Beaver B.: Introduction to Probability and Statistics, Fourth Canadian Edition, Nelson (2018). Notes STAT-2910, Dr. Kevin Granville 9/14/2024 Chapter 2: Describing Data with Numerical Measures by Ahmad Farooqi, PhD 62