Lesson 3: Data Description PDF

ENENDA30 – ENGINEERING DATA ANALYSIS Lesson 3: Data Description ❑ Measures of Central Tendency (Mean, Median, Mode, Midrange) ❑ Distribution Shapes ❑ Measures of Variation (Range, Variance, Standard Deviation) ❑ Chebyshev’s Theorem and The Empirical Rule ❑ Measures of Position (z-score, Percentiles, Quartiles, Interquartile Range, Deciles) ❑ Exploratory Data Analysis (Boxplot) Prepared by: Herbert V. Villaruel, ECE, ECT [email protected] [email protected] National University – Manila College of Engineering No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. INTRODUCTION ❑ This chapter shows the statistical methods that can be used to summarize data. The most familiar of these methods is the finding of averages. ❑ The word average is ambiguous, since several different methods can be used to obtain an average. Loosely stated, the average means the center of the distribution or the most typical case. Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange. ❑ In addition to knowing the average, you must know how the data values are dispersed. The measures that determine the spread of the data values are called measures of variation, or measures of dispersion. These measures include the range, variance, and standard deviation. © Engr. H. V. Villaruel, A.Y. 2021-2022 ❑ Finally, another set of measures is necessary to describe data. These measures are called measures of position. They tell where a specific data value falls within the data set or its relative position in comparison with other data values. The most common position measures are percentiles, deciles, and quartiles. These measures are used extensively in psychology and education. Sometimes they are referred to as norms. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY ❑ A statistic is a characteristic or measure obtained by using the data values from a sample. ❑ A parameter is a characteristic or measure obtained by using all the data values from a specific population. EXAMPLE: Suppose an insurance manager wanted to know the average weekly sales of all the company’s representatives. If the company employed a large number of salespeople, say, nationwide, he would have to use a sample and make an inference to the entire sales force. But if the company had only a few salespeople, say, only 87 agents, he would be able to use all representatives’ sales for a randomly chosen week and thus use the entire population. © Engr. H. V. Villaruel, A.Y. 2021-2022 In this example, the average of the sales from a sample of representatives is a statistic, and the average of sales obtained from the entire population is a parameter. GENERAL ROUNDING RULE: In statistics the basic rounding rule is that when computations are done in the calculation, rounding should not be done until the final answer is calculated. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY THE MEAN The mean, also known as the arithmetic average, is found by adding the values of the data and dividing by the total number of values. ❑ The sample mean, denoted by 𝑋ത (pronounced “X bar”), is calculated by using sample data. The sample mean is a statistic. 𝑿𝟏 + 𝒙𝟐 + 𝒙𝟑 + ⋯ + 𝑿𝒏 σ 𝑿 ഥ 𝑿= = 𝒏 𝒏 where n represents the total number of values in the sample. ❑ The population mean, denoted by μ (pronounced “mew”), is calculated by using all the values in © Engr. H. V. Villaruel, A.Y. 2021-2022 the population. The population mean is a parameter. 𝑿𝟏 + 𝒙𝟐 + 𝒙𝟑 + ⋯ + 𝑿𝒏 σ 𝑿 𝝁= = 𝑵 𝑵 where N represents the total number of values in the population. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 1: The number of confirmed flu cases for a 9-year period is shown. Find the mean. 4 46 98 115 88 44 7 3 48 62 Source: World Health Organization © Engr. H. V. Villaruel, A.Y. 2021-2022 Rounding Rule for the Mean: The mean should be rounded to one more decimal place than occurs in the raw data. For example, if the raw data are given in whole numbers, the mean should be rounded to the nearest tenth. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 2: The data show the systemwide sales (in millions) for U.S. franchises of a well-known donut store for a 5-year period. Find the mean. $221 $239 $262 $281 $318 Source: Krispy Kreme. © Engr. H. V. Villaruel, A.Y. 2021-2022 Rounding Rule for the Mean: The mean should be rounded to one more decimal place than occurs in the raw data. For example, if the raw data are given in whole numbers, the mean should be rounded to the nearest tenth. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY ❑ The mean, in most cases, is not an actual data value. ❑ The procedure for finding the mean for grouped data assumes that the mean of all the raw data values in each class is equal to the midpoint of the class. ❑ In reality, this is not true, since the average of the raw data values in each class usually will not be exactly equal to the midpoint. However, using this procedure will give an acceptable approximation of the mean, since some values fall above the midpoint and other values fall below the midpoint for each class, and the midpoint represents an estimate of all values in the class. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 3: The frequency distribution shows the salaries (in millions) for a specific year of the top 25 CEOs in the United States. Find the mean. Source: S & P Capital. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY THE MEDIAN The median is the halfway point in a data set. Before you can find this point, the data must be arranged in ascending or increasing order. When the data set is ordered, it is called a data array. The median either will be a specific value in the data set or will fall between two values. The median is the midpoint of the data array. The symbol for the median is MD. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 4: The data show the number of tablet sales in millions of units for a 5-year period. Find the median of the data. 108.2 17.6 159.8 69.8 222.6 Source: Gartner. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 5: The number of tornadoes that have occurred in the United States over an 8-year period is as follows. Find the median. 684, 764, 656, 702, 856, 1133, 1132, 1303 Source: The Universal Almanac. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY THE MODE The third measure of average is called the mode. The mode is the value that occurs most often in the data set. It is sometimes said to be the most typical case. A data set that has only one value that occurs with the greatest frequency is said to be unimodal. If a data set has two values that occur with the same greatest frequency, both values are considered to be the mode and the data set is said to be bimodal. If a data set has more than two values that occur with the same greatest frequency, each value is used as the mode, and the data set is said to be multimodal. © Engr. H. V. Villaruel, A.Y. 2021-2022 When no data value occurs more than once, the data set is said to have no mode. Note: Do not say that the mode is zero. That would be incorrect, because in some data, such as temperature, zero can be an actual value. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 6: The data show the number of public libraries in a sample of eight states. Find the mode. 114 77 21 101 311 77 159 382 Source: The World Almanac. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 7: The data show the number of licensed nuclear reactors in the United States for a recent 15-year period. Find the mode. Source: The World Almanac and Book of Facts. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 8: The data show the number of patents secured for the top 5 companies for a specific year. Find the mode. 6180 4894 2821 2559 2483 Source: IFI Claims Patent Services. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 9: Find the modal class for the frequency distribution for the salaries of the top CEOs in the United States, shown in Example 3. © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: The mode for grouped data is the modal class. The modal class is the class with the largest frequency. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 10: The data show the number of gallons of various nonalcoholic drinks Americans consume in a year. Find the mode. Source: U.S. Department of Agriculture © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: The mode is the only measure of central tendency that can be used in finding the most typical case when the data are nominal or categorical. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 11: A small company consists of the owner, the manager, the salesperson, and two technicians, all of whose annual salaries are listed here. (Assume that this is the entire population.) Find the mean, median, and mode. SOLUTION: σ 𝑿 𝟏𝟎𝟎. 𝟎𝟎𝟎 + 𝟒𝟎, 𝟎𝟎𝟎 + 𝟐𝟒, 𝟎𝟎𝟎 + 𝟏𝟖, 𝟎𝟎𝟎 + 𝟏𝟖, 𝟎𝟎𝟎 $𝟐𝟎𝟎, 𝟎𝟎𝟎 𝝁= = = 𝑵 𝟓 𝟓 𝝁 = $𝟒𝟎, 𝟎𝟎𝟎 © Engr. H. V. Villaruel, A.Y. 2021-2022 Hence, the mean is $40,000, the median is $24,000, and the mode is $18,000. NOTE: An extremely high or extremely low data value in a data set can have a striking effect on the mean of the data set. These extreme values are called outliers. This is one reason why, when analyzing a frequency distribution, you should be aware of any of these values. For the data set shown in this example, the mean, median, and mode can be quite different because of extreme values. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY THE MIDRANGE The midrange is a rough estimate of the middle. It is found by adding the lowest and highest values in the data set and dividing by 2. It is a very rough estimate of the average and can be affected by one extremely high or low value. 𝒍𝒐𝒘𝒆𝒔𝒕 𝒗𝒂𝒍𝒖𝒆 + 𝒉𝒊𝒈𝒉𝒆𝒔𝒕 𝒗𝒂𝒍𝒖𝒆 𝑴𝑹 = 𝟐 Example 12: The number of bank failures for a recent five-year period is shown. Find the midrange. 3, 30, 148, 157, 71 © Engr. H. V. Villaruel, A.Y. 2021-2022 Source: Federal Deposit Insurance Corporation. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 13: Find the midrange of data for the NFL signing bonuses. The bonuses in millions of dollars are 18, 14, 34.5, 10, 11.3, 10, 12.4, 10 © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: In statistics, several measures can be used for an average. The most common measures are the mean, median, mode, and midrange. Each has its own specific purpose and use. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF CENTRAL TENDENCY WEIGHTED MEAN The type of mean that considers an additional factor is called the weighted mean, and it is used when the values are not all equally represented. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 14: A student received an A in English Composition I (3 credits), a C in Introduction to Psychology (3 credits), a B in Biology I (4 credits), and a D in Physical Education (2 credits). Assuming A = 4 grade points, B = 3 grade points, C = 2 grade points, D = 1 grade point, and F = 0 grade points, find the student’s grade point average. SOLUTION: © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: In statistics, several measures can be used for an average. The most common measures are the mean, median, mode, and midrange. Each has its own specific purpose and use. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. DISTRIBUTION SHAPES Frequency distributions can assume many shapes. The three most important shapes are positively skewed, symmetric, and negatively skewed. In a positively skewed or right-skewed distribution, the majority of the data values fall to the left of the mean and cluster at the lower end of the distribution; the “tail” is to the right. For example, if an instructor gave an examination and most of the students did poorly, their scores would tend to cluster on the left side of the distribution. A few high scores would constitute the tail of the distribution, which would be on the right side. In a symmetric distribution, the data values are evenly distributed © Engr. H. V. Villaruel, A.Y. 2021-2022 on both sides of the mean. In addition, when the distribution is unimodal, the mean, median, and mode are the same and are at the center of the distribution. Examples of symmetric distributions are IQ scores and heights of adult males. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. DISTRIBUTION SHAPES When the majority of the data values fall to the right of the mean and cluster at the upper end of the distribution, with the tail to the left, the distribution is said to be negatively skewed or left- skewed. As an example, a negatively skewed distribution results if the majority of students score very high on an instructor’s examination. These scores will tend to cluster to the right of the distribution. OTHER CONCEPTS: GEOMETRIC MEAN: QUADRATIC MEAN: © Engr. H. V. Villaruel, A.Y. 2021-2022 HARMONIC MEAN: 𝒏 𝑮𝑴 = 𝒏 𝑿𝟏 𝑿𝟐 𝑿𝟑 … (𝑿𝒏 ) 𝟐 𝑯𝑴 = σ 𝑿𝒏 𝟏 𝑸𝑴 = σ 𝒏 𝑿𝒏 The geometric mean (GM) is defined as the nth root of the product of n values. A useful mean in the physical Sciences (such The harmonic mean (HM) is defined as the as voltage) is the quadratic mean (QM), number of values divided by the sum of the which is found by taking the square root of reciprocals of each value. the average of the squares of each value. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION In statistics, to describe the data set accurately, statisticians must know more than the measures of central tendency. Consider this example: COMPARISON OF OUTDOOR PAINT SOLUTION: A testing lab wishes to test two experimental The mean for brand A is brands of outdoor paint to see how long each will σ 𝑋 210 last before fading. The testing lab makes 6gallons 𝜇= = = 35 𝑚𝑜𝑛𝑡ℎ𝑠 𝑁 6 of each paint to test. Since different chemical agents are added to each group and only six cans The mean for brand B is σ 𝑋 210 are involved, these two groups constitute two 𝜇= = = 35 𝑚𝑜𝑛𝑡ℎ𝑠 small populations. The results (in months) are 𝑁 6 shown. Find the mean of each group. Graphically, different conclusion might be drawn. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION RANGE The range is the simplest of the three measures of variation. The range is the highest value minus the lowest value. The symbol R is used for the range. R = highest value − lowest value Consider the previous example, find the ranges for the paints. For brand A, the range is R = 60 − 10 = 50 months For brand B, the range is © Engr. H. V. Villaruel, A.Y. 2021-2022 R = 45 − 25 = 20 months Conclusion: The range for brand A shows that 50 months separate the largest data value from the smallest data value. For brand B, 20 months separate the largest data value from the smallest data value, which is less than one-half of brand A’s range. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 15: The data show a sample of the top-grossing movies in millions of dollars. Find the range. $409 $386 $150 $117 $73 $70 Source: The World Almanac and Book of Facts. © Engr. H. V. Villaruel, A.Y. 2021-2022 CONCLUSION: The range for these data is quite large since it depends on the highest data value and the lowest data value. To have a more meaningful statistic to measure the variability, statisticians use measures called the variance and standard deviation. NOTE: Make sure the range is given as a single number. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION POPULATION VARIATION AND STANDARD DEVIATION It is necessary to know what data variation means. It is based on the difference or distance each data value is from the mean. This difference or distance is called a deviation. ❑ The population variance is the average of the squares of the distance each value is from the mean. The symbol for the population variance is 𝜎 2 and the formula is: 𝟐 σ 𝑿−𝝁 𝟐 𝝈 = 𝑵 where: X → individual value 𝜇 → population mean N → population size © Engr. H. V. Villaruel, A.Y. 2021-2022 ❑ The population standard deviation is the square root of the variance. The symbol for the population standard deviation is 𝜎 and its formula is: σ 𝑿−𝝁 𝟐 𝝈= 𝝈𝟐 = 𝑵 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION FINDING THE POPULATION VARIANCE & POPULATION STANDARD DEVIATION Step 1 Find the mean for the data. Step 5 Divide by N to get the variance. σ𝑿 σ 𝟐 𝑿 − 𝝁 𝝁= 𝝈𝟐 = 𝑵 𝑵 Step 2 Find the deviation for each data value. Step 6 Take the square root of the variance to 𝑿−𝝁 get the standard deviation. Step 3 Square each of the deviations. 𝟐 σ 𝑿−𝝁 𝑿−𝝁 2 𝝈= Step 4 Find the sum of the squares. 𝑵 𝟐 ෍ 𝑿−𝝁 © Engr. H. V. Villaruel, A.Y. 2021-2022 ROUNDING RULE FOR THE STANDARD DEVIATION: The rounding rule for the standard deviation is the same as that for the mean. The final answer should be rounded to one more decimal place than that of the original data. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 16: Find the variance and standard deviation for the data set for brand A paint in previous example about the comparison of outdoor paint. The number of months brand A lasted before fading was 10, 60, 50, 30, 40, 20 SOLUTION: Thus, the variance is: © Engr. H. V. Villaruel, A.Y. 2021-2022 2 1750 𝜎 = → 𝝈𝟐 = 𝟐𝟗𝟏. 𝟕 6 Then, the standard deviation is: 𝜎 = 291.7 → 𝝈 = 𝟏𝟕. 𝟏 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 17: Find the variance and standard deviation for the data set for brand B paint in previous example about the comparison of outdoor paint. The number of months brand B lasted before fading was 35, 45, 30, 35, 40, 25 SOLUTION: Thus, the variance is: 250 © Engr. H. V. Villaruel, A.Y. 2021-2022 𝜎2 = → 𝝈𝟐 = 𝟒𝟏. 𝟕 6 Then, the standard deviation is: 𝜎 = 41.7 → 𝝈 = 𝟔. 𝟓 Since the standard deviation of brand A is 17.1 and the standard deviation of brand B is 6.5, the data are more variable for brand A. In summary, when the means are equal, the larger the variance or standard deviation is, the more variable the data are. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION SAMPLE VARIANCE AND STANDARD DEVIATION When computing the variance for a sample, one might expect the following expression to be used: σ 𝑿−𝑿 ഥ 𝟐 𝒏 ഥ is the sample mean and n is the sample size. This formula is not usually used, however, since where 𝑿 in most cases the purpose of calculating the statistic is to estimate the corresponding parameter. For example, the sample mean 𝑿 ഥ is used to estimate the population mean μ. ❑The expression above does not give the best estimate of the population variance because when the population is large and the sample is small (usually less than 30), the variance computed by © Engr. H. V. Villaruel, A.Y. 2021-2022 this formula usually underestimates the population variance. ❑Therefore, instead of dividing by n, find the variance of the sample by dividing by n − 1, giving a slightly larger value and an unbiased estimate of the population variance. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION SAMPLE VARIANCE AND STANDARD DEVIATION ❑ The sample variance is denoted by 𝑠 2 and the formula is: σ 𝑿 − 𝑋ത 𝟐 𝒔𝟐 = 𝒏−𝟏 where: X → individual value 𝑋ത → sample mean N → sample size ❑ The formula for the sample standard deviation, denoted by 𝑠, is σ 𝑿 − 𝑋ത 𝟐 © Engr. H. V. Villaruel, A.Y. 2021-2022 𝒔= 𝒔𝟐 = 𝒏−𝟏 NOTE: The procedure for finding the sample variance and the sample standard deviation is the same as the procedure for finding the population variance and the population standard deviation except the sum of the squares is divided by n – 1 (sample size minus 1) instead of N (population size). No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 18: The number of public school teacher strikes in Pennsylvania for a random sample of school years is shown. Find the sample variance and the sample standard deviation. 9 10 14 7 8 3 Source: Pennsylvania School Board Association. SOLUTION: © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION SHORTCUT COMPUTATION FORMULAS The shortcut formulas for computing the variance and standard deviation for data obtained from samples are as follows. VARIANCE: STANDARD DEVIATION: 𝟐 𝟐 𝒏 σ𝑿 − σ𝑿 𝟐 𝒏 σ𝑿 − σ𝑿 𝟐 𝒔𝟐 = 𝒔= 𝒏(𝒏−𝟏) 𝒏(𝒏−𝟏) © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: σ 𝑿𝟐 is not the same as σ 𝑿 𝟐. The notation σ 𝑿𝟐 means to square the values first, then sum; σ 𝑿 𝟐 means to sum the values first, the square the sum. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 19: The number of public school teacher strikes in Pennsylvania for a random sample of school years is shown. Find the sample variance and the sample standard deviation. Use the shortcut formulas. 9 10 14 7 8 3 Source: Pennsylvania School Board Association. SOLUTION: © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION SHORTCUT COMPUTATION FORMULAS The procedure for finding the variance and standard deviation for grouped data is similar to that for finding the mean for grouped data, and it uses the midpoints of each class. SAMPLE VARIANCE: SAMPLE STANDARD DEVIATION: 𝒏 σ 𝒇 ∙ 𝑿𝒎𝟐 − σ 𝒇 ∙ 𝑿𝒎 𝟐 𝒏 σ 𝒇 ∙ 𝑿𝒎𝟐 − σ 𝒇 ∙ 𝑿𝒎 𝟐 𝒔𝟐 = 𝒔= 𝒏(𝒏−𝟏) 𝒏(𝒏−𝟏) Where 𝑋𝑚 is the midpoint of each class and f is the frequency of each class. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATION PROCEDURE: Finding the Sample Variance and Standard Deviation for Grouped Data Step 1 Make a table as shown and find the midpoint of each class. Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D. Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E. Step 4 Find the sums of columns B, D, and E. (The sum of column B is n. The sum of column D is σ 𝑓 ∙ 𝑋𝑚. The sum of column E is. σ 𝑓 ∙ 𝑋𝑚2 ) Step 5 Substitute in the formula and solve to get the variance. 𝑛 σ 𝑓 ∙ 𝑋 2 − σ𝑓 ∙ 𝑋 2 𝑚 𝑚 𝑠2 = © Engr. H. V. Villaruel, A.Y. 2021-2022 𝑛(𝑛 − 1) Step 6 Take the square root to get the standard deviation. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 20: Find the sample variance and the sample standard deviation for the frequency distribution of the data shown. The data represent the number of miles that 20 runners ran during one week. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. COEFFICIENT OF VARIATION ❑ Whenever two samples have the same units of measure, the variance and standard deviation for each can be compared directly. ❑ A statistic that allows you to compare standard deviations when the units are different is called the coefficient of variation. ❑ The coefficient of variation, denoted by CVar, is the standard deviation divided by the mean. The result is expressed as a percentage. For samples, For populations, 𝒔 𝝈 𝑪𝑽𝒂𝒓 = ഥ ∙ 𝟏𝟎𝟎 𝑪𝑽𝒂𝒓 = ∙ 𝟏𝟎𝟎 𝑿 𝝁 © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 21: The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 22: The mean speed for the five fastest wooden roller coasters is 69.16 miles per hour, and the variance is 2.76. The mean height for the five tallest roller coasters is 177.80 feet, and the variance is 157.70. Compare the variations of the two data sets. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATIONS RANGE RULE OF THUMB ❑ The range can be used to approximate the standard deviation. The approximation is called the range rule of thumb. 𝒓𝒂𝒏𝒈𝒆 ❑ A rough estimate of the standard deviation is 𝒔 ≈ 𝟒 CHEBYSHEV’S THEOREM 𝑴𝒆𝒂𝒏 𝒗𝒂𝒍𝒖𝒆 ± (𝒌)(𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏) Chebyshev’s theorem specifies the proportions of the spread in terms of the standard deviation. The proportion of values from a data set that will fall within ‘k’ standard deviations of the mean 1 © Engr. H. V. Villaruel, A.Y. 2021-2022 will be at least 1 − 2, where ‘k’ is a number greater than 1 (k is not necessarily an integer). 𝑘 In summary, then, Chebyshev’s theorem states: ❑ At least three-fourths, or 75%, of all data values fall within 2 standard deviations of the mean. ❑ At least eight-ninths, or 89%, of all data values fall within 3 standard deviations of the mean. This theorem can be applied to any distribution regardless of its shape. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 23: The mean price of houses in a certain neighborhood is $50,000, and the standard deviation is $10,000. Find the price range for which at least 75% of the houses will sell. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 24: A survey of local companies found that the mean amount of travel allowance for couriers was $0.25 per mile. The standard deviation was $0.02. Using Chebyshev’s theorem, find the minimum percentage of the data values that will fall between $0.20 and $0.30. © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: Chebyshev’s theorem can be used to find the minimum percentage of data values that will fall between any two given values. PROCEDURE: 1. Subtract the mean from the larger value. 2. Divide the difference by the standard deviation to get k. 3. Use Chebyshev’s theorem to find the percentage. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATIONS THE EMPIRICAL RULE (NORMAL RULE) Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true. Approximately 68% of the data values will fall within 1 standard deviation of the mean. Approximately 95% of the data values will fall within 2 standard deviations of the mean. Approximately 99.7% of the data values will © Engr. H. V. Villaruel, A.Y. 2021-2022 fall within 3 standard deviations of the mean. NOTE: Because the empirical rule requires that the distribution be approximately bell-shaped, the results are more accurate than those of Chebyshev’s theorem, which applies to all distributions. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATIONS LINEAR TRANSFORMATION OF DATA ❑ In statistics, sometimes it is necessary to transform the data values into other data values. ❑ For example, if you are using temperature values collected from Philippines, these values will be given using the Celsius temperature scale. If the study is to be used in the United States, you might want to change the data values to the Fahrenheit temperature scale. This change is called the linear transformation of the data. QUESTION: How does a linear transformation of the data values affect the mean and standard deviation of the data values? © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATIONS LINEAR TRANSFORMATION OF DATA Example: Suppose you own a small store with five employees. Their hourly salaries are: $10 $13 $10 $11 $16 The mean of the salaries is 𝑋ത = $12, and the standard deviation is 2.550. Suppose you decide after a profitable year to give each employee a raise of $1.00 per hour. The new salaries would be: $11 $14 $11 $12 $17 © Engr. H. V. Villaruel, A.Y. 2021-2022 The mean of the new salaries is 𝑋ത = $13, and the standard deviation of the new salaries is 2.550. OBSERVATION: The value of the mean increases by the amount that was added to each data value, and the standard deviation does not change. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATIONS LINEAR TRANSFORMATION OF DATA Example: Suppose that the five employees worked the number of hours per week shown here: 15 12 18 20 10 The mean of the number of hours is 𝑋ത = 15, and the standard deviation is 4.123. Suppose you decide to double the amount of each employee’s hours for December. The new number of hours would be: 30 24 36 40 20 © Engr. H. V. Villaruel, A.Y. 2021-2022 The mean of the number of hours is 𝑋ത = 30, and the standard deviation of the new number of hours is 8.246. OBSERVATION: When each data value is multiplied by a constant, the mean of the new data set will be equal to the constant times the mean of the original set, and the standard deviation of the new data set will be equal to the absolute value (positive value) of the constant times the standard deviation of the original data set. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF VARIATIONS ADDITONAL CONCEPTS: MEAN DEVIATION σ 𝑿−𝑿ഥ 𝑴𝒆𝒂𝒏 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 = 𝒏 Where: 𝑿 → value ഥ → mean 𝑿 𝒏 → number of values PEARSON COEFFICIENT OF SKEWNESS A measure to determine the skewness of a distribution is called the Pearson coefficient (PC) of skewness. © Engr. H. V. Villaruel, A.Y. 2021-2022 ഥ − 𝑴𝑫) 𝟑(𝑿 𝑷𝑪 = 𝒔 Where: 𝑿ഥ → mean 𝑴𝑫 → median 𝒔 → standard deviation No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION ❑ In addition to measures of central tendency and measures of variation, there are measures of position or location. ❑ These measures include standard scores, percentiles, deciles, and quartiles. They are used to locate the relative position of a data value in the data set. STANDARD SCORES A standard score or z score tells how many standard deviations a data value is above or below the mean for a specific distribution of values. If a standard score is zero, then the data value is the same as the mean. 𝒗𝒂𝒍𝒖𝒆 − 𝒎𝒆𝒂𝒏 𝒛= 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 © Engr. H. V. Villaruel, A.Y. 2021-2022 For samples, For populations, 𝑿−𝑿ഥ 𝑿−𝝁 𝒛= 𝒛= 𝒔 𝝈 The z score represents the number of standard deviations that a data value falls above or below the mean. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 25: A student scored 85 on an English test while the mean score of all the students was 76 and the standard deviation was 4. She also scored 42 on a French test where the class mean was 36 and the standard deviation was 3. Compare the relative positions on the two tests. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 26: In a recent study, the mean age at which men get married is said to be 26.4 years with a standard deviation of 2 years. The mean age at which women marry is 23.5 years with a standard deviation of 2.3 years. Find the relative positions for a man who marries at age 24 and a woman who marries at age 22. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION PERCENTILES ❑ Percentiles divide the data set into 100 equal groups. ❑ Percentiles are position measures used in educational and health-related fields to indicate the position of an individual in a group. © Engr. H. V. Villaruel, A.Y. 2021-2022 The percentile corresponding to a given value X is computed by using the following formula: 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔 𝒃𝒆𝒍𝒐𝒘 𝑿 + 𝟎. 𝟓 𝑷𝒆𝒓𝒄𝒆𝒏𝒕𝒊𝒍𝒆 = 𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 27: The frequency distribution for the systolic blood pressure readings (in millimeters of mercury, mm Hg) of 200 randomly selected college students is shown here. Construct a percentile graph. SOLUTION: © Engr. H. V. Villaruel, A.Y. 2021-2022 Note that a blood pressure of 130 corresponds to approximately the 70th percentile. Also, the 40th percentile corresponds to a value of approximately 118. Thus, if a person has a blood pressure of 118, he or she is at the 40th percentile. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 28: The number of traffic violations recorded by a police department for a 10-day period is shown. Find the percentile rank of 16 and 24, respectively. 22 19 25 24 18 15 9 12 16 20 © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION PERCENTILES Finding a Data Value Corresponding to a Given Percentile Step 1 Arrange the data in order from lowest to highest. Step 2 Substitute into the formula 𝒏∙𝒑 𝒄= 𝟏𝟎𝟎 where n = total number of values p = percentile Step 3A If c is not a whole number, round up to the next whole number. Starting at the lowest value, count over to the number that corresponds to the rounded up value. © Engr. H. V. Villaruel, A.Y. 2021-2022 Step 3B If c is a whole number, use the value halfway between the cth and (c + 1)st values when counting up from the lowest value. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 29: Using the data from Example 28, find the value corresponding to the 65th and 30th percentile. 22 19 25 24 18 15 9 12 16 20 © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION QUARTILES Quartiles divide the distribution into four equal groups, denoted by 𝑄1 , 𝑄2 , 𝑄3. Note: 𝑄1 is the same as the 25th percentile; 𝑄2 is the same as the 50th percentile, or the median; 𝑄3 corresponds to the 75th percentile, as shown: Finding Data Values Corresponding to Q1, Q2, and Q3 © Engr. H. V. Villaruel, A.Y. 2021-2022 Step 1 Arrange the data in order from lowest to highest. Step 2 Find the median of the data values. This is the value for 𝑄2. Step 3 Find the median of the data values that fall below 𝑄2. This is the value for 𝑄1. Step 4 Find the median of the data values that fall above 𝑄2. This is the value for 𝑄3. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 30: Using the data from Example 28, find the value corresponding to 𝑄1 , 𝑄2 , 𝑄3. 22 19 25 24 18 15 9 12 16 20 © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION INTERQUARTILE RANGE (IQR) ❑ In addition to dividing the data set into four groups, quartiles can be used as a rough measure of variability. This measure of variability which uses quartiles is called the interquartile range and is the range of the middle 50% of the data values. ❑ The interquartile range (IQR) is the difference between the third and first quartiles. 𝑰𝑸𝑹 = 𝑸𝟑 − 𝑸𝟏 NOTE: Like the standard deviation, the more variable the data set is, the larger the value of the interquartile range will be. © Engr. H. V. Villaruel, A.Y. 2021-2022 DECILES ❑ Deciles divide the distribution into 10 groups, as shown. They are denoted by 𝐷1 , 𝐷2 , 𝐷3 , etc. ❑ Note that 𝐷1 corresponds to 𝑃10 ; 𝐷2 corresponds to 𝑃20 ; etc. Deciles can be found by using the formulas given for percentiles. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION OUTLIERS ❑ An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. ❑ An outlier can strongly affect the mean and standard deviation of a variable. ❑ Since these measures (mean and standard deviation) are affected by outliers, they are called nonresistant statistics. ❑ The median and interquartile range are less affected by outliers, so they are called resistant statistics. © Engr. H. V. Villaruel, A.Y. 2021-2022 NOTE: Sometimes when a distribution is skewed or contains outliers, the median and interquartile range can be used to more accurately describe the data than the mean and standard deviation. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. MEASURES OF POSITION OUTLIERS Procedure for Identifying Outliers Step 1 Arrange the data in order from lowest to highest and find 𝑄1 and 𝑄3. Step 2 Find the interquartile range: IQR = 𝑄3 − 𝑄1. Step 3 Multiply the IQR by 1.5. Step 4 Subtract the value obtained in step 3 from 𝑄1 and add the value obtained in step 3 to 𝑄3. Step 5 Check the data set for any data value that is smaller than 𝑄1 − 1.5(𝐼𝑄𝑅) or larger than 𝑄3 + 1.5(𝐼𝑄𝑅). © Engr. H. V. Villaruel, A.Y. 2021-2022 There are several reasons why outliers may occur. 1. The data value may have resulted from a measurement or observational error. 2. The data value may have resulted from a recording error. 3. The data value may have been obtained from a subject that is not in the defined population. 4. The data value might be a legitimate value that occurred by chance. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 31: Check the following data set for outliers. 5, 6,12, 13, 15, 18, 22, 50 © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. EXPLORATORY DATA ANALYSIS ❑ In exploratory data analysis (EDA), data can be organized using a stem and leaf plot. ❑ The measure of central tendency used in EDA is the median. ❑ The measure of variation used in EDA is the interquartile range. ❑ The data are represented graphically using a boxplot (sometimes called a box and whisker plot). The purpose of exploratory data analysis is to examine data to find out what information can be discovered about the data, such as the center and the spread. A boxplot can be used to graphically represent the data set. The five-number summary consists of: 1. The lowest value of the data set (i.e., minimum) © Engr. H. V. Villaruel, A.Y. 2021-2022 2. 𝑄1 3. The median 4. 𝑄3 5. The highest value of the data set (i.e., maximum) No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. EXPLORATORY DATA ANALYSIS Constructing a Boxplot Step 1 Find the five-number summary for the data. Step 2 Draw a horizontal axis and place the scale on the axis. The scale should start on or below the minimum data value and end on or above the maximum data value. Step 3 Locate the lowest data value, 𝑄1 , the median, 𝑄3 , and the highest data value; then draw a box whose vertical sides go through 𝑄1 and 𝑄3. Draw a vertical line through the median. Finally, draw a line from the minimum data value to the left side of the box, and draw a line from the maximum data value to the right side of the box. © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. WORDED PROBLEM Example 32: The number of meteorites found in 10 states of the United States is 89, 47, 164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot for the data. Source: Natural History Museum. SOLUTION: © Engr. H. V. Villaruel, A.Y. 2021-2022 No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. EXPLORATORY DATA ANALYSIS Information Obtained from a Boxplot 1. a. If the median is near the center of the box, the distribution is approximately symmetric. b. If the median falls to the left of the center of the box, the distribution is positively skewed. c. If the median falls to the right of the center, the distribution is negatively skewed. 2. a. If the lines are about the same length, the distribution is approximately symmetric. b. If the right line is larger than the left line, the distribution is positively skewed. c. If the left line is larger than the right line, the distribution is negatively skewed. Traditional Exploratory data analysis © Engr. H. V. Villaruel, A.Y. 2021-2022 Frequency distribution Stem and leaf plot Histogram Boxplot Mean Median Standard deviation Interquartile range No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. END OF LESSON 3 REFERENCES: Bluman, A. G. (2018). Elementary Statistics (8th Edition). McGraw-Hill Education, New York. Walpole, R.E., Myers, R.H. (2007). Probability and Statistics for Engineers & Scientists (8th Edition). Perason Education International Montgomery, D. C., Runger, G. C. (2003). Applied Statistics & Probability for Engineers(3th Edition). John Wiley & Sons, Inc. No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.

Lesson 3: Data Description PDF

Document Details

Tags

Related

Summary

Full Transcript