Statistics for Engineers PDF
Document Details
![DiligentRomanticism](https://quizgecko.com/images/avatars/avatar-15.webp)
Uploaded by DiligentRomanticism
Tags
Related
Summary
This document discusses statistics and exploratory data analysis techniques, as well as different methods for displaying and describing univariate data, including frequency distributions, numerical measures of central tendency, and variation.
Full Transcript
Statistics for Engineers Exploratory Data Analysis: Univariate Data 1 / 57 EDA: Univariate data analysis Contents I Aim. I Frequencies and frequency distribution I Graphical displays for categorical data (bar chart, pie chart) I Grap...
Statistics for Engineers Exploratory Data Analysis: Univariate Data 1 / 57 EDA: Univariate data analysis Contents I Aim. I Frequencies and frequency distribution I Graphical displays for categorical data (bar chart, pie chart) I Graphical displays for numerical data (histogram, polygon, boxplot) I Numerical measures to describe: I central tendency (mean, median, mode) I location (quartiles, percentiles) I variation (variance, standard deviation, quasi-variance and quasi-standard-deviation, range, IQR, coefficient of variation) 2 / 57 Aim The aim of exploratory data analysis is to summarize the information given in a data set: Figure: Raw data Figure: Summary graphs and statistics 3 / 57 Frequencies and frequency distribution Def. A frequency distribution is I a list or a table... I containing class groupings (categories or ranges within which the data fall)... I and the corresponding frequencies with which data fall within each class or category Frequencies: I absolute (number of times the value appeared in the sample) I relative (proportion of times the value appeared in the sample) 4 / 57 Why use frequency distributions? I A frequency distribution is a way to summarize data I The distribution condenses the raw data into a more useful form... I and allows for a quick visual interpretation of the data 5 / 57 Grouping by classes: categorical and discrete data Cumulative Cumulative Absolute Relative Absolute Relative Class, xi Freq, ni Freq, fi Freq, Ni Frequency, Fi x1 n1 f1 = nn1 N 1 = n1 F1 = f1 x2 n2 f2 = nn2 N2 = N1 + n2 F2 = F1 + f2............... nk xk nk fk = n Nk = n Fk = 1 Total n 1 empty empty Note: I ni = number of xi in the sample, fi = number n of xi I Ni = Ni−1 + ni , Fi = Fi−1 + fi I 0 ≤ fi , Fi ≤ 1 I Fi and Ni do not make sense for categorical-nominal variables 6 / 57 Grouping by classes Example: Soft drink purchases: I Which brand is the leader? I What percentage of the sampled people purchased Diet Coke? I What percentage of the sampled people purchased Dr. Pepper? 7 / 57 Grouping by classes Example 1: Soft drink purchases: I Which one is the leader? Coke Classic is the leader with a 38% of purchases. I What percentage of the sampled people purchased Diet Coke? 16% of the people purchased Diet Coke. I What percentage of the sampled people purchased Dr. Pepper? 10% of the people purchased Dr. Pepper. 8 / 57 Grouping by classes Example 2: A company desires to evaluate the potential brain drain among its employees. To do so, a survey was conducted. The table below shows different levels of satisfaction (S=satisfied, V=very, U=unsatisfied) for 901 employees. Absolute Class Frequency VU 62 U 108 S 319 VS 412 Total 901 I What type of variable is being studied? Find a frequency distribution of the data. I What percentage of the sampled people are satisfied? I How many individuals are unsatisfied or worse? In %? I How many individuals are at least satisfied? In %? 9 / 57 Grouping by classes Example 2 cont.: I Categorical, ordinal with 4 different classes. The frequency distribution is: Cumulative Cumulative Absolute Relative Absolute Relative Class Frequency Frequency Frequency Frequency VU 62 0.07 62 0.07 U 108 0.12 170 0.19 S 319 0.35 489 0.54 VS 412 0.46 901 1 Total 901 1 I What percentage of the sampled people are satisfied? 35% I How many individuals are unsatisfied or worse? In %? 170, 19% I How many individuals are at least satisfied? In %? 319 + 412 = 731 or 901 − 170 = 731, 35% + 46% = 81% or 100% − 19% = 81% 10 / 57 Grouping by classes Example 3: In order to explore the possibility of increasing the service price, several data of 500 HBO subscribers were analyzed. The table below summarizes the averaged number of HBO emissions followed per week: Nº of emissions Absolute followed (xi ) Frequency (ni ) 0 60 1 100 2 120 3 80 4 50 5 40 6 30 8 10 10 10 Total 500 11 / 57 Grouping by classes Example 3 cont.: I What can you say about the variable in the study? Find its frequency distribution. I What percentage of sampled people followed 3 emissions per week? I How many people followed no more than 3 emissions per week? I How many pleople followe at least 6 emissions per week? I What percentage of people folloewd between 3 and 5 emissions per week? I What percentage of people followed at least 8 emissions per week? I What percentage of people followed at most 2 emissions per week? 12 / 57 Grouping by classes Example 3 cont.: I Numerical, discrete with 9 different values. The frequency distribution is: Cumulative Cumulative Absolute Relative Absolute Relative xi Frequency Frequency Frequency Frequency 0 60 0.12 60 0.12 1 100 0.20 160 0.32 2 120 0.24 280 0.56 3 80 0.16 360 0.72 4 50 0.10 410 0.82 5 40 0.08 450 0.90 6 30 0.06 480 0.96 8 10 0.02 490 0.98 10 10 0.02 500 1 Total 500 1 13 / 57 Grouping by classes Example 3 cont.: I What percentage of sampled people followed 3 emissions per week? 16% I How many people followed no more than 3 emissions per week? 360 I How many people followed at least 6 emissions per week? 30 + 10 + 10 or 500 − 450 = 50 I What percentage of people followed between 3 and 5 emissions per week? 16% + 10% + 8% = 34% or (80 + 50 + 40)/500 = 34% I What percentage of people followed at least 8 emissions per week? 2% + 2% = 4% or 100% − 96% = 4% I What percentage of people followed at most 2 emissions per week? 56% 14 / 57 Grouping by class intervals: continuous (and discrete) data Class Interval Midpoint [li−1 , li ) xi = li +l2i−1 ni fi Ni Fi [l0 , l1 ) x1 n1 f1 N1 F1 [l1 , l2 ) x2 n2 f2 N2 F2.................. [lk−1 , lk ] xk nk fk n 1 Total n 1 empty empty Note: I Left end-point is included, but right end-point is excluded (typical convention) I Reverse end-point convention can be applied - check your software for definition I Useful for tabulating discrete data if X takes many values 15 / 57 Grouping by class intervals: continuous (and discrete) data I Very often class intervals have the same width I Determine the width w of each interval by largest number - smallest number w= number of desired intervals I How many intervals? Roughly between 5 and 20. More specifically: √ I k ≈ n if n is ’small’ I k ≈ 1 + 3.22 log(n) if n is ’large’ I Intervals never overlap I Round up the interval width to get desirable interval endpoints 16 / 57 Grouping by class intervals: continuous (and discrete) data Example 4: A Linkedin offert has received 20 applications in the first day. The data below shows the age of each applicant 24, 35, 37, 21, 24, 37, 26, 46, 58, 30, 32, 23, 46, 38, 41, 43, 44, 27, 53, 27 Find the frequency distribution of the data. I Sort raw data in ascending order: 21, 23, 24, 24, 26, 27, 27, 30, 32, 35, 37, 37, 38, 41, 43, 44, 46, 46, 53, 58 I Find range: Max − Min = 58 − 21 = 37 I Select number of classes: say k = 4 I Compute interval width: 10 (37/4 then round up) I Determine the end-points: 20 but less than 30, 30 but less than 40, etc I Count the observations and assign to classes 17 / 57 Grouping by class intervals: continuous (and discrete) data Example 4 cont.: Class Interval Midpoint ni fi Ni Fi [20, 30) 25 7 0.35 7 0.35 [30, 40) 35 6 0.30 13 0.65 [40, 50) 45 5 0.25 18 0.90 [50, 60] 55 2 0.10 20 1 Total 20 1 I How many people are aged less than 40? In %? 7 + 6 = 13, which is 65% I how may people (approximately) are aged at least 45? In %? 2 + 5 50−45 50−40 = 4.5, which is 22,5% 18 / 57 Graphical presentation of data Once we have a frequency distribution of the data, the following graphical displays can be obtained: Categorical Numerical ⇓ ⇓ pie chart histogram bar chart polygon boxplot 19 / 57 Graphs for qualitative data: pie chart Example: I Each slice is a fraction of the total size of the pie I Many softwares rank slices alphabetically I Although ’pretty’ harder to read than bar charts 20 / 57 Graphs for qualitative data: pie chart Example: Spanish energy mix (https://www.esios.ree.es/es). 21 / 57 Graphs for qualitative data: pie chart misuse 22 / 57 Graphs for qualitative data: bar chart Example: The frequency table below corresponds to levels of satisfaction for 901 employees. Cumulative Cumulative Absolute Relative Absolute Relative Class Frequency Frequency Frequency Frequency VU 62 0.07 62 0.07 U 108 0.12 170 0.19 S 319 0.35 489 0.54 VS 412 0.46 901 1 Total 901 1 23 / 57 Graphs for qualitative data: bar chart Example cont.: I Bars are of the same width and equally-spaced, with the heights corresponding to the frequencies I There are gaps between the bars I Bars are labeled with class names I Many softwares rank bars alphabetically 24 / 57 Graphs for qualitative data: bar chart misuse 25 / 57 Graphs for qualitative data: bar chart I Bar charts can also be constructed for discrete data if there are not too many values I This bar char shows the Netflix spanish productions (data from www.kaggle.com) 26 / 57 Graphs for qualitative data: bar chart misuse 27 / 57 Graphs for qualitative data: other Example: word cloud I This is a word cloud built from your names I Pretty but, useful?: How many of you are named Alvaro or Lucia? (?) 28 / 57 Graphs for quantitative data: histogram and polygon Example: The following table summarizes the monthly smartphone spending (in Euros) reported by 20 UC3M students Class Interval Midpoint ni fi Ni Fi [10, 20) 15 3 0.15 3 0.15 [20, 30) 25 6 0.30 9 0.45 [30, 40) 35 5 0.25 14 0.70 [40, 50) 45 4 0.20 18 0.90 [50, 60) 55 2 0.10 20 1 Total 20 1 29 / 57 Histogram and polygon I There are no gaps between the bars/bins I Bin widths = widths of class intervals (identical), class boundaries are marked on the horizontal axis I Bin heights = frequencies (here, absolute) I Bin areas are proportional to the frequencies 30 / 57 Cumulative polygon 31 / 57 Histogram with area of 1 (on a density scale) I Bin widths = widths of class intervals (not necessarily identical) I Bin heights = l −lfi i i−1 I Bin areas = fi 32 / 57 Histogram shows skewness in data 33 / 57 Describing data numerically Center Location Variation ⇓ ⇓ ⇓ mean quartiles range median percentiles interquartile range mode variance standard deviation coeff. of variation New notation: n X xi = x1 + x2 +... + xn i=1 P ( : sum, i = 1: the lower limit, n: the upper limit, xi : example of a formula depending on i) Example: If xi = i 3 X xi2 = (−1)2 + 02 + 12 + 22 + 32 = 15 i=−1 34 / 57 Central tendency: (arithmetic) mean I The most common measure of central tendency I Population mean PN i=1 xi x1 +... + xN µ= = N N I Sample mean Pn i=1 xi x1 +... + xn x̄ = = n n I If a, b (b 6= 0) are real numbers and y = a + bx, then ȳ = a + bx̄ I Affected by extreme values (outliers) Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200 3+1+5+4+2 3 + 1 + 5 + 4 + 200 x̄ = =3 ȳ = = 42.6! 5 5 35 / 57 Central tendency: median I In the ordered list, the median M is the middle number x((n+1)/2) if n odd (the middle number) M= x(n/2) +x(n/2+1) 2 if n even (the average of the two middle numbers) (x(1) , x(2) ,... , x(n) means that the observations are ranked in increasing order, eg. x(1) = xmin , x(n) = xmax ) I Not affected by outliers Example: Given observations 3, 1, 5, 4, 2 (n = 5), first rank the data 1,2, 3 ,4,5, then identify the middle number(s) 3rd smallest z}|{ M = x((5+1)/2) = x(3) =3 Example: Given observations 3, 1, 5, 4, 2, 0 (n = 6), first rank the data 0,1, 2,3 ,4,5, then identify the middle number(s) the average of 3rd and 4th z }| { x(6/2) + x(6/2+1) x(3) + x(4) 2+3 M= = = = 2.5 2 2 2 36 / 57 Central tendency: mode I The value that occurs most often I Not affected by outliers I Used for either numerical or categorical data I There may be no mode, there may be several modes Example: Given observations 3, 1, 5, 4, 2, there is no mode Example: Given observations 3, 1, 5, 4, 2, 1, the mode is 1 37 / 57 Shape: comparing mean and median Three types of distributions: I Skewed to the left Mean < Median I Symmetric Mean = Median I Skewed to the right Median < Mean 38 / 57 Shape: right-skewed distribution Example: Salaries in Spain in 2017 (Figure from CincoDı́as https://cincodias.elpais.com/cincodias/ 2019/06/21/economia/1561113360_547756.html) 39 / 57 Shape: left-skewed distribution Example: Aged of death in Spain in 2019 ( Data from INE https://www.ine.es/jaxi/Tabla.htm?path=/t20/ e301/provi/l0/&file=02001.px&L=0) 40 / 57 Quartiles and percentiles I Quartiles split the ranked data into four segments with an equal number of values per segment I The first quartile Q1 has position 14 (n + 1) I The second quartile Q2 (= median) has position 12 (n + 1) I The third quartile Q3 has position 34 (n + 1) Example: Given observations 22, 18, 17, 16, 16, 13, 12, 21, 11 (n = 9), first rank the data 11, 12, 13, 16, 16 , 17, 18, 21, 22, then identify the positions Q1 = x(2.5) = x(3) = 13 Q2 = 16 Q3 = x(7.5) = x(8) = 21 I pth percentile, p = 1, 2,... , 99, Pk = x(k(n+1)/100). Example cont.: 60th percentile = x(60(9+1)/100) = x(6) = 17 41 / 57 Quartiles and percentiles 42 / 57 Variation: range and interquartile range (IQR) I Range is the simplest measure of variation R = xmax − xmin I Ignores the way the data is distributed I Sensitive to outliers Example: Given observations 3, 1, 5, 4, 2, R = 5 − 1 = 4 Example: Given observations 3, 1, 5, 4, 100, R = 100 − 1 = 99 I Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low observations and calculate the range of the middle 50% of the data IQR = 3rd quartile − 1st quartile = Q3 − Q1 43 / 57 Variation: Outliers I Outliers are observations that fall I below the value of Q1 − 1.5 · IQR I above the value of Q3 + 1.5 · IQR I For extreme outliers, replace 1.5 by 3 in the above definition I Example: Consider the following data I 2, 3 ,4, 5, 5, 5, 6, 7, 8, 11, 13, 14, 15, 15, 16, 19, 22, 23, 23, 24, 25, 29, 30 IQR = Q3 − Q1 = 23 − 5 = 18 → Q3 + 1.5 ∗ IQR = 50 No outliers I 2, 3 ,4, 5, 5, 5, 6, 7, 8, 11, 13, 14, 15, 15, 16, 19, 22, 23, 23, 24, 25, 29, 53 IQR = Q3 − Q1 = 23 − 5 = 18 → Q3 + 1.5 ∗ IQR = 50 One outlier: 53 44 / 57 Boxplot To build up a boxplot we need the five number summary: min, max, Q1, M (or Q2), Q3 and identify the outliers (if any) I Example: consider the following data: 2, 3 ,4, 5, 5, 5, 6, 7, 8, 11, 13, 14, 15, 15, 16, 19, 22, 23, 23, 24, 25, 29, 30 I min=2, Q1=5, M=14, Q3=23, max=30 I there are not outliers 45 / 57 Boxplot An alternative illustration of box-plot including outliers: 46 / 57 Boxplot: example Example: Power losses in Europe (data from https://www.ceer.eu/ documents/104400/-/-/09ecee88-e877-3305-6767-e75404637087) 47 / 57 Measure of variation: variance I Average of squared deviations of values from the mean I Population variance PN (xi − µ)2 σ2 = i=1 N I Sample variance faster to calculate Pn zP }| { 2 n 2 2 (xi − x̄) xi − n(x̄) σ̂ 2 = i=1 = i=1 ⇐ divided by n n n I Sample quasi-variance (corrected sample variance) Pn 2 Pn 2 2 i=1 (xi − x̄) i=1 xi − n(x̄) s2 = = ⇐ divided by n − 1 n−1 n−1 I They are related via n−1 2 σ̂ 2 = s n I If a, b (b 6= 0) are real numbers and y = a + bx, then sy2 = b 2 sx2 48 / 57 Measure of variation: standard deviation (SD) I The most-commonly used measure of spread I Population standard deviation, sample standard deviation and sample quasi-standard deviation are respectively √ √ √ σ = σ2 σ̂ = σ̂ 2 s = s2 I Shows variation about the mean I Has the same units as the original data, whilst variance is in units2 I Variance and SD are both affected by outliers 49 / 57 Calculating variance and standard deviation Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 124 124 124 x̄ = = 15.5 ȳ = = 15.5 z̄ = = 15.5 8 8 8 n X xi2 = 112 + 122 +... + 212 = 2000 i=1 n X yi2 = 142 + 152 +... + 172 = 1928 i=1 n X zi2 = 112 + 112 +... + 202 = 2068 i=1 Pn xi2 − n(x̄)2 2000 − 8(15.5)2 78 sx2 = i=1 = = = 11.1429 ⇒ sx = 3.3381 n−1 8−1 7 1928 − 8(15.5)2 6 sy2 = = = 0.8571 ⇒ sy = 0.9258 8−1 7 2 2068 − 8(15.5)2 146 sz = = = 20.8571 ⇒ sz = 4.5670 8−1 7 50 / 57 Comparing standard deviations Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 15.5 sx = 3.3 11 12 13 14 15 16 17 18 19 20 21 y = 15.5 sy = 0.9 11 12 13 14 15 16 17 18 19 20 21 z = 15.5 sz = 4.6 11 12 13 14 15 16 17 18 19 20 21 51 / 57 Numerical summaries and frequency tables. Standarization. I If the data is discrete then Pk Pk i=1 xi ni xi2 ni − nx̄ 2 x̄ = and s2 = i=1 n n−1 I If the data is continuous, we replace xi in the above definition, by the mid-points of class intervals I To standardize variable x means to calculate x − x̄ s I If you apply this formula to all observations x1 ,... , xn and call the transformed ones z1 ,... , zn , then the mean of the z’s is zero with the standard deviation of one I Standarization = finding z-score 52 / 57 Standarization and z-Scores. In addition to measures of location, variability, and shape, we are also interested in the relative location of values within a data set. Measures of relative location help us determine how far a particular value is from the mean. 53 / 57 Measure of variation: variance Many textbooks do not distinguish between sample variance and sample quasi-variance and use the unbiased estimator as sample variance as follows: Analogously with the sample standard deviation: 54 / 57 Empirical rule If the data is bell-shaped (normal), that is, symmetric and with light tails, the following rule holds: I 68% of the data are in (x̄ − 1s, x̄ + 1s) I 95% of the data are in (x̄ − 2s, x̄ + 2s) I 99.7% of the data are in (x̄ − 3s, x̄ + 3s) Note: This rule is also known as 68-95-99.7 rule Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard deviation is 5. Assuming that the data is bell-shaped, give the limits of an interval that captures 95% of the observations. 95% of xi ’s are in: (x̄ ± 2s) = (40 ± 2(5)) = (30, 50) 55 / 57 Chebyshev’s Theorem The empirical rule can only be applied for bell-shaped (normal) distributions. However, Chebyshev’s Theorem allows us to make statements about the proportion of data values that must be within a specified number of standard deviations from the mean for any type of distribution: 56 / 57 Measure of variation: coefficient of variation (CV) I Measures relative variation and is defined as s CV = |x̄| I Is a unitless number (sometimes given in %’s) I Shows variation relative to mean Example: Stock A: Average price last year = 50, Standard deviation = 5 Stock B: Average price last year = 100, Standard deviation = 5 5 5 CVA = = 0.10 CVB = = 0.05 50 100 Both stocks have the same SDs, but stock B is less variable relative to its mean price 57 / 57