Topic4_Descriptive_Stat_2023_1.pptx

Full Transcript

Topic 4: Descriptive Statistics: Normal Distribution GEN 4191 Data Analytics for Business Optimisation 2023.2 – BBA6 Dr. Krisztina Soreg Let’s get started! After this session you should be able to: • Distinguish the three main types of Descriptive Statistics and the most frequently used tools s...

Topic 4: Descriptive Statistics: Normal Distribution GEN 4191 Data Analytics for Business Optimisation 2023.2 – BBA6 Dr. Krisztina Soreg Let’s get started! After this session you should be able to: • Distinguish the three main types of Descriptive Statistics and the most frequently used tools such as frequency, mean, median, mode, range, standard deviation and variance; • Understand and apply the main tools of variability including the range, standard deviation, standard error and variance; • Use the Data Analysis ToolPak for the Descriptive Statistics measures and the functions; Central Tendency Mean: average of a data set • Goal: to use a single value reflecting the center of the data distribution  central location • Types: mean, median and mode Median: middle of the set of numbers Mode: most common number Central Tendency What is a mean? Central Tendency: Mean (Arithmetic Average) Definition: the sum of all observations divided by the number of observations Disadvantages 1. Arithmetic mean cannot be used when we are dealing with qualitative characteristics 2. We cannot calculate the mean even if a single data value is missing • • • • Values of our data: x1, x2, … xn N: number of values “X bar”: the mean (average) of the values Excel: “average” function 3. Might create a distorted picture of the data due to the presence of extreme values Central Tendency: Mean Why is this chart based on average data about revenues? Data is listed by industries (significant groups, categories) and not smaller units  we can compare the average performance of these sectors for the given year Central Tendency: Mean What is the purpose of the average EU spending bar? It allows to analyze the individual country spending on research and development  we can see who is spending above or below the average within the entire group (EU) Central Tendency What is a median? Central Tendency: Median (Positional Average) Definition: the measure of location that specifies the middle value when the data are arranged from least to greatest is the median  where the center of the given data lies • Half the data are below the median, and half the data are above it. • For an odd number of observations, the median is the middle of the sorted numbers. • For an even number of observations, the median is the mean of the two middle numbers • Not sensitive to outliers Disadvantages 1. Less representative average because it does not depend on all the items in the series 2. If having even number of items, median cannot be exactly found 3. Affected more by sampling fluctuations than the mean as it is concerned with only one item i.e., the middle item. Central Tendency: Median Number of data is even Number of data is odd Central Tendency What is a mode? Central Tendency: Mode Disadvantages Definition: the observation that occurs most frequently • Remember: a set of data may have one mode, more than one mode or no mode! 1. Unstable when the data consist of a small number of values 2. It is not rigidly defined if having more than one mode 3. Not based on all values • E.g.: 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48  one mode • 3, 3, 3, 9, 16, 16, 16, 27, 37, 48  bimodal • 3, 6, 9, 16, 27, 37, 48  no mode Central Tendency: Types of distribution The mean, median and mode might not always belong to one “team”… Central Tendency: Types of distribution Negative skew Normal (symmetrical) skew • • Mean = median = mode The data is said to have no skew • • Median value > mean the ‘long tail’ of the few, but extreme, values are on the ‘negative’ side (left) of the mean Positive skew • • Median value < mean a long tail of a few relatively extreme values on the positive (right) side of the mean Central Tendency: Types of distribution - Examples What can be observed? • The IQ of a majority of the people in the population lies in the normal range • The IQ of the rest of the population lies in the deviated range Central Tendency: Types of distribution - Examples Variability Main goal: to detect how far apart the data points appear to fall from the center Range: the difference between the maximum value and the minimum value in the data set E.g.: 0, 3, 3, 12, 15, 24 Range: 24 – 0 = 24 Standard deviation: average amount of variability in your dataset  how far each score lies from the mean  the larger the standard deviation, the more variable the data set is! How to calculate?  square root of the variance Variance: the average of squared deviations from the mean  the more spread the data, the larger the variance is in relation to the mean How to calculate?  square the standard deviation Variability What is standard deviation? The dispersion of a dataset relative to its mean. It tells us, on average, how far each value lies from the mean. • “Dispersement” tells us how much your data is spread out • How to calculate?  taking the square root of the variance • High standard deviation means that values are generally far from the mean. • Low standard deviation indicates that values are clustered close to the mean. Standard Deviation What does standard deviation demonstrate? • Useful measure of spread for normal distributions (bell curve) • In normal distributions, data is symmetrically distributed with no skew. Most values cluster around a central region, with values tapering off as they go further away from the center. • The standard deviation tells us how spread out from the center of the distribution your data is on average • The mean is represented by the Greek letter μ, in the center. • Each segment (colored in dark blue to light blue) represents one standard deviation away from the mean. • For example, 2σ means two standard deviations from the mean. Standard Deviation How to interpret it? 68% of the data is clustered around mean within the 1st SD, in other words there is a 68% chance that the data lies within the 1st SD 95% of the data is clustered around mean within the 2nd SD, in other words there is a 95% chance that the data lies within the 2nd SD 99.7% of the data is clustered around mean within the 3rd SD, in other words there is a 99.7% chance that the data lies within the 3rd SD Standard Deviation Low and high SD Let’s see an example: • The graph on the left might represent an abnormally high number of students getting scores close to the average  small standard deviation (tall histogram) • The graph on the right represents more students getting scores away from the average  large standard deviation (flat histogram) Standard Deviation Share of Amazon orders arriving late from January 2020 to January 2021 Which one is the outlier? • May 2020: Amazon was overwhelmed by the surge in demand, resulting in delivery delays and logistics bottlenecks at its warehouses • Pandemic: shoppers were panic buying essential items • Source: https://www.statista.com/statistics/1220033/s hare-of-amazon-orders-arriving-late/ Standard Deviation Example: Job satisfaction survey Collecting data on job satisfaction ratings from three groups of employees using simple random sampling What do we see? • The mean (M) ratings are the same for each group – it’s the value on the x-axis when the curve is at its peak. • Their standard deviations (SD) differ from each other. • Highest peak: lowest SD • Lowest peak: high SD Standard Deviation Example: Food delivery time Situation: you are starving and according to Google, two pizza restaurants advertise a 20-minute average delivery time  both look equally good! What do we know?  mean (average): not informative enough! How to make a good choice? We obtain the delivery time data: • Pizza Hut: has a SD of 10 minutes • Don Pepe: has a value of 5 minutes Let’s create a graph based on standard deviation! Standard Deviation Example: Food delivery time Which one will you choose? 16% of deliveries exceeds 30 minutes Let’s consider a 30-minute wait or longer to be unacceptable  shaded areas represent the % of delivery times exceeding 30 minutes 2% of deliveries exceeds 30 minutes What is normal distribution? Normal distribution with different means Changing the mean shifts the entire curve left or right on the X-axis  mean is always defining the location of the peak for the bell curve What is normal distribution? Normal distribution with different standard deviations As we increase the spread of the bell curve, the likelihood that observations will be further away from the mean also increases What is normal distribution? Normal distribution and its main characteristics Symmetric bell curve  no skewed distributions The mean, median and mode are equal Half of the population is less than the mean and half is greater than the mean The Empirical Rule allows to determine the proportion of values that fall within certain distances from the mean What is normal distribution? Normal distribution and its main characteristics Mean +/- standard deviations Percentage of data contained 1 68% 2 3 95% 99.7% For example, in a normal distribution, 68% of the observations fall within +/- 1 standard deviation from the mean What is normal distribution? Normal distribution and the Empirical Rule The empirical rule, also referred to as the threesigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean. The empirical rule predicts that 68% of observations falls within the first standard deviation (µ ± σ), 95% within the first two standard deviations (µ ± 2σ), and 99.7% within the first three standard deviations (µ ± 3σ). What is normal distribution? Normal distribution and its main characteristics Assume that a pizza restaurant has a mean delivery time of 30 minutes and a standard deviation of 5 minutes. Using the Empirical Rule, we can determine that: • • • 68% of the delivery times are between 25-35 minutes (30 +/- 5), 95% are between 20-40 minutes (30 +/- 2*5) and 99.7% are between 15-45 minutes (30 +/-3*5). Normal distribution: real examples Standard Deviation Example: How to calculate it manually? You are provided the following dataset: 46, 69, 32, 60, 52, 41 Step 1: find the mean Step 2: find each score’s deviation from the mean Standard Deviation Example: How to calculate it manually? You are provided the following dataset: 46, 69, 32, 60, 52, 41 Step 3: square each deviation from the mean Step 4: find the sum of squares Standard Deviation Example: How to calculate it manually? You are provided the following dataset: 46, 69, 32, 60, 52, 41 Step 5: find the variance  we are working with a sample size of 6, we will use n – 1 (for a more general conclusion without underestimating variability), where n = 6 Step 6: find the square root of the variance As the SD = 13.31, we can say that each score deviates from the mean by 13.31 points on average Standard Error How to define it? • The approximate standard deviation of a statistical sample population • E.g.: a sample mean deviates from the actual mean of a population  this deviation is the standard error of the mean. Main characteristics: • The smaller the standard error, the more representative the sample will be of the overall population. • The larger the sample size, the smaller the standard error because the statistic will approach the actual value 1. The standard error increases when the variance of the population increases. 2. Standard error decreases when the size of the sample increases – as the sample size gets closer to the real size of the population. Standard Error What is the difference compared to the standard deviation? The standard deviation describes variability within a single sample. The standard error estimates the variability across multiple samples of a population. Standard Deviation Let’s see an example! Assume that Business A’s average wage is $80,000, with a standard deviation of $20,000. Significant standard deviation  no assurance that you will be paid close to $80,000 per year if you work at the company because salaries vary a lot! Assume that Business B’s average salary is $80,000, but the standard deviation is only $4,000. The standard deviation is so little  you’ll get paid close to $80,000 as wages vary so little. The standard deviation of wages is thus much higher  the length of the boxplot for firm A is comparatively higher Box plot Standard Deviation Let’s see an example! Why might wages vary? Same occupation, different pay: • • • • • Level of education, credentials Experience (years) Performance and success Geographical location Industry, sector, etc. Standard Deviation Let’s see an example! • Median wage: the point at which half of workers earned more than that amount and half earned less • 10th percentile wage: the point at which 10 percent of workers in an occupation made less than that amount and 90 percent made more • 90th percentile wage: the point at which 90 percent of workers in an occupation made less than that amount and 10 percent made more • The difference between those two wages—the high earners and low earners in an occupation—is referred to as the wage difference Standard Deviation Let’s see an example! Assume that the median property price in neighborhood A is $250,000, with a $50,000 standard deviation. The standard deviation is high  some of the housing values will be much higher than $250,000, while others will be much lower. Assume that the mean property price in neighbourhood B is similarly $250,000, but the standard deviation is just $10,000. The standard deviation is low  you can be confident that any property in the neighbourhood will be near to this price. As the standard deviation of property values is so much higher in neighbourhood A, the length of the boxplot is significantly longer Box plot Shape of the Distribution: Kurtosis & Skewness How to check normality? Shape of the Distribution: Kurtosis & Skewness Kurtosis: the degree to which a distribution is peaked Skewness: the degree to which a set of data are not symmetrical KURTOSIS= “Peakedness” or flatness SKEWNESS= Symmetry or asymmetry Shape of the Distribution: Kurtosis What is Kurtosis? • A statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution • Goal: to show the extent to which a distribution contains outliers Types: • Distributions with medium kurtosis (medium tails)  normal distribution (0) • Distributions with low kurtosis (thin tails)  negative result (-) • Distributions with high kurtosis (fat tails)  positive result (+) Shape of the Distribution: Kurtosis Positive Negative Narrower and taller curve than the normal distribution + heavy-tailed curve  longer tails with a possibility for more extreme values. Lower and flatter curve than the normal distribution  light-tailed curve with shorter tails and possibly less outliers. Shape of the Distribution: Kurtosis Normality of the data distribution can be defined as Sk=0 & K=0* * In data analysis toolpak and using formulas in excel, in other softwares it might be K=3. Shape of the Distribution: Skewness  The analysis of a data set often depends on whether the distribution is symmetric or non-symmetric.  Skewness tells us about the direction of variation of the data set. Is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.  Symmetric distribution: the pattern of frequencies from a central point is the same (or nearly so) from the left and right.  Non-symmetric distribution: the patterns from a central point from the left and right are different. Shape of the Distribution: Skewness Mean = Median = Mode Symmetrical Sk = 0 Skewed to the Left Sk < 0 Negative skewness Skewed to the Right Sk > 0 Positive skewness Shape of the Distribution: Skewness Interpretation:  If Sk = 0, then the distribution is normal and symmetrical.  If Sk > 0, then the distribution is right skewed.  If Sk < 0, then the distribution is left skewed. Problems with normality • In stats we assume many times that data follows a Gaussian (or normal) distribution (a.k.a. assuming that the population from which the sample was taken was normally distributed) • But with large enough sample sizes (>30 or >40) the violation of normality should not cause major problems • Although TRUE NORMALITY IS CONSIDERED TO BE A MYTH Problems with normality • If you repeated your analysis 1000 times, choosing a new random sample every time, you’d get something that looks like a normal distribution with a mean (x) equal to the population’s mean (μ) and a stand.dev. equal to the standard error. • Most people don’t want to take 1000 new samples since the whole point of sampling is to reduce the work! • What we do instead? • We assume that the normality exists… Descriptive Statistics in Data Analysis ToolPak Each of these tools might be calculated either manually (with a function) or with the Data Analysis ToolPak Data tab  Data Analysis  Descriptive Statistics How to use the normal distribution function? What is it good for? It will calculate the probability that variable x falls below or at a specified value. That is, it will calculate the normal probability density function or the cumulative normal distribution function for a given set of parameters The function: =NORM.DIST(x,mean,standard_dev,cumulative) • • • • X: the value for which we wish to calculate the distribution. Mean: the arithmetic mean of the distribution. Standard_dev: the standard deviation of the distribution. Cumulative: specifies the type of distribution to be used: TRUE (Cumulative Normal Distribution Function) or FALSE (Normal Probability Density Function). • We can use 1 for TRUE and 0 for FALSE when entering the formula. Descriptive Statistics in Data Analysis ToolPak Moodle exercises for practicing Descriptive Statistics under Topic 4: • Normal distribution construction • Salaries • Annual sales Margin of error and confidence interval If several samples are taken from a population, it is very likely that each sample will have a different mean value. We want to know the mean value of the population and not that of the sample. The confidence interval indicates the range in which the true mean value of the population lies with a certain probability. Margin of error and confidence interval Margin of error and confidence interval Main target: to specify with how many percentage points your results will differ from the real population value Margin of error: the range of values below and above the sample statistic in a confidence interval. The confidence interval is a way to show what the uncertainty is with a certain statistic (from a poll or survey). The level of confidence (CL) is usually expressed as a percent. Common values are 90%, 95%, 99%. For example, a 95% confidence interval with a 4 percent margin of error means that your statistic will be within 4 percentage points of the real population value 95% of the time Margin of error and confidence interval How to calculate the CI? Add and subtract the "Confidence Level" from the "Mean". Lower Confidence Limit: 2.447356 Descriptive Statistics Tool Upper Confidence Limit: 5.189008 Mean Standard Error of the mean Median Mode 3.818182 0.615234 3 3 Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level (95.0%) 2.040499 4.163636 0.260801 0.730477 7 1 8 42 11 1.370826 Margin of error Interpretation: The population mean falling in the interval of 2,447 and 5,189 with probability 0,95 or 95%. Lower CL: 3,818 – 1,37 Upper CL: 3,818 + 1,37 • • • There is a 95% probability that the interval we obtain from sample data contains the true population mean We can say there is a 95% chance of including the population mean in our interval there is only a 5% chance that the range excludes the mean of the population. Box plot What is it? A method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles (divides the number of data points into four parts) A box and whisker plot—also called a box plot— displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers (lines) go from each quartile to the minimum or maximum. E.g.: test scores between schools or classrooms  multiple data sets from independent sources that are related to each other in some way Box plot Maximum Q3 Mean Median Q1 Minimum Thank you for your attention! [email protected]

Use Quizgecko on...
Browser
Browser