BA 311 Lecture 2 PDF

Summary

This document is a lecture on business analytics discussing data processing in spreadsheets and statistical concepts. It covers measures of central tendency (mean, median, mode) and variability (range, interquartile range, mean absolute deviation, variance, standard deviation, and coefficient of variation). It includes various examples using Excel, including calculations and how to use formulas.

Full Transcript

Data Processing in spreadsheets: Measures of Central Tendency and Variability Module Outline Measures of Central Tendency Mean Median Mode Measures of Variability Range Inter-Quartile Range MAD Variance Standard Deviation Coefficient of V...

Data Processing in spreadsheets: Measures of Central Tendency and Variability Module Outline Measures of Central Tendency Mean Median Mode Measures of Variability Range Inter-Quartile Range MAD Variance Standard Deviation Coefficient of Variations Measures Of Central Tendency Aside from frequency distributions and histograms, collected data can also be summarized as a single index/value, which represents the entire data. These measures may also help in the comparison of data. Central tendency is defined as “the statistical measure that identifies a single value as representative of an entire distribution”. This value aims to provide an accurate description of the entire data. It is the single value that is most typical/representative of the collected data. Measures of central tendency such as mean or median only provides the location of the middle value but it does not tell anything about the spread of the data Measures Of Central Tendency ത Mean (𝑋) Median (Md) Mode (Mo) Arithmetic Mean Most common measure of location Computed as the average value of the data set. If the data for the entire population are available, the population mean is computed in the same manner by the letter µ. For ungrouped data it is Using excel, it is =AVERAGE(data) Arithmetic Mean For grouped data, it is Example A real estate agent would like to establish pricing for lots in an upscale area outside the city limit. To set the prices, he gathered pricing data on nearby area presented in the table. He would like to get the average value of lots to get an idea of how much he should sell. Example A gym trainer would like to come up with a fitness plan for his clients, he come up with a table to show how long (in minutes) do people stay inside the workout area. Median Another measure of central location, it is the value in the middle when the data are arranged in ascending order (smallest to largest value) For ungrouped data: If n (total number of observations), is odd the median is the middle value. If n is even number of observations then median has no single middle value. In this case we follow the convention and define the median as the average of the values for the middle 2 observations Median For Grouped Data Example Example Mode MODE is the value that occurs most frequently in a data set. It is the data with the greatest frequency value For ungrouped data if there’s only 1 mode it is called Unimodal If data contains at least 2 modes it’s called Multimodal For grouped data where L is the lower class boundary of the modal class/cell fm-1 is the frequency of the group before the modal class/cell (with the highest frequency) fm is the frequency of the modal group fm+1 is the frequency of the group after the modal class/cell w is the group width Example Example Measures of Variability Range Interquartile Range Mean Absolute Deviation Variance Standard Deviation Coefficient of Variation Range Difference between the highest value and the lowest value Although it is the most simple measure of variability it is seldom used as the only measure, because the range is based only on 2 of the observed values and is highly influenced by extreme values. The smaller value for a measure of dispersion shows that mean that the data are clustered closely (the mean of the data is representative and therefore is reliable) The large value for a measure of dispersion shows that the data are scattered and in this case mean is not the representative figure and therefore it is not reliable. Using excel function =MAX(data range)-MIN(data range) Example Interquartile Range It is computed on the middle 50% of the observations, after elimination of highest and lowest 25% observations in a data set which is arranged in ascending order. Unlike range inter-quartile range it is not sensitive to extreme values. To avoid distorting measures of range with extreme values, one can resort to inter- quartile range. For rather big data sets, it is easier to compute for the 1st and 3rd quartile as follow: =QUARTILE(data range,1) , 1 is the 1st quartile =QUARTILE(data range,3) 3 is the 3rd quartile And then Subtract the values Example Using the example used previously, compute for the interquartile range 50% of the total observations is 6 observations Sort the observations in ascending order Eliminate the highest 25% and lowest 25% of the observations Mean Absolute Deviation Gets the absolute difference between the actual value and the measure of central location/tendency MAD is based on the absolute (difference is always positive, regardless of direction) distance from the mean. Example Variance Measure of variability that utilizes all the data. The variance is based on the deviation from the mean (difference between the value of each observation and the mean) Basically average of the square deviations from the arithmetic mean Variances of population and sample are practically the same, when the number of observations is large When one computed variance is computed using the sample, the sum of squared deviations about the sample mean is divided by n-1, the resulting sample variance provides an unbiased estimate of the population variance Variance where: σ2 = estimated variance from population S2 = estimated variance from sample n = total number of observations x i = the ith observation xbar = measure of central location/tendency (mean) Using Excel To obtain variance based on the population =VAR.P(data range) To obtain variance based on the sample =VAR.S(data range) Example Due to the pandemic, a certain health insurance is checking the amount of time a benefit/claim is filed and claimed for the last 3 months. The following table shows the processing times. Standard Deviation While mean indicates representative value for a data set, standard deviation shows the dispersion or variability across data points. It is considered as the most efficient measure of dispersion. Was introduced by Karl Pearson in 1983. If all data points present in a sample are near each other, the standard deviation tends to be small. Otherwise, if data points are greatly dispersed then standard deviation will tend to be large. If the sample gathered is a good representation of the population, it approximates the population variance denoted by σ. Standard deviation has little meaning in its absolute sense. It is best used when standard deviations of 2 distributions are compared, the distribution with smaller standard deviation show less variability Standard Deviation where: S = estimated variance N = total number of observation Xi = the ith observation Xbar = measure of central tendency (mean) Using Excel To obtain variance based on the population = STDEV.P(data range ) To obtain variance based on the sample = STDEV.S(data range ) Standard Deviation Example Coefficient of Variation Measures dispersion in relation to the mean. This is a relative measure of dispersion and is used to compare the relative variation in one data set with the relative variation in another data set. The coefficient of variation is a useful statistic for comparing the relative variability of different variables, each with different standard deviations and different means. CVP = Coefficient of variation CVS = Coefficient of variation σ = estimated variance from population S = estimated variance from sample µ = measure of central location/tendency xbar = measure of central location/tendency Example Measures of Dispersion Variance is based on the SQUARED distance from the mean. Variance is also always positive, regardless of direction, since squaring a positive or negative real number always creates a positive number (or zero). So when a value is far from the mean, variance tends to consider these points more heavily than MAD would. Variance is expressed in squares units instead of original units and create problem in interpretation, for this reason the standard deviation is preferred to variance. Another important difference is that variance is in a different 'measurement' than MAD. For example, if the measurements are in length, say inches (or centimeters), the MAD is a measurement in inches (or centimeters), while variance is in inches- squared (or cm-squared). Standard deviation is the square root of variance, and it is a more comparable metric to MAD, though it also treats observations that are far from the mean as 'heavier' than MAD does. Reference Course Notes from Eugene Rex L. Jalao, Ph.D. Business Intelligence Training of University of the Philippines National Engineering Center. http://igcseatmathematicsrealm.blogspot.com/2015/02/mean-of-grouped-data.html Turban, E; Sharda, R.; Delen, D.; King, D. (2010) Business Intelligence: A Managerial Approach Albright, S. Christian; Winston, Wayne L.; (2015) Business analytics: Data analysis and decision making Evans, James R. (2017) Business Analytics : methods, models, and decisions Camm, Jeffrey D.; Ohlmann, Jeffrey W.; Fry, Michael J.; Anderson, David; Cochran,, James J. (2015) Essentials of Business Analytics

Use Quizgecko on...
Browser
Browser