Data Description: Measures of Variation PDF
Document Details

Uploaded by StylizedEmpowerment5714
University of Balamand
Clara Al Kosseifi
Tags
Related
Summary
This document describes measures of variation in data, including range, percentiles, variance, and standard deviation. It provides examples and explanations for each concept.
Full Transcript
Data description: measures of variation Clara Al Kosseifi Goals Goals The arithmetic means are both 200 mg/dL. However, the two samples appear radically different. This difference lies in the greater variability, or spread. Several different me...
Data description: measures of variation Clara Al Kosseifi Goals Goals The arithmetic means are both 200 mg/dL. However, the two samples appear radically different. This difference lies in the greater variability, or spread. Several different measures can be used to describe the variability of a sample. Range Range The range is very easy to compute. It is very sensitive to extreme observations. The range is that it depends on the sample size (n). The larger n is, the larger the range tends to be. We canβt compare ranges from data sets of differing size. Range: example Quantiles/Percentiles The range is very easy to compute. the pth percentile is the value ππππ such that p percent of the sample points are less than or equal to ππππ. The median, being the 50th percentile, is a special case of a quantile. Quantiles/Percentiles Quantiles/Percentiles Quantiles/Percentiles To compute percentiles, the sample points must be ordered. This can be difficult if n is even moderately large. Frequently used percentiles are quartiles (25th, 50th, and 75th percentiles). quintiles (20th, 40th, 60th, and 80th percentiles) deciles (10th, 20th,... , 90th percentiles) Variance and standard deviation Data variation: is based on the difference or distance each data value is from the mean. This difference or distance is called deviation. Example Cholesterol measurement : β(X-ΞΌ)=0 To eliminate this problem we sum the squares and find the mean by dividing by n-1(n is total number of data) βͺvariance s2= β(X-ΞΌ)2/(n-1) and the standard deviation s = βπ π 2 If the values are near to the mean βͺ the variance s2 is small If the values are far from the mean βͺ the variance s2 is large Variance and standard deviation Properties of the Variance and standard deviation If we create a translated sample π₯π₯1 + ππ, β― , π₯π₯ππ + ππ by adding a constant c to each data point then the variance and standard deviation to remain the same because the relationship of the points in the sample relative to one another remains the same: Properties of the Variance and standard deviation If we create a rescaled sample c. π₯π₯1 , β― , ππ. π₯π₯ππ by multiplying by a constant c each data point then the arithmetic mean of the rescaled sample is also rescaled: Properties of the Variance and standard deviation Properties of the Variance and standard deviation Shortcut formulas To save time when repeated subtracting and squaring occur in the original formulas we have shortcut formulas mathematically equivalent to the previous The Coefficient of variation CV It is useful to relate the arithmetic mean and the standard deviation to ππ each other: πͺπͺπͺπͺ = Γ ππππππ ππ The CV is most useful in comparing the variability of several different samples, each with different arithmetic means. A more accurate comparison could be made by comparing the CVs than by comparing the standard deviations. This measure remains the same regardless of what units are used because if the units change by a factor c, then both the mean and standard deviation change by the factor c; The CV remains the same. The Coefficient of variation CV Grouped data Consider the data set in Table 2.7, which represents the birthweights from 100 consecutive deliveries at a Boston hospital The simplest way to display the data is to generate a frequency distribution. A frequency distribution is an ordered display of each value in a data set together with its frequency, that is, the number of times that value occurs in the data set Grouped data the frequency (Count), relative frequency (Percent)= ππ Γ 100 ππππππππππ cumulative frequency (CumCnt), is the number of data in the sample that are less than or equal to b. Grouped data: steps The data could be grouped into broader categories: 1. Subdivide the data into k intervals, starting at some lower bound π¦π¦1 and ending at some upper bound π¦π¦ππ+1. 2. the kth and last interval is from π¦π¦ππ inclusive to π¦π¦ππ+1 exclusive. The first interval is from π¦π¦1 inclusive to π¦π¦2 exclusive; 3. The group intervals are generally chosen to be equal 4. A count is made of the number of units that fall in each interval Grouped data: steps Group interval Frequency (count) Percent CumCnt CumPct 29.5 β€ π₯π₯ < 69.5 5 5 5 5 100 Γ 100 = 5 69.5 β€ π₯π₯ < 89.5 10 10 15 15 89.5 β€ π₯π₯ < 99.5 11 11 26 26 99.5 β€ π₯π₯ < 109.5 19 19 45 45 109.5 β€ π₯π₯ < 119.5 17 17 62 62 119.5 β€ π₯π₯ < 129.5 20 20 82 82 129.5 β€ π₯π₯ < 139.5 12 12 94 94 139.5 β€ π₯π₯ < 169.5 6 6 100 100 Total 100 100 Resume Variance and standard deviation can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. It is useful in comparing two (or more) data sets to determine which is more (most) variable Variance and standard deviation are used to determine the consistency of a variable When 2 variables with different units should be compared we use the coefficient of variation