GM Biostatistics 2017-18 3rd Seminar (PDF)

3rd seminar ▪ Descriptive statistics Measure of center Measure of spread Percentile, quartile Histogram, box-and-whisker plot ▪ Examples Week 4 ...

3rd seminar ▪ Descriptive statistics Measure of center Measure of spread Percentile, quartile Histogram, box-and-whisker plot ▪ Examples Week 4 Descriptive statistics aim: to obtain information about the population from which the sample was taken. In a study, we collect information – data – from individuals. Individuals can be people, animals, plants, or any object of interest. Descriptive statistics: Organize and summarize data Inferential statistics: reach decision about a large body of data by examining only a small part of the data A variable is a characteristic that varies among individuals in a population or in a sample (a subset of a population). - quantitative: something that can be counted or measured for each individual and then added, subtracted, averaged, etc. across individuals in the population - categorical: something that falls into one of several categories. What can be counted is the count or proportion of individuals in each category. The distribution of a variable tells us what values the variable takes and how often it takes these values. ❖ Measure of center. n Mean (arithmetic average) x i x= i =1 add all values, then divide by the number of individuals. It is the “center of mass.” n xi designates the Advantages: Disadvantages: elements of the sample Algebraically defined and so mathematically manageable Distorted by outliers There is only one. Distorted by skewed data Uses all the data values, therefore there is no information loss Known sampling distribution Descriptive statisitcs II. ❖ Measure of center. Median the median is the value which divides the sample into two equal parts such that the number of values equal to or greater than the median is equal to the number of values equal to or less than the median an ordered array is a listing of the values of a sample from the smallest to the largest values if the number of elements is odd, the median will be the middle value in the ordered array if the number of elements is even, the median will be the average of the two middle values in the ordered array sample: 5, 8, 2, 4, 6 ordered array: 2, 4, 5, 6, 8 median: 5 sample: 7, 1, 0, 6, 5, 2 ordered array: 0, 1, 2, 5, 6, 7 median: (2+5)/2 = 3.5 the mean and the median are the same only if the distribution is symmetrical. Adventages: Disadventages: Clearly defined, all data sets have (exactly one!) median. Not algebraically defined Not distorted by outliers Ignores most of the information Not distorted by skewed data Complicated sampling distribution Mode the value that occurs the most frequently sample: 1, 2, 2, 2, 3, 4, 4, 5, 5 mode = 2 the number of modes is not necessarily one. If all the values are different, there is no mode. One mode - unimodal distribution, two modes - bimodal distribution Adventages: Disadventages: Easily determined for categorical data Ignores most of the information Not distorted by outliers Not algebraically defined Unknown sampling distribution (not always exist or there are more than 1) Descriptive statistics III. ❖ Measure of spread. standard deviation (SD, s – related to samples) xi: the elements of the sample  ( x − x) n 2 i x : the mean of the sample SDsample = i =1 n −1 n: the number of elements in the sample the standard deviation is used to describe the variation around the mean (within a given set of data). Like the mean, it is not resistant to skew or outliers the SD of a sample gives an unbiased estimation of the population SD standard deviation (s, related to a population) xi: the elements in the population ( ) n 2 xi − x SD population = i =1 x : the mean of the population n n: the number of the elements in the population variance SD2 measures dispersion relative to the scatter of values about their mean Descriptive statistics IV. ❖ Measure of spread. coefficient of variation (CV) comparing the two standard deviations of two sets of data may lead to fallacious results. It may be that the two variables involved are measured in different units. For example, we may wish to know, for a certain population, whether serum cholesterol levels, measured in milligrams per 100 ml, are more variable than body weight, measured in pounds. expresses the standard deviation as a percentage of the mean. The formula is given by: SD CV = 100 x the coefficient of variation is independent of the scale of measurement, it is a useful statistic for comparing the variability of two or more variables measured on different scales. Descriptive statistics V. ❖ Percentile, quartiles. the i-th percentile of a sample is the value which is equal to or larger than i % of the observations. the first, second and third quartile of a sample are the values which have 25%, 50% and 75%, respectively, of the observations at or below them. Q1 = first quartile = 2.2 OR Q1 is the 25th percentile, Q2 is the 50th percentile (or the median), Q3 is the 75th percentile Q2 = second quartile = Q1 is the median of the lower half of the sorted data, excluding median median = 3.4 Q3 is the median of the upper half of the sorted data, excluding median The interquartile range (IQR) is the difference between the third and first quartiles (contains the middle 50 percent of the observations in a data set): Q3 = third quartile = 4.35 IQT = Q3 − Q1 Range: the difference between the smallest and the largest value in a set. Of limited usefulness. Example In a study the average weight of boys in a class at primary school was investigated. A random sample was taken and the data are shown in the table below: test person ID weight (kg) a) Determine the mean, the median and the mode of the sample. b) Determine the variance and the standard deviation of the sample. c) Estimate the standard deviation of the population (i.e. all the boys in the same class of the primary school). d) Assuming that there are only six boys in the class (i.e. the whole population was measured) determine the standard deviation of the population. Example In a study the average weight of boys in a class at primary school was investigated. A random sample was taken and the data are shown in the table below: test person ID weight (kg) a) Determine the mean, the median and the mode of the sample. b) Determine the variance and the standard deviation of the sample. c) Estimate the standard deviation of the population (i.e. all the boys in the same class of the primary school). d) Assuming that there are only six boys in the class (i.e. the whole population was measured) determine the standard deviation of the population. n Solution a) x i 21 + 25 + 23 + 20 + 26 + 23 138 x= i =1 = = = 23 n 6 6 In order to determine the median an ordered array has to be constructed: 20,21,23,23,25,26 23 + 23 Since the number of elements is even, the median is the mean of the two central elements: = 23 2 The mode is the element present with the highest frequency. In this case it is 23. b) ( ) n 2 xi − x (21 − 23) 2 + (25 − 23) 2 +... SD of the sample: SD = i =1 = = 2.28 n −1 6 −1 variance: SD 2 = 2.282 = 5.2 Example In a study the average weight of boys in a class at primary school was investigated. A random sample was taken and the data are shown in the table below: test person ID weight (kg) a) Determine the mean, the median and the mode of the sample! b) Determine the variance and the standard deviation of the sample! c) Estimate the standard deviation of the population (i.e. all the boys in the same class of the primary school). d) Assuming that there are only six boys in the class (i.e. the whole population was measured) determine the standard deviation of the population. Example In a study the average weight of boys in a class at primary school was investigated. A random sample was taken and the data are shown in the table below: test person ID weight (kg) a) Determine the mean, the median and the mode of the sample! b) Determine the variance and the standard deviation of the sample! c) Estimate the standard deviation of the population (i.e. all the boys in the same class of the primary school). d) Assuming that there are only six boys in the class (i.e. the whole population was measured) determine the standard deviation of the population. Solution c) The standard deviation of the sample is the statistical estimate of the standard deviation of the population. Therefore, based on the sample the estimate of the population SD is 2.28. d) In this part the SD of the population has to be calculated, i.e. we do not estimate the SD, but calculate its accurate value. This can only be done if each individual in the population has been measured. If this is the case the population SD can be calculated according to the following formula: n 6 ( x −x) ( x −x) 2 2 (21 − 23) + (25 − 23) +... + ( 23 − 23 ) i i 2 2 2 SDpop = i =1 = i =1 = = 2.08 n 6 6 Example 2. In a survey researchers wanted to investigate the ability of residents of a town to recall historical dates. A random sample was taken and the scores on a test achieved by the volunteers are summarized in the table below: test person ID scores a) Determine the mean, the median and the mode of the sample! b) Determine the standard deviation and the variance of the sample Example 2. In a survey researchers wanted to investigate the ability of residents of a town to recall historical dates. A random sample was taken and the scores on a test achieved by the volunteers are summarized in the table below: test person ID scores a) Determine the mean, the median and the mode of the sample! b) Determine the standard deviation and the variance of the sample n Solution a) x i 45 + 64 + 33 + 77 + 64 + 86 + 54 423 x= i =1 = = = 60.43 n 7 7 in order to determine the median the elements have to arranged in ascending order (ordered array) : 33, 45, 54, 64, 64, 77, 86 the median in the central element in the sample if the number of elements is odd. Therefore, the median of the sample is 64. mode: the element occurring with the highest frequency in the sample. Since element “64” is present twice in the sample and all the other elements only once, the mode of the sample is 64. b) n ( x −x) 2 ( 45 − 60,43) + ( 64 − 60,43) +... + ( 54 − 60.43 ) i 2 2 2 standard deviation (SD): SD = i =1 = = 18.19 n −1 7 −1 variance is the square of the SD, therefore: SD2 = 18.192 = 330.88 Example 2. In a survey researchers wanted to investigate the ability of residents of a town to recall historical dates. A random sample was taken and the scores on a test achieved by the volunteers are summarized in the table below: test person ID scores c) Prepare a table showing the following descriptors of the sample: relative frequencies and cumulative distribution function of the sample (cumulative relative frequencies) Example 2. In a survey researchers wanted to investigate the ability of residents of a town to recall historical dates. A random sample was taken and the scores on a test achieved by the volunteers are summarized in the table below: test person ID scores c) Prepare a table showing the following descriptors of the sample: relative frequencies and cumulative distribution function of the sample (cumulative relative frequencies) Solution Sample Relative frequency Cumulative relative freq. elements 33 1/7 1/7 45 1/7 2/7 54 1/7 3/7 64 2/7 5/7 77 1/7 6/7 86 1/7 7/7 First column: unique sample elements are written in ascending order (i.e. if an element is present more than once, it is displayed only once). Second column: the number of occurrence of a given element is divided by the number of elements in the sample to yield the relative frequency. Third column: For every element the relative frequencies of all the elements smaller than or equal to the given element are summed. Example 2. In a survey researchers wanted to investigate the ability of residents of a town to recall historical dates. A random sample was taken and the scores on a test achieved by the volunteers are summarized in the table below: test person ID scores d) Prepare two graphs based on the table prepared in c. Plot the (relative frequency) distribution of the sample as a histogram in the 1st graph, and the empirical cumulative distribution function in the 2nd one. Example 2. In a survey researchers wanted to investigate the ability of residents of a town to recall historical dates. A random sample was taken and the scores on a test achieved by the volunteers are summarized in the table below: test person ID scores d) Prepare two graphs based on the table prepared in c. Plot the (relative frequency) distribution of the sample as a histogram in the 1st graph, and the empirical cumulative distribution function in the 2nd one. Solution CDF histogram cumulative relative freq. relative frequency test scores test scores The filled circles in the graph displaying the cumulative distribution function of the sample determine which value the function assumes at the discontinuities (i.e. at the step-wise increments), e.g. at 45 the cumulative distribution function of the sample assumes 2/7. Example 3. The pulse rate of test subjects was measured and the data is summarized in the table below: ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 pulse rate (1/min) 64 80 69 79 72 63 66 74 70 75 72 62 68 57 73 64 Define class intervals according to Sturges’ rule, and determine the relative frequencies and cumulative relative frequencies. Example 3. The pulse rate of test subjects was measured and the data is summarized in the table below: ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 pulse rate (1/min) 64 80 69 79 72 63 66 74 70 75 72 62 68 57 73 64 Define class intervals according to Sturges’ rule, and determine the relative frequencies and cumulative relative frequencies. Solution log10 N According to Sturges’ rule the number of class intervals should be: k = 1 + log2 N = 1 + = 1 + 3.3219 log10 N log10 2 k = 1 + log 2 (16 ) = 5 k = 1 + 3.219 log10 16 = 5 Another question that must be decided regards the width of the class intervals (w). Class intervals generally should be of the same width (although this is sometimes impossible to accomplish). This width may be determined by dividing the range (R) by k, the number of class intervals, thus w = R/k the range (R) is the difference between the smallest and the largest observation in the data set: R = 80-57 = 23 thus: w = 23/5 = 4,6 ≈ 5 Class interval Numbers belonging to the Relative frequency Cumulative class interval relative frequency 52-57 57 1/16 1/16 58-63 63, 62 2/16 3/16 64-69 64, 69, 66, 68, 64 5/16 8/16 70-75 72,74, 70, 75, 72, 73 6/16 14/16 76-81 80, 79 2/16 16/16 Example 4. Test scores achieved on a test were investigated in a sample. The following table summarizes the percentile values of the sample: percentile 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 score 1 5 10 16 20 23 26 28 29 30 31 33 35 38 41 44 45 48 49 50 a) Read the median, the Q1, Q2, Q3 values from the table. b) What percent of the values lie in the range between 16 and 33 (more specifically larger than 16 and smaller than or equal to 33)? c) What percent of values are smaller than or equal to 45? d) What percent of values are larger than 26? Example 4. Test scores achieved on a test were investigated in a sample. The following table summarizes the percentile values of the sample: percentile 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 score 1 5 10 16 20 23 26 28 29 30 31 33 35 38 41 44 45 48 49 50 a) Read the median, the Q1, Q2, Q3 values from the table. b) What percent of the values lie in the range between 16 and 33 (more specifically larger than 16 and smaller than or equal to 33)? c) What percent of values are smaller than or equal to 45? d) What percent of values are larger than 26? Solution a) Q2 is the same as the median. Q1, Q2 and Q3 correspond to the 25th, 50th and 75th percentile, respectively. Therefore, Q1=20, Q2=30, Q3=41 b) 60% of the values are smaller than or equal to 33. 20% of the values are smaller than or equal to 16. This is the range we are looking for, i.e. 16

GM Biostatistics 2017-18 3rd Seminar (PDF)

Document Details

Tags

Related

Summary

Full Transcript