24 biostat_lecture_2.pdf

Full Transcript

Common Numerical Descriptive Statistics Measures of Center v Measure of Center the value at the center or middle of a data set Arithmetic Mean v Arithmetic Mean (Mean) the measure of center obtained by adding the values and dividing the total by the number of valu...

Common Numerical Descriptive Statistics Measures of Center v Measure of Center the value at the center or middle of a data set Arithmetic Mean v Arithmetic Mean (Mean) the measure of center obtained by adding the values and dividing the total by the number of values What most people call an average. Mean v Advantages Takes every data value into account Is relatively reliable, means of samples drawn from the same population don’t vary as much as other measures of center v Disadvantage Is sensitive to extreme data values; one extreme value can affect it dramatically Median v Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude v is not affected by an extreme value Finding the Median First sort the values (arrange them in order), the follow one of these 1. If the number of data values is odd, the median is the number located in the exact middle of the list. 2. If the number of data values is even, the median is found by computing the mean of the two middle numbers. 5.40 1.10 0.42 0.73 0.48 1.10 0.42 0.48 0.73 1.10 1.10 5.40 (in order - even number of values – no exact middle shared by two numbers) 0.73 + 1.10 2 MEDIAN is 0.915 5.40 1.10 0.42 0.73 0.48 1.10 0.66 0.42 0.48 0.66 0.73 1.10 1.10 5.40 (in order - odd number of values) exact middle MEDIAN is 0.73 Mode v Mode the value that occurs with the greatest frequency v Data set can have one, more than one, or no mode Bimodal two data values occur with the same greatest frequency Multimodal more than two data values occur with the same greatest frequency No Mode no data value is repeated Mode is the only measure of center that can be used with nominal data Mode - Examples a. 5.40 1.10 0.42 0.73 0.48 1.10 ïMode is 1.10 b. 27 27 27 55 55 55 88 88 99 ïBimodal - 27 & 55 c. 1 2 3 6 7 8 9 10 ïNo Mode Critical Thinking Interpreting measures of center in the context of Histograms Copyright © 2010 Pearson Education Shape of the Distribution The mean, median and mode are good, but cannot be used to identify the shape of the distribution. 3.1 - 11 Basics Concepts of Measures of Variation In the most basic of definitions- variability reflects how scores differ from one another Sometimes called spread or dispersion Three measures of variability commonly used: range, standard deviation, variance The range of a set of data values is the difference between the maximum data value and the minimum data value. Range = (maximum value) – (minimum value) It is very sensitive to extreme values; therefore not as useful as other measures of variation. The standard deviation of a set of sample values, denoted by s, is a measure of variation of values about the mean. Standard Deviation Represents average amount of variability in set of scores (average distance from the mean) Denoted as s Σ= sigma, sum of: x=each individual score = mean of all scores n= sample size 1 5 Standard Deviation - Properties v The standard deviation is a measure of variation of all values from the mean. Values close together have a small standard deviation, but values with much more variation have a larger standard deviation v The value of the standard deviation s is positive. v The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data values far away from all others). v The units of the standard deviation s are the same as the units of the original data values. Variance v The variance of a set of values is a measure of variation equal to the square of the standard deviation. v Sample variance: s2 - Square of the sample standard deviation s v Population variance: s 2 - Square of the population standard deviation s Measures of Locations Percentiles There are 99 percentiles denoted P1, P2,... P99, which divide a set of data into 100 groups with about 1% of the values in each group. P1 P2 P99 1% 1% 1% Example: Pediatrician needs to interpret the growth chart Quartiles Q1, Q2, Q3 divide ranked values into four equal parts 25% 25% 25% 25% (minimum) Q1 Q2 Q3 (maximum) (median) Quartiles denoted Q1, Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group. v Q1 (First Quartile) separates the bottom 25% of sorted values from the top 75%. In other words, the 25th percentile is the value that has 25% of the scores below it. Also known as the lower quartile. v Q2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%. The 50th percentile is the value that has 50% of the scores below it. Also known as median. v Q3 (Third Quartile) separates the bottom 75% of sorted values from the top 25%. The 75th percentile is the score that has 75% of the scores below it. It is also known as the upper quartile. Z-scores Measures of Relative Standing Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Motivations This section introduces measures of relative standing, which are numbers showing the location of data values relative to the other values within a data set. They can be used to compare values from different data sets. Motivation: – Michael Jordan is 78 inches tall. – Rebecca Lobo is 76 inches tall. – Relatively speaking, who is taller? Facts: – Average height of men is 69 inches with a st. dev. of 2.8 inches. – For women, the average height is 63.6 inches with a st. dev. of 2.5 inches. Jordan’s and Lobo’s heights should be standardized relative to their gender so the data can be compared. Standardizing Data Data can be standardized so that different data sets can be compared, or to compare values within the same data set. Z score v z Score (or standardized value) the number of standard deviations that a given value x is above or below the mean Measures of Position z Score Sample x z= s– x Population x–µ z= s Facts About z-Scores The mean of the z-scores of a population is always 0. The standard deviation of the z-scores of a population is always 1. z-scores do not have units! Interpreting Z Scores Whenever a value is less than the mean, its corresponding z score is negative Ordinary values: –2 ≤ z score ≤ 2 Unusual Values: z score < –2 or z score > 2 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. In our example, MJ’s z-score = (78 – 69)/2.8 = 3.21 Lobo’s z-score = (76 – 63.6)/2.5 = 4.96 Example: Body Temperatures Body temperatures of healthy human children have mean = 98.60° F and standard deviation = 0.62°F. A child has temperature of 101°F. Does this child have a fever? 35 Body temperatures of healthy human children have mean = 98.60° F and standard deviation = 0.62°F. A child has temperature of 101°F. Does this child have a fever? 101 − 98.6 𝑧= = 3.87 0.62 Probability Always express a probability as a fraction or decimal number between 0 and 1. v For any event A, the probability of A is between 0 and 1 inclusive. That is, 0 £ P(A) £ 1. Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 4.1 - 37 Basic Rules for Computing Probability Relative Frequency Approximation of Probability Gallup Poll: Among 1038 randomly selected adults, 52 said that 2nd-hand smoke is not at all harmful. P(person believes that 2nd-hand smoke is not at all harmful) =52/1038 =0.0501 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 4.1 - 38 Addition Rule Sickle Cell Anemia is autosomal-linked disease: both parents are carriers with the genotype of Aa. Kid genotypes: AA, Aa, aa P(kid has the genotype aa of Sickle Cell Anemia) =1/4 P(kid has the genotype Aa of Sickle Cell Anemia) = 1/2 P(kid has the genotype AA of Sickle Cell Anemia) = 1/4 Question: P(kid has the genotype AA or Aa of Sickle Cell Anemia) =? Addition Rule P(kid has the genotype AA or Aa of Sickle Cell Anemia) = 1/4 + 1/2 = 3/4 AA and Aa are mutually exclusive, i.e. no kid can have AA and Aa genotypes at the same time. Multiplication Rule: Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 4.1 - 41 Key Concept The basic multiplication rule is used for finding P(A and B), the probability that event A occurs in a first trial and event B occurs in a second trial. Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 4.1 - 42 Notation P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial) Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 4.1 - 43 Multiplication Rule P(A and B) = P(A) * P(B) where Events A and B are independent. e.g. Assuming cancer and Schizophrenia are two independent illness P(person develops a cancer during life) = 0.25 P(person develops schizophrenia during life) = 0.01 P(person develops both cancer and schizophrenia during life) =0.25 * 0.01 =0.0025 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 4.1 - 44 Normal Distribution and Probability Calculation Example - Heights of U.S. Adults Female and Male adult heights are well approximated by normal distributions: YF~N(63.7,2.5) YM~N(69.1,2.6) 20 20 18 16 14 12 10 10 8 6 4 Std. Dev = 2.48 Std. Dev = 2.61 Mean = 63.7 Mean = 69.1 2 0 N = 99.23 0 N = 99.68 59.5 61.5 63.5 65.5 67.5 69.5 71.5 73.5 75.5 55.5 57.5 59.5 61.5 63.5 65.5 67.5 69.5 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 56.5 58.5 60.5 62.5 64.5 66.5 68.5 70.5 INCHESM INCHESF Cases weighted by PCTM Cases weighted by PCTF Source: Statistical Abstract of the U.S. (1992) v Continuous random variable takes infinitely many possible values, v And those values can be associated with measurements on a continuous scale without gaps or interruptions e.g. Baby weight, blood pressure, heart beat Area and Probability Because the total area under the density curve is equal to 1, there is a correspondence between area and probability. Normal distribution for continuous data Example: population’s resting heart rate is normally distributed with mean = 70 Standard deviation (SD) = 10 68% of the heart rates in the population lies within ± 1SD of mean, i.e. Mean ± 1SD = 70 ± 10 or the range of 60-80 Do we need a new normal reference distribution each time we look at a normally distributed variable? Not necessarily. We will use the z-score and standard normal distribution. z Score x–µ Population z= s If x fits a normal distribution with mean (µ) and standard deviation (σ), Z score follows a standard normal distribution with mean=0 and standard deviation=1 Standard Normal Distribution The standard normal distribution is a normal probability distribution with µ = 0 and s = 1. The total area under its density curve is equal to 1. The standard normal distribution which has three properties: 1. It’s graph is bell-shaped. 2. It’s mean is equal to 0 (µ = 0). 3. It’s standard deviation is equal to 1 (s = 1). Develop the skill to find areas (or probabilities) corresponding to various regions under the graph of the standard normal distribution. Converting to a Standard Normal Distribution x–µ z= s

Use Quizgecko on...
Browser
Browser