Document Details

Uploaded by Deleted User

Mine Çetinkaya-Rundel

Tags

statistics data analysis numerical data descriptive statistics

Summary

This document is chapter 2 from the 4th edition of OpenIntro Statistics. It covers summarizing data, including examining numerical data, scatterplots, and dot plots. The slides also discuss concepts like mean, median, and variance.

Full Transcript

Chapter 2: Summarizing data OpenIntro Statistics, 4th Edition Slides developed by Mine Çetinkaya-Rundel of OpenIntro. The slides may be copied, edited, and/or shared via the CC BY-SA license. Some images may be included under fair use guidelines (educational purposes). Examining numerical data ...

Chapter 2: Summarizing data OpenIntro Statistics, 4th Edition Slides developed by Mine Çetinkaya-Rundel of OpenIntro. The slides may be copied, edited, and/or shared via the CC BY-SA license. Some images may be included under fair use guidelines (educational purposes). Examining numerical data Scatterplot Scatterplots are useful for visualizing the relationship between two numerical variables. Do life expectancy and total fertility ap- pear to be associated or independent? Was the relationship the same through- out the years, or did it change? http:// www.gapminder.org/ world 1 Scatterplot Scatterplots are useful for visualizing the relationship between two numerical variables. Do life expectancy and total fertility ap- pear to be associated or independent? They appear to be linearly and negatively associated: as fertility increases, life expectancy decreases. Was the relationship the same through- out the years, or did it change? http:// www.gapminder.org/ world 1 Scatterplot Scatterplots are useful for visualizing the relationship between two numerical variables. Do life expectancy and total fertility ap- pear to be associated or independent? They appear to be linearly and negatively associated: as fertility increases, life expectancy decreases. Was the relationship the same through- out the years, or did it change? The relationship changed over the years. http:// www.gapminder.org/ world 1 Dot plots Useful for visualizing one numerical variable. Darker colors represent areas where there are more observations. 2.5 3.0 3.5 4.0 GPA How would you describe the distribution of GPAs in this data set? Make sure to say something about the center, shape, and spread of the distribution. 2 Dot plots & mean 2.5 3.0 3.5 4.0 GPA The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data. The mean GPA is 3.59. 3 Mean The sample mean, denoted as x̄, can be calculated as x1 + x2 + · · · + xn x̄ = , n where x1 , x2 , · · · , xn represent the n observed values. The population mean is also computed the same way but is denoted as µ. It is often not possible to calculate µ since population data are rarely available. The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate. 4 Stacked dot plot Higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape of the distribution. 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 GPA 5 Histograms - Extracurricular hours Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. Histograms are especially convenient for describing the shape of the data distribution. The chosen bin width can alter the story the histogram is telling. 150 100 50 0 0 10 20 30 40 50 60 70 6 Bin width Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much? 200 150 150 100 100 50 50 0 0 0 20 40 60 80 100 0 10 20 30 40 50 60 70 Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities 80 40 60 30 40 20 20 10 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities 7 Shape of a distribution: modality Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? 14 20 15 15 15 10 10 8 10 10 6 5 4 5 5 2 0 0 0 0 0 5 10 15 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Note: In order to determine modality, step back and imagine a smooth curve over the histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve. 8 Shape of a distribution: skewness Is the histogram right skewed, left skewed, or symmetric? 30 15 60 25 20 10 40 15 10 20 5 5 0 0 0 0 2 4 6 8 10 0 5 10 15 20 25 0 20 40 60 80 Note: Histograms are said to be skewed to the side of the long tail. 9 Shape of a distribution: unusual observations Are there any unusual observations or potential outliers? 40 30 25 30 20 20 15 10 10 5 0 0 0 5 10 15 20 20 40 60 80 100 10 Extracurricular activities How would you describe the shape of the distribution of hours per week students spend on extracurricular activities? 150 100 50 0 0 10 20 30 40 50 60 70 Hours / week spent on extracurricular activities 11 Extracurricular activities How would you describe the shape of the distribution of hours per week students spend on extracurricular activities? 150 100 50 0 0 10 20 30 40 50 60 70 Hours / week spent on extracurricular activities Unimodal and right skewed, with a potentially unusual observation at 60 hours/week. 11 Commonly observed shapes of distributions modality 12 Commonly observed shapes of distributions modality unimodal 12 Commonly observed shapes of distributions modality unimodal bimodal 12 Commonly observed shapes of distributions modality unimodal bimodal multimodal 12 Commonly observed shapes of distributions modality uniform unimodal bimodal multimodal 12 Commonly observed shapes of distributions modality uniform unimodal bimodal multimodal skewness 12 Commonly observed shapes of distributions modality uniform unimodal bimodal multimodal skewness right skew 12 Commonly observed shapes of distributions modality uniform unimodal bimodal multimodal skewness right skew left skew 12 Commonly observed shapes of distributions modality uniform unimodal bimodal multimodal skewness right skew symmetric left skew 12 Practice Which of these variables do you expect to be uniformly distributed? (a) weights of adult females (b) salaries of a random sample of people from North Carolina (c) house prices (d) birthdays of classmates (day of the month) 13 Practice Which of these variables do you expect to be uniformly distributed? (a) weights of adult females (b) salaries of a random sample of people from North Carolina (c) house prices (d) birthdays of classmates (day of the month) 13 Application activity: Shapes of distributions Sketch the expected distributions of the following variables: number of piercings scores on an exam IQ scores Come up with a concise way (1-2 sentences) to teach someone how to determine the expected distribution of any variable. 14 Are you typical? http:// www.youtube.com/ watch?v=4B2xOvKFFz4 15 Are you typical? http:// www.youtube.com/ watch?v=4B2xOvKFFz4 How useful are centers alone for conveying the true characteristics of a distribution? 15 Variance Variance is roughly the average squared deviation from the mean. Pn − x̄)2 i=1 (xi s = 2 n−1 16 Variance Variance is roughly the average squared deviation from the mean. Pn − x̄)2 i=1 (xi s = 2 n−1 The sample mean is 80 x̄ = 6.71, and the sample 60 size is n = 217. 40 20 0 2 4 6 8 10 12 Hours of sleep / night 16 Variance Variance is roughly the average squared deviation from the mean. Pn − x̄)2 i=1 (xi s = 2 n−1 The sample mean is 80 x̄ = 6.71, and the sample 60 size is n = 217. 40 The variance of amount of 20 sleep students get per night 0 2 4 6 8 10 12 can be calculated as: Hours of sleep / night (5 − 6.71)2 + (9 − 6.71)2 + · · · + (7 − 6.71)2 s2 = = 4.11 hours2 217 − 1 16 Variance (cont.) Why do we use the squared deviation in the calculation of variance? 17 Variance (cont.) Why do we use the squared deviation in the calculation of variance? To get rid of negatives so that observations equally distant from the mean are weighed equally. To weigh larger deviations more heavily. 17 Standard deviation The standard deviation is the square root of the variance, and has the same units as the data p s= s2 18 Standard deviation The standard deviation is the square root of the variance, and has the same units as the data p s= s2 The standard deviation of amount of sleep students get per night can be 80 calculated as: 60 √ 40 s = 4.11 = 2.03 hours 20 0 2 4 6 8 10 12 Hours of sleep / night 18 Standard deviation The standard deviation is the square root of the variance, and has the same units as the data p s= s2 The standard deviation of amount of sleep students get per night can be 80 calculated as: 60 √ 40 s = 4.11 = 2.03 hours 20 0 We can see that all of the 2 4 6 8 10 12 Hours of sleep / night data are within 3 standard deviations of the mean. 18 Median The median is the value that splits the data in half when ordered in ascending order. 0, 1, 2, 3, 4 If there are an even number of observations, then the median is the average of the two values in the middle. 2+3 0, 1, 2, 3, 4, 5 → = 2.5 2 Since the median is the midpoint of the data, 50% of the values are below it. Hence, it is also the 50th percentile. 19 Q1, Q3, and IQR The 25th percentile is also called the first quartile, Q1. The 50th percentile is also called the median. The 75th percentile is also called the third quartile, Q3. Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR. IQR = Q3 − Q1 20 Box plot The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median. 70 60 # of study hours / week 50 40 30 20 10 0 21 Anatomy of a box plot 70 60 # of study hours / week 50 suspected outliers 40 max whisker reach & upper whisker 30 20 Q3 (third quartile) median 10 Q1 (first quartile) 0 lower whisker 22 Whiskers and outliers Whiskers of a box plot can extend up to 1.5×IQR away from the quartiles. max upper whisker reach = Q3 + 1.5 × IQR max lower whisker reach = Q1 − 1.5 × IQR 23 Whiskers and outliers Whiskers of a box plot can extend up to 1.5×IQR away from the quartiles. max upper whisker reach = Q3 + 1.5 × IQR max lower whisker reach = Q1 − 1.5 × IQR IQR : 20 − 10 = 10 max upper whisker reach = 20 + 1.5 × 10 = 35 max lower whisker reach = 10 − 1.5 × 10 = −5 23 Whiskers and outliers Whiskers of a box plot can extend up to 1.5×IQR away from the quartiles. max upper whisker reach = Q3 + 1.5 × IQR max lower whisker reach = Q1 − 1.5 × IQR IQR : 20 − 10 = 10 max upper whisker reach = 20 + 1.5 × 10 = 35 max lower whisker reach = 10 − 1.5 × 10 = −5 A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data. 23 Outliers (cont.) Why is it important to look for outliers? 24 Outliers (cont.) Why is it important to look for outliers? Identify extreme skew in the distribution. Identify data collection and entry errors. Provide insight into interesting features of the data. 24 Extreme observations How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million? 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 Annual Household Income 25 Robust statistics 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 Annual Household Income robust not robust scenario median IQR x̄ s original data 190K 200K 245K 226K move largest to $10 million 190K 200K 309K 853K move smallest to $10 million 200K 200K 316K 854K 26 Robust statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, for skewed distributions it is often more helpful to use median and IQR to describe the center and spread for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread 27 Robust statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, for skewed distributions it is often more helpful to use median and IQR to describe the center and spread for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread If you would like to estimate the typical household income for a stu- dent, would you be more interested in the mean or median income? 27 Robust statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, for skewed distributions it is often more helpful to use median and IQR to describe the center and spread for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread If you would like to estimate the typical household income for a stu- dent, would you be more interested in the mean or median income? Median 27 Mean vs. median If the distribution is symmetric, center is often defined as the mean: mean ≈ median Symmetric mean median If the distribution is skewed or has extreme outliers, center is often defined as the median Right-skewed: mean > median Left-skewed: mean < median Right−skewed Left−skewed mean mean median median 28 Practice Which is most likely true for the distribution of percentage of time actually spent taking notes in class versus on Facebook, Twitter, etc.? 50 40 30 20 10 0 0 20 40 60 80 100 % of time in class spent taking notes (a) mean> median (c) mean ≈ median (b) mean < median (d) impossible to tell 29 Practice Which is most likely true for the distribution of percentage of time actually spent taking notes in class versus on Facebook, Twitter, etc.? 50 median: 80% 40 mean: 76% 30 20 10 0 0 20 40 60 80 100 % of time in class spent taking notes (a) mean> median (c) mean ≈ median (b) mean < median (d) impossible to tell 29 Extremely skewed data When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log transformation. 30 Extremely skewed data When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log transformation. The histograms on the left shows the distribution of number of basketball games attended by students. The histogram on the right shows the distribution of log of number of games attended. 40 150 30 100 20 50 10 0 0 0 10 20 30 40 50 60 70 0 1 2 3 4 # of basketball games attended # of basketball games attended 30 Pros and cons of transformations Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation. # of games 70 50 25 ··· log(# of games) 4.25 3.91 3.22 ··· However, results of an analysis in log units of the measured variable might be difficult to interpret. 31 Pros and cons of transformations Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation. # of games 70 50 25 ··· log(# of games) 4.25 3.91 3.22 ··· However, results of an analysis in log units of the measured variable might be difficult to interpret. What other variables would you expect to be extremely skewed? 31 Pros and cons of transformations Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation. # of games 70 50 25 ··· log(# of games) 4.25 3.91 3.22 ··· However, results of an analysis in log units of the measured variable might be difficult to interpret. What other variables would you expect to be extremely skewed? Salary, housing prices, etc. 31 Intensity maps What patterns are apparent in the change in population between 2000 and 2010? http:// projects.nytimes.com/ census/ 2010/ map 32 Considering categorical data Contingency tables A table that summarizes data for two categorical variables is called a contingency table. 33 Contingency tables A table that summarizes data for two categorical variables is called a contingency table. The contingency table below shows the distribution of survival and ages of passengers on the Titanic. Survival Died Survived Total Adult 1438 654 2092 Age Child 52 57 109 Total 1490 711 2201 33 Bar plots A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot. 1500 60.0% Relative frequency 1000 Frequency 40.0% 500 20.0% 0 0.0% Died Survived Died Survived Survival Survival 34 Bar plots A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot. 1500 60.0% Relative frequency 1000 Frequency 40.0% 500 20.0% 0 0.0% Died Survived Died Survived Survival Survival How are bar plots different than histograms? 34 Bar plots A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot. 1500 60.0% Relative frequency 1000 Frequency 40.0% 500 20.0% 0 0.0% Died Survived Died Survived Survival Survival How are bar plots different than histograms? Bar plots are used for displaying distributions of categorical variables, histograms are used for numerical variables. The x-axis in a histogram is a number line, hence the order of the bars cannot be changed. In a bar plot, the categories can be listed in any order (though some orderings make more sense than others, especially for ordinal variables.) 34 Choosing the appropriate proportion Does there appear to be a relationship between age and survival for passengers on the Titanic? Survival Died Survived Total Adult 1438 654 2092 Age Child 52 57 109 Total 1490 711 2201 35 Choosing the appropriate proportion Does there appear to be a relationship between age and survival for passengers on the Titanic? Survival Died Survived Total Adult 1438 654 2092 Age Child 52 57 109 Total 1490 711 2201 To answer this question we examine the row proportions: 35 Choosing the appropriate proportion Does there appear to be a relationship between age and survival for passengers on the Titanic? Survival Died Survived Total Adult 1438 654 2092 Age Child 52 57 109 Total 1490 711 2201 To answer this question we examine the row proportions: % Adults who survived: 654 / 2092 ≈ 0.31 35 Choosing the appropriate proportion Does there appear to be a relationship between age and survival for passengers on the Titanic? Survival Died Survived Total Adult 1438 654 2092 Age Child 52 57 109 Total 1490 711 2201 To answer this question we examine the row proportions: % Adults who survived: 654 / 2092 ≈ 0.31 % Children who survived: 57 / 109 ≈ 0.52 35 Bar plots with two variables Stacked bar plot: Graphical display of contingency table information, for counts. Side-by-side bar plot: Displays the same information by placing bars next to, instead of on top of, each other. Standardized stacked bar plot: Graphical display of contingency table information, for proportions. 36 What are the differences between the three visualizations shown below? 1500 2000 1500 1000 Frequency Frequency Survival Survival 1000 Died Died Survived Survived 500 500 0 0 Adult Child Adult Child Age Age 1.00 Relative frequency 0.75 Survival 0.50 Died Survived 0.25 0.00 Adult Child Age 37 Mosaic plots What is the difference between the two visualizations shown below? 1.00 Adult Child Relative frequency 0.75 Survival Died Died 0.50 Survived 0.25 0.00 Survived Adult Child Age 38 Pie charts Can you tell which order encompasses the lowest percentage of mammal species? RODENTIA CHIROPTERA CARNIVORA ARTIODACTYLA PRIMATES SORICOMORPHA LAGOMORPHA DIPROTODONTIA DIDELPHIMORPHIA CETACEA DASYUROMORPHIA AFROSORICIDA ERINACEOMORPHA SCANDENTIA PERISSODACTYLA HYRACOIDEA PERAMELEMORPHIA CINGULATA PILOSA MACROSCELIDEA TUBULIDENTATA PHOLIDOTA MONOTREMATA PAUCITUBERCULATA SIRENIA PROBOSCIDEA DERMOPTERA NOTORYCTEMORPHIA MICROBIOTHERIA Data from http:// www.bucknell.edu/ msw3. 39 Side-by-side box plots Does there appear to be a relationship between class year and number of clubs students are in? 8 6 4 2 0 First−year Sophomore Junior Senior 40 Case study: Gender discrimination Gender discrimination In 1972, as a part of a study on gender discrimination, 48 male bank supervisors were each given the same personnel file and asked to judge whether the person should be promoted to a branch manager job that was described as “routine”. The files were identical except that half of the supervisors had files showing the person was male while the other half had files showing the person was female. It was randomly determined which supervisors got “male” applications and which got “female” applications. Of the 48 files reviewed, 35 were promoted. The study is testing whether females are unfairly discriminated against. Is this an observational study or an experiment? 41 Gender discrimination In 1972, as a part of a study on gender discrimination, 48 male bank supervisors were each given the same personnel file and asked to judge whether the person should be promoted to a branch manager job that was described as “routine”. The files were identical except that half of the supervisors had files showing the person was male while the other half had files showing the person was female. It was randomly determined which supervisors got “male” applications and which got “female” applications. Of the 48 files reviewed, 35 were promoted. The study is testing whether females are unfairly discriminated against. Is this an observational study or an experiment? 41 Data At a first glance, does there appear to be a relatonship between promotion and gender? Promotion Promoted Not Promoted Total Male 21 3 24 Gender Female 14 10 24 Total 35 13 48 42 Data At a first glance, does there appear to be a relatonship between promotion and gender? Promotion Promoted Not Promoted Total Male 21 3 24 Gender Female 14 10 24 Total 35 13 48 % of males promoted: 21/24 = 0.875 % of females promoted: 14/24 = 0.583 42 Practice We saw a difference of almost 30% (29.2% to be exact) between the proportion of male and female files that are promoted. Based on this information, which of the below is true? (a) If we were to repeat the experiment we will definitely see that more female files get promoted. This was a fluke. (b) Promotion is dependent on gender, males are more likely to be promoted, and hence there is gender discrimination against women in promotion decisions. (c) The difference in the proportions of promoted male and female files is due to chance, this is not evidence of gender discrimination against women in promotion decisions. (d) Women are less qualified than men, and this is why fewer females get promoted. 43 Practice We saw a difference of almost 30% (29.2% to be exact) between the proportion of male and female files that are promoted. Based on this information, which of the below is true? (a) If we were to repeat the experiment we will definitely see that more female files get promoted. This was a fluke. (b) Promotion is dependent on gender, males are more likely to be promoted, and hence there is gender discrimination against women in promotion decisions. Maybe (c) The difference in the proportions of promoted male and female files is due to chance, this is not evidence of gender discrimination against women in promotion decisions. Maybe (d) Women are less qualified than men, and this is why fewer females get promoted. 43 Two competing claims 1. “There is nothing going on.” Promotion and gender are independent, no gender discrimination, observed difference in proportions is simply due to chance. → Null hypothesis 44 Two competing claims 1. “There is nothing going on.” Promotion and gender are independent, no gender discrimination, observed difference in proportions is simply due to chance. → Null hypothesis 2. “There is something going on.” Promotion and gender are dependent, there is gender discrimination, observed difference in proportions is not due to chance. → Alternative hypothesis 44 A trial as a hypothesis test Hypothesis testing is very much like a court trial. H0 : Defendant is innocent HA : Defendant is guilty We then present the evidence - collect data. Then we judge the evidence - “Could these data plausibly have happened by chance if the null hypothesis were true?” If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt in our minds about the null hypothesis. Ultimately we must make a decision. How unlikely is unlikely? Image from http:// www.nwherald.com/ internal/ cimg!0/ oo1il4sf8zzaqbboq25oevvbg99wpot. 45 A trial as a hypothesis test (cont.) If the evidence is not strong enough to reject the assumption of innocence, the jury returns with a verdict of “not guilty”. The jury does not say that the defendant is innocent, just that there is not enough evidence to convict. The defendant may, in fact, be innocent, but the jury has no way of being sure. Said statistically, we fail to reject the null hypothesis. We never declare the null hypothesis to be true, because we simply do not know whether it’s true or not. Therefore we never “accept the null hypothesis”. 46 A trial as a hypothesis test (cont.) In a trial, the burden of proof is on the prosecution. In a hypothesis test, the burden of proof is on the unusual claim. The null hypothesis is the ordinary state of affairs (the status quo), so it’s the alternative hypothesis that we consider unusual and for which we must gather evidence. 47 Recap: hypothesis testing framework We start with a null hypothesis (H0 ) that represents the status quo. We also have an alternative hypothesis (HA ) that represents our research question, i.e. what we’re testing for. We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation (today) or theoretical methods (later in the course). If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis. If they do, then we reject the null hypothesis in favor of the alternative. 48 Simulating the experiment...... under the assumption of independence, i.e. leave things up to chance. If results from the simulations based on the chance model look like the data, then we can determine that the difference between the proportions of promoted files between males and females was simply due to chance (promotion and gender are independent). If the results from the simulations based on the chance model do not look like the data, then we can determine that the difference between the proportions of promoted files between males and females was not due to chance, but due to an actual effect of gender (promotion and gender are dependent). 49 Application activity: simulating the experiment Use a deck of playing cards to simulate this experiment. 1. Let a face card represent not promoted and a non-face card represent a promoted. Consider aces as face cards. Set aside the jokers. Take out 3 aces → there are exactly 13 face cards left in the deck (face cards: A, K, Q, J). Take out a number card → there are exactly 35 number (non-face) cards left in the deck (number cards: 2-10). 2. Shuffle the cards and deal them intro two groups of size 24, representing males and females. 3. Count and record how many files in each group are promoted (number cards). 4. Calculate the proportion of promoted files in each group and take the difference (male - female), and record this value. 50 Step 1 51 Step 2 - 4 52 Practice Do the results of the simulation you just ran provide convincing ev- idence of gender discrimination against women, i.e. dependence between gender and promotion decisions? (a) No, the data do not provide convincing evidence for the alternative hypothesis, therefore we can’t reject the null hypothesis of independence between gender and promotion decisions. The observed difference between the two proportions was due to chance. (b) Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination against women in promotion decisions. The observed difference between the two proportions was due to a real effect of gender. 53 Practice Do the results of the simulation you just ran provide convincing ev- idence of gender discrimination against women, i.e. dependence between gender and promotion decisions? (a) No, the data do not provide convincing evidence for the alternative hypothesis, therefore we can’t reject the null hypothesis of independence between gender and promotion decisions. The observed difference between the two proportions was due to chance. (b) Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination against women in promotion decisions. The observed difference between the two proportions was due to a real effect of gender. 53 Simulations using software These simulations are tedious and slow to run using the method described earlier. In reality, we use software to generate the simulations. The dot plot below shows the distribution of simulated differences in promotion rates based on 100 simulations. −0.4 −0.2 0 0.2 0.4 Difference in promotion rates 54

Use Quizgecko on...
Browser
Browser