Chapter 2 2024 Statistics Lecture Slides PDF

Chapter 2: Looking at Data — Relationships Lecture Presentation Slides Macmillan Learning © 2017 Chapter 2: 2 Looking at Data–Relationships Introduction 2.4 Least-Squares Regression 2.5 Cautions about Correlation and Regression 2. 6 Data Analysis for Two-Way Tables Introduction ▪ Scatterplots ▪ Explanatory and response variables ▪ Interpreting scatterplots ▪ Categorical variables in scatterplots 3 Associations Between Variables 4 Many interesting examples of the use of statistics involve relationships between pairs of variables. Two variables measured on the same cases are associated if knowing the value of one of the variables tells you something about the values of the other variable that you would not know without this information. When you examine the relationship between two variables, a new question becomes important: Is your purpose simply to explore the nature of the relationship, or do you wish to show that one of the variables can explain variation in the other? A response variable (dependent) measures an outcome of a study. An explanatory (independent) variable explains or causes changes in the response variable. 4 Scatterplot 5 The most useful graph for displaying the relationship between two quantitative variables is a scatterplot. A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point on the graph. How to Make a Scatterplot 1. Decide which variable should go on each axis. If a distinction exists, plot the explanatory variable on the x- axis and the response variable on the y-axis. 2. Label and scale your axes. 3. Plot individual data values. 5 Scatterplot Example: Make a scatterplot of the relationship between body weight and backpack weight for a group of hikers. Body weight (lb) 120 187 109 103 131 165 158 116 Backpack weight (lb) 26 30 26 24 29 35 31 28 6 Interpreting Scatterplots To interpret a scatterplot, follow the basic strategy of data analysis from Chapter 1. Look for patterns and important departures from those patterns. How to Examine a Scatterplot As in any graph of data, look for the overall pattern and for striking deviations from that pattern. ▪You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. ▪An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship. 7 Interpreting Scatterplots Two variables are positively associated when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together. Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice-versa. 8 Interpreting Scatterplots ✓ There is one possible outlier―the hiker with the body weight of 187 pounds seems to be carrying relatively less weight than are the other group members. Strength Direction Form ✓ There is a moderately strong, positive, linear relationship between body weight and backpack weight. ✓ It appears that lighter hikers are carrying lighter backpacks. 9 Adding Categorical Variables  Consider the relationship between mean SAT verbal score and percent of high school grads taking the SAT for each state. Southern To add a states categorical highlighted variable, use a different plot color or symbol for each category. 10 Categorical explanatory variables When the explanatory variable is categorical, you cannot make a scatterplot, but you can compare the different categories side by side on the same graph (side by side boxplots) Comparison of income (quantitative response variable) for different education levels (five categories). But be careful in your interpretation: This is NOT a positive association, because education is not quantitative. 11 Nonlinear Relationships ▪ There are other forms of relationships besides linear. The scatterplot below is an example of a nonlinear form. ▪ Note that there is curvature in the relationship between x and y. 12 Correlation  The correlation coefficient r  Properties of r  Influential points 13 Measuring Linear Association A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables. Linear relations are important because a straight line is a simple pattern that is quite common. Our eyes are not good judges of how strong a relationship is. Therefore, we use a numerical measure to supplement our scatterplot and help us interpret the strength of the linear relationship. The correlation r measures the strength of the linear relationship between two quantitative variables. 14 Measuring Linear Association We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. The following facts about r help us further interpret the strength of the linear relationship. Properties of Correlation ▪ r is always a number between –1 and 1. ▪ r > 0 indicates a positive association. ▪ r < 0 indicates a negative association. ▪ Values of r near 0 indicate a very weak linear relationship. ▪ The strength of the linear relationship increases as r moves away from 0 toward –1 or 1. ▪ The extreme values r = –1 and r = 1 occur only in the case of a perfect linear relationship. 15 Correlation 16 Properties of Correlation 1. Correlation makes no distinction between explanatory and response variables. 2. r has no units and does not change when we change the units of measurement of x, y, or both. 3. Positive r indicates positive association between the variables, and negative r indicates negative association. 4. The correlation r is always a number between –1 and 1. Cautions: ▪ Correlation requires that both variables be quantitative. ▪ Correlation does not describe curved relationships between variables, no matter how strong the relationship is. ▪ Correlation is not resistant. r is strongly affected by a few outlying observations. ▪ Correlation is not a complete summary of two-variable data. 17 Correlation Examples For each graph, estimate the correlation r and interpret it in context. 18 2.4 Least-Squares Regression  Regression lines  Least-square Regression Line  Facts about Least-Squares Regression  Correlation and Regression 19 Regression line A regression line is a line that best describes the linear relationship between the two variables, and it is expressed by means of an equation of the form: Where is the slope and is the intercept. Once the equation of the regression line is established, we can use it to predict the response y for a specific value of the explanatory variable x. Regression Line A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We can use a regression line to predict the value of y for a given value of x. Example: Predict the number of new adult birds that join the colony based on the percent of adult birds that return to the colony from the previous year. If 60% of adults return, how many new birds are predicted? 21 2 2 The least-squares regression line The least-squares regression line is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. 2 3 The least-squares regression line The equation of the least-squares regression line of y on x is yˆ = b0 + b1 x ŷ is the predicted y value (y hat) b1 is the slope b0 is the y-intercept Two different regression lines can be drawn if 24 we interchange the roles of x and y. Example: Fitted Line Plot Fitted Line Plot Fat = 3.505 - 0.003441 NEA NEA = 745.3 - 176.1 Fat 700 4 600 Nonexercise activity (calories) 500 Fat gain (Kilograms) 3 400 300 2 200 100 1 0 -100 0 -100 0 100 200 300 400 500 600 700 0 1 2 3 4 Nonexercise activity (calories) Fat gain (Kilograms) Correlation coefficient of NEA and Fat, r = -0.779 stay same in both cases 25 26 BEWARE!!! Not all calculators and software use the same convention. Some use: yˆ = a + bx yˆ = ax + b And some use: Make sure you know what YOUR calculator gives you for a and b before you answer homework or exam questions. Facts About Least-Squares Regression Regression is one of the most common statistical settings, and least- squares is the most common method for fitting a regression line to data. Here are some facts about least-squares regression lines.  Fact 1: A change of one standard deviation in x corresponds to a change of r standard deviations in y.  Fact 2: The LSRL always passes through (x-bar, y-bar).  Fact 3: The distinction between explanatory and response variables is essential. 27 28 (in 1000s) Year 1 977 Powerboat s 447 Dead Manate es 13 yˆ = 0.125 x − 41.4 1 978 460 21 1 979 481 24 1 980 498 16 1 981 513 24 1 982 512 20 1 983 526 15 1 984 559 34 1 985 585 33 1 986 614 33 1 987 645 39 1 988 675 43 1 989 711 50 1 990 719 47 There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths. The least squares regression line has the equation: yˆ = 0.125 x − 41.4 Thus if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths? yˆ = 0.125(500) − 41.4  yˆ = 62.5 − 41.4 = 21.1 Roughly 21 manatees. ----Could we use this regression line to predict the number of manatee deaths for a year with 200,000 powerboat registrations? 29 Extrapolation !!! !!! Extrapolation is the use of a regression line for prediction far outside the range of values of x used to obtain the line. Such predictions are often not accurate. Extrapolation  Sarah’s height was plotted against her age.  Can you predict her height at age 42 months?  Can you predict her height at age 30 years (360 months)? 30 Extrapolation  Regression line: y-hat = 71.95 +.383 x  Height at age 42 months? y-hat = 88  Height at age 30 years? y-hat = 209.8  She is predicted to be 6’10.5” at age 30! 31 Coefficient of determination, r2 Least-squares regression looks at the distances of the data points from the line only in the y direction. The variables x and y play different roles in regression. Even though correlation r ignores the distinction between x and y, there is a close connection between correlation and regression. The square of the correlation, r2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.  r2 is called the coefficient of determination. r2 represents the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x. 32 33 r = -1 Changes in x r2 = 1 explain 100% of r = 0.87 the variations in r2 = 0.76 y. Y can be entirely predicted for any given value of x. Changes in x r=0 explain 0% of Here the change in x only r2 = 0 the variations in explains 76% of the change in y. The values y y. The rest of the change in y takes are entirely (the vertical scatter, shown as independent of red arrows) must be explained what value x by something other than x. 33 takes. r r==–0.3, r 2 = 0.09, or 9% –0.3, r 2 = 0.09, or 9% The Theregression regressionmodel modelexplains explainsnot noteven even 10% 10%ofofthe thevariations variationsininy.y. r r==–0.7, r 2 = 0.49, or 49% –0.7, r 2 = 0.49, or 49% The Theregression regressionmodel modelexplains explainsnearly nearly half halfofofthe thevariations variationsininy.y. r r==–0.99, r 2 = 0.9801, or ~98% –0.99, r 2 = 0.9801, or ~98% The Theregression regressionmodel modelexplains explainsalmost almost all allofofthe thevariations variationsininy.y. 34 Regression line 35 Correlation tells us about strength and direction of the linear relationship between two quantitative variables. In Regression we study the association between two variables in order to explain the values of one from the values of the other (i.e., make predictions). When there is a linear association between two variables, then a straight line equation can be used to model the relationship. In regression the distinction between Response and Explanatory is important. 2.5 Cautions About Correlation and Regression  Residuals and Residual Plots  Outliers and Influential Observations  Lurking Variables  Correlation and Causation 36 37 Residuals A residual is the difference between an observed value of the response variable and the value predicted by the regression line: residual = observed y – predicted y = y − yˆ Points above the The sum of these line have a positive residuals is always 0. residual.  Points below the line have a negative residual. Predicted ŷ dist. ( y − yˆ ) = residual Observed y Residual Plots A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line. ▪ Look for a “random” scatter around zero. ▪ Residual patterns suggest deviations from a linear relationship. Gesell Adaptive Score and Age at First Word 38 The x-axis in a residual plot is the same as on the scatterplot. Only the y-axis is different. 39 Residuals are randomly scattered—good! Curved pattern—means the relationship you are looking at is not linear. A change in variability across a plot is a warning sign. You need to find out why it is, and remember that predictions made in areas of larger 40 variability will not be as good. Outliers and Influential Points An outlier is an observation that lies outside the overall pattern of the other observations.  Outliers in the y direction have large residuals Outliers in the x direction are often influential for the least-squares regression line, meaning that the removal of such points would markedly change the equation of the line. 41 Outliers and Influential Points Gesell Adaptive Score and Age at First Word After removing child 18 r 2 = 11% From all of the data r 2 = 41% 42 Cautions About Correlation and Regression ▪ Both describe linear relationships. ▪ Both are affected by outliers. ▪ Always plot the data before interpreting. ▪ Beware of extrapolation. ▪ predicting outside of the range of x ▪ Beware of lurking variables. ▪ These have an important effect on the relationship among the variables in a study, but are not included in the study. ▪ Correlation does not imply causation! 43 Example: A personal trainer wants to look at the relationship between number of hours of exercise per week and resting heart rate of her clients. The data show a linear pattern with the summary statistics shown below: mean standard deviation x= hours of sx=4.8 exercise per week y=resting heart rate (beats per sy=7.2 minute) r =−0.88 Find the equation of the least-squares regression line for predicting resting heart rate from the hours of exercise per week. 44 45 2.6 Data Analysis for Two-Way Tables  The Two-Way Table  Joint Distribution  Conditional Distributions  Simpson’s Paradox 46 Categorical Variables Recall that categorical variables place individuals into one of several groups or categories. ▪The values of a categorical variable are labels for the different categories. ▪The distribution of a categorical variable lists the count or percent of individuals who fall into each category. When a dataset involves two categorical variables, we begin by examining the counts or percents in various categories for one of the variables. A Two-way Table describes two categorical variables, organizing counts according to a row variable and a column variable. Each combination of values for these two variables is called a cell. 47 Two-way tables Two-way tables summarize data about two 48 categorical variables (or factors) collected on the same set of individuals. Example (Smoking Survey in Arizona): High school students were asked whether they smoke and whether their parents smoke. Does parental smoking influence the smoking habits of their high school children? Explanatory Variable: Smoking habit of student’s parents (both smoke/ one smoke/ neither smoke) Response variable: Smoking habit of student (smokes/does not smoke) To analyze the relationship we can summarize the result in a Two-way table: Two-way tables (Cont …) 49 Explanatory (Row) Variable: Smoking habit of student’s parents Response (Column) variable: Smoking habit of student High school students were asked whether they smoke, and whether their parents smoke: Second factor: Student smoking status First factor: 400 1380 Parent smoking status 416 1823 188 1168 This 3X2 two-way table has 3 rows and 2 columns. Numbers are counts or frequency Margins Margins 50 show the total for each column and each row. 400 1380 Margin for parental 416 1823 smoking 188 1168 Margin for student smoking For each cell, we can compute a proportion by dividing the cell entry by the total sample size. The collection of these proportions is the joint distribution of the two categorical variables. Marginal distributions (When examine the distribution of a single variable 51in a two-way table) Marginal distributions: Distribution of column variable separately (or row variable separately) expressed in counts or percent. 400 1380 416 1823 188 1168 400 1380 33.1% 416 1823 41.7% 1780 188 1168 25.2%  33.1% 5375 18.7% 81.3% 100% 1004 = 18.7% 5375 Marginal distribution (Cont..) Parental smoking Smoker Nonsmoker Total 45% Sum of Counts 40% Percent of students interviewed 52 35% Both 400 1380 33.1% 30% 25% One 416 1823 41.7% 20% Neither 188 1168 25.2% 15% 10% 5% Total 18.7% 81.3% 100% 0% Both One Neither Sum of Counts Student smoking Parents The marginal distributions can be 90% 80% displayed on separate bar graphs, Percent of students interviewed 70% typically expressed as percents instead 60% of raw counts. 50% Each graph represents only one of 40% the two variables, ignoring the 30% second one. 20% 10% Each marginal distribution can also be 0% shown in a pie chart. Smoker Nonsmoker Conditional Distribution 53 A conditional distribution is the distribution of one factor for each level of the other factor. A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total). 400 1380 416 1823 188 1168 Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Conditional distributions (Cont…) 54 Comparing conditional distributions helps us describe the “relationship" between the two categorical variables. We can compare the percent of individuals in one level of factor 1 for each level of factor 2. 400 1380 416 1823 188 1168 Conditional distribution of student smokers for different parental smoking statuses: Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Percent of students who smoke when one parent smokes = 416/2239 = 18.6% Percent of students who smoke when neither parent smokes = 188/1356 = 13.9% Conditional distributions (Cont…) 55 The conditional distributions can be compared graphically by displaying the percents making up one level of one factor, for each level of the other factor. Conditional distribution of student smoking status for different levels of parental smoking status: Percent who Percent who Row total smoke do not smoke Both parents smoke 22% 78% 100% One parent smokes 19% 81% 100% Neither parent smokes 14% 86% 100% The Two-Way Table Young adults by gender and chance of getting rich Female Male Total Almost no chance 96 98 194 Some chance, but probably not 426 286 712 A 50-50 chance 696 720 1416 A good chance 663 758 1421 Almost certain 486 597 1083 Total 2367 2459 4826 What are the variables described by this two-way table? How many young adults were surveyed? 56 Marginal Distribution The Marginal Distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table. Note: Percents are often more informative than counts, especially when comparing groups of different sizes. To examine a marginal distribution: 1.Use the data in the table to calculate the marginal distribution (in percents) of the row or column totals. 2. Make a graph to display the marginal distribution. 57 Marginal Distribution Young adults by gender and chance of getting rich Female Male Total Almost no chance 96 98 194 Examine the marginal Some chance, but probably not 426 286 712 distribution of chance of A 50-50 chance 696 720 1416 getting rich. A good chance 663 758 1421 Almost certain 486 597 1083 Chance of being wealthy by age 30 Total 2367 2459 4826 Response Percent Almost no chance 194/4826 = 4.0% 35 30 Some chance 712/4826 = 25 14.8% Percent 20 15 A 50-50 chance 1416/4826 = 10 29.3% 5 A good chance 1421/4826 = 0 Almost Some 50-50 Good Almost 29.4% none chance chance chance certain Almost certain 1083/4826 = Survey Response 58 22.4% Conditional Distribution 59 Marginal distributions tell us nothing about the relationship between two variables. For that, we need to explore the conditional distributions of the variables. A Conditional Distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. To examine or compare conditional distributions: 1.Select the row(s) or column(s) of interest. 2.Use the data in the table to calculate the conditional distribution (in percents) of the row(s) or column(s). 3. Make a graph to display the conditional distribution. ▪ Use a side-by-side bar graph or segmented bar graph to compare distributions. 59 Conditional Distribution Young adults by gender and chance of getting rich Femal Male Total e Calculate the conditional distribution of Almost no chance 96 98 194 opinion among males. Some chance, but probably 426 286 712 not A 50-50 chance 696 720 1416 Examine the relationship between gender A good chance 663 758 1421 and opinion. Almost certain 486 597 1083 Chance Chanceofofbeing being wealthy byage wealthy by age3030 Chance of being wealthy by age 30 Total 2367 2459 4826 100% Response Male Female 90% Almost no chance 98/2459 = 96/2367 = 80% 4.0% 4.1% Almost certain 70% Some chance 286/2459 = 426/2367 = 35 60% 35 30 11.6% 18.0% Good chance Percent 30 2550% Percent 25 20 Percent A 50-50 chance 720/2459 = 696/2367 = 20 1540% 50-50 chance 29.3% 29.4% 15 10 10 530% A good chance 758/2459 = 663/2367 = 5 020% Males 0 Some chance 30.8% 28.0% Almost no Almost no Some Some 50-50 50-50 Good Good Almost Almost Males 10% chance chance chance chance certain chance chance chance chance certain Almost certain 597/2459 = 486/2367 = Females 0% Opinion Opinion Almost no chance 60 24.3% 20.5% Males Females Opinion Conditional Distribution 61 In the table below, the 25 to 34 age group occupies the first column. Conditional distributions (Cont…) 62 Here the percents are calculated by age range (columns). 29.30% = 11071 37785 = cell total. column total The conditional distributions can be graphically compared using side by side bar graphs of one variable for each value of the other variable. Here, the percents are calculated by age range (columns). 63 63 Simpson’s Paradox When studying the relationship between two variables, there may exist a lurking variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered. The lurking variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables. An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox. 64 Simpson’s Paradox Consider the acceptance rates for the following groups of men and women who applied to college. Not Counts Accepted Total accepted Not Percents Accepted accepted Men 198 162 360 Men 55% 45% Women 88 112 200 Women 44% 56% Total 286 274 560 A higher percentage of men were accepted: Is there evidence of discrimination? 65 Simpson’s Paradox (cont…) 66 Consider the acceptance rates when broken down by type of school. BUSINESS SCHOOL Not Not Counts Accepted Total Percents Accepted accepted accepted Men 18 102 120 Men 15% 85% Women 24 96 120 Total 42 198 240 Women 20% 80% ART SCHOOL Not Not Counts Accepted Total Percents Accepted accepted accepted Men 180 60 240 Women 64 16 80 Men 75% 25% Total 244 76 320 Women 80% 20% Within each school a higher percentage of women were accepted than men. Simpson’s Paradox (cont…) Within each school a higher percentage of women were accepted than men. 67 There is not any discrimination against women!!! lurking variables have an important effect on the relationship among the variables in a study, but are not included in the study. ✓ Lurking variable: Applications were split between the Business School (240) and the Art School (320). This is an example of Simpsons Paradox. When the lurking variable (Type of School: Business or Art) is ignored the data seem to suggest discrimination against women. However, when the type of school is considered, the association is reversed and suggests discrimination against men. Simpson’s Paradox (cont…) 68 An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

Chapter 2 2024 Statistics Lecture Slides PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue