BMS 511 Biostatistics & Statistical Analysis Chapter 4 PDF
Document Details
Uploaded by .keeks.
Marian University
2018
Guang Xu
Tags
Summary
This document provides lecture notes on biostatistics, focusing on chapter 4 regarding relationships and regression. It covers concepts such as scatterplots, correlation coefficients, and regression lines. The notes were published in 2018.
Full Transcript
BMS 511 Biostatistics & Statistical Analysis Chapter 4 Relationships: Regression Guang Xu, PhD, MPH Assistant Professor of Biostatistics and Public Health College of Osteopathic Medicine Marian University Previous Learn...
BMS 511 Biostatistics & Statistical Analysis Chapter 4 Relationships: Regression Guang Xu, PhD, MPH Assistant Professor of Biostatistics and Public Health College of Osteopathic Medicine Marian University Previous Learning Objectives Demonstrate Relationships: Scatterplots and correlation Bivariate data Scatterplots Interpreting scatterplots Adding categorical variables to scatterplots The correlation coefficient r Facts about correlation Copyright © 2018 W. H. Freeman and Company Learning Objectives Demonstrate Regression The least-squares regression line Facts about least-squares regression Outliers and influential observations Working with logarithm transformations Cautions about correlation and regression Association does not imply causation Copyright © 2018 W. H. Freeman and Company The least-squares regression line The least-squares regression line is the line that makes the sum of the squared vertical distances of the data points from the line as small as possible. Copyright © 2018 W. H. Freeman and Company Residuals The vertical distances from each point to the least-squares regression line are called residuals. We can show with algebra that the sum of all the residuals is 0. Copyright © 2018 W. H. Freeman and Company Notation 𝑦ො is the predicted 𝑦 value on the regression line 𝑦ො = intercept + slope 𝑥 𝑦ො = 𝑎 + 𝑏𝑥 Not all calculators/software use this convention. Other notations include: 𝑦ො = 𝑎𝑥 + 𝑏 𝑦ො = 𝑏0 + 𝑏1 𝑥 𝑦ො = variablename 𝑥 + constant Copyright © 2018 W. H. Freeman and Company Interpreting the regression line The slope of the regression line describes how much we expect y to change, on average, for every unit change in x. The intercept is a necessary mathematical descriptor of the regression line. It does not necessarily describe a specific property of the data. Copyright © 2018 W. H. Freeman and Company Finding the least-squares regression line 𝑠𝑦 The slope of the regression line is 𝑏 = 𝑟 𝑠𝑥 r is the correlation coefficient between x and y. sy is the standard deviation of the response variable y. sx is the standard deviation of the explanatory variable x. The intercept is 𝑎 = 𝑦ത − 𝑏𝑥ҧ 𝑥ҧ and 𝑦ത are the respective means of the 𝑥 and 𝑦 variables. Copyright © 2018 W. H. Freeman and Company Plotting the least-squares regression line Use the regression equation to find the value of y for two distinct values of x, and draw the line that goes through those two points. Hint: The regression line always passes through the mean of x and y. The points used for drawing the regression line are derived from the equation. They are NOT actual points from the data set (except by pure coincidence). Copyright © 2018 W. H. Freeman and Company Facts about the least-squares regression line Fact 1. There is a distinction between the explanatory variable and the response variable. If their roles are reversed, we get a different regression line. Fact 2. The slope of the regression line is proportional to the correlation between the two variables. Fact 3. The regression line always passes through the point 𝑥,ҧ 𝑦ത. Fact 4. The correlation measures the strength of the association, while the square of the correlation measures the percent of the variation that is explained by the regression line. Copyright © 2018 W. H. Freeman and Company Linear associations only (1 of 2) Don’t compute the regression line until you have confirmed that there is a linear relationship between x and y. ALWAYS PLOT THE RAW DATA These data sets all give a linear regression equation of about ŷ = 3 + 0.5x. But don’t report that until you have plotted the data. Copyright © 2018 W. H. Freeman and Company Linear associations only (2 of 2) Is regression appropriate for these data sets? Copyright © 2018 W. H. Freeman and Company The coefficient of determination, r 2 r 2, the coefficient of determination, is the square of the correlation coefficient. r 2 represents the fraction of the variance in y that can be explained by the regression model. r = 0.87, so r 2 = 0.76 This model explains 76% of individual variations in BAC Copyright © 2018 W. H. Freeman and Company Interpreting r2 (1 of 3) r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y. r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y. Copyright © 2018 W. H. Freeman and Company Interpreting r2 (2 of 3) r = –0.99, r 2 = 0.9801, or about 98% The regression model explains almost all of the variations in y. Copyright © 2018 W. H. Freeman and Company Interpreting r2 (3 of 3) r represents the direction and strength of a linear relationship. r2 indicates what fraction of the variation in y can be explained by the linear regression model. r = –0.972 r2 = 0.946 r = –0.538 r2 = 0.290 Copyright © 2018 W. H. Freeman and Company Outliers and influential points Outlier: An observation that lies outside the overall pattern. “Influential individual”: An observation that markedly changes the regression if removed. This is often an isolated point. Copyright © 2018 W. H. Freeman and Company Outlier example The rightmost point changes the regression line substantially when it is removed, which makes it an influential point. The topmost point is an outlier of the relationship, but it is not influential (regression line changes very little by its removal). Copyright © 2018 W. H. Freeman and Company Regression with a transformation Logarithm transformations are often used when data are highly right skewed. If the response variable is transformed with logarithms, regression is performed normally, except the predicted response must be transformed back into original units. To predict brain weight when body weight is 100 kg: log brain weight = 1.01 + 0.72 × log body weight log brain weight = 1.01 + 0.72 × 2 = 2.45 brain weight = 102.45 = 282g Copyright © 2018 W. H. Freeman and Company Making predictions (1 of 4) Use the equation of the least-squares regression to predict y for any value of x within the range studied. Prediction outside the range is extrapolation. Avoid extrapolation. What would we expect for the BAC after drinking 6.5 beers? 𝑦ො = 0.0144∗ 6.5 + 0.0008 𝑦ො = 0.936 + 0.0008 = 0.0944mg/mL Copyright © 2018 W. H. Freeman and Company Making predictions (2 of 4) Positive linear relationship Copyright © 2018 W. H. Freeman and Company Making predictions (3 of 4) If Florida were to limit the number of powerboats to 500,000, what could we expect the number of manatee deaths to be in that year? A) ~21 B) ~ 65 C) ~109 D) ~65,006 What if Florida were to limit the number of powerboats to 200,000? Copyright © 2018 W. H. Freeman and Company Making predictions (4 of 4) Year Boats Deaths Year Boats Deaths Year Boats Deaths 1977 447 13 1989 711 50 2001 944 81 1978 460 21 1990 719 47 2002 962 95 1979 481 24 1991 681 55 2003 978 73 1980 498 16 1992 679 38 2004 983 69 1981 513 24 1993 678 35 2005 1010 79 1982 512 20 1994 696 49 2006 1024 92 1983 526 15 1995 713 42 2007 1027 73 1984 559 34 1996 732 60 2008 1010 90 1985 585 33 1997 755 54 2009 982 97 1986 614 33 1998 809 66 2010 942 83 1987 645 39 1999 830 82 2011 922 87 1988 675 43 2000 880 78 2012 902 81 Copyright © 2018 W. H. Freeman and Company Association does not imply causation Association, however strong, does NOT imply causation. The observed association could have an external cause. A lurking variable (aka. confounder, confounding variable, confounding factor) is a variable that is not among the explanatory or response variables in a study, and yet may influence the relationship between the variables studied. We say that two variables are confounded when their effects on a response variable cannot be distinguished from each other. Copyright © 2018 W. H. Freeman and Company Lurking variables (1 of 4) What is most likely the lurking variable, if any, in each case? Strong positive association between the shoe size and reading skills in young children. Copyright © 2018 W. H. Freeman and Company Lurking variables (1 of 4) What is most likely the lurking variable, if any, in each case? Negative association between moderate amounts of wine- drinking and death rates from heart disease in developed nations. Copyright © 2018 W. H. Freeman and Company Lurking variables (2 of 4) Clear positive association between per capita chocolate consumption and the concentration of Nobel Laureates in world nations! Copyright © 2018 W. H. Freeman and Company Lurking variables (3 of 4) Relationship between muscle sympathetic nerve activity and a measure of arterial stiffness in young adults. Gender is a lurking variable. Copyright © 2018 W. H. Freeman and Company Lurking variables (4 of 4) Same data broken down by gender. Copyright © 2018 W. H. Freeman and Company Establishing causation (1 of 2) Establishing causation from an observed association can be done if: 1. The association is strong. 2. The association is consistent. 3. Higher doses are associated with stronger responses. 4. The alleged cause precedes the effect. 5. The alleged cause is plausible. Copyright © 2018 W. H. Freeman and Company Establishing causation (2 of 2) Lung cancer is clearly associated with smoking. What if a genetic mutation (lurking variable) caused people to both get lung cancer and become addicted to smoking? It took years of research and accumulated indirect evidence to reach the conclusion that smoking causes lung cancer. Copyright © 2018 W. H. Freeman and Company Stretch Break! Boxplot A way of summarizing a set of data measured on an interval scale Often used in exploratory data analysis Shows shape of the distribution, its central value, and variability Picture produced consists of o the most extreme values in the data set (maximum and minimum values), o the lower and upper quartiles, o and the median Especially helpful for indicating whether a distribution is skewed and whether there are any outliers in the data set Also very useful when large numbers of observations are involved and when two or more data sets are being compared Advantages of Boxplots Graphically display a variable’s location and spread at a glance Provide some indication of the data’s symmetry and skewness Show outliers Can quickly compare datasets by showing two boxplots side by side Interpreting a Boxplot The box itself contains the middle 50% of the data. The upper hinge (edge) of the box indicates the 75th percentile of the data The lower hinge (edge) of the box indicated the 25th percentile of the data The range of the middle two quartiles is known as the inter-quartile range The line in the box indicates the median (or mean) value of the data If the median line within the box is not equidistant from the hinges then the data is skewed The ends of the vertical lines or “whiskers” indicate the minimum and maximum data values, unless outliers are present If outliers are present, the whiskers extend to a maximum of 1.5 times the inter-quartile range The points outside the ends of the whiskers are the outliers Boxplot Boxplot Boxplot Boxplot Histograms A way of summarizing data that are measured on an interval scale (either discrete or continuous) Used in exploratory data analysis to illustrate the major features of the distribution of the data Divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group This means that the rectangles might be drawn of non-uniform height High bars indicate more points in a class, and low bars indicate less points Can also help detect any outliers or any gaps in the data set Histograms The strength of a histogram is that it provides an easy- to-read picture of the location and variation in a data set There are, however, two weaknesses of histograms that you should bear in mind: o histograms can be manipulated to show different pictures o If too few or too many bars are used, the histogram can be misleading This is an area which requires some judgment, and perhaps some experimentation, based on the analyst's experience. Histogram Statistics Mean The average of all the values. Minimum The smallest value. Maximum The biggest value. Std Dev An expression of how widely spread the values are around the mean. Class Width The x-axis distance between the left and right edges of each bar in the histogram. Number of Classes The number of bars (including zero height bars) in the histograms. Skewness Is the histogram symmetrical? If so, Skewness is zero. If the left hand tail is longer, skewness will be negative. If the right hand tail is longer, skewness will be positive. Kurtosis Kurtosis is a measure of the pointiness of a distribution. The standard normal curve has a kurtosis of zero. The Matterhorn, has negative kurtosis, while a flatter curve would have positive kurtosis. Shape: Skewness and Kurtosis A "normal" distribution of variation results in a specific bell-shaped curve, with the highest point in the middle and smoothly curving symmetrical slopes on both sides of center. Many distributions are non-normal. They may be skewed, or they may be flatter or more sharply peaked than the normal distribution. A "skewed" distribution is one that is not symmetrical, but rather has a long tail in one direction. If the tail extends to the right, the curve is said to be right-skewed, or positively skewed. If the tail extends to the left, it is negatively skewed. Kurtosis is also a measure of the length of the tails of a distribution. For example, a symmetrical distribution with positive kurtosis indicates a greater than normal proportion of product in the tails. Negative kurtosis indicates shorter tails than a normal distribution would have. Histogram Histogram Histogram Histogram Histogram Histogram Histogram Histogram Histogram Histogram Quantile-Quantile plots Used to see if a given set of data follows some specified distribution The values of the respective variable are first sorted into ascending order. The ith observation is plotted against one axis as i/n, and against the other axis as the ith observation from the second dataset. It should be approximately linear if the specified distribution is the correct model or if the two datasets are from the same distribution. Example of Normal Data 2 qq1 0 -2 -4 -2 0 2 4 Quantil es of S tandard Normal Q-Q Plot N(5,10) Example of Non-Normal Data 12 10 8 qq2 6 4 2 0 -4 -2 0 2 4 Quantil es of S tandard Normal Q-Q Plot Exponential(λ=1) 6 exp1000.1 4 2 0 -2 0 2 Quantil es of S tandard Normal Quantile-Quantile plots Quantile-Quantile plots Not normally distributed data Heavy Tailed Data Normal Data Skewed Data Quantile-Quantile plots Quantile-Quantile plots Quantile-Quantile plots Learning Objectives Demonstrate Regression The least-squares regression line Facts about least-squares regression Outliers and influential observations Working with logarithm transformations Cautions about correlation and regression Association does not imply causation Copyright © 2018 W. H. Freeman and Company