Summary

This document presents a lecture on the multiple regression model. It explains the relationship between variables, and how to extend a bivariate regression model to include multiple predictors, with examples and visualizations. It also introduces concepts of visualization, dummy variables, and multiple regression coefficients.

Full Transcript

The multiple regression model In a bivariate regression model, the equation is as follows: 𝐸 𝑦 = 𝛼 + 𝛽𝑥 > - population In the multiple regression model, this e...

The multiple regression model In a bivariate regression model, the equation is as follows: 𝐸 𝑦 = 𝛼 + 𝛽𝑥 > - population In the multiple regression model, this equation is extended: 𝐸 𝑦 = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 Predictor Or: 𝐸 𝑦 = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 +𝛽3 𝑥3 … T ↑ T and so end on... 1st predictor predictor The multiple regression model A b-coefficient in a multiple regression analysis describes the slope of the effect of an explanatory variable when we control for the effects of the other explanatory variables in the model. Another way of putting it: the effect of any independent variable in the model is the effect of that single variable, when we hold the effects of other explanatory variables constant. Example: Exploring Education’s Influence on Monthly Opera Attendance cultural capital operationalised as 'opera attendance Predicted monthly opera attendance = a + 𝑏1*level of education 𝑦ො = 0.042 + 0.032*LoE / strength Lbivariate adds main/prime to lul. edu. regressioetbt IV Predicted monthly opera attendance = a + 𝑏1*level of education + b2*age BUT , stronger (multiple both not strength regression but S nere be z Y b ,x + = a + very strong , Visualizing the relationship between X and Y in a multiple regression context: The Problem: In multiple regression 𝑦̂ depends on multiple variables, making it hard to isolate only one 𝑥’s effect for visualization. The Solution: To focus on one 𝑥 (let′ s say 𝑥 1), we fix the values of other variables (𝑥2, 𝑥3 etc.)by assigning them specific values (e.g., their mean or any other value). This is like "freezing" the values of 𝑥2, 𝑥3 etc. so their effects don’t change 𝑦. ො What Happens: Fixing other variables simplifies the equation to include only one 𝑥 (𝑥 1), which we can plot as a line. 𝑦ො = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑓𝑖𝑥𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓𝑥2, 𝑥3 … + 𝛽1𝑥 1 Fixing the value of age 𝑦ො = -0.328 + 0.043*LoE + 0.006*age If age is 60: 𝑦ො = -0.328 + 0.043*LoE + 0.006*(60) = 0.032 + 0.043*LoE If age is 70: 𝑦ො = -0.328 + 0.043*LoE + 0.006*(70) = 0.092 + 0.043*LoE The slope (b) of education is the same for each age, but intercept (a) is different. Visualizing the regression equations for different values of age age= 70 𝑦ො = 0.092 + 0.043*LoE age= 60 𝑦ො = 0.032 + 0.043*LoE ↳ parallel as they have the same b (slope/ , LoE) , only intercept changes Bivariate vs multiple regression age= 70 𝑦ො = 0.092 + 0.043*LoE 𝑦ො = 0.042 + 0.032*LoE age= 60 𝑦ො = 0.032 + 0.043*LoE ↑ ↑ predictors + one predictor 2 Interpretation of coefficients 𝑦ො = -0.328 + 0.043*LoE + 0.006*age When ‘controlling for age’: the effect of a one-unit increase on the education scale, when age has a specific 'fixed' value gives a set of parallel 'partial' regression lines for each value of 'age’. In other words: there are different age groups and within those different age groups the effect of education on opera visits is the same. (0 045). Interpretation of multiple regression coefficients The constant represents the baseline opera attendance score for individuals with an educational level of 0. In simpler terms, individuals with no education, on average, attend the opera an average of 0.042 times age constants per month models , change as The constant represents the baseline opera attendance score for individuals with an educational level of 0 and age of 0. In simpler terms, individuals with no education, and age of 0, on average, attend the opera an average of -.328 times per month. Interpretation of multiple regression coefficients The positive and significant education level slope coefficient indicates an average increase of 0.032 in opera attendance for each one-unit increase on the education scale. The positive and significant education level slope coefficient indicates an average increase of 0.043 in opera attendance for each one-unit increase on the education scale, holding the effect of age constant/regardless of age. Interpretation of multiple regression coefficients The positive and significant education level slope coefficient indicates an average increase of 0.032 in opera attendance for each one-unit increase on the education scale. The positive and significant education level slope coefficient indicates an average increase of 0.043 in opera attendance for each one-unit increase on the education scale, holding the effect of age constant/regardless of age. The positive and significant age slope coefficient indicates an average increase of 0.06 in opera attendance for each one-unit increase on age scale, holding the effect of education level constant/regardless of education level. Multiple regression with dummy variables (two categories) Nominal/categorical variables One assumption of linear regression analysis is that both dependent and independent variables have an interval/ratio scale (age, income, height, etc.). (Continuous However, many variables are not available at this measurement level, such as: Married or not Repeating a grade or not Religious affiliation Political preference Nominal/categorical variables with two categories (dichotomous variables) If the dependent variable is nominal/categorical, we must use models that are not covered in this course. If an independent variable is nominal/categorical, we can use dummy variables. A dummy variable is dichotomous and has the values 0 and 1. E.g., The smoking variable has two (answer) categories: ”yes" and ”no". 1. Do you smoke? ⎕ yes ⎕ no Example: Smoking and self-rated health O smoking significan higher ~ is bad for your health Yower health of average O non-smoker smokers 3 dummy = I smoker variable 6 765 2 504(1) y = - =.. = 4 26. Example: Smoking and self-rated health The coefficient b represents the difference in health scores between smokers and nonsmokers. On average, smokers score 2.504 points lower than nonsmokers, or equivalently, nonsmokers score 2.504 points higher than smokers. Nonsmokers are the reference category, and in a model with a single dummy variable, the constant a corresponds to the value for the 𝑦ෝ = 6.765 − 2.504𝑥 reference category. 𝑥: 𝑆𝑚𝑜𝑘𝑒𝑟 = 1, 𝑁𝑜𝑛 𝑠𝑚𝑜𝑘𝑒𝑟 = 0 𝑆𝑚𝑜𝑘𝑒𝑟: 𝑦ෝ = 6.765 − 2.504 ∗ 1 = 4.261 𝑁𝑜𝑛 𝑠𝑚𝑜𝑘𝑒𝑟: 𝑦ෝ = 6.765 − 2.504 ∗ 0 = 6.765 Example: Smoking and self-rated health Self rated health y = -2.5038x + 6.7647 R² = 0.386 10 9 8 7 Self rated health 6 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 1.2 non smoker (0) - smoker (1) Multiple regression with dummy variables (three or more categories) Nominal/categorical variables with three or more categories A nominal/categorical variable may also have more than two categories: Religious affiliation (Jewish, Muslim, Catholic, Protestant, etc.) Skateboarder, BMX'er, inline skater Assumption: You can never fall into two categories at once, e.g., you can't be a Catholic and a Protestant, or a skateboarder and an inline skater, at the same time. Nominal/categorical variables with three or more categories A nominal/categorical variable with three or more categories can also be used to create dummy variables. E.g. do you do one of the following extreme sports (you can only tick an answer category)? ⎕ Yes, I skateboard ⎕ Yes, I BMX ⎕ Yes, I inline-skate ⎕ No, I do not like sports. Four categories four dummies This variable with 4 answer categories can be converted into 4 dummy variables with 2 answer categories each. ⎕ Yes, I skateboard (value=1) vs. the rest (value=0) ⎕ Yes, I BMX (value=1) vs. the rest (value=0) ⎕ Yes, I inline-skate (value=1) vs. the rest (value=0) ⎕ No, I do not like sports(value=1) vs. the rest (value=0). Example: Relationship between extreme sports and pain threshold You also asked a question about the pain threshold: indicate your pain threshold on a scale from 0 to 10: 0 (extremely low pain threshold) to 10 (extremely high pain threshold) Research question: is there a relationship between the type of extreme sports and the pain threshold? You test the following hypothesis: "Practitioners of extreme sports have a higher pain threshold than non-athletes." O = non-athletes (intercept) What does it say above? On average, skateboarders score 4.9 points higher on the variable "pain threshold" compared to the reference category "non-athletes.” Or: On average, non-athletes score 4.9 points lower on the variable "pain threshold" compared to skateboarders. On average, BMXers score 3.6 points higher on the variable "pain threshold" compared to the reference category "non - athletes." Or: on average, non-athletes score 3.6 points lower on the variable "pain threshold" compared to BMXers. On average, inliners score 2.3 points higher on the variable "pain threshold" compared to the reference category "non- athletes." Or: on average, non-athletes score 2.3 points lower on the variable "pain threshold" compared to inliners. Since there is only one dummified independent variable in the model, the value of the reference category is equal to the value of the constant/intercept (a). Do we have evidence supporting the hypothesis "Practitioners of extreme sports have a higher pain threshold than non- athletes"? Prediction equation What is the average pain threshold for a skateboarder? 𝑦ො = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +𝑏3 𝑥3 Wℎ𝑒𝑟𝑒: 𝑎 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡, 𝑥1 = 𝑆𝑘𝑎𝑡𝑒𝑏𝑜𝑎𝑟𝑑𝑒𝑟, 𝑥2 = 𝐵𝑀𝑋𝑒𝑟, 𝑥3 = 𝐼𝑛𝑙𝑖𝑛𝑒𝑟 𝑦ො = 3.571 +(4.857*1)+(3.616*0)+(2.304*0)=8.428 What happens if we change the reference category? Ref category: Skateboarders Ref category: Non-athletes What happens if we change the reference category? Ref category: Non-athletes 𝑦ො = 𝑎 + 𝑏1 𝑆𝑘𝑎𝑡𝑒𝑏𝑜𝑎𝑟𝑑𝑒𝑟 + 𝑏2𝐵𝑀𝑋𝑒𝑟 + 𝑏3 𝐼𝑛𝑙𝑖𝑛𝑒𝑟 Ref category: Skateboarders 𝑦ො = 𝑎 + 𝑏1 𝐵𝑀𝑋𝑒𝑟 + 𝑏2 𝐼𝑛𝑙𝑖𝑛𝑒𝑟 + 𝑏3 𝑛𝑜𝑛 − 𝑎𝑡ℎ𝑙𝑒𝑡𝑒 What if you have a variable in your model in addition to your dummies? How do we interpret the output? Still the same way, only you can no longer interpret the constant/intercept as being the average value of the reference category. It is now the average value of the reference category for age = 0. In the model above, age has been added as a control variable. So, when keeping the effect of age constant, skateboarders score on average 4.656 higher on the pain threshold scale compared to the reference category ”Non-athlete ". What is the average pain threshold of a G 42-year-old skateboarder? 𝑦ො = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +𝑏3 𝑥3 + 𝑏4 𝑥4 n = 4 411 + 4 656(1) + Y 257(0) +.. 2 3 421(0) Where: + 042(42). 𝑎 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡,. 𝑥1 = 𝑠𝑘𝑎𝑡𝑒𝑏𝑜𝑎𝑟𝑑𝑒𝑟, ( 0 -. 𝑥2 = 𝐵𝑀𝑋𝑒𝑟, 𝑥3 = 𝐼𝑛𝑙𝑖𝑛𝑒𝑟, = 7 303. 𝑥4 = 𝑎𝑔𝑒 𝑦ො = 4.411 +(4.656*1)+(3.421*0)+(2.257*0)+(-0.042*42)= 7.303 Summary: Dummy Variables in Regression Dummy variables are used for categorical/nominal independent variables. Dummy variables are dichotomous. You always include one dummy variable less in your model than you can create based on the original categorical/nominal variable. The dummy variable that you do not include in your model is the reference category. All dummy variables included in the model are to be interpreted in reference to the reference category. In a model with only one dummified independent variable, the constant/intercept is the value of the reference category.

Use Quizgecko on...
Browser
Browser