Lecture 3 PDF - Statistical Linear Models

Summary

This lecture provides a recap of statistical linear models, covering key concepts like linear functions, sample and population models, and notations. The document also discusses aspects of slope, variance, and different types of regression analysis used in social sciences.

Full Transcript

Recap Lecture 1 Linear function: y = α + βx (all data points are exactly on the line) In social sciences, a statistical linear model: sample→ 𝑦ො = 𝑎 + 𝑏𝑥 OR y = a + bx + 𝑒 Crediction (gives the best matching/fitting line to the cloud of data points) population→ E(y)= 𝛼 + 𝛽𝑥...

Recap Lecture 1 Linear function: y = α + βx (all data points are exactly on the line) In social sciences, a statistical linear model: sample→ 𝑦ො = 𝑎 + 𝑏𝑥 OR y = a + bx + 𝑒 Crediction (gives the best matching/fitting line to the cloud of data points) population→ E(y)= 𝛼 + 𝛽𝑥 OR y = α + βx + 𝜀 Notation: (error happens) Population mean and std dev: 𝜇 and 𝜎 → (unknown) constants Sample mean and std: 𝑦ത and 𝑠 → variables 3 Recap Lecture 2 The slope b of the prediction equation ( 𝑦ො = 𝑎 + 𝑏𝑥 OR y = a + bx + 𝑒) depends on the units of DV and IV measurements. The standardized b, equivalent to Pearson's correlation coefficient in bivariate regression, serves as a unitless and easily interpretable measure of the relationship strength between two variables. to interpret b,wenedent -S * only if bivariate regression 4 Recap Lecture 2 In linear regression, the variance in dependent variable (TSS) is decomposed into the Regression Sum of Squares (SSR) and the Error Sum of Squares (SSE). TSS= SSR +SSE Recap Lecture 2 9 10 8 9 7 8 6 7 6 Grade Grade 5 5 4 4 3 TSS   ( yi  y ) 2 3 2 1 2 SSE = σ( 𝑦𝑖 − 𝑦ෝ𝑖))2 1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Hours study Hours study 10 Y := individual observation 9 8 7 ↑ 6 prediction equation Grade = 5 4 3 2 SSR = σ( 𝑦ത − 𝑦ෝ𝑖))2 i = mean of DV values 1 AKNOW 0 0 1 2 3 4 5 6 Hours study Recap Lecture 2 Coefficient of determination or r² is used to assess model fit and indicates the proportion of reduction in error from using the linear prediction equation instead of 𝑦ത to predict y. 𝑟 2 = 1 − 𝑆𝑆𝐸 𝑇𝑆𝑆 = 𝑇𝑆𝑆−𝑆𝑆𝐸 𝑆𝑆𝑅 𝑇𝑆𝑆 = 𝑇𝑆𝑆 Another method for assessing model fit is the F-test; however, it is not always meaningful, as we are typically more interested in evaluating the effects of specific explanatory variables, for which the t-test suffices, rather than the entire model. 7 Inference for the slope We know what the estimator b for the parameter β is and what the correlation is. But we also want to be able to say something about generalizability. Can we infer from the data to the population: is the value found for the estimator b statistically significantly different from zero? The hypothesis is as follows: Directional (one-tailed) · existence & direction Hours of study has an effect on the score obtained. Ho = M , [Mz 𝐻0 = 𝛽 = 0; 𝐻𝑎 = 𝛽 ≠ 0 Ha = m , >Mz Note: this is a hypothesis without direction. Non-directional (two-tail) · existence , no direction Ho : M, = Haim , Me Hypothesis testing A test statistic is used in hypothesis testing and helps you decide whether to support or reject the null hypothesis. You calculate a test statistic based on your data (e.g. from an experiment or survey) and then compare the value of this statistic to the expected/critical value under the null hypothesis. If value is larger than critical t, reject Ho t-test types t-test : compare means of I groups of to determine if there is one-sample : compares mean a sample to statistical difference between a known population mean the two leg ,averageclassheight comparate assumptions : normal distribution, samples are independent, independent : compares means of two independent variances are equal groupag. difference in test scores- Female vs. male steps of hypothesis testing in a t-test paired : compares means of same group before and after treatment 1 formulate hypothesis Le training program & employee g... productivity ( S Ho : no difference in test scores between men and women Ho M. = Me non dire He there is a difference in testa average scores between men and women Ha : M F Mz. 2. significance level 2 = 0 05. if pc0. 05 , rejectHo P20 05 , fail to. reject Ho. 3 compute -statistic one-sample t-test : = sample mean t = M pop mean =. S = sta dev (sample).. n = sample size independent t-test : *., group means I = z = &x variance of = groups ninc = sample sines paired t-test : D = mean of differences Sta dev Of differences t Sp =.. 4 compare to critical. value/p-value t-table to find critical - based on of t-stat. L critical , rejectHo P-value < 0 05. , reject Ho Test statistic for the b-coefficient Same form as for a t or z test (recall Introduction to Statistics): 𝑃𝑜𝑖𝑛𝑡 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 t = (b-0)/se se is the standard error of the sample slope (b) and estimates the variability of values that we would get if random samples were repeatedly drawn from the population. ↳ shows sta dev-. of sample distribution of b Test statistic for the b-coefficient b t se For the standard error of b we have: s For which: 𝑠 = 𝑆𝑆𝐸 se   ( x  x )2 𝑛−𝑝−1 Degrees of freedom A how to compute of EXAM of errors to find critical t A Test statistic for the b-coefficient Explained by model / effect 𝑡= = ∑(_ )) Not explained by the model /error This is the test statistic that we calculate based on our data and can be shown as t*/tobserved. The steps to follow for hypothesis testing – OPTION 1 1. Formulate 𝐻 and 𝐻 (and decide whether you should take a one-sided or two-sided test) and choose type of test. 2. Choose a significance/alpha level (e.g., α =.05) 3. Calculate the test statistic (tobserved) 4. Determine the critical value of the test statistic (critical value is a point in the distribution of the test statistic under the null hypothesis). For t-tests use the t-distribution with (n-p-1) degrees of freedom at the selected alpha level. 5. Reject the null hypothesis if the calculated test statistic is greater than the critical value (tobs > tcritical). 6. Interpret the results in terms of content of your hypothesis and research question. 14 The steps to follow for hypothesis testing – OPTION 2 1. Formulate 𝐻 and 𝐻 (and decide whether you should take a one-sided or two-sided test) and choose type of test. 2. Choose a significance/alpha level (e.g., α =.05) 3. Calculate the test statistic (tobserved) 4. Find the p value that corresponds to the tobserved value. The p value is the answer to the question "how likely is it that we’d get this test statistic as large as we did if the null hypothesis were true?”. So, the p value quantifies the strength of evidence against a null hypothesis. A smaller p- value suggests stronger evidence against the null hypothesis. 5. Reject the null hypothesis if p value < the significance level α. 6. Interpret the results in terms of content of your hypothesis and research question. 15 * Calculation of the * t /tobserved 𝑠= =. = 1.1325 = 1.064 i3 x x̄ (x -x̄)².. 𝑠𝑒 = 0 2.5 6.25 = = = 0.254 1 2.5 2.25 ∑ ̅.. 2 2.5 0.25 3 2.5 0.25 4 2.5 2.25. 𝑡= = = 6.287 5 2.5 6.25. sum = 17.5 Calculation of the * t /tobserved 𝑏 1.60 𝑡= = = 6.287 𝑠𝑒 0.254 t = 6.287 df = 6-2=4 Let us say we test at α = 0.01 Walkthrough of Option 1 x = Hours of study y = Grade x y 0 0 1 3 2 3 3 7 4 7 0 667 1 600x y = +. 5 8. in population equation if not , significant = o E(y) = 0 + 1. 600x Walkthrough of Option 1: Find tcritical (using t table) Critical t : 4 604. Walkthrough of Option 1: Find tcritical (online calculator) 20 Walkthrough of Option 1: Compare tobserved to tcritical 6.287 > 4.604 tobserved > tcritical We can reject the null hypothesis and say that the value of b differs significantly from zero. Visualization of t-test The sample distribution with the relevant limit value for the test statistic t = 6.287 P = 0.005 P = 0.005 t= t=-4.604 (critical t=4.604 (critical 0 value) value) Our t-value with 6.287 lies above the critical value of 4.604 and thus in the rejection region. 22 Walkthrough of Option 2 x = Hours of study y = Grade x y 0 0 1 3 2 3 3 7 4 7 5 8 This cell shows the p value associated with the tobserved. The p value is 0.003, it is lower than the significance level alph 0.003 < 0.01  we can reject the null hypothesis and say that the value of b differs significantly from zero. * SPSS output (Study hours and grade) What Value e b 1.60 a 0.67 r 0.95 r² 0.91 se 0.25 t 6.29 RSS/MSS 44.80 SSE 4.53 TSS 49.33 dfM 1 dfE 4 dfT 5 p value (for t 0.003 test) Interpret the results in context There is a significant positive relationship between hours of study and grade (b = 1.600 , t(4) = 6.29, p <.001). In other words, the more hours one spends studying, the better the grades they receive. Note: THIS IS JUST AN EXAMPLE WITH A VERY SMALL SAMPLE THAT DOES NOT MEET SAMPLING ASSUMPTIONS Association and causality Association/correlation does not imply causality “While correlation - or, more generally, association - does not imply causation, causation must in some way or other imply association” (Goldthorpe, 2001, p.2). Three criteria for causal relationship between two variables: 1. There must be an association between the two variables 2. There must be an appropriate time order 3. Alternative explanations should be eliminated * EXAM FSA 26 Time order The cause (X) should precede the effect (Y). Sometimes the causal direction is obvious (i.e., race, age or sex exist before the current achievements) But in most cases, it is not so obvious because the variables are measured together with no specific time order. 27 Alternative explanations Are there any alternative explanations for a relationship between two variables? Will the relationship between two variables remain if we remove the effect of other variables? How does that work? Experiments are the gold standard for establishing causality. In laboratory experiments we create conditions in which we can control all factors and thereby isolate a causal relation. 28 Experimental Control in Biology To assess the influence of various soil types on tomato plant growth, one can employ the following experimental design: cm? cm? Soil A: standard Soil B:enriched with nutrients To isolate the impact of soil type on plant growth, we maintain constant conditions for sunlight exposure, water supply, temperature, and pot size, effectively controlling these variables throughout the experiment. 29 Experimental Control in Human Behaviour Research What is the effect of eco-labelling on purchasing behavior for wines? 1) Treatment Control 2) How much money that they would pay for the wine? This experiment does not directly control for other variables but by randomization it aims to create similar variable distributions between groups, including unobserved variables potentially linked to outcome variables. 30 Alternative explanations Not all social science inquiries can be addressed through experiments. For instance, in investigating the impact of educational levels on voting preferences, it is not feasible to assign children to varying educational levels and subsequently inquire about their voting preferences. ↳ ethical issues In observational studies, we can use statistical control instead of experimental control Summary of three-variable relationships AEXAMA (moderation) leffects - Suppressor variables Agresti & Finlay, 2018 39 Spurious associations If the dependent (𝑌) variable and independent (𝑋 ) variable are both dependent on a third variable (𝑋 ), and when their association disappears when we control for this third variable, then we speak of a spurious relationship between 𝑌 and 𝑋. We thought that height influenced school performance, but it turns out that age influences both height and school performance. Height (𝑋 ) Age (𝑋 ) control School performance (Y) What does it look like in the regression output? y bx = a+ y = 9. 478 + 0. 375x Y ↑ Strong nificant correlation Sig (closetol) ``` ``` ``` ``` ``` ``` Signation ``` - ``` ` ``` ` ``` ` ` nificant Chain/indirect relationship via a mediator variable Education (𝑋 ) Income (𝑋 ) Health (Y) We speak of an indirect relationship if the relationship between 𝑋 and Y runs via a third variable 𝑋. For example, education has a positive relationship with health, which runs via income. A potential explanatory mechanism is as follows: Individuals with higher education levels tend to earn more, enabling them to afford superior healthcare services and, consequently, uphold better health. What does it look like in the regression output? strong - -significant ``` ``` insignificant ``` (turns weak ``` ``` ``` ``` ` ` Important: The difference between an indirect relationship and a spurious relationship is made on the basis of theory. Multiple causes: Independent causes Age (𝑋 ) Adolescents' #norelation smoking behavior (Y) Sex (𝑋 ) What does it look like in the regression output? Vinange ``` ``` ``` ~Y ``` ``` ``` ``` 3 sig. ``` ``` ``` ``` ```` ```` ``` ` ` Multiple causes: Related causes Parental attitudes towards smoking (𝑋 ) Adolescents' smoking behavior(Y) Parental smoking behavior(𝑋 ) What does it look like in the regression output? strong ` ` `` -` ` ` `` ``` ` ``` J sig `` ``` Cremains ``` `` ``` `` ``` ``` `` ``` ` ` strongdecreases ` ` Suppressor variables Suppose, based on previous research, you expect a positive relationship between education and annual income. But your data shows that there is no relation whatsoever: Suppressor variables But how is that possible? After some reflection and checking the literature, you find out that older people tend to have higher income but have lower education than young people. You examine these relations in your data (age and education, and age and income): So, what will happen now if we control for age in our model (i.e., remove the effect of age on the relationship between education and income)? Suppressor variables By removing the effect of age on the relationship between education and income, we see that the effect of education on income is indeed there. ` ` ` ` ` ` ` `` ` `````` `` ` `` ` ` `` ```` ` `` ` `` ``` ` `` ` `` ` `` ` `` Age is therefore a so-called suppressor variable:` a variable that`"suppresses" ` effect of education (or "conceals") the effect of another variable. The ` on income is suppressed by the age variable. Controlling for variables In the previous example, when we control for the effect of age, we effectively keep the effect of age constant. What we do with 'keeping constant' is to look at the relationship between education and income within a group of cases of the same age. Looking at relations between two variables within groups is also called a partial regression. This assumes that the relationship between education and income is the same within each age group. Statistical interaction The effect of an independent variable on the dependent variable may be influenced by another independent variable. For example: Certain industries (e.g., tech, finance) offer higher returns on education compared to others (e.g., retail, transportation). Industry of employment (X2) Education (X1) Income (Y)

Use Quizgecko on...
Browser
Browser