🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

ASA 3 answers.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

Assignment 3 Logistic regression and interaction variables The aim of this assignment is to learn how to perform logistic regression models with the use of statistical software package, STATA. In later exercises we focus on interaction variables to improve our statistical models and comprehension of...

Assignment 3 Logistic regression and interaction variables The aim of this assignment is to learn how to perform logistic regression models with the use of statistical software package, STATA. In later exercises we focus on interaction variables to improve our statistical models and comprehension of model results, to critically reflect on the goodness of fit and to critically reflect on microdata using statistical software, STATA. Create a *.do file with STATA syntax to work time efficiently. STATA output tables and figures should be neatly presented in your Word document. For an incomplete overview of STATA commands, see the STATA workshop documents and the Appendix of this document. All exercises!  Always paste your syntax and the appropriate output from STATA in your answer file to support your answers.  Whenever a new command is used, explain the structure and content of the code. At all times, make sure your work is transparent and traceable! Remember to save your work.  Upload the answers of your assignment on Nestor when you are finished. The deadline is on Wednesday 19h CET. Student names: Student IDs: Introduction Topic: Research question: Dataset: Fertility intentions Which socio-demographic characteristics influence whether a person intends to have a child within the next three years or not? GGS_NetherlandsW1_teaching.dta The dataset used in this week’s assignment is an adjusted subset from Wave 1 of the Generations and Gender Survey (GGS), containing a sample of 18-50 year old individuals in the Netherlands. You will use the data to study fertility intentions. After age 50, people are considered to no longer be able to conceive children, and are thus excluded from this study. The variables in this dataset are: arid ID variable a622 1801 = wants to have a child within 3 years 1802 = wants to have a child >3 years 1803 = does not want to have (more) children aage Age of the respondent in years asex 1 = male 2 = female (ref) ankids Number of children respondent unionstatus 1 = single 2 = LAT 3 = cohabiting (ref) 4 = married employ 1 = employed (ref) 2 = other 1 education prevdissol siblings mumalive 1 = secondary or less 2 = post-secondary (ref) If respondent experienced the dissolution of a previous relationship 0 = no (ref) 1 = yes Number of siblings the respondent has Whether respondent’s mother is still alive 0 = not alive 1 = alive (ref) Report Collect the answers of each exercise in this Word document including the *.do file in the Appendix. From this moment on, train yourself to answer the questions in an academic manner. If you experience any difficulties with the assignment or Stata: 1) First, Google it; 2) Then, discuss it with your fellow students; 3) Last, ask or e-mail the computer lab supervisors (preferably during the computer lab sessions). Good luck with assignment 3! Exercise A sum 1. Report summary statistics. How many observations are there? How many observations of each variable? For each variable, what is the min/max, mean and standard deviation of the variable? Do we need to worry about missing values? Which variables need to be transformed? Variable Obs Mean arid aage asex education ankids 3,459 3,459 3,459 3,459 3,459 81412.45 34.93033 1.559699 1.349523 1.235328 a622 unionstatus employ prevdissol siblings 3,459 3,459 3,459 3,459 3,459 mumalive 3,459 Std. Dev. Min Max 2378.315 7.663585 .496495 .4768883 1.251077 77328 18 1 1 0 85485 50 2 2 7 1802.386 2.952587 1.262215 .2656837 2.558543 .8112421 1.231247 .4399023 .4417605 1.986573 1801 1 1 0 0 1803 4 2 1 15 .8722174 .3338958 0 1 Each of the variables has the same number of observations, so we do not have to worry about missing values. It seems we do need to transform some of the variables: the dependent variable a622 is the most important one, but also the categorical variables need to be used as dummy variables (either by specifying i. in regression, or by generating dummies). 2 2. Given the research question, what kind of dependent variable is needed? Can we work with the data as it is now? If not, make any necessary changes to the variables. Report and explain what you did. * First ask frequency table for a622, to explore the values. tab a622 Intention to have a child during next 3 yrs Freq. Percent Cum. want to have a child within 3 years wants to have a child in > 3 years does not want to have (more) children 728 668 2,063 21.05 19.31 59.64 21.05 40.36 100.00 Total 3,459 100.00 Alternative code, that gives you both the values and labels: sysdir set PLUS "X:\PLUS" ssc install fre fre a622 a622 Valid Intention to have a child during next 3 yrs 1801 want to have a child within 3 years 1802 wants to have a child in > 3 years 1803 does not want to have (more) children Total Freq. Percent Valid Cum. 728 21.05 21.05 21.05 668 19.31 19.31 40.36 2063 59.64 59.64 100.00 3459 100.00 100.00 Or: codebook a622 a622 Intention to have a child during next 3 yrs type: label: range: unique values: tabulation: numeric (int) a622 [1801,1803] 3 Freq. 728 Numeric 1801 668 1802 2,063 1803 units: missing .: 1 0/3,459 Label want to have a child within 3 years wants to have a child in > 3 years does not want to have (more) children 3 * Create binary variable for child intentions. generate chintent = (a622 == 1801) * Assign variable label. label variable chintent "Whether intends to have a child within 3 years" * Assign value labels. label define chintentl 0 "No" 1 "Yes" label values chintent chintentl Given the research question, the response variable should be binary, telling you whether the person intends to have a child within the next three years (yes/no). The variable ‘a622’ needs to be recoded as a dummy variable before continuing the analysis. We ask Stata to compute a new variable, called ‘chintent’, where those individuals who intend to have a child within the next three years (a622==1801) will be coded with a 1 (yes) in the new variable. All those who do not have this intention (a622==1802 or a622==1803) will be coded as a 0 (no). We also add value labels and a variable label to the new variable. Please note that recoding the a622 variable is also a possibility. However, always recode into a different variable, because if you recode the same variable you lose the original data. recode a622 (1801 = 1 "Yes") (1802 1803 = 0 "No"), gen(chintent) 3. Run an OLS regression using the dependent variable – which you created in question 2 – and all available socio-demographic characteristics. Make sure you set the reference categories as said in the beginning of this assignment. To do this once permanently, use command fvset base 1 variablename. Show the regression results and provide a neat table of the results using the “outreg2” command (see Appendix 2)1. Which assumptions of OLS regression and the error term are violated? (see lecture 3 and Appendix 2 for some help) Show the problems using statistical tests and graphs and explain each statistical test and graph. What are the consequences of each violation? Propose a solution to the problem. Hint: use the command predict to save and explore the error term. * Set the reference categories. fvset base 0 chintent fvset base 2 asex fvset base 3 unionstatus fvset base 1 employ fvset base 2 education fvset base 0 prevdissol fvset base 1 mumalive You first have to set the reference categories for the categorical variables. As default, Stata takes the smallest value as reference. With the command fvset you can set new reference categories permanently. The alternative is that you define the reference category every time you run an analysis. In the reg command, you would then specify ib3.unionstatus. You might need to install the ‘outreg2’ command first. Command is ‘ssc install outreg2’ after you changed sysdir directory to your student drive! sysdir set PLUS "X:\PLUS" 1 4 * Run linear regression with all available independent variables. reg chintent aage i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings i.mumalive Source SS df MS Model Residual 99.7472416 475.033909 11 3,447 9.06793105 .137810824 Total 574.781151 3,458 .166217799 Std. Err. Number of obs F(11, 3447) Prob > F R-squared Adj R-squared Root MSE Coef. aage -.0124145 .0011002 -11.28 0.000 -.0145715 -.0102574 asex male -.0458278 .013359 -3.43 0.001 -.0720202 -.0196354 ankids -.089256 .0069752 -12.80 0.000 -.1029319 -.0755801 unionstatus single LAT Married -.2375353 -.2519584 .0042734 .0207536 .0257882 .0201321 -11.45 -9.77 0.21 0.000 0.000 0.832 -.278226 -.3025201 -.0351987 -.1968446 -.2013967 .0437454 employ Other -.0895282 .0155257 -5.77 0.000 -.1199687 -.0590876 education Secondary or less -.0447768 .0137497 -3.26 0.001 -.0717352 -.0178185 prevdissol Yes siblings .0782544 .0077603 .015586 .0033233 5.02 2.34 0.000 0.020 .0476958 .0012444 .1088131 .0142761 mumalive Not alive -.0155577 .0196718 -0.79 0.429 -.0541273 .0230118 _cons .864825 .0371062 23.31 0.000 .7920727 .9375773 VARIABLES aage 1.asex ankids 1.unionstatus 2.unionstatus 4.unionstatus (1) chintent -0.0124*** (0.00110) -0.0458*** (0.0134) -0.0893*** (0.00698) -0.238*** (0.0208) -0.252*** (0.0258) 0.00427 (0.0201) 5 P>|t| 3,459 65.80 0.0000 0.1735 0.1709 .37123 chintent ssc install outreg2 outreg2 using reg01, replace excel e(all) t = = = = = = [95% Conf. Interval] 2.employ -0.0895*** (0.0155) -0.0448*** (0.0137) 0.0783*** (0.0156) 0.00776** (0.00332) -0.0156 (0.0197) 0.865*** (0.0371) 1.education 1.prevdissol siblings 0.mumalive Constant Observations 3,459 R-squared 0.174 rank 12 ll_0 -1804 ll -1474 r2_a 0.171 rss 475.0 mss 99.75 rmse 0.371 r2 0.174 F 65.80 df_r 3447 df_m 11 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 predict e, res sum e Variable Obs Mean e 3,459 1.71e-11 Std. Dev. .3706379 hist e 6 Min Max -.6435837 1.149897 0 .5 Density 1 1.5 2 2.5 facem regresie logaritmica sau variatii procentuale. (variatia de acum - variatia de ieri)/variatia de ieri) -.5 0 .5 Residuals 1 1.5 -.5 0 Residuals .5 1 1.5 * Graph scatter plots between the error term and the independent variables (one example is done below, aage) graph twoway scatter e aage 20 30 Age of Respondent 40 50 * Test for heteroscedasticity of the error term. Try to interpret the test. estat hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of chintent chi2(1) Prob > chi2 = = 465.89 0.0000 The null hypothesis of this test (constant variance, homoskedacity) is rejected. 7 * Perform Durbin-Wu-Hausman test to test for endogeneity (not further elaborated in this answer model) Violation 1: Heteroskedasticity – Plot residuals versus predicted (fitted) values. Graph in STATA: -graph scatter e x- Ideally the plot should look like a random scatter of points, as we want residuals to be randomly spread over the range of measured values. But it is clearly not. Formal test in STATA: -estat hettest- Consequence: Standard errors will be wrong (efficiency loss) and it is difficult to do hypothesis testing. Violation 2: Errors are not normally distributed (hist e, normal). By definition, you cannot have a normal distribution of the error terms when the residuals are only free to take on two possible values. Consequence: Standard errors will be wrong (efficiency loss) and it is difficult to do hypothesis testing. Violation 3: Linearity – The existing values of the dependent variable are only 0 or 1. Using OLS, there is no constraint that the predicted value of the dependent variable falls in the 0-1 range. This implies that it is possible to predict values below 0 and above 1 which are, by definition, incorrect. What we learn from this is that the true relationship between the expected value of Y and X is nonlinear. Consequence: OLS will tend to give the correct sign but the magnitude of the estimates are likely to be grossly understated or overstated (biased coefficients / coefficients are not consistent) and it is impossible to do statistical inferences. Solution: Alternative estimation techniques. In this case, run a discrete (choice) model, such as logistic or probabilistic regression. 4. What is the predicted probability (using the estimates of question 3) of… i) a 40y-old woman, single, unemployed, lower educated, with two children and no siblings, whose previous relation has been dissolved and whose mother is alive… ii) a 30y-old woman, married, employed, higher educated, without children and without siblings, who has not experienced the dissolution of a previous relationship and whose mother has passed away… … to have fertility intentions? Is there anything noteworthy about the predicted probabilities? The predicted probabilities are as follows (see Excel file): i) y=0.8648−0.0124∗40−0.0458∗0−0.2375∗1−0.0895∗1−0.0448∗1−0.0893∗2+0.0078∗0+0.0783∗ ii) y=0.8648−0.0124∗30−0.0458∗0−0.0043∗1−0.0895∗0−0.0448∗0−0.0893∗0+ 0.0078∗0+ 0.0783∗ The predicted probability of i) is negative which underlines the problems with predicting a binary variable (chintent) using OLS. Note: for clarity we have also written down the coefficients that are not applicable and have multiplied them by 0, but of course, you can simply leave those out and also leave out *1 for the applicable coefficients. Note: you can best do these calculations in Excel, if you copy and paste your coefficients there. See the Excel answer sheet for the calculations. 8 Note: we want you to be able to calculate these probabilities manually, but it is also possible to have State do it with the following code: margins, at(aage=(40) asex=(2) unionstatus=(1) employ=(2) education=(1) ankids=(2) siblings=(0) prevdissol=(1) mumalive=(1)) 5. Interpret the coefficient of age (sign, significance, value). The coefficient of age is negative and statistically significantly different from zero (at the 99% significance level). If age is increased by 1 year, then the probability of the dependent variable (intention of having a child within 3 years) decreases with 0.0124. 6. Now, run the appropriate method of analysis with the main effects of all available socio-demographic characteristics. Make sure you set the right reference categories (use command fvset: fvset base 1 variablename). Decide whether you want to keep all variables in your final model. What is the (final) model specification on the logit scale? What is the (final) model specification on the odds scale? What is the (final) model specification on the probability scale? Show that the results are robust to changes in the model specification (for example, by leaving one variable out). * Set the reference categories. fvset base 1 chintent fvset base 2 asex fvset base 3 unionstatus fvset base 1 employ fvset base 0 prevdissol fvset base 1 mumalive * Run logistic regression with all available independent variables. logit chintent aage i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings i.mumalive 9 Logistic regression Number of obs LR chi2(11) Prob > chi2 Pseudo R2 Log likelihood = -1440.2196 Std. Err. z P>|z| = = = = 3,459 679.37 0.0000 0.1908 chintent Coef. [95% Conf. Interval] aage -.0885674 .0086794 -10.20 0.000 -.1055788 -.071556 asex male -.327371 .0976669 -3.35 0.001 -.5187946 -.1359475 ankids -.824535 .0661383 -12.47 0.000 -.9541637 -.6949063 unionstatus single LAT Married -1.462335 -1.520078 .4669831 .143403 .1802654 .1378864 -10.20 -8.43 3.39 0.000 0.000 0.001 -1.743399 -1.873391 .1967307 -1.18127 -1.166764 .7372355 employ Other -.6493875 .123818 -5.24 0.000 -.8920664 -.4067087 education Secondary or l.. -.3006374 .0998864 -3.01 0.003 -.4964111 -.1048638 prevdissol Yes siblings .4883868 .0584183 .1138649 .0259698 4.29 2.25 0.000 0.024 .2652158 .0075184 .7115578 .1093182 mumalive Not alive -.2650245 .1757751 -1.51 0.132 -.6095373 .0794884 _cons 2.876541 .2793299 10.30 0.000 2.329064 3.424018 You first have to set the reference categories for the categorical variables. As default, Stata takes the smallest value as reference. With the command fvset you can set new reference categories permanently. The alternative is that you define the reference category every time you run an analysis. In the logit command, you would then specify ib3.unionstatus. If you already specified fvset for question 3, then you do not need to do it again. In the regression, you specify which variables are categorical by adding i. in front of the variable name so they will be modelled as dummy variables. Note that you do not need to specify that the dependent variable is categorical, or what the reference category is. The command logit is for binary response variables only, and models the probability of a positive outcome (dependent variable is nonzero). You can also read this in the help file (help logit). The logit command gives coefficients (beta) and their confidence intervals, while the logistic command gives odds ratios (exp(beta)) and their confidence intervals. You will also notice that the logistic command does not give any information regarding the constant, because it does not make much sense to talk about a constant with odds ratios (however, the constant is important when you want to predict so the logit command is preferred). 10 We first ran the model with all available socio-demographic characteristics as independent variables. Note that one of them (‘mumalive’) is insignificant, looking at the z-score, p-value or 95% confidence intervals. Next, we run the regression again but this time removing ‘mumalive’ from the model and we investigate whether the coefficients of the other independent variables are robust to changes. * Run logistic regression without mumalive. logit chintent aage i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings Logistic regression Number of obs LR chi2(10) Prob > chi2 Pseudo R2 Log likelihood = -1441.3969 = = = = 3,459 677.02 0.0000 0.1902 Logit = 2.942857 - 0.091*age -0.3302*asex male - 0.8212*ankids ..... chintent Coef. Std. Err. z P>|z| [95% Conf. Interval] aage -.0910597 .0085348 -10.67 0.000 -.1077875 -.0743319 asex male -.330242 .0976205 -3.38 0.001 -.5215746 -.1389094 ankids -.8212769 .065995 -12.44 0.000 -.9506246 -.6919291 unionstatus single LAT Married -1.466456 -1.525917 .4667768 .1434075 .1802282 .1378468 -10.23 -8.47 3.39 0.000 0.000 0.001 -1.74753 -1.879158 .1966021 -1.185383 -1.172676 .7369515 employ Other -.6556288 .1237742 -5.30 0.000 -.8982218 -.4130358 education Secondary or l.. -.302349 .0998367 -3.03 0.002 -.4980253 -.1066727 prevdissol Yes siblings _cons .488535 .0556298 2.942857 .1138365 .0258379 .2761546 4.29 2.15 10.66 0.000 0.031 0.000 .2654197 .0049884 2.401604 .7116504 .1062711 3.48411 The fitted regression equations are: Logit(intention to have a child within 3 years) = 2.9429 – 0.0911 * age – 0.3302 * male – 0.8213 * nr children – 1.4665 * single – 1.5259 * LAT + 0.4668 * married – 0.6556 * other employment status - 0.3023 * secondary education or less + 0.04885 * yes prevdissol + 0.0556 * nr siblings Odds(intention to have a child within 3 years) = e(2.9429 – 0.0911 * age – 0.3302 * male – 0.8213 * nr children – 1.4665 * single – 1.5259 * LAT + 0.4668 * married – 0.6556 * other employment status - 0.3023 * secondary education or less + 0.04885 * yes prevdissol + 0.0556 * nr siblings) Probability(intention to have a child within 3 years) = odds / (1 + odds) = e (2.9429 – 0.0911 * age – 0.3302 * male – 0.8213 * nr children – 1.4665 * single – 1.5259 * LAT + 0.4668 * married – 0.6556 * other employment status - 0.3023 * secondary education or less + 0.04885 * yes prevdissol + 0.0556 * nr siblings) / (1 + e(2.9429 – 0.0911 * age – 0.3302 * male – 0.8213 * nr children – 1.4665 * single – 11 1.5259 * LAT + 0.4668 * married – 0.6556 * other employment status - 0.3023 * secondary education or less + 0.04885 * yes prevdissol + 0.0556 * nr siblings) ) Leaving out one variable is also directly a robustness check. The coefficients of the independent variables seem to be robust to the changes that were made (the coefficients seem not significantly different). Note that you are not allowed to compare the coefficients between different specifications in logistic regression, so be careful with comparisons (different specifications in logistic regression have different likelihood functions). 7. What is the predicted probability (using the estimates of question 6) of i) a 40y-old woman, single, unemployed, lower educated, with two children and no siblings, whose previous relationship has been dissolved? ii) a 30y-old woman, married, employed, higher educated woman without children and without siblings, who has not experienced the dissolution of a previous relationship? Is there anything noteworthy about the predicted probabilities? Remember to first calculate the log(odds), then odds, and finally the probabilities. The predicted log(odds), odds and probabilities are as follows (see Excel file): i) ln ( odds )=2.9429−0.0911∗40−0.3302∗0−1.4665∗1−0.6556∗1−0.3023∗1−0.8213∗2+0.0556∗0+0. odds=e( ln (odds ))=0.0139 Probability=0.014 ii) ln ( odds )=2.9429−0.0911∗30−0.3302∗0+0.4668∗1−0.6556∗0−0.3023∗0−0.8213∗0+0.0556∗0+ 0.4 odds=e( ln (odds ))=1.9696 Probability=0 . 663 Nothing really noteworthy. Probabilities are between 0 and 1, as they should be. Note: Use as many decimals as possible for the steps prior to the final probabilities, especially in the exam! It is advisable to copy the regression output to Excel and make calculations using Excel. Again: you need to be able to do these calculations in Excel as done above. But it is good to also be aware of this alternative: margins, at(aage=(40) asex=(2) unionstatus=(1) employ=(2) education=(1) /// ankids=(2) siblings=(0) prevdissol=(1)) 8. Do older individuals have greater intentions to have a child within 3 years? Explain why or why not. Is there a linear or non-linear relationship between age and the intentions of having a child within 3 years? Explain how you can test for this and perform the regressions. Exclude the variable mumalive. No. The results in question 6 suggest that there is a negative, not positive, relationship between age and the log(odds) of the intentions of having a child within 3 yrs. To test for linearity, we have to generate polynomials of the variable aage and run the regression again including the newly created variables: gen aage2 = aage * aage gen aage3 = aage * aage * aage gen aage4 = aage * aage * aage * aage 12 gen aage5 = aage * aage * aage * aage * aage An alternative way to code this is: forvalues i = 2(1)5{ gen aage`i' = aage^`i' label variable aage`i' "Age^`i'" } logit chintent aage aage2 i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings i.mumalive logit chintent aage aage2 aage3 i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings logit chintent aage aage2 aage3 aage4 i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings logit chintent aage aage2 aage3 aage4 aage5 i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings As you can see from the regression results, the relationship between age and the intentions of having a child within 3 years is a nonlinear one. The preferred model is (looking at the significance level of the aage variables in the different models): logit chintent aage aage2 aage3 aage4 i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings ( or… * Create a categorical variable for age: egen agecat = cut(aage), at(17,25,30,35,40,45,50) label define agecatl 17 "17-24" 25 "25-29" 30 "30-34" 35 "35-39" 40 "40-44" 45 "45-50" label values agecat agecatl and run the regression including the variable i.agecat) 9. Create a plot (in Excel) that shows the functional relationship between age and the probability of fertility intentions to support your answer of question 5. For the calculation of these probabilities, you should keep the other variables in the model constant. For categorical variables, use the reference category. For ratio variables, use the mean value. Remember to calculate log(odds), odds and probabilities first where only the variable ‘age’ is allowed to change. The plot shows that age has a nonlinear relationship with fertility intentions. * Check mean nr of children and nr of siblings as input for calculating probabilities for the plots. mean ankids siblings 13 (or… Calculate the log-odds, odds ratios and probabilities per year or per age group (below done per age group). See the Excel file for the calculations and plot. As in the previous exercise, we assume the other variables to stay constant at the reference categories for the categorical variables, and at the mean values for the ratio variables. For this exercise, we are only interested in the effect of age. This plots shows that age has a non-linear effect on the probability of fertility intentions). 14 Next, you want to explore whether the effect of having experienced the dissolution of a previous relationship (variable ‘prevdissol’) on the intentions to have a child within 3 yrs is the same for men as it is for women. Run the model with the same variables as in your final model of question 6 (so excluding mumalive and using the simple aage), but this time include the interaction between ‘asex’ and ‘prevdissol’ (add to the regression command: i.asex##i.prevdissol). Be consistent with the same definition of reference categories that you used previously. 10. Is the interaction coefficient significant? Describe the changes in the coefficients and significance of the main effects of the variables ‘sex’ and ‘prevdissol’ after adding the interaction effect. * Logistic regression with interaction asex*prevdissol. logit chintent aage i.asex ankids i.unionstatus i.employ i.education i.prevdissol siblings i.asex#i.prevdissol The interaction is significant, with p=0.013. This means that the effect of having dissolved a previous relationship is significantly different for males than for females. Or, formulated differently, it means that the effect of sex is different depending on whether you have resolved a previous relationship or not. The coefficient of the main effect of ‘prevdissol’ is still positive, but somewhat smaller in size, and less significant. The coefficient of the main effect of ‘sex’ is negative in both models and slightly more significant in this interaction model. You can now no longer interpret prevdissol without reference to sex, and vice versa. Note: when you use the double ## in the interaction instead of the single #, then even if you do not specify the main effects in your Stata command, Stata will estimate them anyway. Thus, this code would give the same output: logit chintent aage ankids i.unionstatus i.employ i.education siblings i.asex##i.prevdissol But it’s best to specify main effects and interaction effects yourself, just to be sure. 15 Logistic regression Number of obs LR chi2(11) Prob > chi2 Pseudo R2 Log likelihood = -1438.3269 Std. Err. z P>|z| = = = = 3,459 683.16 0.0000 0.1919 chintent Coef. [95% Conf. Interval] aage -.0914408 .0085496 -10.70 0.000 -.1081977 -.074684 asex male -.467348 .1127548 -4.14 0.000 -.6883434 -.2463526 ankids -.8206944 .0661659 -12.40 0.000 -.9503772 -.6910117 unionstatus single LAT Married -1.462417 -1.52185 .4590363 .1435676 .1803116 .1379121 -10.19 -8.44 3.33 0.000 0.000 0.001 -1.743804 -1.875254 .1887336 -1.18103 -1.168446 .7293391 employ Other -.6498551 .1238386 -5.25 0.000 -.8925743 -.407136 education Secondary or less -.305776 .0999765 -3.06 0.002 -.5017263 -.1098257 prevdissol Yes siblings .2704672 .054704 .1450571 .0258685 1.86 2.11 0.062 0.034 -.0138396 .0040027 .5547739 .1054052 asex#prevdissol male#Yes .5412407 .2178048 2.48 0.013 .1143511 .9681303 _cons 3.015382 .2783424 10.83 0.000 2.469841 3.560923 11. Analyze your interaction findings (other than looking at the significance of the values) in the following way: i) by making a table with predicted probabilities ii) by making a plot of this table (y-axis: probability; x-axis: prevdissol; separate lines for men and women) For the calculation of these probabilities, you should keep the other variables in the model constant. For categorical variables, use the reference category. For ratio variables, use the mean value. See the Excel file on Nestor for an example of how to do this, using different data. * Check mean age, mean nr of children and mean nr of siblings. mean aage ankids siblings 16 Mean estimation Number of obs Mean aage ankids siblings 34.93033 1.235328 2.558543 = 3,459 Std. Err. [95% Conf. Interval] .1303037 .021272 .0337776 34.67485 1.193621 2.492317 35.18581 1.277035 2.624769 Interaction between sex and childprevp probabilit y sex 1, male 2, female No previous relationship dissolved prevdissol A previous relationship dissolved 0.179 0.259 0.330 0.314 In the Excel file you can see the calculations involved, leading to the table and plot of predicted probabilities. For both males and females, the probability that one intends to have a child within the next three years is higher when one has already experienced the dissolution of a previous relationship. However, the interaction tells us that a union dissolution has a larger positive effect on the probability of fertility intentions for males than for females. For females, union dissolution seems a less important factor in fertility intentions. 17 18 12. Interpret the coefficient of the interaction variable (sign, significance, value). What is the effect on the log(odds) if you compare men and women whose previous relationship has and has not been dissolved? What is the effect on the odds if you compare men and women whose previous relationship has and has not been dissolved? The effect of having dissolved a previous union, as opposed to not having been in a union before, for men rather than women is estimated to be positive, and statistically significant (p<0.05). Either use the first derivative to interpret the coefficient of the interaction variable or use your table with predicted probabilities to check your interpretation. Fertility intention=3.0154−0.0914∗age−0.4673∗male−0.8207∗nr of kids−1.4624∗single−1.5219∗LAT + _cons coeff You take the derivative with respect to sex. The change in fertility intentions will depend on whether or not experienced the dissolution of a previous relationship: ∂ Fertility intention =−0.4673+0.5412∗relations h ip dissolved ∂ Male male male#yes 19 You take the derivative with respect to prevdissol. The change in fertility intentions will depend on whether one is male or female: ∂ Fertility intention =0.2705+ 0.5412∗male ∂ Relationship dissolved prevdisool male#yes Example: the effect of being male rather than female on fertility intentions is negative (0.4673 on the log odds), but the effect of being male and having also dissolved a previous relationship is positive (-0.4673 +0.5412 on the log odds). This is confirmed by the predicted probabilities of the previous exercise. This is the same as saying that the odds of intending to have a child are (e-0.4673+0.5412=e-0.4673*e0.5412=) 1.077 times higher for males who have experienced the dissolution of a previous union, than females who did not. Notice how the effect on log(odds) is additive and on odds it is multiplicative. 13. Interpret the coefficients of the main effects of ‘sex’ and ‘prevdissol’ in the model with the interaction sex*prevdissol? When ‘sex’ and ‘prevdissol’ are involved in an interaction, the coefficient for ‘sex’ alone is the effect of sex when the respondent has not experienced the dissolution of a previous relationship (ref). In other words, the effect of sex when the value for the interacting variable is set at the reference category: prevdissol = no. So the effect of sex is conditional on the value of prevdissol, and the other way around, and can therefore not really be read as a main effect anymore. 20 Appendix 1 Getting Started with STATA GETTING STARTED Open syntax editor: Browse data: WEBSOURCES help [stata command] http://www.princeton.edu/~otorres/Stata/statnotes 21 Appendix 2. STATA commands insheet using H:/stata/data.csv, delimit(“;”) for data file in *.csv-format into STATA. The data is delimited with ;. saveold data.dta, replace Save the data in an older *.dta-format so any STATA version can open the *.dta file. You need to use replace or it will not save your file if data.dta already exists. sum year The command shows a table including the mean, standard deviation, minimum and maximum. tab year, gen(year) The command shows a table of frequencies for each year. In addition, it generates dummy variables for each year. gen lnx1=ln(x1) The command generates a new variable, called lnx1, which is the natural logarithm of x1. gen x1x2=x1*x2 The command generates a new variable, called x1x2, which is an interaction variable of x1 multiplied by x2. tab year, gen(y) The command shows a one-way table of frequencies of the variable year. It also generates dummy variable for each unique observation in the variable year. In this case, it will create year dummies – a dummy (0/1 variable) for each year – named y1, y2, et cetera. sort x1 The command sorts the data based on the values containing in variable x1. cor x1 x2 The command generates the correlation matrix between variables x1 and x2. reg y x1 x2, r Regression model, where y is the dependent variable, x1 and x2 are independent variables. The constant will also be estimated by default. Robust standard error will be estimated. xi: reg y x1 x2 i.x3 The command xi makes it possible to turn the categorical variable x3 into dummy variables in a simple regression model. outreg2 using H:/stata/reg1.xml, replace excel e(all) Use this command after a regression to save the coefficients + extras in the specified folder. estat hettest Use this command after a regression to test for heteroscedasticity in the error term. gen beta1=_b[x1] This command generates a new variable containing the coefficient of x1 regarding the last regression. graph twoway line x1 x2 graph save H:/stata/graph1.gph, replace These commands provide you with a line graph and saves it in the specified folder. help help help help sum tab cor reg The help command is your friend in most cases. Or… look online for help: See, for example, http://www.ats.ucla.edu/stat/stata/modules/ 22

Use Quizgecko on...
Browser
Browser