Linear Regression PDF

Linear regression [email protected] Outline 1. Correlation 2. Purpose of linear regression 3. Ordinary least squares (OLS) estimators 4. Classical OLS assumptions Correlation coefficient The correlation coefficient (Pearson’s correlation coefficient) in the empirical distribution of two variables x and y - being an estimator of the correlation coefficient ρ in the population - is defined as: where is called the covariance. Properties of correlation coefficient The correlation coefficient falls in the range [−1, 1] The correlation coefficient takes value of zero when variables are not linearly related The absolute value of the correlation coefficient equals to one when there is a linear functional relationship between the two variables The correlation coefficient describes also a direction of a relationship (positive - low scores of one variable are associated with low scores of the other variable; negative - low scores of one variable are associated with high scores of the other variable) Correlation coefficient Linear correlation coefficient and non- linear relation In such situation the correlation coefficient will show variables are not related. Correlation coefficient and outliers In this case, the value of the correlation coefficient will be significantly overestimated due to a single observation for which the values of both characteristics are abnormally high. Purpose of linear regression 1. We have some clues about causal effects of variables, but we require a numerical answer by estimating a relationship between variables 2. Forecast or predict the value of one variable, Y , based on the value of another variables, X1, X2,.... Pros/Cons Pros By far the most common approach for modeling numeric data Can be adapted to model almost any modeling task Provides estimates of both the strength and size of the relationships among features and the outcome Cons Makes strong assumptions about the data distribution The model’s form must be specified by the user in advance Does not handle missing data Only works with numeric features, so categorical data requires extra processing Classical linear regression model - one independent variable where: y - dependent variable (endogenous variable, explained variable), x - independent variable (exogenous variable, explanatory variable), i - error term (disturbance, noise), i - number of observation Classical linear regression model - one independent variable Assumptions: Independent variable is nonstochastic (its values are considered fixed in repeated samples) and uncorrelated with the error term Disturbances represented by the error terms tend to cancel each other (the mean, or the expected value - of the error term is zero) The error term is spherical (it is not autocorrelated, its variance is constant) The error term is normally distributed Least squares method - one indpendent variable Unknown parameters of the linear regression model α1, α2 are estimated by the least squares (LS) method Received estimates of parameters αˆ1 i αˆ2, are used for computing the predicted (estimated, theoretical) values of the dependent variable yˆi and residuals ˆi (the difference between actual and estimated values of dependent variable) The unknown variance of the error term σ2 is estimated by the variance of residuals σˆ2 Logic of least squares method: minimise the sum of squared residuals Least squares method - one indpendent variable Parameters’ estimators: Least squares method - one indpendent variable The variance of parameters’ estimators: Decomposition of the dependent variable variance Decomposition of the dependent variable variance Sum of squared deviations of actual values from the mean (total variation) Sum of squared deviations of estimated values from the mean (explained variation) Sum of squared residuals (unexplained variation) Coefficient of determination The coefficient of determination measures the proportion or percentage of the total variation in dependent variable explained by the regression model (in case of the model with one independent variable it is equal to the correlation coefficient between regressand and regressor) Corrected coefficient of determination where k - number of independent variables (adding new independent variable always increases standard coefficient of determination R2 ) Coefficient of determination Interpreting R2 and corrected R¯2 High values of R2 means that regressors are good at predictions the values of the dependent variable in the sample but there are few common pitfalls: 1. An increase in the R2 or R¯2 does not necessarily mean that an added variable is statistically significant 2. A high R2 or R¯2 does not mean that regressors are a true cause of dependent variable 3. A high R2 or R¯2 does not mean that there is no omitted variable bias 4. A high R2 or R¯2 does not mean that you have the most appropriate set of regressors Information criteria Another approach to measure model error Used to compare different specification of models estimated with maximum likelihood method AIC - Akaike’s Information Criterion: AICc - Akaike’s corrected Information Criterion: BIC - Bayesian Information Criterion: L is a likelihood of the model and k the total number of parameters and intial states that have been estimated. Classical linear regression model - multiple independent variables or in matrix notation where Classical linear regression model - multiple independent variables Assumptions: Independent variables are nonstochastic (their values are considered fixed in repeated samples) and uncorrelated with the error term There is no exact collinearity between independent variables (none of the independent variables is a linear combination of the others) Disturbances represented by the error terms tend to cancel each other (the mean, or the expected value - of the error term is zero) The error term is spherical (it is not autocorrelated, its variance is constant) The error term is normally distributed Least squares method - multiple independent variables Types of data for regression models Cross-section data - data on the state of different objects at the same point in time Time series data - set of observations on the values that a variable takes at different times, collected at regular time intervals, e.g. daily, monthly, quarterly, yearly. In formulas very often subscript t is used instead of i. Pooled data - combination of time series and crosssection data, i.e. data on the state of different objects at different, regular time intervals Dummy (binary) variable - qualitative variable, structural changes, outliers, seasonality Intepretation of parameters How do we interpret parameters in regression? Assuming that we want to estimate parameters in: Coefficient β1 measures the expected change in Y if X1 changes with one unit but all other variables X2,..., Xk do not change - This is so called ceteris paribus condition in multiple regression parameters can only be interpreted under ceteris paribus condition Intepretation of parameters Depending on the transformation we interpret parameters in a different way Logarithms can be used to transform the dependent variable Y an independent variable X or both There are three cases of logarithms In regression that affects the interpretation: 1% change in X is associated with change in Y of 0.01β1 change in X by one unit (∆X = 1) is associated with a 100β1% change in Y 1% change in X is associated with a β1% change in Y, β1 is the elasticity of Y with respect to X Statistical validation of the model Investigating error terms whether there is no autocorrelation (Durbin-Watson test, we do not analyse in case of cross-section data) whether they have constant variance (White’s test) whether they are normally distributed(Jarque-Bera test) If error terms do not have adequate properties, it may mean that the model is missing important explanatory variables or it has incorrect functional form. Testing the significance of regression coefficients/ independent variables (Student’s t-test, F- test) Multicollinearity Testing the significance of single coefficient - Student’s t-test Hypotheses Test statistic has Student’s t-distribution with (N − K) degrees of freedom, where N denotes number of observations, and K is a number of independent variables (including an intercept) Testing the overall significance of a Regression- F-test Hypotheses Test statistic follows Snedecor’s F-distribution with (K − 1) and (N − K) degrees of freedom Binary variables Binary variables are used as a tool to incorporate qualitative information into regression models In this section we only focus on qualitative independent variables How can we incorporate binary information about sex or about status that a person does or does not own a PC or a house? These type of information can be captured by defining a binary variable 0 − 1 often called a dummy variable We need to recode that information to 0 − 1 variable assigning the value 0 to one event and 1 to the other For example, in sex variable where we have values female and male we might define variable female taking on the value 1 for females and the value 0 for males Single independent binary variable Consider the following simple model of hourly wage determination: δ0 is used as the parameter on dummy variable female with 1 - female and 0 -male If the coefficient δ0 < 0 then we can say that there is a discrimination - for the same level of other variables women earn less than a men on average In terms of regression interpretation variable female plays a role of an intercept shift between males and females Single independent binary variableGraphical interpretation Dummy variable interpretation In the model of hourly wage determination the interpretation of parameter δ0 is straightforward Assuming that δˆ0 = −1.5 we can say that given the same level of education the woman earns on average 1.5GBP less per hour than the man The interpretation is tricky when dependent variable is in logarithmic transformation. When log(wage) is the dependent variable then the coefficient on a dummy variable is interpreted as a percentage difference calculated with the following formula Dummy variables for multiple categories We can use several dummy independent variables in the same equation for different variables We can also use dummy variables for variable with multiple categories For example, credit rating variable has several possible values like AAA, AA, A In this situation we have to create dummy variable for each category of credit rating AAA - dAAA = 1 if credit rating AAA and 0 otherwise AA- dAA = 1 if credit rating AA and 0 otherwise A - dA = 1 if credit rating A and 0 otherwise However we can only include 2 dummy variables in regression Using Dummy variables for different slopes We can also use dummy variables to allow for differences in slope Consider the following model: For males intercept is β0 and the slope on education is β1 For females intercept is β0 + δ0 and the slope on education is β1 + δ1

Linear Regression PDF

Document Details

Tags

Related

Summary

Full Transcript