Statistics 136 Chapter 8: Dummy Variables

Study Notes

A dummy variable is a qualitative variable that takes on a value of 0 or 1 to indicate the presence or absence of a particular characteristic.
For models with more than one qualitative independent variable, define the appropriate number of dummy variables for each qualitative variable and include them in the model.
Models containing only qualitative independent variables are called analysis of variance (ANOVA) models.
Models containing both qualitative and quantitative independent variables, where the qualitative variables are of primary interest, are called analysis of covariance (ANCOVA) models.

When interaction effects are significant but the main dummy variables are not, it means that the effect of one variable on the response variable depends on the level of another variable.

The Lack of Fit Test is used to determine if a regression model is adequate for a given dataset.
The test statistic is SSLF = SSE - SSPE, where SSPE is the sum of squared pure errors.
The critical region for the test is Reject the null hypothesis if F1 > F(α, c-p, n-c).
Limitations of the test include the need for replication in X and the assumption of normality and homoskedasticity.

The Ramsey's Regression Specification Error Test (RESET) is used to test whether non-linear combinations of the explanatory variables help to explain the response variable.
The test involves regressing the dependent variable against the polynomial of the fitted values and the original variables.
Weaknesses of the test include its high power, which can detect even trivial departures from the null hypothesis, and its sensitivity to identical residual values.

The Shapiro-Wilk Test is a test for normality of a dataset.
The test statistic is W = (Σ(ai * x(i))^2) / (Σ(xi - x̄)^2), where x(i) is the i-th order statistic and ai are constants given by the expected values of the order statistics of an iid sample from the standard normal distribution.
The test is used to determine if the data follows a normal distribution.

The Anderson-Darling Test is a test for normality of a dataset.
The test is based on the concept that when given a hypothesized underlying distribution, the data can be transformed to a uniform distribution.
Strengths of the test include its high power and ability to detect most departures from normality, even with small sample sizes.

Remedial measures for heteroskedasticity include solving for nonlinearity through variable transformation, dealing with outliers, and using Generalized Least Squares (GLS) or Weighted Least Squares (WLS).

High leverage points, outliers, and influential observations can be generated by various sources, including measurement problems, recording/encoding problems, contamination/mixture populations, and non-linearity.
These observations can have a significant impact on the model and its estimates, and must be detected and addressed accordingly.
Detection methods include visual plots and statistical tests, while remedial measures include transformations, robust regression, and deletion of the problematic observations.

Podcast