46 Questions
What is a scenario where the removal of univariate outliers may not be sufficient?
When dealing with multivariate outliers
What is a characteristic of multivariate outliers?
They need not be outliers on any of the variables individually
What is a limitation of using scatterplots to identify multivariate outliers?
Scatterplots are useless when dealing with more than two variables
Who is credited with developing a technique for identifying multivariate outliers?
Prasanta Chandra Mahalanobis
What is the purpose of Mahalanobis' Distance?
To identify multivariate outliers
What is the distribution of Mahalanobis' Distance?
Chi-square distribution
How do we determine the critical value for identifying multivariate outliers using Mahalanobis' Distance?
Using a chi-square table
What is the typical procedure in SPSS for identifying multivariate outliers using Mahalanobis' Distance?
It gives us the value of Mahalanobis' Distance for the 10 most extreme points in the dataset
What is the critical chi-square value for p = .001 and df = 2?
13.816
What is the primary difference between univariate and multivariate outliers?
Univariate outliers affect only one variable, while multivariate outliers affect multiple variables.
What is the purpose of calculating Mahalanobis' Distance?
To detect multivariate outliers
What happens when multivariate outliers are removed from the dataset?
We re-examine the results to see if they have changed
What should you do when univariate outliers are removed and the results do not change?
Return them to the dataset
What is the role of the ID variable in the regression syntax?
It is used to identify each participant in the dataset
What should you do when testing for univariate outliers?
Ask yourself if the nature of the results has changed
Why do we always remove multivariate outliers?
Because they are particularly problematic
What does the first added line to the regression syntax do to the residual scores?
It standardizes them like deriving a Z score
What should the histogram of residual scores look like to meet the assumption of normality?
A bell-shaped curve centered around zero
What does the scatterplot of residual scores against predicted values help to check?
The assumption of homoscedasticity
What is a common indication of nonnormality in the scatterplot of residual scores against predicted values?
A fan-shaped curve
What is the purpose of standardizing residual scores in regression analysis?
To make the residuals more interpretable and comparable
What can be used to detect multivariate outliers in regression analysis?
Mahalanobis' distance
What is the assumption that the regression equation should be equally good at predicting values for low and high levels of the predictors?
Homoscedasticity
What can be a consequence of violating the assumption of homoscedasticity in regression analysis?
Decreased model accuracy
What is the primary reason for not deleting outliers from the dataset?
To maintain the integrity of the original data
What is the purpose of creating a filter variable when dealing with outliers?
To de-select the outliers from the analysis
What is the assumption that is being tested in Step 2 of the data checking process?
Normality
What is the next step in the data checking process after testing for univariate outliers?
Testing for multivariate outliers
What is the purpose of testing for multivariate outliers?
To identify cases that are unusual on multiple variables
What should you do if removing univariate outliers does not change the results of the analysis?
Return the excluded cases to the analysis
What is the purpose of creating a new variable called varFilter?
To filter out cases that are not selected
What is the primary purpose of testing for univariate outliers?
To check the normality assumption
What is the consequence of not checking for multivariate outliers?
The analysis may be biased towards the outliers
What is indicated by a fan-shaped scatterplot of residual scores against predicted values?
Heteroscedasticity
What statistical measure can be used to detect multivariate outliers?
Mahalanobis' Distance
What is the purpose of standardizing residual scores in regression analysis?
To check the normality assumption
What is a consequence of violating the assumption of homoscedasticity?
All of the above
What is the role of the ID variable in regression syntax?
To label cases in the dataset
What type of plot is used to check the normality assumption of residual scores?
Histogram
What is the purpose of plotting residual scores against predicted values?
To check for homoscedasticity
What is a potential consequence of not removing multivariate outliers from a dataset?
Violation of the homoscedasticity assumption
Which of the following is a limitation of using Mahalanobis' Distance to identify multivariate outliers?
It is sensitive to the scale of the predictor variables
What is the purpose of calculating the chi-square value when using Mahalanobis' Distance to identify multivariate outliers?
To establish a cut-off value for identifying outliers
Why is it problematic to only consider univariate outliers in a multivariate analysis?
Univariate outliers may not be extreme in the multivariate space
What is a common indication of nonnormality in the residual scores of a regression analysis?
A skewed distribution with outliers
What is the purpose of standardizing residual scores in regression analysis?
To compare the magnitude of residual scores across different models
Study Notes
Multivariate Outliers
- Multivariate outliers are cases that are outliers when considered against all the variables simultaneously.
- They need not (but could) be outliers on any of the variables individually.
- An example is an 18-year-old earning $100,000 a year, where neither age nor salary is an outlier, but the combination is likely a multivariate outlier.
Identifying Multivariate Outliers
- Mahalanobis' Distance is a statistical technique used to identify multivariate outliers.
- It measures how far a case is from the centroid of the multidimensional normal distribution created from all the predictors.
- Mahalanobis' Distance is distributed as chi-square, and we look for cases with a chi-square corresponding to a p-value less than 0.001.
Application in SPSS
- SPSS can give the value of Mahalanobis' Distance for the 10 most extreme points in the dataset.
- We compare these 10 most extreme points with a critical chi-squared value, which depends on the number of predictors.
- If any case in our sample has a value of Mahalanobis' Distance greater than the critical chi-squared value, it is considered a multivariate outlier.
Removing Multivariate Outliers
- When testing for multivariate outliers, remove them from the dataset and rerun the regression.
- The nature of the results may change due to removing the outliers.
- However, multivariate outliers are always removed, as they are particularly problematic.
Regression Analysis Assumptions
- The first assumption is normality, where residuals should be normally distributed and centred around zero.
- The second assumption is homoscedasticity, where residuals should be distributed similarly across values of the predictors.
Evaluating Regression Assumptions
- The histogram of standardized residual scores should be normally distributed and centred around zero.
- The scatterplot of residual scores against predicted values should resemble a rectangle, indicating homoscedasticity.
- Deviations from these patterns suggest problems with the assumptions.
Outliers
- Multivariate outliers are outliers when considered against all variables simultaneously
- Example: an 18-year-old earning $100,000/year is likely to be a multivariate outlier
- Multivariate outliers are particularly problematic and always need to be removed
Identifying Multivariate Outliers
- We use Mahalanobis' Distance to identify multivariate outliers
- Provides the 10 most extreme points in the dataset (not necessarily outliers)
- Compare these points using chi-square cut-off value (at p < .001 level)
- Cut-off value differs based on the number of predictors
Testing Multivariate Outliers
- Use Mahalanobis' Distance to identify multivariate outliers
- Provides the 10 most extreme points in the dataset
- Compare to chi-square (.001) cut-off value
- Any values exceeding this cut-off are considered multivariate outliers
- Filter out any such cases
Regression and Data Analysis
- Use regression analysis with course, age, and mathsBC as variables
- Use residuals to test for outliers and check for errors
Data Entry and Assumptions
- Check for data entry errors
- Check normality assumption
- Check homoscedasticity assumption
- Testing for univariate outliers
Consequences of Violating Assumptions
- Check for data entry errors
- Check normality assumption
- Check homoscedasticity assumption
- Testing for univariate outliers
Regression Diagnostics
- What happens when assumptions aren't met?
- Test for univariate outliers
- Test for multivariate outliers
- Transform data
Testing for Univariate Outliers
- Examine descriptives (skewness, kurtosis)
- Examine histograms and boxplots
This quiz covers the detection of multivariate outliers in regression analysis, including cases that are outliers when considered against all variables simultaneously. Learn how to identify and deal with these outliers in statistical analysis.
Make Your Own Quizzes and Flashcards
Convert your notes into interactive study material.
Get started for free