quiz image

Identifying Multivariate Outliers in Regression Analysis

MesmerizedPeridot avatar
MesmerizedPeridot
·
·
Download

Start Quiz

Study Flashcards

46 Questions

What is a scenario where the removal of univariate outliers may not be sufficient?

When dealing with multivariate outliers

What is a characteristic of multivariate outliers?

They need not be outliers on any of the variables individually

What is a limitation of using scatterplots to identify multivariate outliers?

Scatterplots are useless when dealing with more than two variables

Who is credited with developing a technique for identifying multivariate outliers?

Prasanta Chandra Mahalanobis

What is the purpose of Mahalanobis' Distance?

To identify multivariate outliers

What is the distribution of Mahalanobis' Distance?

Chi-square distribution

How do we determine the critical value for identifying multivariate outliers using Mahalanobis' Distance?

Using a chi-square table

What is the typical procedure in SPSS for identifying multivariate outliers using Mahalanobis' Distance?

It gives us the value of Mahalanobis' Distance for the 10 most extreme points in the dataset

What is the critical chi-square value for p = .001 and df = 2?

13.816

What is the primary difference between univariate and multivariate outliers?

Univariate outliers affect only one variable, while multivariate outliers affect multiple variables.

What is the purpose of calculating Mahalanobis' Distance?

To detect multivariate outliers

What happens when multivariate outliers are removed from the dataset?

We re-examine the results to see if they have changed

What should you do when univariate outliers are removed and the results do not change?

Return them to the dataset

What is the role of the ID variable in the regression syntax?

It is used to identify each participant in the dataset

What should you do when testing for univariate outliers?

Ask yourself if the nature of the results has changed

Why do we always remove multivariate outliers?

Because they are particularly problematic

What does the first added line to the regression syntax do to the residual scores?

It standardizes them like deriving a Z score

What should the histogram of residual scores look like to meet the assumption of normality?

A bell-shaped curve centered around zero

What does the scatterplot of residual scores against predicted values help to check?

The assumption of homoscedasticity

What is a common indication of nonnormality in the scatterplot of residual scores against predicted values?

A fan-shaped curve

What is the purpose of standardizing residual scores in regression analysis?

To make the residuals more interpretable and comparable

What can be used to detect multivariate outliers in regression analysis?

Mahalanobis' distance

What is the assumption that the regression equation should be equally good at predicting values for low and high levels of the predictors?

Homoscedasticity

What can be a consequence of violating the assumption of homoscedasticity in regression analysis?

Decreased model accuracy

What is the primary reason for not deleting outliers from the dataset?

To maintain the integrity of the original data

What is the purpose of creating a filter variable when dealing with outliers?

To de-select the outliers from the analysis

What is the assumption that is being tested in Step 2 of the data checking process?

Normality

What is the next step in the data checking process after testing for univariate outliers?

Testing for multivariate outliers

What is the purpose of testing for multivariate outliers?

To identify cases that are unusual on multiple variables

What should you do if removing univariate outliers does not change the results of the analysis?

Return the excluded cases to the analysis

What is the purpose of creating a new variable called varFilter?

To filter out cases that are not selected

What is the primary purpose of testing for univariate outliers?

To check the normality assumption

What is the consequence of not checking for multivariate outliers?

The analysis may be biased towards the outliers

What is indicated by a fan-shaped scatterplot of residual scores against predicted values?

Heteroscedasticity

What statistical measure can be used to detect multivariate outliers?

Mahalanobis' Distance

What is the purpose of standardizing residual scores in regression analysis?

To check the normality assumption

What is a consequence of violating the assumption of homoscedasticity?

All of the above

What is the role of the ID variable in regression syntax?

To label cases in the dataset

What type of plot is used to check the normality assumption of residual scores?

Histogram

What is the purpose of plotting residual scores against predicted values?

To check for homoscedasticity

What is a potential consequence of not removing multivariate outliers from a dataset?

Violation of the homoscedasticity assumption

Which of the following is a limitation of using Mahalanobis' Distance to identify multivariate outliers?

It is sensitive to the scale of the predictor variables

What is the purpose of calculating the chi-square value when using Mahalanobis' Distance to identify multivariate outliers?

To establish a cut-off value for identifying outliers

Why is it problematic to only consider univariate outliers in a multivariate analysis?

Univariate outliers may not be extreme in the multivariate space

What is a common indication of nonnormality in the residual scores of a regression analysis?

A skewed distribution with outliers

What is the purpose of standardizing residual scores in regression analysis?

To compare the magnitude of residual scores across different models

Study Notes

Multivariate Outliers

  • Multivariate outliers are cases that are outliers when considered against all the variables simultaneously.
  • They need not (but could) be outliers on any of the variables individually.
  • An example is an 18-year-old earning $100,000 a year, where neither age nor salary is an outlier, but the combination is likely a multivariate outlier.

Identifying Multivariate Outliers

  • Mahalanobis' Distance is a statistical technique used to identify multivariate outliers.
  • It measures how far a case is from the centroid of the multidimensional normal distribution created from all the predictors.
  • Mahalanobis' Distance is distributed as chi-square, and we look for cases with a chi-square corresponding to a p-value less than 0.001.

Application in SPSS

  • SPSS can give the value of Mahalanobis' Distance for the 10 most extreme points in the dataset.
  • We compare these 10 most extreme points with a critical chi-squared value, which depends on the number of predictors.
  • If any case in our sample has a value of Mahalanobis' Distance greater than the critical chi-squared value, it is considered a multivariate outlier.

Removing Multivariate Outliers

  • When testing for multivariate outliers, remove them from the dataset and rerun the regression.
  • The nature of the results may change due to removing the outliers.
  • However, multivariate outliers are always removed, as they are particularly problematic.

Regression Analysis Assumptions

  • The first assumption is normality, where residuals should be normally distributed and centred around zero.
  • The second assumption is homoscedasticity, where residuals should be distributed similarly across values of the predictors.

Evaluating Regression Assumptions

  • The histogram of standardized residual scores should be normally distributed and centred around zero.
  • The scatterplot of residual scores against predicted values should resemble a rectangle, indicating homoscedasticity.
  • Deviations from these patterns suggest problems with the assumptions.

Outliers

  • Multivariate outliers are outliers when considered against all variables simultaneously
  • Example: an 18-year-old earning $100,000/year is likely to be a multivariate outlier
  • Multivariate outliers are particularly problematic and always need to be removed

Identifying Multivariate Outliers

  • We use Mahalanobis' Distance to identify multivariate outliers
  • Provides the 10 most extreme points in the dataset (not necessarily outliers)
  • Compare these points using chi-square cut-off value (at p < .001 level)
  • Cut-off value differs based on the number of predictors

Testing Multivariate Outliers

  • Use Mahalanobis' Distance to identify multivariate outliers
  • Provides the 10 most extreme points in the dataset
  • Compare to chi-square (.001) cut-off value
  • Any values exceeding this cut-off are considered multivariate outliers
  • Filter out any such cases

Regression and Data Analysis

  • Use regression analysis with course, age, and mathsBC as variables
  • Use residuals to test for outliers and check for errors

Data Entry and Assumptions

  • Check for data entry errors
  • Check normality assumption
  • Check homoscedasticity assumption
  • Testing for univariate outliers

Consequences of Violating Assumptions

  • Check for data entry errors
  • Check normality assumption
  • Check homoscedasticity assumption
  • Testing for univariate outliers

Regression Diagnostics

  • What happens when assumptions aren't met?
  • Test for univariate outliers
  • Test for multivariate outliers
  • Transform data

Testing for Univariate Outliers

  • Examine descriptives (skewness, kurtosis)
  • Examine histograms and boxplots

This quiz covers the detection of multivariate outliers in regression analysis, including cases that are outliers when considered against all variables simultaneously. Learn how to identify and deal with these outliers in statistical analysis.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser