STT157 Exploratory Data Analysis (EDA) Lecture 5 PDF
Document Details
Uploaded by Deleted User
MSU-IIT
Tags
Summary
This document presents a lecture on median polish, a robust data analysis technique. It describes the method, its applications, and how it can be implemented using R. This method is suitable for exploring patterns and effects in two-way tables.
Full Transcript
Welcome to STT157 Exploratory Data Analysis (EDA) 1 Median Polish 2 WHAT IS MEDIAN POLISH? Is one of the method that introduced by John Tukey. Is a simple and robust method in exploratory data analysis. An exploratory data analysis technique...
Welcome to STT157 Exploratory Data Analysis (EDA) 1 Median Polish 2 WHAT IS MEDIAN POLISH? Is one of the method that introduced by John Tukey. Is a simple and robust method in exploratory data analysis. An exploratory data analysis technique used to extract effects from a two-way table. 3 Median polish is robust to outliers since it uses medians rather than means. It is a data analysis technique which more robust than ANOVA for examining the significance of the various factors in a multifactor model. 4 The goal is to characterize the role each factor has in contributing towards the expected value. It does so by iteratively extracting the effects associated with the row and column factors via medians. 5 STEPS FOR CONDUCTING MEDIAN POLISH According to John Tukey, the steps in median polish are as follows: 1. Take the median of each row and record the value to the side of the row, subtract the row median from each value in that row. 2. Compute the median of the row medians, and record the value as the overall effect, subtract the overall effect from each of the row medians. 3. Take the median of each column and record the value beneath the column. Subtract the column median from each value in that particular column. 4. Compute the median of the column medians and add the value to the current overall effect. Subtract this addition to the overall effect from each of the column medians. 5. Repeat steps 1 – 4 until no changes occur with the row or column medians. 6 The fit for each cell in row i and column j is: 𝒇𝒊𝒕 𝒊𝒋 =𝒄𝒐𝒎𝒎𝒐𝒏𝒕𝒆𝒓𝒎+𝒓𝒐𝒘 𝒆𝒇𝒇𝒆𝒄𝒕 ( 𝒊 ) +𝒄𝒐𝒍𝒖𝒎𝒏𝒆𝒇𝒇𝒆𝒄𝒕 ( 𝒋 ). Whenever we fit a model to data, we need to examine the differences between the raw data and the values suggested by the fitted equation. For additive models fitted to two-way tables, we can find these differences from 𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒊𝒋 = 𝒅𝒂𝒕𝒂𝒊𝒋 − 𝒇𝒊𝒕 𝒊𝒋 or, equivalently, 𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒊𝒋 =𝒅𝒂𝒕𝒂𝒊𝒋 −(𝒄𝒐𝒎𝒎𝒐𝒏 𝒕𝒆𝒓𝒎+𝒓𝒐𝒘 𝒆𝒇𝒇𝒆𝒄𝒕 ( 𝒊 ) +𝒄𝒐𝒍𝒖𝒎𝒏𝒆𝒇𝒇𝒆𝒄𝒕 ( 𝒋 ) ). We can arrange the equation as 𝒅𝒂𝒕𝒂𝒊𝒋 =𝒄𝒐𝒎𝒎𝒐𝒏𝒕𝒆𝒓𝒎+𝒓𝒐𝒘 𝒆𝒇𝒇𝒆𝒄𝒕 ( 𝒊 ) +𝒄𝒐𝒍𝒖𝒎𝒏 𝒆𝒇𝒇𝒆𝒄𝒕 ( 𝒋 ) +𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒊𝒋. 7 The resulting model is additive in the form of: where is the response variable for row i and column j, μ is the overall typical value (hereafter referred to as the common value), is the row effect, is the column effect and is the residual or value left over after all effects are taken into account. 8 Example: Infant mortality rates in the United States, all races, 1964-1966, by region and father’s education. (Entries are numbers of deaths per 1000 live births.) Education of Father (in years) Region Northeast 25.3 25.3 18.2 18.3 16.3 North 32.1 29.0 18.8 24.3 19.0 Central South 38.8 31.0 19.3 15.7 16.8 West 25.4 21.1 20.3 24.0 17.5 Source: U.S. Dept. of Health, Education and Welfare, National Center for Health Statistics, Infant Mortality Rates: Socioeconomic Factors, United States, Vital and Health Statistics, Series 22, Number 14, Rockville, MD, 1972. DHEW publication number (HSM)72-1045 (data from Table 8, p. 21). 9 WHEN DO WE STOP ITERATING? The goal is to iterate through the row and column smoothing operations until the row and column effect medians are close to 0. However, Hoaglin et al. (1983) warn against “using unrestrained iteration” and suggest that a few steps should be more than adequate in most instances. 10 Final Version of the Table (along with the column and row values) is shown below: Education of Father (in years) REGION Northeast North Central South West 11 INTERPRETING MEDIAN POLISH As noted earlier, a two-way table represents the relationship between the response variable, y and the two categories as: In our working example, , and, for and, respectively; and, for and, respectively. 12 So, the mortality rate in the upper left-hand cell from the original table can be deconstructed as: 𝑦 11= 𝝁 +𝜶 𝒊 + 𝜷 𝒋 +𝝐 𝒊𝒋 𝟐𝟓. 𝟑=𝟐𝟎. 𝟕−𝟏.𝟔+𝟕.𝟔−𝟏. 𝟒 The examination of the table suggests that the infant mortality rate is greatest for fathers who did not attain more than 8 years of school (i.e. who has not completed high school) as noted by the high column effect value of 7.6. This is the rate of infant mortality relative to the overall median (i.e. on average, 20.6 infants per thousand die every year and the rate goes up to 7.6 + 20.6 for infants whose father has not passed the 8th grade). 13 Infants whose father has completed more than 16 years of school (i.e. who has completed college) have a lower rate of mortality as indicated by the low effect value of -3.5 (i.e. 3.5 fewer depths than average). The effects from regions also show higher infant mortality rates for North Central and Western regions (with effect values of 2.6 and 0.4 respectively) and lower rates for the northeastern and southern regions; however the regional effect does not appear to be as dominant as that of the father’s educational attainment. 14 IMPLEMENTING THE MEDIAN POLISH IN R R has a built-in function called medpolish() We can define the maximum number of iteration by setting the maxiter= parameter but note that medpolish will, by default, automatically estimate the best number of iterations for us. R CODE: First load your data frame 15 IMPLEMENTING THE MEDIAN POLISH IN R R CODE: First load your data frame df