Lecture 6 2024 PDF
Document Details
Uploaded by FerventMoldavite3499
Utrecht University
2024
Tags
Summary
This lecture covers survey weighting and different types of missing data mechanisms, including MCAR, MAR, and MNAR. It discusses random sampling and generalizability, along with how to identify non-representativeness in a sample. The lecture also introduces methods to handle missing data, such as mean imputation and regression imputation, emphasizing the importance of considering the nature of the missing data in the choice of imputation method.
Full Transcript
Conducting a Survey Block 2, 2024 Week 6: Weighting and Scaling 1 This week Scales and reliability of scales Missingness Mechanisms Unit non-response Weighting Short introduction to Multiple Imputation...
Conducting a Survey Block 2, 2024 Week 6: Weighting and Scaling 1 This week Scales and reliability of scales Missingness Mechanisms Unit non-response Weighting Short introduction to Multiple Imputation 2 Scales and reliability scales Many constructs are not easily measured (i.e. not with a single question) – You do not measure math-abilities with one task (e.g. 5*12 =... ) – You do not measure neuroticism by asking “how neurotic do you consider yourself on a scale from 1-10?” Solution: several items together are supposed to measure the construct The items are summed or averaged into a scale-score (new variable) Level of measurement of scale is additional advantage Reliability of the scale: “do the items more or less measure the same thing?” Cronbach’s alpha and item analysis (next slide) 3 Reliability of scales: output Cronbach's alpha: the closer to 1, the higher the internal consistency of the scale 4 Introduction to Missingness Mechanisms Every data point has some likelihood to be missing The process that governs these probabilities is called the missing data mechanism We distinguish three missingness mechanisms – MCAR: Missing Completely at Random – MAR: Missing at Random – MNAR: Missing not at random A missingness mechanism governs the probability that a set of values are missing given the values taken by the observed and missing observations 5 What it would look like 6 Missing Completely at Random (MCAR) The probability to be missing is the same for all cases There is no observable reason why the data is missing → the missingness does NOT depend on the data. Example: asking people on the street and finding some people unwilling to disclose their weight, without a reason. 7 Missing at Random (MAR) The probability to be missing is not the same for all cases, but may depend on observed information There is an observable reason why the data is missing → the missingness DEPENDS on the observed data. Example: Asking people on the street and finding e.g. women less likely to disclose their weight → we have information on gender and can use this information! 8 Missing Not at Random (MNAR) The probability to be missing is not the same for all cases AND may now depend on missing information There is no observable reason why the data is missing → the missingness DEPENDS on the missing data. Example: Asking people on the street and finding heavier people unwilling to disclose their weight. 9 Random Sampling and Generalisation Target population −→ (simple) random sample Randomness ‘ensures’ generalizability to the population – but chance is involved (especially risky in small samples) – check by comparing on important variables – it can be hard to find the required information about your population (possible sources: complete lists, CBS, funda) Generalisability might be lost if not all intended respondents participate – are non-respondents different from respondents? – check by comparing on important variables – it can be hard to get information from non-respondents – if respondents and non-respondents differ, the sample is biased 10 Does sample resemble the population? Example Target population: all people that use the fitness facilities at Olympos. The administration did probably not give you a members-list, but may have been willing to give some numbers: – percentages males and females – percentages UU students, HBO students (higher vocational schools) and staff members Consider the percentage females according to administration is 30% Consider your sample of 40 respondents includes 10 females → 25% It is possible to test: 𝐻0 : 𝜋𝑓𝑒𝑚 =.30 𝐻𝐴 : 𝜋𝑓𝑒𝑚 ≠.30 11 Testing a proportion in SPSS Test value:.70 (Note, SPSS chooses the largest category as test category) Conclusion: the percentage females in your sample (25%) is below the population value (30%), but not significantly different 12 Does the sample resemble the population? Consider the percentages according to the administration are: – UU students.40 – HBO students.40 – staff members.20 Consider the numbers in the sample are respectively: 26 (.65), 12 (.30), 2 (.05) Test: 𝐻0 : 𝜋𝑈𝑈 =.40 // 𝜋𝐻𝐵𝑂 =.40 // 𝜋𝑠𝑡𝑎𝑓𝑓 =.20 𝐻𝐴 : not 𝐻0 Chi-square test 13 Unit vs Item Non-response Generalisibility might be lost if not all intended respondents participate 14 Weighting Under- or over sampling can be corrected by weighting Make a new variable in SPSS and denote this new variable as a weighting variable: Data - Weight cases 15 Output before/after weighting Weights off Weights on 16 Stratified sampling and weighting Population: 500 residents in a home for elderly people For instance, randomly sample 10 people from each stratum. Sample distribution of age is not representative for population → weight! 17 Unit non-response Generalisability might be lost if not all intended respondents participate as desired/designed – non-respondents (not reached or refused to participate) – respondents (providing all information) – respondents that did not answer all questions » most important answers available: item non-response (case included) » most important answers are not available: unit non- response (case deleted) 18 Unit non-response in SPSS Make a data-file in SPSS with the variables you reported for both the respondents and non-respondents (e.g. gender, age) Add a variable ‘response’ to the file with values: 1=‘yes’, 0=‘no’ Compare the respondents and non-respondents: – if variable is categorical (e.g. gender) → cross-tabs – if variable is numerical (e.g. age) → independent samples t-test Note: you can report that average age in sample deviates from population, however, to weight you need age classes (1=16-18 (20%), 2=18-20 (35%), etc.) 19 Comparing distributions 20 Comparing averages Independent samples t-test (grouping variable = ‘response’, test variable = ‘age’) has two variants: – equal variances assumed – equal variances not assumed 21 More on weighting Weighting is used also to take into account for complex survey design – Clustering – Stratification – Etc. This is a topic of advanced survey methodology and statistics courses In this case, the weights will be design weights. 22 Weighting summarised Several sources for non-representativeness of the sample are possible: stratified sampling, ‘failed randomization’, non-random non-response For each cause, a weight (new variable in datafile) can be computed to correct The weights can be multiplied into one overall weight (new variable in your data-file) The final weight ‘ensures’ that the sample (after weighting) is representative for the target population. 23 Next topic: imputation. Short break? 24 Item Missings 25 MCAR, MAR, MNAR and Imputation If the data are MCAR or MAR, the missing data mechanism can be ignored, and multiple imputation and maximum likelihood procedures can be used. If the data are MNAR, one may (in general) not ignore the missing data mechanism. – Out of scope of this course We assume MCAR or MAR for now. 26 Item Missings 27 Example: The benefits of actually studying 28 One value removed 29 Let’s impute the data: Mean Imputation 30 Let us remove 50% of the data 31 Mean Imputation 32 Regression Imputation 33 Regression Imputation – 50% missing 34 Regression Imputation Regression imputation gives us the best value under a regression model – The imputed value is the most probable value – The imputed value has minimum error True grade is uncertain – Predictions do not portray this uncertainty 35 Some Theory ഥ is the sample mean for the incomplete data Let's say that 𝑀 – It differs for each sample The sample mean for the complete data is 𝑀 and the population mean is 𝑀 What says that our sample is the best sample – There is sample uncertainty that should be accounted for in our imputations Solution: Multiple imputation that tackles the problem of uncertainty about the imputation. – Instead of imputing one value, we impute the same missing value m times. – For now we assume that 𝑚 = 5 36 Regression Multiple Imputation 37 Stochastics Regression Imputation 38 SPSS – Analyse Pattern 39 SPSS – Analyse Pattern 40 SPSS – Anlyse Pattern 41 SPSS 42 SPSS 43 SPSS 44 Multiple Imputation (Predictive Mean Matching) - Descriptives 45 Multiple Imputation (PMM) – Pooled Regression 46 Pooled – why? Multiple imputation generates different imputed files These will have to be combined Combine the between and within imputation variance (somehow – technical, hence not covered here!) 47 Missing data mechanisms 48