Lecture 5 2024 PDF
Document Details
Uploaded by FerventMoldavite3499
Utrecht University
2024
Tags
Summary
This is a lecture about conducting surveys, focusing on coding and screening processes using SPSS. It covers a range of topics including straightforward and not so straightforward coding methods, as well as handling missing values and outliers in data analysis.
Full Transcript
Conducting a Survey Block 2, 2024 Week 5: Coding & Screening 1 Straightforward coding 2 Not so straightfofward coding Order the following statistics courses from easy (1) to difficult (5)... M&S1... M&S2... M&S3... Tes...
Conducting a Survey Block 2, 2024 Week 5: Coding & Screening 1 Straightforward coding 2 Not so straightfofward coding Order the following statistics courses from easy (1) to difficult (5)... M&S1... M&S2... M&S3... Test theory... CaS Which courses in your study did you really like (multiple answers allowed)? Developmental psychology Psychonomics Multilevel analysis Philosophy of science Health psychology I CaS 3 Not so straightfofward coding Forced choice: A ranking question is made easier by dividing it into smaller problems (choosing between just two options at the time). Which course did you find the most difficult (choose one from each each pair)? In this way a single question translates into several distinct variables 4 Some remarks Missing values can be left empty (system missing) or defined explicitly (user-missing), sometimes used to distinguish between different types, e.g. 97 = does not apply 98 = don’t know 99 = refused to answer E.g. SPSS also allows string values, e.g. variable gender with values male and female. Better use numerical codes and value labels: 1 = male, 2 = female Assign an identification number to each questionnaire (respondent) and add this id as variable in the SPSS file 5 Screening 1=male,2=female age in years anxiety on ordinal 1-7 scale married is number of years married income in 5 categories: 1≤ 1500, 2 = 1501-2500,..., 5>5000 6 Screening per variable For continuous variables: much information is provided simply by descriptives For categorical variables: much information is provided by frequency distributions and cross-tables 7 Screening per variable 8 Bivariate screening Strange combinations cannot be found by examining one variable at the time Bivariate outlier: for instance, in a scatter plot (for continuous variables) 9 Multivariate screening Multivariate outliers: examining many variables together Mahalonobis distance: each respondent gets a score, representing the distance in a multivariate space between the respondent and the average respondent Save Mahalanobis distance: – in regression (so, you have to specify a regression model) – examines only the x-space (predictors of regression) – trick: use all variables as predictors and case number as the dependent variable (note: the regression output is nonsense!!!) Saving mahalonobis creates a new variable: examine high values 10 Mahalanobis: results 11 Potential outliers Examine score patterns of suspicious persons: – are these values very low or high? (compare with the means of the variables – is there an irregular combination of values on predictors? Causes of outliers: – typos, for instance weight = 888 kg (probably 88, check questionnaire) – respondent not part of target population (e.g. age=35, target 20-25) – sample consists of subsamples – no clear explanation can be found some people are just very different than the average person 12 What do you do with outliers? Outliers can have a large effect on the outcome of your analyses Alternatives – minimize the influence by changing the score to a less extreme value – delete the influence by deleting the extreme case(s) from the analyses Disadvantage – is the sample still random if you remove one or more respondents? Advice – always be careful in removing outliers!! – analyze with and without outliers and see if the results differ – if results differed, I would choose to report the analyses without outliers BUT you have to report honestly and precisely what you did 13 Analyses Interested in a general summary measure (e.g. age of males and females): Central tendency measures: mean, median, mode – mean assumes continuous variable and (approximately) normality – median can be a good alternative (not sensitive to skewness and works also for ordinal categorical data) – mode is most occurring score used as summary for nominal data Explore option in SPSS: mean, median, information about distribution, useful plots, e.g. boxplot Comparing males and females on average age? Find the appropriate test 14 Explore 15 Boxplots 16 Independent Samples T-Test E.g. in SPSS: Analyze – Compare means – Independend samples t-test – Test variable: age – Grouping variable: gender – Define groups: (specify codes of data file, e.g. 1 and 2) 17 More than two independent groups For continuous dependent variable (age) and more than two groups, e.g. degree (5 groups): SPSS → Analyze → General Linear Model → Univariate Dependent variable: age Fixed factor: degree What are the assumptions underlying this model? – absence of (severe) outliers – normality of residuals (that is, normality of age in each of the 5 subgroups) – equal variances in the 5 subgroups (activate option: homogeneity test) 18 Histogram after ’split file’ To get an analysis or plot per group, you can use: Data → Split file → Compare groups (include the variable ’degree’) Now, by asking for one histogram for ‘age’, you’ll get five! Warning: do not forget to return to Split file to re-activate the default: ’Analyze all cases, do not create groups’ 19 Output ANOVA through GLM 20 Association between variables Relation between two variables – Pearson correlation: requires continuous data and linear relation – Spearman or tau: can be used for ordinal data and for non-linear relations – Crosstabs with chi-square (χ2) test: association between nominal variables Always examine the scatterplot before analysing correlations (to choose the best measure, to notice outliers, non-linearity etc). The significance values reported belong to the null hypothesis that the relation is equal to zero. Always report both significance and relevance (e.g. the actual value of the correlation) 21 χ2 tests establishes whether or not an observed frequency distribution differs from a theoretical distribution. Independence (homogeneity) – assesses whether paired observations on two variables are independent of each other. 22 Crosstabs in SPSS 23 Output Crosstabs 24