Lecture 7 (2024) - Combined PDF
Document Details
Uploaded by Deleted User
2024
Tags
Summary
This lecture discusses different types of exam questions (selected response vs. constructed response) and their implications. It uses Bloom's Taxonomy and examples from different fields to illustrate the concepts.
Full Transcript
Meeting 7 What I will discuss today Selected response (= multiple-choice questions) versus constructed response ( = open-ended or essay questions) See also book chap 6 Assignments (feedback) 1 Practice exam questions Open versus mult...
Meeting 7 What I will discuss today Selected response (= multiple-choice questions) versus constructed response ( = open-ended or essay questions) See also book chap 6 Assignments (feedback) 1 Practice exam questions Open versus multiple choice (MC) items Some people say too many MC exams (recently in the University Krant, UK) They say it is important that you can reproduce, apply and evaluate your knowledge, there is now too little attention to do this as a result of MC exams, too much recognition. So they want more essay questions. Is this a “face validity” argument or is this grounded in scientific research? 2 Bloom’s taxonomy (low versus higher orders of cognition) So the idea is: MC = remembering; essay questions = higher order 3 cognition: Is this true?? Example (from Hift, 2014) A 24-year-old woman is admitted to a local hospital with a short history of epistaxis (= nosebleed). On examination she is found to have a temperature of 36.9 C. She is wasted, has significant generalized lymphadenopathy and mild oral candidiasis but no dysphagia. A diffuse skin rash is noticed, characterized by numerous small purple punctate lesions. A full blood count shows a haemoglobin value of 110 g/L, a white cell count of 3.8/10 9 per litre and platelet count of 8.3?10 9 per litre. Which therapeutic intervention is most urgently indicated in this patient? a. Antiretroviral therapy b. Fluconazole c. Imipenem d. Prednisone e. Platelet concentrate infusion 4 Example from Hift (2014) Although this is a “medical item” you can imagine that it is not a remembering or recognition item This items requires: Analysis, knowledge and evaluation not only recognition 5 Hift (2014) “That the multiple-choice format demonstrates high validity is due in part to the observation that well- constructed, context-rich multiple- choice questions are fully capable of assessing higher orders of cognition, and that they call forth cognitive problem-solving processes which exactly mirror those required in practice. “ 6 As another example In 1923, Babe Ruth had 522 at bats with 205 hits. Assuming that the binomial distribution can be appropriately applied, find the expected number of hits in 529 at bats. a. 321 b. 186 c. 230 d. 20 This is not pure recognition 7 From Hift (2014) What some people suggest Reality error variance 8 Thus: Why MC questions? Advantages Multiple-choice exams over Essay questions Exams 1. More items (questions) and therefore: A. often more reliable than essay questions, (Spearman-Brown formula, see lecture 2) B. better representations of construct (content validity) 9 Why MC questions? 2, Objective scoring (standardization! See lecture 1) and thus fair 3. MC questions are equally suited as essay questions to test knowledge, apply knowledge, analyze and the evaluate knowledge and they are equally suited as essay questions (Schuwirth & van der Vleuten, 2004). Also, relations with future behavior and achievements no differences or MC better ! (Bridgeman & Lewis, 1994; Hift, 2014). 10 Why MC questions? 4. MC questions are in most cases more cost effective (Hift, 2014) 11 When open/essay questions? To test “creation” open-ended questions more suited than MC question. Example: writing a story to show your writing skills Conclusion: perfectly fine in many cases to use MC questions, also to measure higher order cognition! It is not so much the MC questions format, but how they are sometimes constructed that make them “recognition” items 12 Feedback assignments See lecture 13 Exam questions See lecture and and word file practice exam questions 14 Meeting 6 chap 14 projective tests chap 15 Interests and attitudes 1 Chp 14 Projective Pers. tests Projective test: ambiguous stimulus, reaction on this stimulus shows needs, feelings, experiences etc Examples: Rorschach test and the Thematic Apperception Test (TAT) Tests are controversial 2 Rorschach The Rorschach inkblot test 10 cards, 5 black and grey, 2 black, gray and red, and 3 cards with different colours Individual test, where the respondent is asked to interpret an ink blot, psychologist does not give any clues 3 Rorschach 4 Rorschach Two rounds: (1) Free association “Now we’re going to do the inkblot test - perhaps you’ve heard of it?” Hand subject first card and say “What might this be?” (2) “Inquiry” , where answers are being scored. (tip: check in the book what is being asked) 5 Rorschach What is scored? Characteristics like: (1) Location (W-whole; D-detail; Dd-unusual detail) (2) Determinant (What inkblot features helped determine your response and how? F-Form; m- Movement;Color; Shading) (3) Content (Human, Animal, Nature) (4) Popularity (how frequently is the percept seen in normative samples) 6 Exner Collection of a broad normative data base Integrated system of scoring The Rorschach: A comprehensive system 7 Rorschach Problems with the Rorschach: (1) Administration, scoring and interpretation are not standardized (2) Subjective interpretation of results (3) Results are unstable over time (4) Psychometric properties are insufficient However, used in clinical practice, used as a kind of semi- 8 structured interview, how people respond to ambiguous stimuli. TAT Thematic Apperception Test (Murray) 30 cards, “tell me what happens on these cards, what do these people think?” Based on Murray’s theory of needs Registration of reactions and response time scoring themes like: achievement, affiliation, and power 9 http://www.utpsyc.org/TATintro/ TAT Underlying assumptions: the respondent shows his/her conflicts. Problems: not standardized, subjective in the interpretation of the results, unreliable 10 Chapter 15: Interests and Attitudes measures of personality, interests, and attitudes all are used to measure non-cognitive traits But there are some different traditions 11 Measuring Interests Two important distinctions I. Origin of Scales (= underlying principle of construction) A. Criterion-keying B. Broad areas II. Item Format A. Absolute level of interest B. Relative level of interest 12 Measuring Interests I. Origin of the scale One approach uses criterion-keying: which items differentiate between well-defined groups (see also chapter 12!) Groups are vocational groups So which items differentiate between members of different vocational groups ? 13 Measuring interests Other approach focus on broad areas of interest and a high score in a certain area (for example, artistic, persuasive or scientific) may lead to certain occupations 14 Measuring interests II Item format ABSOLUTE LEVEL Rate the extent to which you like each of these activities: Dislike Neutral Like Dissecting frogs O O O Analyzing data O O O Selling magazines O O O RELATIVE LEVEL Among these activities, mark M for the one you like the Most and mark L for the one you like the Least. Make no mark for the other activity. Dissecting frogs [M] [L] 15 Analyzing data [M] [L] Selling magazines [M] [L] Holland Themes and RIASEC Codes START HERE Realistic Investigative Conventional Artistic Enterprising Social 16 R(ealistic)I(nvestigative)A(rtistic)S(ocial)E(nterpri sing)C(onventional) = acronym to report score Holland Features of the hexagon: Hexagon gives an idea about the degree of relationships between themes (= personality types) Vertexes represent themes, adjacent themes are more strongly related, diagonally themes low correlations Themes related to job types 17 See Table 15.3 (next) Holland’s Personality Types with Examples of Related Jobs and Characteristics Type Code Examples of Jobsa A Few Descriptors Realistic R security guard, athletic trainer practical, frank dentist Investigative I steel worker, police detective, critical, curious chemical engineer Artistic A dancer, fashion designer, expressive, idealistic editor Social S child care worker, occupational therapy aide, kind, generous teacher Enterprising E telemarketer, sales, extroverted, optimistic marketing manager Conventional C police dispatcher, dental assistant, orderly, efficient accountant 18 O*NET http://www.onetonline.org/find/descriptor/bro wse/Interests This site lists many jobs on the basis of RIASEC codes 19 Strong Interest Inventory (SII) A very popular instrument Consists of different scales and scores (see next slide) 20 Strong Interest Inventory 1. General Occupational Themes (GOT) based on Holland’s RIASEC, 6 scores 2. Basic Interest Scales (BIS, Athletics, Science, Military etc, 30 scores, based on factor analysis) 3. Occupational Scales, different occupations (based on criterion keying) 4. Personal Style Scales: work style, learning environment, leadership, team orientation, risk taking 21 5. Administrative Indexes (= validity indexes) Reliability and Validity Strong Inventory Reliability through Cronbach’s alpha and test- retest, in general satisfactory (larger than r =.80) How do you investigate validity? (1) test results differentiate between existing occupational groups in predictable directions (see next slide) (2) scores are predictive of the occupation (thus 22 future behavior) Differentiation on SII Occupational Scales 23 But there are also invalid tests in this area The DISC colour test Very popular questionnaire in human resources management Objective scoring (not projective) But is it a good test? 24 From the manual The DiSC® Model The foundation of DiSC® was first described by William Moulton Marston in his 1928 book, Emotions of Normal People. Marston identified what he called four “primary emotions” and associated behavioral responses, which today we know as Dominance (D), Influence (i), Steadiness (S), and Conscientiousness (C). 25 DISC Colur test Since Marston’s time, many instruments have been developed to measure these attributes. The Everything DiSC® assessment uses the circle, or circumplex, as illustrated below, as an intuitive way to represent this model. Although all points around the circle are equally meaningful and interpretable, the DiSC model discusses four specific reference points. 26 From the manual Dominance: direct, strong-willed, and forceful Influence: sociable, talkative, and lively Steadiness: gentle, accommodating, and soft- hearted Conscientiousness: private, analytical, and logical 27 The Disc colour test 28 Example Questions Dominance scale I am very outspoken with my opinion I am forceful I tend to challenge people I can be blunt I am tough minded 29 Colour test This test is popular in human resources management (HRM) People do the test and they get a colour (1 perhaps 2 colours) Thus people are a ‘type” 30 Colour test In popular management literature they describe how you should deal with “red” people or “yellow” people Or that you need a “red” person or a “yellow” person in the team It is suggested: this is handy for team dynamics (“Not too many reds” etc) 31 In the bookshop … 32 Problems Theoretical background is weak (based on Jungs typology and a work from 1928 by Marston) There are no scientific articles that – independently from commercial interest– show the psychometric quality of this test There is a –commercial- test manual with some psychometric information, but difficult to judge whether analyses make sense 33 But HRM people love it … Why are these tests so popular? Again: like for the MBTI (see lecture 5) People can make a nice story around this test Face validity is high “If people can construct a simple and coherent story, they will feel confident regardless of how well grounded it is in reality” (Kahneman & Klein, 2010) 34 Generalizations about Career Interest Measures Quite reliable Respectable validity Little use of modern psychometric theory Movement to online completion Assessing abilities along with interests 35 Important lesson And do not forget …. these are commercial products (like many tests) therefore … questions are not publicly available 36 Attitude Measures Components Cognitive, behavioral, emotional Measures concentrate on cognitive elements The number of attitudes: huge Many attitude scales; none widely used 37 Types of scales: Likert Well known Identify target Large number of items Agree-disagree format; 5-point likert scale Sum ratings (method of summated ratings) Item analysis to get final items Different types of scales: Likert, Thurstone, Guttman 38 Example From Likert (1932) Internationalism scale (opinion polling): All men who have the opportunity should enlist in the Citizens Military Training Camps Strongly approve Approve Undicided Disapprove Strongly Disapprove 39 Thurstone Scales Read true the text, not much used in practice, no questions on exam 40 Guttman Scales I do not agree with the book, although Guttman scaling is not often used, probabilistic versions of Guttman scaling are popular to construct attitude and other non-cognitive 41 Guttman scaling For example, Mokken scaling First, what is Guttman scaling ? (see next slide) Assume we have 8 items We can order items according to increasing difficulty Then it is expected that, given a person’s total score, say X = 4 he/she will answer the 4 easiest items correctly and the other item incorrectly, for a total score X = 2, the 2 easiest items etc. 42 This is a Guttman scale Guttman Scale + = gives a correct answer; - = gives an incorrect answer; 43 often we also use 1 for correct answer and 0 for incorrect answer Mokken scaling However, in practice we almost never encounter these data because this is unrealistic, persons answering behavior is not completely in agreement with a Guttman scale, more realistic is item 1 2 3 4 5 6 7 8 Person A + + - - - - - - Person B + - + + - - - - Person C + + + - + + + - 44 Mokken scaling To model this more realistic behavior: Mokken scaling Using Mokken scaling we check to what degree our data are in agreement with the Guttman model using scalability coefficient: H H is between 0 and 1, the higher the value the better the scale, the better we are able to scale persons H = 0 bad scale H = 1 very good scale 45 Mokken scaling Note that items are ordered according to increasing difficulty Then, how many Guttman errors are there for person, A, B and C? Answer: A – no Guttman errors, B – 2 Guttman errors, C – 3 Guttman errors item 1 2 3 4 5 6 7 8 Person A + + - - - - - - Person B + - + + - - - - Person C + + + - + + + - Why? Error is defined as the number of minus signs to 46 the left of every plus sign Guttman errors 1 2 3 4 5 Peter 1 0 0 1 0 Marvin 1 1 0 0 1 Hajo 0 1 1 1 0 Tom 1 1 1 0 0 …… (other persons) 5 items, 4 persons, assume that the items are ordered according to increasing difficulty! 47 Then Tom has a perfect Guttman pattern Which person(s) has (have) a perfect Guttman pattern? person item 1 2 3 4 Susan 1 1 0 1 Tom 1 0 1 1 Marvin 1 0 0 1 Hajo 0 0 1 1 a. Susan b. Tom and Marvin 48 c. Marvin Answer ! person1 item 4 1 3 2 Susan 1 1 0 1 Tom 1 1 1 0 Marvin 1 1 0 0 Hajo 1 0 1 0 The items are now first ordered according to increasing difficulty (p-value) and now you can see that Tom and Marvin both have a perfect Guttman pattern 49 Statistical versus clinical prediction (chap 5) Sarbin (1943) prediction of college grade point average using (!) High school ranks + college aptitude test r =.45 (2) Judgment of counselors on the basis of interviews High school ranks + college aptitude test r =.35 This was (and is) very counterintuitive: more information leads to worse prediction and expert judgment makes things worse Meehl (1954) Judgment: Two basic options 1. “mechanical” or statistical prediction: 51 X1 + X2+ X3 = Xt 2. Expert (“clinical”) judgment: Expert combines information “in the head” Some very counterintuitive findings … Statistical methods are more effective than human judgment for combining multiple data features people do not minimize error, they introduce error But as long they can construct a coherent story they will believe in it irrespective of any empirical evidence (cf Kahneman) Meehl (1956): “the first rule (..) to predict a patient’s or student’s behavior is to carefully to avoid talking to him and the second rule is to avoid thinking of him” Not discussed in the lectures but still important What is correction for attenuation ? What other career interest inventories do we have? 53 Summarizing this lecture It is important to know some basics about projective tests, but these tests are controversial Interest and attitude questionnaires have similarities with personality questionnaires Basic principles and construction of interest and attitude scales is important 54 Meeting 5 Personality and Clinical questionnaires Chap. 12 – Objective personality questionnaires Chap. 13 – Clinical instruments and methods 1 Personality and Clinical questionnaires Things I will discuss today are related to chapters 12 and 13 I will also relate different strategies to construct tests to item response theory that is discussed in chapter 6 In this lecture and also in lecture 6 I will discuss some strange tests that are often used! 2 Objective Personality Assessment, mood assessment Self-reports and peer-reports questionnaires are often used to determine personality traits Contexts: health care, clinical psychology, personnel selection and development, research We measure here typical performance Objective = scoring does not involve professional training 3 Example Likert scale Non-conditional Behavior: not dependent on context 4 But sometimes contextual behavior Conscientiousness questions depend on context: research job 1 2 3 4 5 Bipolar scale (= two opposite end points) with answer categories 1 2 3 4 Contextual questionnaires There is an increasing interest in questionnaires that are developed for a specific context (example, see previous slide) In this questionnaire conscientiousness is measured in the context of doing PhD research 6 Contextual questionnaires The scale indicates that your personality is suited to become a PhD student Advantage: higher criterion validity than when using unconditional questionnaires like the NEO- PI-R Drawback: for each job a different questionnaire should be developed 7 Classification Scheme for Objective Personality Tests Scope of Coverage Orientation Comprehensive Specific Domain Normal Edwards Personal Preference Schedule Piers-Harris Children’s Self-Concept Scale Sixteen Personality Factor Inventory Rotter Locus of Control Scale NEO Personality Inventory Bem Sex Role Inventory Abnormal MMPI-2 Beck Depression Inventory Millon Clinical Multiaxial Inventory-III State-Trait Anxiety Inventory Personality Assessment Inventory Suicidal Ideation Questionnaire 8 Response Sets and Faking Terms: response sets, response styles, distortion, impression management, socially desirable, faking good, faking bad Terminology in the literature is not consistent Basic problem: disentangling scores that reflect true personality traits from scores that reflect response sets etc. 9 Response Sets and Faking Clinical settings, but also in personnel selection settings For example, selecting students for university, own research showed that personality questionnaires cannot be used, because students distorted their responses 10 Response sets, styles Response set = person’s tendency to answer questions in a certain way, independent of true feelings (not stable over time) Response style = consistent and stable tendencies in response behavior that cannot explained by question content - For example, choosing always the extreme options, or always say “agree” 11 Strategies to detect or minimize distortion Detection Check for consistency on same or similar items (see MMPI validity scales) Example: Variable Response Inconsistency Scale (VRIN) 67 pairs of items with either similar or opposite content Inconsistent responding: the same or different responding (depending on content item pair) Minimizing Balancing direction of items (see next slide) Using forced-choice method (see next slide) 12 Balancing directionality Item Responses in Direction of Friendliness True (T) False (F) 1. I enjoy being with friends. T 2. I often go places with friends. T 3. Friends are more trouble than they’re worth. F 4. I have very few friends. F 13 Using forced-choice method Directions: In each pair, pick the statement (A or B) that best describes you. 1A. I usually work hard. 1B. I like most people I meet. 2A. I often feel uneasy in crowds. 2B. I loose my temper frequently. 14 Forced choice Ipsative scoring: Items for each of the scales are paired with items on the other scales that are roughly comparable on social desirability and then you have to select between which statement describes you best Ipsative meaning ‘of the self’: everyone the same total scores so you compare a person with him/herself 15 Strategies to personality test development 1. Content strategy 2. Criterion-keying strategy 3. Theory based approach 4. Factor analysis and item response theory 5. Combined approaches 16 1. Content Strategy -Tests are constructed on the basis of a simple understanding what we want to measure based on “common sense” - Written form of an interview Example: Woodworth Personal Data Sheet Aim: to indicate candidates for the army that were not suited to go to war 17 Examples of items Woodworth sheet 1. Do you usually feel well and strong? YES NO 2. Do you usually sleep well? YES NO 3. Are you often frightened in the middle of the night? YES NO 4. Are you troubled with dreams about your work? YES NO 5. Do you have nightmares? YES NO 6. Do you have too many sexual dreams? YES NO 7. Do you ever walk in your sleep? YES NO 8. Do you have the sensation of falling when going to sleep? YES NO 9. Does your heart ever thump in your ears so you cannot sleep? YES NO 10. Do ideas run through your head so you cannot sleep? YES NO … (106 more questions) 18 2. Criterion-keying strategy Items are selected on the basis of their ability to discriminate between two well-defined groups For example normal and clinical persons Advantage: direct and simple Disadvantage: (1) atheoretical, (2) only applicable when there are well-defined criterion groups, and (3) suggests clear distinction 19 between groups, which is often not the csse Example problem using no theory content Item test correlation I am content with my job.31 (select) I feel often enthousiastic about my.37 (select) job My favourite colour is green.33 (select) A lot of things bother me.03 (reject) Most people are trustworthy.34 (select) Without a theory you would select also the item “my favourite …” 20 MMPI MMPI, most often used questionnaire to distinguish “psychiatric patients” from normal persons Development: scores of different types of psychiatric patients and control groups. Items that distinguished psychiatric patients from control groups were part of the questionnaire. The Minnesota Multiphasic Personality Inventory (MMPI) was developed in the late 1930's by psychologist Starke R. 21 Hathaway and psychiatrist J.C. McKinley at the University of Minnesota. MMPI-2 I will not ask specificities about the differences between the different versions of the MMPI on the exam, I am more interested in general principles (how constructed, what do the validity scales measures?) 22 MMPI-2: Validity scales L-scale (15 items): high scores indicate that the respondent possesses a degree of virtue that is rarely observed (fake good) Examples. I never get angry (answering True results in one point) I like everyone (T) 23 Validity scales F-scale 60 items endorsed by a normal person no more than 10% of the time. Indicates: Fake bad Examples. I have a cough most of the time. (T) K-scale: 30 items defensiveness: fake good Example: I have very few quarrels with members of my family. (T) 24 MMPI-2 Very high scores on validity scales imply that the scores on the clinical scales cannot be trusted There are about 10 000 published papers using the MMPI-2 and this pool is added to by hundreds of papers every year. Most used questionnaire in clinical context 25 Recent version: MMPI Restructered Form (RF) Question The K-scale of the MMPI measures a. Fake good b. Fake bad c. Fake resistance 26 3. Theoretical Strategy Edwards Personal Preference Schedule (EPPS) based on Murray’s theory of needs Edwards selected 15 needs from the 28 needs described by Murray Murray: 28 human needs such as achievement, need to conform, need for attention (exhibition) etc 27 EPPS Achievement : need to accomplish tasks well Deference: to conform to customs and defer to others Order: to plan well and be organized Exhibition: to be the center of attention in a group Autonomy: to be free of responsibilities and obligations Affiliation: to form strong friendships and attachments Intraception: to analyze behaviors and feelings of others Succorance: to receive support and attention from others Dominance: to be a leader and influence others Abasement: to accept blame for problems Nurturance: to be of assistance to others Change: to seek new experiences and avoid routine Endurance: to follow through on tasks and complete assignments Heterosexuality: to be associated with and attractive to members of the opposite sex Aggression: to express one's opinion and be critical of others 28 EPPS Instead of evaluating each statement in relation to a rating scale (Likert scaling!), respondents have to choose between statements according to the extent these statements describe their preferences or behavior Thus also nice example of forced-choice questionnaire 29 4. Factor analytic strategy (and item response theory) Factor Analysis - identify dimensions (factors) underlying a set of items Advantage: empirical method to check whether there are basic dimensions underlying human personality Disadvantage (1) result depends on the initial pool of items (2) different methods with sometimes different results sometimes not clear what is the best 30 method Factor analytic strategy Most famous example: 16 Personality Factor(16PF) Questionnaire, based on a reduction of 4504 adjectives that describe traits (Cattell), different norms for different groups http://personality-testing.info/tests/16PF.php Factor A: cool-warm Factor B: lower-higher mental capacity Factor C: ego strength 31 Factor D: submissive-dominance etc NEO-PI-R Neuroticism (Anxiety, Hostility, Depression, Self- Consciousness, Impulsiveness, Vulnerability) The NEO-PI-R was designed to provide a general description of normal personality relevant to clinical, counseling and educational situations. Extraversion (Warmth, Gregariousness, Assertiveness, Activity, Excitement-Seeking, Positive Emotions) Openness to Experience (Fantasy, Aesthetics, Feelings, Actions, Ideas, Values) Agreeableness (Trust, Modesty, Compliance, Altruism, Straightforwardness, Tender-Mindedness) Conscientiousness (Competence, Self-Discipline, 32 Achievement-Striving, Dutifulness, Order, Deliberation) 5. Combined approaches Most tests are now based on a combination of some theory, content and factor analytic approaches 33 Brief Interlude So we discussed a number of strategies to construct personality/clinical tests and some well designed tests. Before we turn to a technique that is related to factor analysis (IRT) it is good to realize that in practice also many bad (that is, no empirical evidence) tests are often used. I mention two tests: The MBTI The DISC colour test (will be discussed in lecture 6) 34 The Myers-Briggs Type Indicator test Very popular questionnaire in human resources management Objective scoring But is it a good test? What is the Myers-Briggs Type Indicator Test (MBTI)? Developed by I. B. Meyers and K. C. Briggs Based on Carl Jung’s theory of psychological 35 types From the site 36 From the (commercial!) site 37 Theory behind MBTI, Jung: 4 ways how we experience the world 1. Sensing (knowing through sight, hearing, touch) 2. Intuition (inferring what underlies sensory inputs) 3. Feeling (focusing on emotional aspects of experience) 4. Thinking (reasoning or thinking abstractly) 38 Dimensions 1. Extraversion/Introversion dichotomy (E/I) 2. Intuition(”impressions and patterns”)/Sensing (”what happens in my environment”) (N/S) 3. Thinking/Feeling (T/F) 4. Judging/Perceiving (J/P) 4 x 4 = 16 personality types 39 Theory behind MBTI Jung says: We all have our preferences and these underlie our interests, needs, values and motivation Ok, but what does empirical research say about this? 40 Structure MBTI 93 items For example: “When I go out” - I plan my activities - I just do what I think comes into mind 41 Problems: Not valid, research did not support the classifications into 16 types Relation with other tests measuring similar constructs is weak Low test-retest reliability Factor structure is unclear 42 Conclusion Thus through empirical psychometric research we can show that using this questionnaire can not be defended “Dangerous” to use this test because you classify persons into types that cannot be defended on the basis of empirical findings 43 But why is it then so popular? Perhaps It has intuitive appeal, it is handy to classify persons into different “Types” Some proponents of the MBTI say: “the scores are not determinate, they are subject to feedback” “When people hear about their personality MBTI scores it allows them to understand an aspect of their personality without needing to add 44 judgment of what is sick or well, good or bad” Strategies to personality test development back to scientific underpinnings of testing Related to factor analysis is item response theory (IRT) (see chapter 6! Perhaps you remember) Often used to construct tests central in IRT is is item characteristic curve (ICC) ICC can be used to investigate quality of items and tests ICC gives the relation between a trait value and 45 the probability of answering an item correctly Example ICCs 46 IRT X-Axis: The trait value (theta = greek symbol θ) is a standardized transformed total score Thus, theta may represent “traits” like: intelligence, personality, psychopathology, but also “knowledge about statistics” 47 IRT Interpretation: for example, theta = +1 means 1SD above the mean score in the population Y-Axis For each theta value you can compute the proportion correct answers to an item, this is “the probability of correct responses” 48 IRT Parameters of the curve are used to describe the curve difficulty (b) slope (a) guessing (c) 1, 2, 3-parameter models 1 parameter model: only the difficulty 2 parameter model: difficulty and slope 3 parameter model: difficulty, slope, and 49 guessing IRT Item difficulty (b) = the point on the trait scale where the probability of given a correct answer equals.5 (see figure) Item discrimination (a) = the slope of the ICC, the higher the slope, the better an item discriminates between persons with different trait scores 50 IRT Guessing parameter (c) = lower asymptote, that is, the probability of given a correct answer if you have no knowledge (cognitive testing) or a very low trait value (non-cognitive testing) Thus, you can use ICCs to see where an item discriminates between different theta values This is an extension to CTT, where we only had the (corrected) item-total correlation 51 More on IRT You can also use IRT to inspect local reliability using the information function “local” means “as a function of the trait” Practical value: you can use information functions to select items 52 Question Which item is the most difficult item? a. Item A b. Item C 53 c. Both items C and D Remember from lecture 2 We already talked about reliability Different methods to estimate reliability of a test But note that was only one estimate for a test Thus, for example the reliability equaled.78 Based on this estimate you can calculate the SEM and determine a confidence interval for the individual scores on a test 54 IRT and measurement precision Now back to item response theory, using IRT you can estimate the measurement precision (“reliability”) as a function of the trait value, this is not possible using classical test theory Using the item infomation functions (next slide) 55 Item information functions for two hypothetical items Item B measurement precision around 2 56 Item A high measurement precision around -1 Information and SE We can obtain information about the precision of the estimate of θ through the information function SE (θ) = 1/ √I(θ) Thus contrary to classical test theory measurement precision is dependent on the trait level 57 Item and test information Test information is the sum of the item information functions Thus we can select items that have optimal information in a specified area of the trait scale 58 Empirical Item information functions from own research Item Information Curves 0.6 0.5 0.4 Information 0.3 0.2 0.1 0.0 −4 −2 0 2 4 Ability 59 Test Information Function I=3 60 I = 5, SE =.58 SE =.45 Interesting application: Target information function You can a priori specify the SE you are interest in, for example to distinguish groups of persons or individuals SE (θ) =.25 then information should be 16 You can select items for your test until you reached SE (θ) =.25 61 Target information function interesting for using cut-off scores in personnel selection or in student selection Assume that you want to select persons 1SD above the mean, then you should construct a test that has the smallest SE (= highest information) around θ = 1 62 To summarize … We discussed different personality tests and clinical scales We emphasized different types of test construction We also related topics mentioned in chapter 6 in the book with topics discussed in chapters 12/13 We discussed some new ways of test construction based on item response theory ! 63 faculty of behavioural and social sciences 06-10-202 2 | 1 06-10-2022 | 1 Intelligence Testing Juan Camilo Arboleda Lecture 4 October 3, 2024 Overview 06-10-202 2 | 2 1. Definition of intelligence 2. Theories of intelligence 3. Individual intelligence tests 4. Intellectual disability 5. Trends in individual testing What is intelligence? 06-10-202 2 | 3 › Based on Gottfredson (1997), a very general mental capacity that involves the ability to: ▪ Reason ▪ Plan ▪ Solve problems ▪ Think abstractly ▪ Comprehend complex ideas ▪ Learn quickly ▪ Learn from experience Correlates of intelligence 06-10-202 2 | 4 › Correlations between performance on an intelligence test and some external criterion: ▪ Academic grades (r =.50) ▪ Job performance (r’s between.30-.60) ▪ Quality of life (e.g., health indices) Theories of intelligence 06-10-202 2 | 5 › Spearman’s two-factor theory ▪ Based his theory on correlations between tests of simple sensory functions ▪ Concluded that correlations were generally high - Underlying factor “g” (general mental ability) - Specific test variance “s” (plus error variance) Theories of intelligence 06-10-202 2 | 6 › Thurstone’s multiple-factor theory ▪ Concluded that correlations among tests were low - Multiple largely independent factors ▪ Original primary mental abilities (PMA): - Spatial - Perceptual - Numerical - Verbal - Memory - Words - Induction - Reasoning - Deduction Theories of intelligence 06-10-202 2 | 7 › Other (extreme) multifactor theories of intelligence: ▪ The structure of intellect model by J.P. Guilford - 180 facets of intelligence - Has not stood the test of time - But: - divergent thinking: different solutions/alternatives e.g: slogan company - convergent thinking: most appropriate solution e.g: medical diagnose › So what? “G” or multiple factors? Cattell’s Gf-Gc theory 06-10-202 2 | 8 Fluid and crystallized intelligence theory ▪ Fluid intelligence (Gf): raw mental capacity ▪ Crystallized intelligence (Gc): everything learned ▪ Gf: potential vs. Gc: actual ▪ Hierarchical theory: both Gf and Gc consist of several components Carroll’s three-stratum theory 06-10-202 2 | 9 › Carroll summarized existing hierarchical models by analyzing hundreds of factor analyses on human abilities Developmental theories 06-10-202 2 | 10 › So far, we considered psychometric theories ▪ Depend heavily on the analysis of relationships among specific tests › Developmental theories focus on cognitive development with age and experience Developmental theories 06-10-202 2 | 11 › Key features: ▪ Development is stage-based and stages are qualitatively different from each other ▪ Fixed order of stages for everyone ▪ Stages cannot be skipped and are irreversible ▪ There is usually a relationship between stages and age Jean Piaget’s theory 06-10-202 2 | 12 › The most prominent developmental theory of intelligence › 4 main stages: ▪ Sensorimotor (0-2 years) - lack of object permanence ▪ Preoperational (2-6 years) - lack of principles of conservation ▪ Concrete operational (7-12 years) - Use of principles of conservation and reversibility ▪ Formal operational (12+ years) - Mature cause and effect reasoning › https://www.youtube.com/watch?v=TRF27F2bn-A Information processing theories 06-10-202 2 | 13 › Emphasize not what is known but how content is processed › Intelligence is measured with elementary cognitive task (ECT) ▪ A relatively simple task requiring mental processing ▪ May provide a culture- and education free measure - e.g., reaction time › Semantic verification task Jensen’s theory 06-10-202 2 | 14 › Fluid intelligence (“g”) influences: ▪ intelligence test performance ▪ But also information processing factors Sternberg’s triarchic theory 06-10-202 2 | 15 › Three sub-theories: ▪ Componential (mental processes) ▪ Experiential (novelty-automation) ▪ Contextual (environment interaction) › “Practical intelligence” › Widely criticized theory Gardner’s MI theory 06-10-202 2 | 16 › Proposed several types of intelligence: ▪ linguistic ▪ logical-mathematical ▪ spatial ▪ musical ▪ bodily-kinesthetic ▪ intrapersonal ▪ interpersonal Intelligence? ▪ naturalist ▪ spiritual ▪ existential ▪ moral ▪ … › Popular theory in educational settings Summary of theories 06-10-202 2 | 17 1. Classical theories - Spearman’s “g” - Thurstone’s PMA 2. Hierarchical theories - Cattell’s Gf-Gc theory - Carroll’s three-stratum theory 3. Developmental theories - Piaget’s four stages of cognitive development 4. Information processing and biological theories - Jensen’s information processing theory - Gardner’s MI theory Theories and Testing in Practice 06-10-202 2 | 18 › Hierarchical models most prevalent in widely used tests ▪ Gf-Gc theory ▪ Carroll’s three-stratum theory › Test reports usually contain: ▪ a total score indicative of “g” ▪ sub-scores for group factors such as verbal, spatial, numerical, memory etc. › Less influence on practice: ▪ Information processing models ▪ Developmental theories Individual intelligence tests 06-10-202 2 | 19 › Characteristics: ▪ Individual administration ▪ Advanced training necessary for administration ▪ Wide range of age and ability, start and stop rules ▪ Requires establishing rapport ▪ Usually free-response format ▪ Immediate scoring of items ▪ Duration of roughly one hour ▪ Opportunity for observation Example items 06-10-202 2 | 20 Example items 06-10-202 2 | 21 The Wechsler scales 06-10-202 2 | 22 › Wechsler: ▪ Was unhappy with the Stanford-Binet test - Wanted a test for adults - Wanted a test with more than a general score ▪ Invented many different scales (children, adolescents, adults) ▪ Defined intelligence as: “the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment” (Wechsler 1958, p. 7) WAIS 4 - Structure 06-10-202 2 | 23 › Four index scores, determined by factor analysis WAIS 4 – Example items 06-10-202 2 | 24 WAIS 4 - Administration 06-10-202 2 | 25 › Rapport building › Following start rule for each subtest › Administration proceeds according to stop rule › Development towards multiple-choice items WAIS 4 – Scoring and Norming 06-10-202 2 | 26 Scaled scores Composite Subtest raw (M = 10, SD = scores (M = scores 3) 100, SD = 15) Separate conversions for 13 different age groups and one reference group WAIS 4 – Score transformation 06-10-202 2 | 27 WAIS 4 – Psychometric Info 06-10-202 2 | 28 › Standardized on a stratified sample of 2450 adults, representative of the U.S. population, aged 16-89 › Stratification variables: age, sex, race/ethnicity, education level, geographic region › 200 healthy cases per 12 age groups › Internal consistency & test-retest reliability >.90 › Standard error of measurement: 3-5 scaled score points for index scores, 2 points for FSIQ, on average The Stanford-Binet 06-10-202 2 | 29 › First original scale appeared in 1905 › For many years, it was the measure for intelligence › Nowadays overshadowed by Wechsler scales, but still used in clinical practice › Items were organized by age level › Scale yielded a single total score › A couple of translations and revisions occurred afterwards The Stanford-Binet (SB5) 06-10-202 2 | 30 › Radical revision of earlier scale in 2003 ▪ Items organized by subtest ▪ Multiple scores next to total score ▪ Scoring similar to WAIS - FSIQ - Composite scores: - Verbal & Nonverbal IQ - Index scores: - Fluid reasoning - Knowledge - Quantitative reasoning - Visual-spatial processing - Working memory The Stanford-Binet (SB5) 06-10-202 2 | 31 › Routing tests: are administered first to identify appropriate level of other tests to administer › Very broad age range: 2-85+ years › Psychometric properties are comparable to Wechsler scales Other tests 06-10-202 2 | 32 › Peabody Picture Vocabulary Test ▪ Quick assessment of mental ability ▪ Takes 10-15 minutes to administer ▪ 228 items ▪ Example item: - “show me the ball” - “show me the sphere” Other tests 06-10-202 2 | 33 › The Wechsler memory scale (WMS) ▪ Specific measure of memory ▪ Designed for age 16-90 ▪ Primarily a measure of short-term memory ▪ Important distinctions: - Length of recall (immediate or delayed) - Type of input (auditory or visual) Other tests 06-10-202 2 | 34 › WMS items Intellectual Disability 06-10-202 2 | 35 › Adaptive behavior = how well a person copes with ordinary life ▪ Differs by age › Intellectual disability is characterized by: ▪ Significant limitations in intellectual functioning and in adaptive behavior (e.g., IQ < 70) ▪ An onset before age 18 The VABS scales 06-10-202 2 | 36 › Vineland Adaptive Behavior Scales ▪ Most widely used measure of adaptive behavior › Two distinct features from Wechsler scales ▪ Measures typical- instead of maximum performance ▪ Information provided by external observer (e.g., parent) The VABS scales 06-10-202 2 | 37 › Administration is similar to a semi-structured interview Trends in individual testing 06-10-202 2 | 38 1. Increased use of hierarchical models to guide test development and interpretation 2. Greater complexity in both structure and scoring 3. Additional materials for remedial instruction 4. Increased use of briefer tests 5. Development and documentation of excellent norms 6. Attention to test bias 7. Increased frequency of revision ▪ „study your psychological testing book diligently“ ☺ 06-10-202 2 | 39 Thank you! Lecture 4: Validity (part 2) Decisions and test use A test is often used with a particular purpose Decisions! Fail/pass Particular diagnosis Admitted or rejected Particular treatment 1 Decisons and test use Practical importance of a test depends on the quality of the decision made on the basis of the test To determine what the test adds to the decision: Compare the decisions made with and without the test § Utility 2 Example of the contribution of a test Critics on the SAT admission test: The SAT gives no adequate description of all the qualities of a student Should we then not use the SAT? Compare decisions with and without using SAT scores (with other decision procedures) Random allocation of potential students? Teachers use, for example, grades and behavior in the 3 class to make a decision Contribution test scores Contribution of a test is not equal to predictive validity Also without the test often better than on the basis of chance Take care of context or circumstances 4 Utility models The contribution of a test to decisions made is partly depended on context factors Base rate Proportion of applicants that is suited for the job within the group of applicants. Proportion of persons with a depression in the population Selection ratio Proportion applicants that is hired Proportion of persons that gets a diagnosis of 5 depression. Utility models Aim: to optimize the success ratio Proportion of persons that is successful Proportion of diagnosed persons that really has a depression 6 Utility on the basis of a table Table contains all Work Decision on the basis of test score applicants Performance (criterion) Aim: to hire Fail (reject) Pass succesful candidates Succesful A B and to reject False negative Hit unsuccesful Unsuccesful C Hit D False positive candidates But test is not A: incorrectly rejected B: correctly hired perfect thus also C: correctly rejected incorrect decisions 5 D: incorrectly hired Base rate Proportion of persons that is suited for the job within the applicant pool This equals the success ratio when selecting by chance !"# Decision on the basis of test Base rate = Job Fail (reject) Pass !"#"$"% performance For example: Successful A B &"' Unsuccessful C D Base rate = =.10 Decision on the basis of test &"'"()"*) Job performance Fail (reject) Pass By chance 10% Successful 2 8 successful Unsuccesful 75 15 6 Selection ratio › Proportion applicants that is being accepted for the job › How ‘restrictive’ is the selection? Decision on the basis of test !"# Selection ratio = Job Fail (reject) Pass $"!"%"# performance Example: Successful A B &"'( Unsuccessful C D Selection ratio= =.23 Decision on the basis of test )"&"*("'( Job Fail (reject) Pass Performance Successful 2 8 23% of the applicants Unsuccessful 75 15 Is accepted 7 Success ratio › Proportion of the hired applicants that is successful Decision on basis test ! Job Success ratio = Reject Pass !"# performance Success A B Example: Unsuccessful C D ces Decision on basis test Success ratio = Job Reject Pass 8 performance =.35 8 + 15 Success 2 8 Unsuccessful 75 15 35% of the hired people is successful 8 Utility/incremental value test Compare base rate with the success ratio Does the test has incremental value ? Base rate was 10% succesful Success ratio using test was 35% Decision on the basis of test Job Performance Reject Pass Success 2 8 Unsuccessful 75 15 9 Role predictive validity Distribution of persons in the table The higher the predictive validity: The fewer the persons in boxes A en D (misses) The more persons in boxes B en C (hits) Decision on basis test Reject Pass Succesful A B False negative Hit Unsuccessful C D Hit False Positive 10 Incremental value of a test Test has little incremental value when: The base rate is very high or very low Almost everyone can do the job or almost no one can do the job: then a test is not useful The selection ratio is high When almost everyone is hired: a test is not useful 11 Incremental value of a test Even tests with relatively low validity have incremental validity when The base rate is not very high or low (arround.50 ideal) There are enough persons suited/not suited to make a distinction between the two The selection ratio is low Almost nobody is selected Only the very best are selected Estimation of the incremental value of a test Taylor en Russell (1939) They provide tables for the estimation of the incremental value of using test scores. The tables provide the success ratio for a particular base rate, selection ratio and predictive validity (r or R). Tables can also be used to obtain insight into the 12 success ratio when predictive validity is increased Example Taylor-Russell table What is the success ratio? Assume: base rate = 80% Assume realistic predictive validity r =.40 Incremental value is moderate when selection ratio is low (from 80% naar 95% success) Incremental value is small when selection ratio is high (from 80% to 83% success) If we invest in a test with r =.50, only 1 13 or 2% increase Example 2 Taylor-Russell table What is the success ratio? Assume: base rate = 5% Assume realistic predictive validity (r =.40) Moderate incremental value when selection ratio is low (from 5% to 12%) No added value when selection ratio is large (stays 5%) Thus: when we have an extreme base rate and high selection ratio, a test with even high predictive validity does 14 not add to success ratio Question A test is used to select a candidate for a call center; this test has a predictive validity of r =.35. Every year a company gets 400 candidates of whom 200 get a job. The base rate (thus without using the test) is 80%. How many will be succesful in the job when the test is used ? a. 89 b. 168 15 c. 178 Selection Success ratio = ratio =.89 Question Answer 200/400 =.50 A test is used to select a candidate for a call center; this test has a predictive validity of r =.35. Every year a company gets 400 candidates of whom 200 get a job. The base rate (thus without using the test) is 80%. How many will be succesful ? 89% of the 200 are successful: 200 * 0.89 = 178 16 Considerations test use Should we use a test in a selection procedure? Costs/benefits Unsuccesful employees/students are expensive Salary, extra time to invest etc. Selection procedures require investments Time, organization, buying a test, administration costs etc. Benefits expected change in success ratio versus costs procedure: Economic utility 17 Prediction and Decisions Predictive validitity Check the quality of your data, for example Criterion measures Range restriction High predictive validity does not always lead to better decions Check relations with existing information (see next slide) Check context in which the test is used 21 Output SPSS MULTIPLE REGRESSION Y = mean grade; X1 = cognitive test, X2 = procrastination Change R2 =.009 as a result of adding procrastination 22 Meeting 3 Factor analysis (literature: see document on Nestor) Validity (Chapter 5 book) 1 Factor analysis Validity: Does the test measure what it is supposed to measure? Factor analysis is one of the methods that can be used to demonstrate internal structure validity 2 Factor Analysis Aim factor analysis: Summarize many variables (items) into less variables (factors).. and to retain as much information as possible Gives you a better idea what you are measuring Factor = weighted sum of item scores or subtest scores 3 Central idea The key concept of factor analysis is that there are similar patterns of responses to multiple variables (items) because they are all associated with a latent (= not directly observable) variable: the factor Factor may be conscientiousness, verbal intelligence, knowledge of test theory etc. 4 Factor analysis Exploratory What is the structure of the test? Central question: Suppose you construct an intelligence test. How many subfactors can you distinguish on the basis of the test data (e.g., verbal, numerical)? Confirmative Can you confirm the assumed structure of the test? Central question: Suppose you construct an intelligence test that should measure verbal and numerical intelligence. 5 Can you distinguish those subfactors? Factor analysis Component analysis Principal Component Analysis (PCA): explorative Multiple Group Method (MGM): confirmative Common factor analysis Explorative or confirmative Difference: How do we summarize information from the items? Component analysis: sum of weighted observed variable scores 6 We only discuss PCA and MGM Component analysis Factor score or component score = weighted sum of item scores or subtest scores fiq=b1qzi1+ b2qzi2+b3qzi3+…+ bJqziJ fiq= factor score of person i for factor q bjq = weight of variable j for factor q Zij = standardized score of person i on variable j NOTE: the terms factor scores and component scores are used both in the context of component analysis 7 Component analysis Step 1: Determine the weights bjq MGM: Chosen by researcher (0/1) PCA: Optimal estimation on the basis of observed data Step 2: Correlations of all variables on all factor scores We call these correlations for some models: loadings on a factor 8 Component analysis Step 3: Interpretation Variables with a high correlation on the same factor measure similar content Label the factor Uncorrelated factors: Orthogonal Interpretation: constructs are independent Correlated factors: Oblique Interpretation: depended constructs 9 Component analysis Step 4: Proportion explained variance How well do the factors represent the variation in the data? Variance accounted for (VAF), usually between.30 -.80 More factors = higher VAF 10 Variance accounted for How well does the model (MGM, PCA) fit (“describe”) the data Or: How well does my factor score describe my data Assume I have an Intelligence test, “verbal” and “nonverbal” If the variance explained is high, then a score on the factor “verbal intelligence” and on the factor “nonverbal intelligence” predicts very well the 11 item scores VAF Explained variance” or “Variance accounted for” will also be discussed in Stats 2 12 Multiple Group Method (MGM) Is the expected groupings of variables found in the data? Weights are 1 or 0 Correlation between (standardized) variables and proposed factors Expected: for factor q the variables with weight 1 correlate higher than variables with weight 0. Be careful about the correlation of an item with the subtest including itself 13 Example MGM Jealousy scale 15 polytomous items 3 expected subscales: Reactive jealousy: items 1-5 Anxious jealousy: items 6-10 Possessive jealousy: items 11-15 To analyze these groupings 14 MGM on a data set of 1345 persons Example MGM The factors “reactive jealousy” was computed by taking the sum score on items 1-5 “anxious jealousy by taking the sum score on items 6-10 “possessive jealousy” scores on items 11-15 15 Example MGM Item-rest correlations = the correlation of an item with the factor sum score not taking that item into account in the factor Item-test correlations = the correlation of an item with the sum score on all items in the factor 16 Item-factor correlations Item Label Reactive Anxious Possesive MGM results 1 Flirt.59(.79) (.11) (.31) 2 Private matters.50(.69) (.07) (.25) 3 Sexual contacts.39(.61) (.03) (.16) 4 Intimate dancing.66(.80 (.10) (.27) 5 Kiss on mouth.54(.73) (.09) (.25) Item-rest factor 6 Attractive others (.15).62(.77) (.35) correlations 7 Sexual relationship (.09).67(.80) (.28) 8 Sexual intrest (.05).72(.83) (.27) 9 Other sex (.13).64(.78) (.43) (Item-test 10 Leave me (.02).61(.76) (.28) factor correlations) 11 Contact to other sex (.24) (.36).64(.79) 12 Friendship others (.24) (.26).60(.76) 17 13 Look at others (.24) (.25).49(.68) 14 Claiming (.32) (.27).50(.69) 15 No freedom (.26) (.37).65(.79) Example MGM So these results are in agreement with what we expected: the item test (or rest) correlations are higher for the items in the factor where an item belongs than for the items in the other factors 18 Question In de table below you find correlations between 6 items and 2 factors. Which labels can you put on factor 1 and factor 2 ? Variable F1 F2 Neck problems.05.50 Headache -.09.67 Feeling useless.72 -.03 Back pain -.10.73 Restless.81.08 Feeling depressed.65.07 a. 1: Psychological complaints 2: Physical complains b. 1: Physical complains; 2: Psychological complains 19 c. The different complains cannot be distinguished answer (a) Variabele F1 F2 Neck problems.05.50 Headache -.09.67 Feeling useless.72 -.03 Back pain -.10.73 Restless.81.08 Feeling depressed.65.07 20 Principal Component Analysis (PCA) Find an x number of factors that explain as much variance as possible Find “optimal” weights for these variables 21 PCA Factors are found one by one Find weights so that the factor explains maximum variance Result: first principal component (PC) Then, for all variables, the residuals of the variables are computed by subtracting that part of the variables that is explained by the first PC. Find weights from the residuals that explains 22 maximum variance Result second principal component Properties PCA first PC explains most variance, then the second PC etc All PC’s are uncorrelated Factor-item correlation (= factor loadings) sometimes difficult to interpret 23 Example After rotation High loadings Item PC1 PC2 F1 F2 on both PC’s 1 -.73.60 -.95.04 2 -.78.52 -.94 -.06 3 -.78.53 -.95 -.05 4.59.72.04.93 5.60.70.05.92 Solution: Rotation 6.55.77 -.03.95 24 = Variance Rotation accounted for Aim PCA: To explain as much variance as possible Total VAF as large as possible This says nothing about VAF separate factors Aim rotation: Substitute PC’s by new factors with in total the same amount of VAF But other VAF for separate factors 25 V1 = (-.73, 60) v2 =( -.78,.52) etc see slide 24 26 How do we determine the number of principal components? 1. Researcher may have an idea about the number of factors Procedure stops when this number of factors is reached 27 How do we determine the number of principal components? 2. On the basis of criteria Kaiser’s criterion Eigenvalue > 1 Eigenvalue is related to the amount of explained variance associated with the component This criterion often overestimates the number of pc’s 28 How do we determine the number of principal components? Scree criterion As much VAF as possible with the least number of factors Number of factors before ‘bend’ in scree plot 29 Scree-plot Eigenvalues Scree criterion: 3 components bend Kaiser criterion: 3 components (larger than 1) 30 Number of components Example SAQ: Statistics Anxiety Questionnaire 23 items 2571 students 5-point Likert scale “Statistics makes me cry” “SPSS always crashes” Can we distinguish different types of statistics anxiety (= can we distinguish different factors) ? 31 What do you get from PCA Output in SPSS? Total Variance Explained % explained variance and eigenvalues Screeplot Rotated Component matrix Loading/structure matrix in SPSS 32 SPSS component analysis 33 spss 34 First component: % of variance = 7.29/23 =.31696 * 100 = 31.696 35 Spss output Thus using the Eigenvalue criterion we conclude that there are four components for “statistics anxiety” because there are 4 components with an Eigenvalue larger than 1.. And if we use the scree plot criterion? That is: number of factors before “bend” see next slide 36 Using scree plot not completely clear how many factors 37 Interpretation: four components - fear of computers - fear of statistics - fear of mathematics - fear of peer evaluations 38 Validity (chapter 5) Validity: Does the test measure what it is supposed to measure? So: not a straighforward definition Depends on the aim of the test, how it is used! Why do we use a test? Different aims: 1. Test as a predictor for behavior/performance Admission to university, selection job etc. 2. Test as an operationalizion of a psychological construct Hypothetical construct: e.g., intelligence, personality trait Therefore, two types of validity: 40 1. Predictive validity (prediction of criterion behavior) 2. Construct validity (measuring a trait) Predictive validity Question is: How well predict test score X criterion Y? How do we research predictive validity? We need: test scores and criterion data in a representative sample Determine relation between test scores and criterion Correlation test score and criterion score = validity coefficient r(X,Y) Multiple predictors: R (multiple correlation) 41 = validity coefficient Different types of predictive validity Predictive validity To what extent are the predictions confirmed by the criterion data obtained in the future? r (correlation), R (multiple correlation), R2 (explained variance) Concurrent validity To what extent is the agreement between the test results and criterion data obtained at the 42 same time r (correlation), R (multiple correlation),R2 (explained variance) Types Predictive validity Incremental validity Improvement of prediction on top of information from existing predictors ΔR2 (additional explained variance) 43 Example College Admission Aim: predicting which candidates are most succesful during the study One of the predictors: Math test Because: Statistics/Methods in the psychology curriculum Does the math test score predicts stat. grades? 44 Example 649 students Correlation math test with statistics grade.40 Explained variance R2 =.16 45 Predictive validity in practice Predictive validity often not larger than r =.60 Explained variance R2 =.36 Does not seem much, but: We can explain sometimes a reasonable amount of the explained variance with one or a small number of test scores Rules of thumb r.10 small.30 moderate.50 large 46 Criticism (psychological) test scores Measurements and predictions are far from perfect Dawes (1979): “In response, I can only point out that 16% of the variance is better than 4% of the variance. To me, however, the fascinating part of this argument is the implicit assumption that that the other 84% of the variance is predictable and that we can somehow predict it.” 47 Reasons low validity coefficients 1. Low reliability criterion underestimation predictive validity 2. Ignoring the different meanings of the criterion e.g., successful performance criterion by manager, what is a successful employee? May be different per organization/function 48 Reasons low validity coefficients 3. To assume a linear relation May not be linear: underestimation predictive validity Consider the relation 49 Linear relation? Linear relation: More is always better Not always the case: check! Le et al., 2011 Personality and task performance 50 Reasons low validitity coefficient 4. Ignoring complex group composition Moderator variables Sometimes for different subgroups we need different subsamples 5. Criterion is more complex than assumed Complex criterion behavior often difficult to operationalize criterion measure is too general, important subtleties are lost 6. Range restriction 51 There are only criterion scores from the selected group: underestimation validity Range restriction Often encountered problem when determining the predictive validity in selection Selection: Only persons with a high score on the predictors are selected Effect: There are only criterion scores Y (job success/study success etc.) available for the highest scoring candidates Reduced spread in X (and often also in Y as a result) Underestimation of the predictive validity! 52 Example range restriction Test Math and grade statistics We only admit students with at least half of the Math questions correct (X ≥ 15) 649 434 students students Correlation Correlation math test math test with with statistics statistics grade: grade: r =.40 r =.29 53 Math test Question A questionnaire about study behavior is used to select candidates for Medicine. A researcher would like to investigate the predictive validity of the questionnaire. Therefore, she determines the relation between the questionnaire scores and the grades obtained during the study. Of the candidates who did the selection procedure approximately 50% is being selected. The correlation on the basis of this selection procedure probably is a. an underestimation of the predictive validity. 54 b. an accurate estimation of the predictive validity. c. an underestimation of the incremental validity. Incremental validity What is the additional value of a test on top of existing information? Important here: Correlation with criterion Y (as high as possible) But also: correlation with existing predictors X As low as possible Because then the test explains unique variance 55 If not, no improved prediction In cremental validity Test with relative low correlation with criterion can sometimes add important information on top of other predictors: when the relation with existing predictors is low 56 Incremental Validity criterion High correlation between X1 and X2 Low correlation between X1 and X2 X1 X2 X X2 57 Low relationship between predictors Predictor 2: Explains a criterion considerable X1 X2 unique amount of the criterion. So: has increased value upon predictor 1 Low correlation between X1 and High incremental X2 validity (ΔR2) Strong relationship between predictors Predictor 2: criterion X1 X2 Explains a small unique amount of the criterion So: has little added value upon High correlation between X1 predictor 1 and X2 Low incremental validity (ΔR2) Next lecture More on validity Intelligence testing Lecture 2 Test Development, Item Analysis (chap 6) Reliability (chap 4) 1 What are we discussing today? Both in chapter 4 and chapter 6 methods are discussed that provide us quantitative (psychometric) information about the quality of a test I first discuss chapter 6 because here are the easier, basic statistics and then I discuss chapter 4 on reliability Note that pp. 230-243 on fairness and bias (chapter 6) do not have to be studied 2 Chapter 6: Test Development, Item Analysis I will focus here on item analysis because this is central to test construction, and needs some additional clarification The book distinguishes different types of test items on p. 201 and further, which is ok! But In the literature often the terms dichotomous and polytomous items are used 3 Types of items Dichotomous item: item with two item scores Example: multiple-choice item, such as A power test usually will have ___. A. a very generous time limit @ B. many items C. machine scoring D. at least some essays 4 Thus: 4 options but only two scores (correct and incorrect 1,0) Types of items Polytomous item: more than two item scores Example: Likert item, such as “I love algebra” 1 = strongly agree, 2 = Agree, 3 = Uncertain 4 = Disagree 5 = Strongly disagree 5 possible scores, thus polytomous 5 Item Analysis Central aim is to present indicators that can help you to obtain insight into the quality of the items and the test Define purpose of the test and construct, write test items, then investigate the psychometric quality of the items When items do not perform as expected we can change item content or we can remove items from the test or questionnaire Important indicators for items are item difficulty (or item popularity) and item discrimination Basis is a score matrix 6 Score matrix (dichotomous item scores) Person/item 1 2 3 4 5 1 1 1 1 1 1 2 1 1 1 1 1 3 0 1 0 0 0 4 1 0 0 0 1 5 0 0 0 0 1 … mean.6.6.4.4.8 Mean for dichotomous items is called the p-value = item difficulty (popularity) = item proportion correct (or proportion endorsed) 7 Score matrix (polytomous item scores) Person/item 1 2 3 4 1 1 3 5 4 2 3 4 2 5 3 3 3 4 6 4 6 6 6 4 5 4 6 7 3 … mean 3.4 4.4 4.8 4.4 8 p-values and a-values for dichotomous items Thus: p-value: proportion of persons choosing correct (or keyed) item option = mean item score for dichotomous items a-value: proportion of persons choosing an incorrect item option, can be used to investigate quality alternative answers p-value has nothing to do with p- value in hypothesis testing ! 9 Example Relative frequency distribution * Denotes correct option A B C D Item 1 0.20 0.10 0.55* 0.15 Item 2 0.00 0.90* 0.05 0.05 Very easy item Item 3 0.25 0.25 0.25 0.25* Difficult item Item 4 0.05 0.05 0.40 0.50* Item 2: p-value = 0.90 a-values = 0.00, 0.05, 0.05 10 Item Discrimination Item discrimination is an item’s ability to differentiate in a desired way between (groups of) persons Thus not: racial/ethnic/gender discrimination Criteria external (defined groups, for example, depressed vs nondepressed: almost never used) Internal: differentiate between persons with different total scores on the test Two Indexes of discrimination Disc index, item-test correlation 11 Table 6.8 Example of Data Set-up for Item Analysis Items (1 = correct, 0 = incorrect) Case Score 1 2 3 4 5 6 7 8 9 10 1 10 1 1 1 1 1 1 1 1 1 1 2 9 1 1 1 1 1 1 1 1 0 1 Upper 3 9 1 1 1 1 1 1 1 0 1 1 Group 4 8 1 1 1 1 1 1 0 1 0 1 5 8 1 1 1 1 1 1 1 0 0 1 6 8 1 1 1 1 1 1 0 1 0 1 - - - - - 95 3 1 0 1 1 0 0 0 0 0 0 96 3 1 1 1 0 0 0 0 0 0 0 Lower 97 3 1 0 1 1 0 0 0 0 0 0 Group 98 2 1 0 1 0 0 0 0 0 0 0 99 2 1 1 0 0 0 0 0 0 0 0 100 2 1 0 1 0 0 0 0 0 0 0 12 Item Discrimination Book discusses Discrimination index (Disc index) How to calculate this index? Consider the correct (or keyed) answer option, then take the Difference between proportion correct (or endorsed) score in high scoring group and proportion correct (or endorsed) score in low scoring group (see table, next power point) 13 TABLE 6.9 Sample Item Analysis Data for Items from an Achievement Test Item Statistics Statistics on Alternatives Prop. Endorsing Item Prop. Disc. Point Alt Total Low High Key Correct Index Biser. 6.56.50.43 1.56.36.87 * 2.26.45.07 3.10.09.07 4.05.00.00 10.62.10.04 1.05.00.00 2.62.64.73 * 3.00.00.00 4.31.36.27 23.26.40.37 1.03.09.00 2.08.18.00 3.26.00.40 * 4.56.55.60 28.97.09.24 1.00.00.00 2.03.09.00 3.00.00.00 4.97.91 1.00 * 29.69.05.03 1.69.55.60 * 2.08.09.13 3.15.27.20 4.08.09.07 14 Item discrimination In practice often Item-test correlation (= item-total correlation) or corrected item-total correlation are being used That is: how is the score on an item related to the total score If this relation is strong (= relatively high item-test correlation), the item has high discrimination and it is a good item Same principle as the Disc index but now we consider the whole score range instead of low vs high scoring persons 15 Corrected item-total correlation = Correlation between score on a specific item and the sum of the other items without the score on that specific item (see example next power point) 16 The correlation between the numbers in bold is the corrected item-total correlation of item 1 Pers./item 1 2 3 4 Total Total score score without score on item 1 1 1 1 0 0 2 1 2 0 1 1 0 2 2 3 1 1 1 1 4 3 4 1 1 1 1 4 3 5 0 0 0 0 0 0 6 0 0 1 0 1 1 7 1 1 1 0 3 2 8 1 0 1 0 2 1 17 9 0 0 1 0 1 1 10 0 0 0 0 0 0 6 items NEO (spss output) é 18 Item-total correlation Note: The item-total correlations = point biserial correlation for dichotomous item scores (see also table 6.9 book and on slide 14 above) 19 Question Below you find the item scores of 8 persons on a test of 3 items item Person 1 2 3 1 0 1 0 2 1 1 1 3 1 0 0 4 0 1 0 5 1 0 1 6 0 1 1 7 0 1 0 8 1 1 0 Which statement is true? (a) person 2 answers the easiest item incorrectly (b) person 3 has a raw score of 2 (c) the p-values of item 1 and item 3 are equal (d) none of the above statements is correct 20 Item Selection Criteria Number of items, longer tests more reliable p-values First items in test easy items, next more difficult items Shape of distribution (see Figure 6.10) & purpose Discrimination We strive for item-total correlation larger than say.15 -.20 Content