Psychological Testing PDF

Summary

This document is about psychological testing, providing definitions, types, and applications. It reviews testing history and the role of statistics in the process. Detailed information about different scales (nominal, ordinal, interval, ratio) , frequency distributions, percentile ranks, and Z-scores is provided.

Full Transcript

Psychological Testing INTRODUCTION Getting to know you! What is your Where are you What year are name? from? you? Is there anything in What do you What is your...

Psychological Testing INTRODUCTION Getting to know you! What is your Where are you What year are name? from? you? Is there anything in What do you What is your particular that plan to do after major? interests you you graduate? about this class? Most recent binge watched TV show(s)? Definitions Test  Measurement device or technique used to quantify behavior or aid in the understanding and prediction of behavior  Measures a sample of behavior  Error is involved in the sampling process Test Item  Specific stimulus to which a person responds overtly  Can be scored or evaluated Psychological Tests  Designed to measure human characteristics that pertain to behaviors  Can be educational tests Definitions Reliability  Accuracy, dependability, consistency, or repeatability of test results Validity  Meaning and usefulness of test results  How appropriate are specific interpretations or inferences of test results Test Administration  How a test is given to test takers (highly standardized? training required?) Overview: Types of Tests How may people are taking it?  Individual: one person at a time by a test administrator  Group: many people at the same time What is it trying to measure?  Achievement: assess prior learning  Aptitude: evaluate potential for learning  Intelligence: solve problems, adapt to changes, think abstractly  Personality Tests: Look at overt and covert dispositions of an individual 1. Structured personality tests (endorse or reject statements about themselves) 2. Projective personality tests (responses to ambiguous stimuli are interpreted) Overview: Applications  Interviewing  IQ (aptitude) testing  Education (achievement) and learning disabilities  Personality and other clinical tests  Neuropsychological tests  Test bias and law Testing History: IQ Testing  … some really early stuff happened …  1905: Binet and Simon (France)  Request by French government to create test  FIRST general intelligence test  30 items, normed with 50 kids  1908: Revised Binet scale  Increase # of items, normed with > 200  Test score determined a mental age (compare to chron. age)  1916: Revised at Stanford  became Stanford – Binet  Items added, normed with 1000 people Testing History  Role of WWI  Army asked APA to create tests evaluate recruits  2 group tests: Army Alpha and Army Beta  1937: First Wechsler intelligence test  Provided several sub scores (not just a mental age)  Included performance tests (no verbal response)  Around WWI: increase in personality testing  Personality as a trait (vs. state)  1921: Rorschach; 1935: Thematic Apperception Test  1943: MMPI – structured, empirical methods to determine meaning  1990: MMPI-2 (so you aren’t all compared to Minnesotans!) Testing History: Status  WWII  Federal funding for paid, supervised training for clinically oriented psychologists  1947  Report states that testing is a unique function of clinical psychology  Taught only to doctoral psychology students Why We Need Stats Descriptive Statistics  Describe what has happened  Evaluation and comparison of observations  Means, frequencies, etc. Inferential Statistics  Speculate about what CANNOT be directly observed/measured  Infer something about a larger sample from a smaller sample Properties of Scales Magnitude  Measure of “moreness” of a given quantity Equal Intervals  Distance between two points the same as the distance between two other points separated by the same number of units  Relationship between measured units and outcome is linear Absolute 0  Absence of the property being measured rather than an arbitrary, relative zero Types of Scales Nominal Scales  ONLY purpose naming objects NO MATH!  Often assigns an arbitrary number to a given object Ordinal Scales  Ranks objects Arithmetic can be used, but hard to interpret!  Difference between ranks has no meaning Interval Scales Arithmetic can be used,  Has magnitude and equal intervals but ratios do not mean  No absolute zero anything (b/c no zero) Ratio Scales All math and all conclusions  Properties of the interval scale work just fine!  Does have an absolute zero Frequency Distributions  How often each score occurs in data set  Typically on ascending X and Y axes  Not all frequency distributions are so symmetrical! Frequency Distributions Frequency Distributions Frequency Distributions Frequency Distributions Frequency Distributions  Can be in the form of a frequency polygon  Includes class intervals  This example is skewed! Percentile Ranks  Determine: What percent of the scores fall below a given point?  Steps: 1. Arrange data from smallest to largest 2. Count # of cases with lower scores 3. Determine total # of cases 4. Follow formula …. Pr = (B / N) x 100 Pr = percentile rank Xi = score of interest B = number of scores below Xi N = total number of scores Percentile Ranks Examples: Group 1 – Find %ile rank of of 9 miles away Group 2 – Find %ile rank of 21 years old Describing Distributions Mean  Average of a data set  Add all of the scores, dividing by the total number of cases  Problems with the mean? Class Data “Distance” Question  Minimum: 3  Maximum: 1423  Mean: 161.24  Drops down to 68.78 (!!!) without the 2 outliers Describing Distributions Variance  Average square deviation around a mean Mean = Describing Distributions Standard Deviation  Square root of the variance is called the standard deviation  Gives more information than the mean, shows variation in data set  Allows for precise statements about the distribution of scores Standard Deviations Examples: Group 1 – Find the variance and SD for Age Group 2 – Find the variance and SD for Credits  What does it mean that they are different? Z Scores  Limitation of a mean or standard deviation:  Do not convey a lot of information  Conclusions are difficult to draw from them  Z score creates standardized units that are easier to interpret  Deviation of a score from the mean in standard deviation units  Score > mean, Z score is positive  Score = mean, Z score is 0  Score < mean, Z score is negative Z Scores Examples Group 1 – What is the Z score for 19 years old? Group 2 – What is the Z score for 218 miles away? (SD = 332.0) Standard Normal Distribution  Also called a symmetrical binomial probability distribution  Typically uses Z scores on the X axis  Shows the scores of a population in standard deviation units Percentiles and Z Scores Z Scores  Mean = 0  Standard Deviation = 1 ** When a distribution is normal  transform raw scores to %iles Steps: (example Z = 2.43) 1. Have a Z score table! 2. When using table Part II  find Z score to 1 decimal point on the left (i.e. 2.4) 3. Follow that row across the table to the # of the second decimal point (i.e. to where you see.03 at the top) 4. The number for this example is.4925 5. Because the Z score is positive .5 +.4925 =.9925 *100 = 99.25 %ile 6. If Z score was negative .5 -.4925 =.0075*100 = 0.75 %ile Z Scores and Percentiles Examples Group 1 – What %ile is associated with the Z score for 19 years old? Group 2 – What %ile is associated with the Z score for 218 miles away? (SD = 332.0) T Scores  Another system of transforming raw data to give more intuitive meaning  Percentile equivalents are assigned to each raw score  Characteristics:  Mean of the distribution is set at 50  Standard deviation is 10  Acts as an alternate to Z scores by multiplying a Z score by 10 and adding 50 Norms  Performances by defined groups on a given test  Are represented in different ways (Z scores, quartiles, etc.)  Give information about individual performance relative to a standardization sample Norms Norms  Evaluating performance relative to a standardized sample  Compares each person to the norm Age-Related Norms  Different normative samples for a given test (e.g., IQ) based on age groups  Important function: monitoring how an individual person changes relative to their peers Tracking (specific form of age-related norms)  When one stays at the same levels relative to those peers  Examples: growth charts for kids Critics: This form of testing forces competition Criterion-Referenced Tests  Criterion-referenced tests:  Describe specific skills/abilities that can be demonstrated by a test taker  NOT used to compare people to each other – compared directly to criteria  Applications in designing individualized educational plans for children?  Critics: Standards are often arbitrary (and high-stakes) Scatter Diagram  Individual’s score on two variables at the same time  Relationship between two variables  Allows for visual inspection of data Regression Line  Regression: used to predict the scores on one variable based on knowing the scores of a second variable  Obtained through a regression line  Best fitting straight line through the points on a scatter diagram  Refers to raw scores (i.e. actual data in the real data units)  Predicts: change in Y when X increases by 1 unit  Best fitting line through a series of data points § Involves both predicted & actual scores (which are rarely the same!) § The difference between them is called a residual § The slope of the line is chosen to minimize these errors/differences § The sum of the residuals always equals zero (what does this sound like?) Best-Fitting Line Residual and Standard Error of Estimate  Standard deviation of the residuals = standard error of estimate  Measure of the accuracy of prediction  Standard error of estimate is low --> prediction more accurate Measures VARIABILITY Scatter of OBSERVED values around the regression line. How to Interpret a Regression Plot  Regression plots  Visual representations of the relationship between variables  Often used to determined predictive validity of a measure  Initially study a sample (measure both variables)  Compute regression equation  Use regression equation on future samples when have 1 variable  Use to predict the other variable  What happens if the slope is zero? (no relationship)  May use normative data  Use the average score as the prediction How to Interpret a Regression Plot Shrinkage and Cross Validation Shrinkage  Regression equation is created on one group of subjects...  … then is applied to a different population and...  … prediction is usually WORSE/decreased (called shrinkage)  Can be estimated in advance based on the variance, covariance, and sample size of the original regression equation Cross Validation  Apply a regression equation to a group other than the one for which it was created  Calculate standard error of estimate for the relationship between predicted and actual values Correlation  Correlation (generally!)  Special form of regression  Scores for both variables are in standardized (or Z) units  Pearson product moment correlation coefficient  Ranges from −1.0 to 1.0,  Ratio that can be used to determine the degree of variation in one variable from knowing the movement of the other Correlation  Special form of regression  Scores for both variables are in standardized (or Z) units  Examine a linear relationship b/w two variables Correlations  Correlation coefficient  Direction and magnitude of a relationship between to variables  Pearson product moment correlation coefficient  Most common, both variables continuous Other Correlation Coefficients  Spearman’s rho  Association between two sets of ranks (ordinal)  Biserial r, Point Biserial r, Tetrachoric r, Phi  Dichotomous variables (yes/no, correct/incorrect, etc.)  True and artificially dichotomous variables Coefficients of Determination (R2) and Alienation Coefficient of Determination (R2)  Square of the correlation coefficient  % of the variance in Y scores related to the variability in X  Value: between 0 and 1  How well the regression line predicts actual values  The relationship between THESE 2 variables accounts for ___ % of the variation in the data  Is MOST variation explained by this relationship? Or something else?  Low R2 means the data does NOT fit regression line well Coefficient of Alienation  Nonassociation between two variables  Inverse of the coefficient of determination Coefficients of Determination (R2) Proportion of variance in first- year college performance explained by SAT score  Despite a significant relationship between SAT and college performance (r =.40), the coefficient of determination shows that only 16% of college performance is explained by SAT scores. The Correlation-Causation Problem and the Third Variable Explanation  Relationship between two variables does not mean that one caused the other  Other research is needed to determine such cause-and-effect relationships  Possible that two variables may be related to each other, but both are driven by some third factor  Example: The murder rate and the rate of ice-cream sales in a major city might have a strong positive correlation, but both are driven by hot temperatures (a third variable) Restricted Range When variability is restricted, a correlation may be difficult to find If this chart, which shows GPA vs GRE-Q, only looks at GRE-Q scores over 700, a true correlation will not emerge Measurement and Conceptualization of  Error Measurement in social sciences is more difficult than physical measurements (height or length)  Complex traits often cannot be directly observed  Measuring error is an important part of many sciences, not just psychology Measurement Error  When we give a test  want to know results can be replicated under similar conditions (CONSISTENT  reliable)  Errors in measurement that could REDUCE reliability 1. Systematic Errors  Error impacts score the SAME WAY every time test is taken  Does NOT result in inconsistent measurement (but could be inaccurate!) 2. Random Errors  Error impacts score in a RANDOM WAY every time test is taken  Purely chance  Guessing, distraction, content issues, test takers mood, misreading, etc.  Retake test – likely NOT same error impacts score, but SOME error will  Reduces consistency and accuracy of test Basics of Test Score Theory Score you would get if there were Score you get b/c NO errors in measurement is measurement imperfect Error  Difference b/w a true score and an observed score on some trait or quality  Error is variation in score that has nothing to do with true score  Error = Observed Test Score – True score (E = X – T)  Observed test score = True score + Error (X = T + E) Classical Test Theory  Assumes that errors of measurement are random  Also assumes:  True score WILL NOT CHANGE with repeated assessment  Repeated assessment CAN produce different observed scores Basics of Test Score  Theory Mean: estimated true score for the individual  Dispersion: distribution of random errors TRUE SCORE = average of observed scored over and INFINITE # of testing with the same test Distribution of observed scores for REPEATED TESTING of SAME PERSON Basics of Test Score Theoryhas Distribution the MOST error Distribution has the LEAST error Distribution of observed scores for REPEATED TESTING of SAME PERSON Domain Sampling Model  Issue:  Not feasible to ask you every single question that could possibly assess a certain domain  Example: spelling test for every English word  What do I do instead?  Chose a sample of items from a larger domain  Example: subset of words of a particular difficulty level  Problem this causes?  There will be error because I am using a “shorter” test with items that may/may not correctly represent the content domain  Reliability analysis  estimate how much error we make w/ shorter test Reliability  Estimated from correlation of = observed score & true score  What proportion of variation in OBSERVED test score can be attributed to the TRUE test score (vs ERROR)?  BUT we cannot know the actual true score, so …  We are stuck with ESTIMATING what it would be  Measures MUST have demonstrated reliability before they can be used to make decisions (by law!)  Reliability estimates are correlations Reliability  Possible sources of error  Situational factors (issues with room or test taker)  Characteristics of the test (item issues)  Types of reliability (i.e. ways we estimate)  Test–retest  Parallel/Equivalent forms  Internal consistency  Interrater Test-Retest Reliability  Type of error measured?  Changes of observed score around true score due to tester state  Put another way – something about the TIME of the test impacted the score  Time sampling error  How it is measured?  Same person administered same test at 2 different time points  Calculate correlation between the two (stability coefficient)  Issues to consider?  Only useful for assessing presumed TRAITS (vs. states)  Need to consider time interval between measurements  Reduce impact of memory / practice (1st test can impact 2nd test!)  BUT, not so long that maturational or historical changes impact score Parallel/Equivalent Forms Reliability  Type of error measured?  Error due to content chosen for each test form (represent domain well?)  Item/content sampling error  How it is measured?  Construct two similar forms of a test  Give the two different to the same group of participants  … preferably at the same time  Balance the order of administration (some get 12, others 21)  Calculate correlation between the two (equivalence coefficient)  Issues to consider?  If given on the same day: Differences ONLY reflect random error and differences between the two tests  If correlation is high, forms can be used interchangeably  UNDERUTILIZED! Internal Consistency  Reliability statistics obtained from ONE test administration Measures:  Consistency of performance across items /subtests Used to Estimate:  How performance would generalize to other items in tested content domain High Internal Consistency Means:  Items in the measure all assess the same content domain  Items are well written / not technically flawed Issues to consider?  Only useful if measure is MEANT to assess one concept Otherwise use factor analysis (and want internally consistent factors) Internal Consistency / Split-Half  Type of error measured?  Content sampling error  How it is measured?  One test is split into 2 equal halves – each half is compared  Options: randomly, 1st half / 2nd half, evens / odds  End up with 2 halves that are as “parallel” as possible  Issues to consider?  Measures with fewer items have lower reliability, SO …  Split-half alone will underestimate the true reliability  Use the Spearman-Brown formula to correct for half-length test  BUT, Spearman-Brown should not be used if 2 halves have different variances ( Cronbach's alpha, which everyone uses anyway!) Addition Test 1. 3+2 Both methods will underestimate reliability 2. 4+5 because it correlates two 6- 3. 8+6 item tests (vs. 12 item) First 1/2 4. 17 + 5 Shorter tests = lower reliability 5. 28 + 13 Correct with the Spearman- 6. 75 + 17 Brown formula 7. 113 + 85 8. 166 + 39 9. 476 + 215 Second 1/2 10. 781 + 432 11. 1094 + 841 12. 2741 + 4052 Internal Consistency / KR20  Type of error measured?  If different items all measure same ability / trait / concept  Content sampling error and flawed items  How it is measured?  Used for tests with dichotomous scored items ONLY (0/1)  Usually used for items that are right or wrong  Considers ALL WAYS of splitting the items (vs. split-half) Internal Consistency / Coefficient Alpha  Type of error measured?  If different items all measure same ability / trait / concept  Content sampling error and flawed items  How it is measured?  Used for dichotomous OR non-dichotomous tests (like KR20)  Considers ALL WAYS of splitting the items (vs. split-half)  Provides the lowest estimate of reliability of any internal consistency measure (most conservative)  Most general method for finding internal consistency Interrater Reliability: Observational Studies and Interview Studies  Type of error measured?  Observer differences  Observations do not reflect true occurrences / ratings on scale  How it is measured?  Correlation between observations/ratings of two different raters  Typically done on a subset of the studies observations/ratings  Issues to consider?  Kappa: nominal scale (> 0.75 = excellent, 0.40-0.75 = satisfactory)  ICC: continuous scale When is it “reliable enough”?  Acceptable level of reliability depends largely on what is being measured .70 to.80 is seen as “good enough” for basic research  Researchers often look for reliability of.90 or even.95  Some argue that these reliability levels are less valuable, as they indicate that all items on a test measure the same thing, and thus the test could be shortened  But these levels are more necessary in clinical settings, when they influence treatment decisions  Standard error of measurement is a very useful index of reliability because it leads to confidence intervals Standard Errors of Measurement  SEM = on average, how much a score varies from the true score  Find this using: Standard deviation of the observed score Reliability of the test  Allows estimate the degree to which a test provides inaccurate reading  All tests are not equally accurate but also not equally inaccurate  Higher standard error of measurement leads to less certainty about the accuracy of a given measurement of an attribute What to do about low reliability  Increase the number of items  Is particularly effective in the domain sampling model  Spearman–Brown prophecy formula can be used to calculate the number of items needed for certain level of reliability  Factor and item analysis  Discriminability analysis: correlate each item with the total test score  LOW item-total correlation  probably measuring something different  May choose to drop item  increase reliability Validity Validity  Agreement between a test score or measure and the quality it is believed to measure  Does a test accurately measure what it is intended to measure?  Gather evidence through systematic studies Three types of evidence  Content-related  Criterion-related  Construct-related Validation of a Test  Process for the test developer or user  Collect evidence to support inferences that are made from the test score …  … about what tester will do now or in the future  MUST know desired inference (what is the test supposed to do?) …  To determine how useful the actually test is  Different types of validation studies support different types of inferences!  Types of validity are not interchangeable  May need to do more than one type of validation study Face Validity Face Validity  Extent to which a measure appears to have validity  Do items seem related to the perceived purpose of the test?  Does not offer evidence to support conclusions drawn from a test  Is NOT a statistical or numerical measure  Whether test ”seems like” it measures a criterion  May be undesirable in some circumstances! (examples?)  Likely motivating for achievement and employment testing Content Validity Content Validity  Determine if items on a test are directly related to the content they are assessing  Most often used with achievement tests (i.e. educational settings)  Test that felt like it had nothing to do with what you were told to study?  LOW CONTENT VALIDITY  cannot make a valid inference about your knowledge  Logical rather than statistical quality (i.e. no number associated w/ it!) Process (after initial form of measure is developed) 1. Define domain of interest 2. Select panel of qualified experts (NOT the item writers!) 3. Panel participates in process of matching measure items to domain (or not!) 4. Collect/summarize data from matching process Criterion Validity Criterion Validity  Need when: Using current test to INFER some performance criterion that is NOT being directly measured  Gather evidence that there is a relationship between test score and criterion performance (before making decisions based on test score!) Supported by high correlations b/w test score and well-defined measure Allows calculation of a regression line Process FUTURE prediction of criterion 1. variable ID criterion and measurement method. 2. ID appropriate sample (representative of those that test will be used with) 3. Give test to sample & and obtain criterion data when it is available 4. Determine strength of relationship b/w test score and criterion performance Correlation Criterion Validity: Predictive and Concurrent Predictive Validity  How well test predicts criterion performance in the future  Does the SAT predict future college success?  Predictor variable / initial test: SAT  Criterion: College GPA  High Predictive Validity: If SAT has a high correlation with later college GPA Concurrent Validity  Assess the simultaneous relationship between a test and criterion  Relates to something that is happening right now  Does using a depression screener predict a DSM-5 depression diagnosis?  Predictor variable / initial test: Depression screening measure (short!)  Criterion: DSM-V diagnosis by interview (long and time intensive!)  High Criterion Validity: If screener has a high correlation current DSM-5 dx Validity Coefficient Validity Coefficient  The relationship between a test and the related criterion (r)  Extent to which the test is valid for making statements about the criterion Validity Coefficient Squared (coefficient of determination, R2)  Relationship between variation in the criterion and our knowledge of a test score  Question this answers  How much of the variation in Y will we be able to predict on the basis of X?  How much variation in college performance will we be able to predict on the basis of SAT scores?  One way to think about it is, what percentage of data points will fall on the regression (PREDICTION) line?  If I know JUST GPA, how often will I know exactly what someone's college performance will be?  How much of the variation in college performance does GPA account for?  The remainder of the variation is NOT explained by GPA Cross Validation and Shrinkage Review  Measure developed  successfully predicts something with some accuracy (i.e. has predictive validity)  Want to use the regression equation to make those predictions again in the future with DIFFERENT SAMPLES  PROBLEM: the prediction is going to be HIGHEST on the sample regression equation was developed on  SOLUTION: see how the equation performs on another sample (i.e. CROSS-VALIDATE the equation with new sample) to make sure it still works well with a different group of people  RESULT: the prediction will be worse, there will be MORE ERROR in the prediction, fewer data points will fall on the regression/prediction line (i.e. ability to predict will SHRINK)  BUT could still have very good predictive validity! Evaluating Validity Coefficients  What does the criterion mean?  Ensure the criterion itself is reliable and valid!  Correlating with or predicting to something meaningless is … meaningless  Review the subject population in the validity study  If NOT representative of group that inferences will be made about?  Be sure that the sample size was adequate  Small sample = greater chance that correlation was artificially inflated  A GOOD validity study will include cross-validation (check out predication ability) Evaluating Validity Coefficients  Check for restricted range on both predictor and criterion  Means all score fall very close together (i.e. very little variation)  Correlation requires that there be variation in variables  Review evidence for validity generalization  NOT AN ISSUE OF JUDGMENT, MUST BE STUDIED  Does the validity generalize to other situations?  Groups of people who take the test?  Times periods / situations test is taken?  Consider differential prediction  Predictive relationships may NOT be the same for all demographic groups  I.E. VALIDITY COULD DIFFER  May need validity studies for different groups Construct Validity Why is it necessary?  Often measure things that are not directly observable  Constructs are not clearly defined  To be useful, a construct needs: 1. An operational definition 2. Description of relationships with other variables (i.e. What is it related to? and what is it NOT related to?) Process  Assembling evidence about what a test means  Each relationship that is identified helps to provide a piece of the puzzle of what a test means Convergent Evidence  Know in advance that test SHOULD relate highly to certain other tests measuring the same construct  Expect a high correlation between two or more tests that purport to assess the same construct  Two tests converge (narrow in) on the same thing  Do not expect the correlation to be perfect (or exceptionally high, >0.90), why not? (what would it mean about your test?) Discriminant/Divergent Evidence  Know in advance that test SHOULD NOT relate highly to certain other tests measuring concepts known to be distinct from the one being measured  Two tests of unrelated constructs should have low correlations  Discriminate between two qualities that are not related to each other  What could you do if items on your test correlate highly with items on one of your discriminate validity tests? Item Writing: Guidelines 1. Define clearly what you wish to measure  Items should be as specific as possible 2. Generate pool of items  Likely write 3-4 items for every ONE that you will keep  Avoid redundant items in final test 3. Avoid items that are exceptionally long  Long = confusing or misleading 4. Be aware of the reading level (scale and test-takers) 5. Avoid items that convey two or more ideas at the same time  Hard to answer when they question is asking about 2 things! 6. Consider using questions that mix positive and negative wording  We tend to start “just agreeing” with the items  To work against this, we will mix up the wording of questions (and reverse score) Dichotomous Format Offers two choices for each question  Examples: yes/no or true/false  Appears on educational as well as personality tests Advantages  Simplicity  Often requires absolute judgment Disadvantages  Can promote memorization without understanding  Many situations are not truly dichotomous  50% of getting any item right even if material is not known Polytomous/Polychotomous Format  Has more than two options  Point is given for the correct selection  Advantage over dichotomous: probability of guessing correctly is lower!  Most common example is a multiple-choice test,  One right and several wrong answers  Incorrect answers are called distractors  How many distractors to use?  Reliability of item is NOT enhanced by distractors THAT NO ONE WOULD SELECT  Rarely more than 3 or 4 distractors that work well  What is the impact of poor distractors?  Hurt reliability and validity  Limit good items on a test and time-consuming to read Common Problems w/ MC ?s 1. Unfocused Stem  Stem should include the information necessary to answer the question.  Should not need to read the options to figure out what question is being asked. 2. Negative Stem  Whenever possible, should exclude negative terms such as not and except 3. Window Dressing  Information in the stem that is irrelevant to the question or concept being assessed should be avoided. 4. Unequal Option Length  The correct answer and the distractors should be about the same length. Common Problems w/ MC ?s 5. Negative Options  Whenever possible, response options should exclude negatives such as “not” 6. Clues to the Correct Answer  Inadvertently provide clues by using vague terms such as might, may, and can.  Where certainty is rare, vague terms may signal that the option is correct. 7. Heterogeneous Options  Correct option and all of the distractors should be in the same general category. Polytomous Format: Guessing  Test item with a limited number of responses  certain number can be answered correctly through guessing  Typical multiple choice test: 25% correct by guessing alone  Formula can correct for this guessing effect  Whether or not guessing is a good idea depends on whether incorrect answers carry greater penalties than simply getting no credit Likert Format  Rate degree of agreement with a statement (on continuum)  Often used for attitude and personality scales  Can use factor analysis  Groups of items that “go together” can be identified  Familiar approach that is easy to use Likert Format 5 Choice with a neutral point Some politicians can be Strongly Somewhat Neither agree Somewhat Strongly trusted disagree disagree nor disagree agree agree I am confident that I will Strongly Somewhat Neither agree Somewhat Strongly achieve my life goals disagree disagree nor disagree agree agree I am comfortable talking to Strongly Somewhat Neither agree Somewhat Strongly my parents about personal disagree disagree nor disagree agree agree problems 6 Choice without a neutral point Some politicians can be Strongly Moderatel Mildly Mildly Moderately Strongly trusted disagree y disagree disagre agree agree agree e I am confident that I will Strongly Moderatel Mildly Mildly Moderately Strongly achieve my life goals disagree y disagree disagre agree agree agree e I am comfortable talking Strongly Moderatel Mildly Mildly Moderately Strongly to my parents about disagree y disagree disagre agree agree agree personal problems e Category Format  Similar to a Likert format, but has a greater number of choices  “On a scale of one to ten….”  Can be controversial for a number of reasons  People’s ratings can be affected by a number of factors that can threaten the validity of their responses  Context can change the way people respond to items  Clearly defined endpoints can help overcome such issues  What is the optimal range? 1 to 10? 1 to 7? Why?  Visual analogue scales Category Format  Similar to a Likert format, but has a greater number of choices  “On a scale of one to ten….”  Can be controversial for a number of reasons  People’s ratings can be affected by a number of factors that can threaten the validity of their responses  Context can change the way people respond to items  Clearly defined endpoints can help overcome such issues  What is the optimal range? 1 to 10? 1 to 7? Why?  Visual analogue scales Development of a Test 1. Reviewing the literature  See what other measures exist already  Can they be improved? 2. Defining the construct  Part of reviewing the literature  Need to know what domain you will be sampling from 3. Testing planning and layout  Representative sample … of ITEMS (i.e. ability, behaviors) Development of a Test 4. Designing the test  Brief, clear instructions  Manual / directions for test users (administrators)  Item difficulty (ability tests): Common issue is making items too difficult or easy Will not distinguish / discriminate between individuals Test is not informative when this happens  Item attractiveness (personality tests): Test taker is likely to answer yes/true/agree Should be rephrased or removed if MOST people would agree Development of a Test 5. Item tryout  Choose sample of individuals that match target population  Initial test has 1 ½ to 2 times more items than final test  Can have multiple versions all with different items  combine later 6. Item analysis  People with high levels of characteristic should get HIGH scores  There should be RANGE of scores (not clumped together)  Item Difficulty / Attractiveness = number correct / marked true  If the number is too high, the item does not tell us very much  Item difficulty should vary across the test  Item Discrimination Index = how well item discriminates between high and low scorers on the test Development of a Test 7. Building a scale  Choose items with: moderate difficulty, high discriminability 8. Standardizing the test  Test used with large representative sample  Same conditions and demographics as intended use  If sufficient reliability/validity  calculate percentiles, etc.  If NOT sufficient reliability/validity  back to item writing/analysis Item Difficulty / “Easiness”  Item difficulty is an essential evaluation component  Asks what percent of people got an item correct  What factors go into determining a reasonable difficulty level? How many answers are there? What would we expect chance to produce? T/F Item: can get item correct by chance 50% of time, difficulty of 0.5 is BAD MC Item: can get item correct by chance 25% of time, difficulty of.25 is BAD  Most tests have difficulty between 0.3 - 0.7  Discriminates those who know the answer from those who do not  Formula on p. 173  MC: optimal difficulty is approximately 0.625  Good to have a range! Want people to feel like they are getting some right Item Discriminability  Determines:  Whether people who have done well on a particular item…  … have also done well on the entire test  Items that are too easy/hard will not discriminate well (too little variability)  Extreme Group method  Calculation of a discrimination index  Compares those who did well to those who did poorly (top 27% vs. bottom 27%).. Who are high on the construct of interest and those who are low Want our items to be able to discriminate between high/low well  Discrimination Index = P(upper group correct) – P(lower group correct) If the number is negative, the item should be removed Range is +1 to -1, Aim for > +0.3 Item Discriminability  Point Biserial method  Correlation between a dichotomous and a continuous variable  Individual item versus (correct/incorrect) and overall test score (continuous)  Is less useful on tests with fewer items  Point biserial correlations closer to 1.0 indicate better questions  May want to exclude the item from the total score when doing this! Factor Analysis Factor  Unobserved (latent) variable Factor Analysis  Data reduction technique  Find the fewest number of distinct factors w/in a data set Factor Loading  Correlation b/w an individual item and newly found factors Relationship Between Examiner and Test Taker  Affect test scores:  Examiners’ behavior  Examiners’ relationship w/ test taker (stronger rapport  higher score)  Supportive comments by the administrator  higher score  Even subtle cues (may be unaware of!) can be important  Preexisting idea of their ability (positive or negative impact) Stereotype Threat  Two levels of threat on a test  Anxiety over how one will be evaluated and how one will perform  Members of stereotyped group, pressure to disconfirm negative stereotypes  Examples  Explains 50-80% of difference b/w males & females on SAT math  Explains 25-41% of difference b/w white non-Hispanic & Hispanic SAT scores  Explains 17-29% if difference b/w white non-Hispanic & black SAT scores  Research finds that even with equal preparation, African Americans underperform white students on college tests  Belief that intelligence is a fixed trait that is inherited  can be changed! Stereotype Threat  Hypotheses about causes  Stereotype threat depletes working memory  Self-handicapping  reduced effort  reduced performance  A problem with this hypothesis is its “blame the victim” tone  Stereotype threat  physiological arousal  disrupt performance  Possible remedies  Shift demographic questions to end of test (evidence for this)  Tell test taker the test is is not expected to show differences Language of Test Taker  Many tests are highly linguistic  Puts non-native English speaker at disadvantage  For all tests, can we assume that the test taker understands the questions or instructions of the administrator?  Translating tests is one option  Translations are often incomplete / not validated.  Can also introduce bias.  How were tests validated?  Do they hold for those for whom the testing language is not native?  Interpreters used with caution (introduce bias!) Training Test Administrators  Often requires a high level of training  Standardized administration is an important part of the test’s validity  Often no specific standard of training that must be demonstrated  As an example, can any psychologists administer a Wechsler Intelligence test? Can we assume that they have all been equally trained?  Research finds a minimum of 10 practice administrations is necessary to gain competence  Each program has own training standards Expectancy Effects  Data can be impacted by what an we expect to find  Such effects are not limited to human subjects or research  Expectancy effects can be very subtle, often unintentional  What interpersonal or cognitive factors lead to such effects? Effects of Reinforcing Responses  If your teacher looks at your answers during a test and smiles and nods, will that impact your overall performance? What if he or she frowns at you?  Test takers (particularly children!) work hard for approval/praise  Reactions given to a given answer can influence future answers  Reward (praise!) improves IQ test scores  Questions  Does giving a reinforcing response violate standardized administration protocols? How might that impact testing validity?  What situations might legitimately call for adjustments to standardized administration? Computer-Assisted Test Administration  Advantages  High standardization  Individually tailored sequential administration  Precision of timing responses  Releases human testers for other tasks  Patience  Control of bias  What might be some drawbacks of computer testing?  Misinterpretation (still need some clinical judgment!) Subject Variables  State of the subject can and does influence test performance  Motivation and anxiety  Test anxiety  Often seen in school settings and has three components  Worry  Emotionality  Lack of self-confidence  Illness Types of Interviews  Standardized: predetermined set of questions  Unstandardized: unstructured, questions depend on client responses  Directive: largely guided and controlled by interviewer  Nondirective: largely directed by client, interviewer asks fewer questions  Selection Interview: identify qualifications and capabilities  Diagnostic Interview: emotional/cognitive functioning Interview as a Test How so?  Method for gathering information  Used to describe, make predictions, or both  Reliability and validity are evaluated  Each type has a defined purpose What is the role of an interview?  Some tests CANNOT BE PROPERLY DONE without an interview  Sometimes the interview is the test itself  Selection for employment  Diagnostic interviews Interviews are Reciprocal  Interviews are human interactions!  Anytime we interact we impact each other  Social Facilitation  We tend to act like those around us  If INTERVIEWER acts defensive and aloof, so will the client  GOOD INTERVIEWERS can remain in control and set the tone  NOT reacting to clients tension and anxiety with MORE tension/anxiety  Remain relaxed/confident  calming effect on client Effective Interviewing  Specific techniques vary by the type and purpose of interview Proper Attitudes  A good interview is “more a matter of attitude than skill”  Translation: more about HOW you do things than WHAT you do  Further translation: you can read all the books and SUCK at this!  Often discuss that how one relates to others is VERY hard to teach  Attitudes related to good interviewing:  Warmth Interviewers are rated poorly when  Genuineness seen as: Cold  Acceptance Defensive  Honesty Uninterested / Uninvolved Aloof  Fairness Bored Effective Interviewing Responses to Avoid  With the exception of stress interviews (induce anxiety!)  Know that certain interactions  guarded, discomfort   reveal less information  Specifics:  Judgmental / evaluative statements (stupid, terrible, disgusting)  Probing statements (“Why did you do that?)  demanding  Hostile/angry statements (unless need to know response to anger)  False reassurance (“Everything is going to be okay”)  DO NOT DO THIS. YOU DO NOT KNOW THAT IT WILL BE OKAY!  Does nothing to help. Minimizes their situation. Dismissive.  You know it is false. They know it is false. Just don’t do it. Effective Interviewing Effective Responses  Responses that keep the interaction flowing smoothly  Specifics:  Use open-ended questions when possible (less in structured!)  Lets interviewee lead topics of import/interest (can learn a lot!)  Requires client to produce something spontaneously  Can be an issue sometimes with some people! Need a back up plan! Effective Interviewing Responses that Keep an Interaction Flowing  Respond without interruption (watch your “uhs” and ”hmms”!)  … urge them to continue with ”Yes”, “and”, “because …”  Verbatim Playback: repeat last response exactly  Paraphrasing/Restatement: similar, captures the meaning  Does NOT add anything additional  Shows that you are listening!  Summarizing/Clarification: pull together meaning across responses  Goes JUST beyond what they said to tie things together  Ensure you are understanding what they are saying  Empathy/Understanding: may/may not be reflection of what was said  Infer and state emotions that were implied Developing Interviewing Skills 1. Understand theory and research. 2. Supervised practice.  Reading books alone won’t be enough!  Watching yourself and getting feedback  Watching others and watching them get feedback  BEING UNCOMFORTABLE is a part of the process  Must be open to feedback, and be willing to make changes 3. Make conscious effort to apply interviewing principles  Requires constant self-evaluation  Am I communicating that I understand?  What is being communicated non-verbally?  Experienced interviewers attend to many things at once because they have trained to … not because they have innate special talent Sources of Error in an Interview Interview Validity  Halo Effect: judge person during interview based on first impression (which is made within the first few minutes!)  Can be favorable or unfavorable impression!  Impairs objectivity (… and thus… ?)  General Standoutishness: judge on the basis of one outstanding characteristic (i.e. physical appearance)  Prevents objective evaluation  Make unwarranted inferences based on the characteristic  Example: attractive  more intelligent  Even in HIGHLY STRUCTURED interviews, impressions made during rapport building can influence the evaluation Sources of Error in an Interview Interview Reliability  Unstructured interviews have low reliability  Why does this make sense?  YET, interviews tend to give FAIRER outcomes than other tests  Structured interviews can have high levels of reliability  May limit content obtained during interview Intelligence  Defining is … has never gone well!  Everyone who measures it has their own definition  Consistent correlation between SES and scores on ALL standardized intelligence tests  Originally developed to we would be LESS biased  Yet … here we are Intelligence: Binet  Binet’s three components of intelligence 1. To find and maintain a definite direction and purpose 2. The ability to make necessary adaptations 3. To engage in self-criticism and adjustments in strategy  Principle 1: Age Differentiation  One can differentiate kids of different ages by their ability levels  Assess the mental abilities of different aged children  Found tasks that MOST kids of one age could do and few younger could do  Principle 2: General Mental Ability  Was not concerned with individual elements of intelligence  Essentially, expected high subtest/total test correlations Intelligence: Spearman  All intelligent behavior rests on a foundation of general mental ability (g)  g is believed to be composed of several s factors  Positive manifold (tests have positive correlations, g drives them)  Developed factor analysis to reduce many items to common factors  g  Consistent with Binet’s initial approach  Still considered a predictor of performance (occupational, educational)  Implies that intelligence is best represented as a single score  Performance on individual task: attributed to g (and some unique variance) Binet Scales  1905  30 items increasing in difficulty  TERRIBLE terms (idiot, imbecile, moron)  No normative data or validity evidence  1908  Items group by age level (age scale)  One score (almost all based on verbal ability)  Mental age determined by performance compared to age group  6 year old answers questions for 8 year olds (that 75% can answer) MA =8  10 yo only answers questions 75% of 5 yo’s can answer, MA = 5 Binet Scales  1916 Stanford-Binet  Intelligence Quotient: IQ = (Mental age / chronological age) x 100  When MA is less than chronological age, IQ is < 100  PROBLEM: in 1916 scale had a max MA of 19.5  SOLUTION?: belief that MA ceased to improve after 16, max CA = 16  1937 Stanford-Binet  Max CA changed to 22 years 10 months  Improved scoring standards and instructions  Improved standardization sample  Created alternate equivalent form  PROBLEMS: reliability differences (worse: young age, high IQ), each age group had difference standard deviations of IQ scores (IQ scores across age ranges could not be compared) Binet Scales  1960 Stanford-Binet Chose best items from 2 forms, combined to 1 test Dropped old IQ calculation Standard score, mean = 100, standard deviation = 16 IQ scores at all age ranges could be compared  1972 Stanford-Binet  First time standardization sample had non-white participants (!!!) Wechsler Intelligence Scales  David Wechsler did NOT like the single score of the Binet scale  Point Scale Concept Binet scale  items grouped by AGE (if you were a certain age you could pass the items), many types of items mixed together Wechsler scale  POINT SCALE Points for each items passed / answered correctly Group item content together (i.e. all working memory together) Allows individual scores for CONTENT areas, in addition to overall score  Performance Scale Concept Binet Scale  relied heavily on language and verbal skills Wechsler scale  included scales of NONVERBAL intelligence (i.e. performance) Requires DOING something vs. answering a question w/language Other scales measured performance, this was first that could compare v vs. p Scale attempts to overcome biases of language, culture, education Wechsler Intelligence Scales  Wechsler’s definitions of intelligence: Capacity to act purposefully Think rationally Deal effectively with environment  Believes that intelligence = several interrelated specific abilities SO, his scales involve… Measuring separate abilities Adding them together to get a general intelligence score Wechsler Intelligence Scales  Format of Tests Substests: measure basic underlying skill/ability Index: related subtests brought together, WAIS-IV has 4 indexes Full Scale IQ (FSIQ): summed scores of all four indexes Wechsler Intelligence Scales Wechsler Subtests Vocabulary  Ability to define words (increase in difficulty)  VERY STABLE OVER TIME (often last to be impacted)  Can be used to estimate baseline/premorbid intelligence Similarities  Identify the similarity between a pair of items (increase in difficulty)  Require abstract thinking – pairs you have not heard before  Ask about seemingly dissimilar things Arithmetic  Math problems in increasing order of difficulty  Less a measure of math than concentration, motivation, memory Wechsler Subtests Digit Span  Repeat digits forward, another section is backward  Test of short term auditory memory  Attention and anxiety can play a significant role in results Information  Tests a range of knowledge (like things you learn in school)  Also taps into curiosity and interest in acquiring knowledge Comprehension  Three types of questions, generally testing common sense 1. “What should be done if …” (… you find someone lying in the street?) 2. Logical explanations (Why do we bury the dead?) 3. Define proverbs (A journey of a 1000 miles begins with the first step) Wechsler Subtests Letter-Number Sequencing  NOT REQUIRED for an index score (supplementary test for WM)  Great test for working memory and attention  Reorder letters and numbers (i.e. Z3B12) Digit-Symbol Coding  Numbers are paired with symbols at the top  Rest of sheet is just numbers --> fill in correct symbols (timed)  Measures ability to learn unfamiliar task, visual-motor dexterity, persistence, speed of performance Block Design  Nine blocks with different sides and a booklet with designs  Arrange the blocks to look like the designs (timed)  Measures reasoning, spatial skills, visual/motor functions, abstract thinking Wechsler Subtests Matrix Reasoning  Identify a pattern / relationship between stimuli  Measures information process and abstract reasoning Symbol Search  Shown two “target” geometric figures  Search among a set of figures, determine if they are in the group  Timed test that measures processing speed Wechsler Scoring  Each subtest  raw score (total number of points)  Raw scores are converted to standard scores for EACH subtest  Allows comparison between subtests  Mean = 10  SD = 3  Subtest scores are then combined into the 4 index scores 1. Verbal Comprehension: crystallized intelligence 2. Perceptual Reasoning Index: fluid intelligence 3. Working Memory: MOST IMPORTANT INNOVATION OF WAIS 4. Processing Speed: how quickly your mind works  Full scale IQ  summing age-corrected scaled scores of indexes Mean = 100 SD = 15 Interpretations of Wechsler Tests Index Comparisons  Can divide scores into verbal IQ (VIQ) and performance IQ (PIQ)  IF they are similar to each other  assume FSIQ is a good estimate of the individuals intelligence  BUT, if the VIQ and PIQ are significantly different?  Example: VIQ = 50, PIQ = 110  FSIQ needs to be interpreted with caution  Would likely not want to make an intellectual disability diagnosis (.90 for FSIQ, VIQ, PRI, WMI, PSI  Test retest is “slightly lower”  SEM for FSIQ = 2.16 (remember, +/- 1 SEM is a 68% confidence interval)  Reliability of individual subtests is lower (this is typical of any test!) Validity  “rests heavily on its correlation with earlier versions of the test” Wechsler Tests: WISC WISC-V  Ages 6-16  Can be administered and scored with 2 iPads  Can be scored and interpreted online (potential issues?)  Good reliability and validity  Pattern analysis  need to be careful (subtest scores are always less reliable) …. Example of the WISC … Wechsler Tests: WPPSI WPPSI-IV  Ages 2.5 to 7 years 7 months  Flexible –more/less subtests depending on what's needed  Good reliability and validity Important….  All of these tests should be used with other assessments  Includes a good interview!  FUNCTIONING and level of impairment matters  No diagnosis should be made on a test score alone Learning Disability: DSM-5 1. Persistent difficulties in reading, writing, arithmetic, or mathematical reasoning skills during formal years of schooling. Symptoms may include inaccurate or slow and effortful reading, poor written expression that lacks clarity, difficulties remembering number facts, or inaccurate mathematical reasoning. 2. Current academic skills must be well below the average range of scores in culturally and linguistically appropriate tests of reading, writing, or mathematics. 3. Learning difficulties begin during the school-age years. 4. The individual's difficulties must not be better explained by developmental, neurological, sensory (vision or hearing), or motor disorders and must significantly interfere with academic achievement, occupational performance, or activities of daily living ( APA, 2013). Learning Disability Testing  Identify students of average intelligence who are struggling in school due to a specific deficit that interferes with learning  Individuals with Disabilities Act (IDEA, 2004)  Every child w/ a disability entitled to FREE appropriate public education, including special education to meet needs  To qualify, disability must adversely impact educational performance  MAJOR disagreements about how LD identification should be done!  cutoff  Often used due to concerns with grade inflation  Best used in conjunction with other pieces of information  Just like all other types of assessment! Graduate and Professional School Entrance Tests Miller Analogies Test  Graduate school admission test  Verbal test  60 minutes, 120 varied analogy problems (examples pg. 319)  Gets VERY difficult  Has specific norms for specific fields  Correlates well with GRE, but predictive validity is unimpressive  Accuracy of prediction varies by age group  More overprediction found for the 45-year old group Graduate and Professional School Entrance Tests LSAT (Law School Admission Test)  Involves intensive time pressure  Three problem types 1. Reading comprehension 2. Logical reasoning 3. Analytical reasoning  Law schools publicize specific weight they put on the LSAT/GPA  Changed very little over the years  Each ACTUAL test is made public after no longer in use  VERY good psychometric properties  High reliability  High content validity  High predictive validity (first-year GPA) Nonverbal Group Ability Tests Raven Progressive Matrices  Well-known, very popular nonverbal test, ages 5-adult  Can give instructions without language if needed  Missing part of a logical sequence is identified  60 items of escalating difficulty (challenging item pg 323)  Considered a strong measure of g (general intelligence)  Strong psychometrics, worldwide norms (!!)  Appears to minimize effects of language and culture  REDUCE BIAS  Latinx/African Americans tend to score 15 pts lower on Binet/Wechsler  7 to 8 points lower with Raven Raven Progressive Matrices Nonverbal Group Ability Tests Goodenough-Harris Drawing Test (G-HDT)  Nonverbal test  Can be administered individually or to a group  Very quick, easy, and inexpensive to administer  Simple  draw a picture of a whole man  Scoring is based on what is or is not included in the drawing  Up to 70 points possible  Score converted to standard scores (mean=100, SD=15)  SOMEHOW significantly related to the Wechsler tests! (not perfect!)  Particularly appropriate for younger children  Often included as part of a test battery (don’t use it alone!) Military Ability Tests ASVAB (Armed Services Vocational Aptitude Battery)  Administered to over a million people each year  Designed for 11th , 12th grades, out of high school  10 subtests (ex: general science, electronics information)  3 academic composites  4 occupational composites  1 overall composite (general ability)  Helps determine entry, assignments, military training schools  Excellent psychometrics  Now administered adaptively using computers Testing in Clinical and Counseling  Testing used in a variety of ways:  Symptoms  psychological diagnosis  Personality (“relatively stable and distinctive pattern of behavior”)  Neuropsychological  brain injury, cognitive deficits, LDs  Often emphasized early in treatment (diagnostic assessment)  Symptoms measures SHOULD be used during treatment  Important to know:  What a measure was developed to do (and NOT to do)  Psychometrics of measure (reliability, validity, who validated with?)  How to APPROPRIATELY interpret scores  Diagnoses should NOT be based on one measure  a full assessment is done with an interview and multiple measures Strategies of Structured Personality Test Construction Logical-Content Strategy  Most early personality tests were built with this method  Measure developed using reason (i.e. what SHOULD measure this?)  Test designer determines type of content needed to measure the characteristic being assessed  Makes assumption that test item accurately describes the subject’s personality or behavior Theoretical Strategy  Measure development begins with a theory about the personality characteristic being assessed  Build items that are consistent with the theory Criterion-Group Strategy  Approach relies on data collection and statistics  Requires 2 groups: 1. Criterion group (i.e. all have a common feature, diagnosis, disorder) 2. Control group (i.e. do NOT have the feature, diagnosis, disorder)  A large number of items are given to BOTH the criterion and control groups  Those items that BEST DISCRIMINATE between the two are kept  A lot of criterion group mark T, and everyone in control marks F  Everyone in control group marks T and most in criterion mark F  Once you know the items that discriminate in your initial group …  … cross-validate with a new group of subjects ….  …Who meet the same criterion and control criteria …  …to make sure that the selected items STILL discriminate b/w the two groups Woodworth Personal Data Sheet  Developed using logical-content strategy  VERY first personality inventory (during WWI)  Help identify emotional state of recruits  Format was basically a paper-pencil psychiatric interview  Questions were chosen to reduce false positives:  Questions that more than 25% of people answered were dropped  Only included items that occurred 2x more often in a group of “neurotic” individuals than “normals”  Led to development of other personality disorder inventories Criticisms of Logical-Content Approach  Based on face validity  People may be unable to accurately evaluate own situation/problems accurately (assumes they can!) Minnesota Multiphasic Personality Inventory (MMPI)  Developed using the criterion-group strategy  Format: True/false, self-report inventory  Scales: 1. Validity: help identify possible dishonest / inconsistent responding 2. Clinical: help identify psychological disorders / symptoms 3. Content: related to specific content areas (e.g. anger)  Purpose:  Distinguish between “normal” and non-normal groups  Originally designed to assist with diagnosis  Requires minimum reading level to be useful (I – 6th grade, II – 8th) Minnesota Multiphasic Personality Inventory (MMPI): Development  Began with 1000 items  ended with 504  Used 8 different diagnostic criterion groups Hypochondriacs Patients who suffer from overconcern of bodily symptoms and express conflicts through bodily (somatic) symptoms Depressives Patients with depressed mood, loss of appetite, loss of interest, suicidal thoughts, and other depressive symptoms Hysterics Immature individuals who overdramatize their plight and may exhibit physical symptoms for which no physical cause exists Psychopathic Individuals who are antisocial and rebellious and exploit others without remorse or deviates anxiety Paranoids Individuals who show extreme suspicions, hypersensitivity, and delusions Psychasthenics Individuals plagued by excessive self-doubts, obsessive thoughts, anxiety, and low energy Schizophrenics Disorganized, highly disturbed individuals out of contact with reality and having difficulties with communication, interpersonal relations, sensory abnormalities (e.g., hallucinations), or motor abnormalities (e.g., catatonia) Hypomanics Individuals in a high-energy, agitated state with poor impulse control, inability to sleep, and poor judgment Minnesota Multiphasic Personality Inventory (MMPI): Development  Criterion group was compared to the control group  Issue with control group being non-representative!!!  Added 2 content groups: 1. Masculinity/Femininity 2. Social introversion  Validity Scales  L Scale: present self in an overly favorable way , “fake good”  K Scale: defensiveness (empirically rather than rationally derived)  F Scale: endorse items rated very infrequently, “fake bad” Minnesota Multiphasic Personality Inventory (MMPI): Interpretations  Scoring  T scores for each scale  M = 50, SD = 10  Scores > 70 considered “significant elevation” (> 65 on MMPI-2)  PROBLEMS!  Those with a specific diagnosis did not just have an elevation on that one scale – most had multiple scale elevations  Made it very difficult to use this to diagnosis! (can’t just say “he is elevated on depression so he must have depression”  Tried to use “pattern analysis” thinking that different diagnoses would have different patterns of scores  not really …  … this is why we just do a diagnostic interview! MMPI-2: Restandardization  Started in 1982  Update and expand norms (better control group!)  Revise out of date items (awkward, sexist, otherwise problematic)  Add items to measure additional constructs (e.g. Content Scales)  Increased number of items to 567 (still T/F)  Some items dropped  Added: 81 items for new content scales, 2 critical items  New sample for norms  CA, MN, NC, OH, PA, VA, WA  2600 total  Added additional validity scales  Variable Response Inconsistency Scale (VRIN): random responding  True Response Inconsistency Scale (TRIN): tendency to mark “true” regardless of the question ( acquiescence) MMPI-2: Pearson Resources  Interpreting Clinical Scales  Content Scales  Sample Reports MMPI-2: Psychometric Properties  MMPI and MMPI-2 are comparable psychometrically  Split-half reliability (median) = 0.7s  Test-rest reliability (median) = low 0.5s to low 0.9s  “as high or better than comparable tests”  … consideration when interpreting a test  Reasoning for using multiple measures! MMPI-2: Issues  Many items are on more than one scale (some on MANY scales)  VERY HIGH level of intercorrelation between the clinical scales  Makes sense, if one item is counted on many scales!  The scales are NOT independent of each other, if you answer True to one question, it will influence the score on multiple scales  Many items are keyed in the same direction  May lead to response sets / styles (e.g. acquiescence)  Would want and equal number keyed T and F  L Scale: all of the items keyed false  K Scale: 29 of 30 items keyed false  VRIN/TRIN help to identify issues with response sets, but … NEO Personality Inventory  Developed w/ combined approach: theory & factor analysis  3 broad domains, each with 6 facets: 1. Neuroticism: anxiety, hostility, depression, self-conscious, impulsiveness, vulnerability 2. Extraversion: warmth, gregariousness, assertiveness, activity, excitement seeking, positive emotions 3. Openness: fantasy, aesthetics, openness to feelings, willingness to try new activities, intellectual curiosity, values  Rational approach to item writing (face valid!)  14 items per facet (7 positively worded, 7 negatively worded)  5 point Likert scale (strongly disagree  strongly agree)  Good psychometrics  Internal constancy and test-retest reliability – high 0.8s to low 0.9s  Often used to measure the Big Five (NEO, agreeableness, conscientiousness)  Translated into many languages How do you become a Neuropsychologist?  Get a PhD or PsyD in Clinical Psychology  This is why most call themselves a CLINICAL neuropsychologist  Clinical psychology training came first!  Internship training in neuropsychology    Neuropsychological Testing  Focus on psychological impairments of the CNS & rehabilitation  Studies the relationship between behavior and brain functioning  Cognitive  Motor  Sensory  Emotional  Overlaps with neurology, psychiatry, and psychometric testing  Used to help identify what is missed by neuroimaging (CT/MRI)  Provides clues about areas of the brain to examine more closely  Provide behavioral and functional assessments (brains differ!)  Early detection of diseases like Alzheimer’s  Develop interventions for those with brain injuries (TBI, stroke, etc)  Help understand concussions, and decide when to return Neuropsychological Testing  Groups of patients that neuropsychologists often work with: 1) Acquired brain dysfunction (e.g., stroke, tumor, CNS infection) 2) Traumatic brain injury 3) Seizure disorders/epilepsy 4) Demyelinating diseases (e.g., multiple sclerosis, ALS) 5) Neurodegenerative disease (e.g., dementia, movement disorders) 6) Medical disorders impacting cognition (e.g., vascular, transplant, pain) 7) Psychiatric conditions (e.g., anxiety, mood, and substance abuse) 8) Childhood issues (e.g., learning disabilities, ADHD, autism spectrum) https://scn40.org/anst/whats-neuropsychology/ Neuropsychological Testing  TABLE 17.1 Selected Neuropsychological Deficits Associated With Left or Right Hemisphere Damage Left hemisphere Right hemisphere Word memory problems Visual–spatial deficits Right–left disorientation Impaired visual perception Finger agnosia Neglect Problems recognizing written words Difficulty writing Problems performing calculations Problems with spatial calculations Problems with detailed voluntary Problems with gross coordinated motor activities, not explained by voluntary motor not explained by paralysis paralysis activities Problems dressing Inability to recognize a physical deficit (e.g., denial of a paralyzed limb) For left-handers: 20% have language skills on the right, 15% split b/w L &R Neuropsychological Testing  NPs go about testing in different ways  Some use fixed batteries of neuropsychological tests  EVERYONE gets the same set up tests every time  Examples: Halstead – Reitan and Luria – Nebraska  Others create a unique set of tests for each individual client based on the presenting problem, typically includes: 1)General intellect 2) Motor skills and sensation 3) Attention and concentration 4) Language 5) Visual-spatial skills 6) Learning and memory 7) Executive functioning 8) Mood and personality https://scn40.org/anst/whats-neuropsychology/ Halstead – Reitan NP Battery  Began in a laboratory designed to assess the impact of brain impairments on adults  Led to an extensive battery that includes MANY different tests  Takes 8-12 hours to complete  Typically also take the MMPI (another 567 questions!)  Available for both adults and children  Valid and useful for:  Identifying dysfunction in a given hemisphere  Identifying tumors or lesions in the brain  Some argue that the benefits do not outweigh the extensive time and effort required to administer  I’ve never seen it used in its entirety, subtests we used at WR:  Finger tapping, trail making, strength-of-grip Luria – Nebraska NP Battery  Views brain as a functional system with several areas contributing to any one ability or skill  Luria used very individualized assessments that were later into the Luria-Nebraska Neuropsychological Battery  269 items that can be given in 24 hours (takes many sessions!)  Divided into 11 subsections  Includes 32 items that are “highly sensitive to brain dysfunction”  High discrimination!  2 scores that indicate whether dysfunction is on the L or R side  Can compare the overall profile of score with research Automated Neuropsychological Matrices (ANAM)  Developed by U.S. Department of Defense  Used w/ a variety of populations  Developed for pre/post deployment  Now also used to assess effects of head injuries in athletes  Sensitive to cognitive changes that result from diseases  Pencil-and-paper and person-administered tests may show different results from computer administration  There is “reasonably good evidence” of its validity California Verbal Learning Test (CVLT)  Example of test that would be a part of an individualized battery  Many tests just tell you IF there is a deficit, this one give more info  How the test is done:  Hear of list of 16 items at 1 word per second  Asked to repeat the list  Repeated FIVE times (that’s a lot!)  Hear a DIFFERENT interference list of 16 words  Asked to immediately repeat that list  After 20 minutes, asked to remember FIRST list by: 1. Free recall (no hints) 2. Cued recall (hint provided) 3. Recognition (do you recognize the word when you see it) California Verbal Learning Test (CVLT)  Why do so many different trials? It gives you a LOT of info  Different strategies (are they grouping words together?)  Is there a difference between recall and recognition memory?  Can see the role of serial position effects  Learning rates across trials (improvement?)  Consistency of item recall across trials  How well information is retained over delays (short/long)  Errors in recall and recognition (should be FAR fewer in recognition)  The scoring program provides all of this information (and more!) based on how many words were remembered, and the order  Are multiple versions of the CVLT, and competitors, like the HVLT Tests We Often Used  WMS: Logical Memory, Visual Reproduction  Trail Making Test / Trails A & B (discussed in text)  CVLT  Grooved Pegboard  WAIS (usually just certain subtests)  Finger Tapping (from the Halstead – Reitan)  Wisconsin Card Sorting Test (you can try it!)  Boston Naming Test  Would add or subtract tests depending on presenting problem  ALWAYS test for effort! (…discussed more in next lecture!) Effort Testing & Symptom Validity  OFTEN overlooked outside of neuropsychology  Significant issue for test interpretation   Treatment planning  Allocation of services / resources  MOST people are being honest / putting forth effort  Even those who “fail effort” may not do so on purpose  Lack motivation (related to depression, other MI)  Conversion disorder  Always assume that something is going on (may not be what they say)  Tests can be:  Embedded – built into a test  Stand Alone – an additional test that you add to a battery Effort Testing  Related to ability tests (i.e. cognition, sensorimotor)  Reduced effort  invalidates measure  Observed score is lower than the true score  Low motivation  just not trying to do well  Malingering  TRYING to do poorly (picking the WRONG answer)  Effort tests take very little effort to do well on!  People with VERY severe cognitive and psychiatric illnesses can get perfect or near perfect scores on them Symptom Validity  Get at issues of self-report (are they always accurate? … no)  Related to somatic and psychiatric complaints  Symptom exaggeration (symptom exists  exaggerates severity)  Symptom fabrication (symptom does not actually exist)  Requires an in-depth diagnostic evaluation  History, past reports/notes, information from others if possible  Many of these symptoms are subjective by nature  can’t just say “you’re wrong!”, pull together information to support opinion  Symptom validity measures (e.g. SIRS)  Any inconsistencies (in reporting, with typical course of illness/symptoms)  Change in behavior when being observed vs. not  PTSD – identify Criterion A trauma (VA disability/service connection claims) Malingering  Very possible to have poor effort and/or have symptom validity issues with a client who is NOT considered malingering  Could be other psychological issues (likely something!)  Conversion disorder  Malingering can be a challenge to actually “diagnose”  NOT a DSM diagnosis  Requires intent of the client to exaggerate symptoms or diminish their ability during the assessment for some external reward  Secondary gain: $$$, time away from work/military, get out of legal trouble, get medication/narcotics, etc. State-Trait Anxiety Inventory  State anxiety: changes often, day-to-day, with situation  Trait anxiety: party of personality, how normally respond  STAI: 40 item scale (20 S, 20 T), 4 point Likert scale  Test-retest reliability: 0.73 – 0.86 (T) and 0.16 – 0.54 (S)  Good concurrent validity with other anxiety measures  Good discriminant validity (S changes and T doesn’t – measure different things!)  Translated into many languages  Parent and child versions (and parent report of child) PHQ-9  Not copyrighted!  Often used in as a screening for depressive symptoms  Provides information about severity of depressive symptoms  Can be used to track change over the course of treatment  High internal consistency Beck Depression Inventory  Copyrighted … but on the internet  Less often used because it costs $  Provides information about severity of depressive symptoms  High internal consistency Quality of Life  Symptoms are not the only thing that matter!  Sometimes people will live with some symptoms, or symptoms will come and go throughout life …  Need to consider QOL in addition to symptoms  Improve the GOOD in addition to decreasing the bad  SF-36 is the most common measure (self-report / interview)  Tried SF-20 but didn’t have good reliability (why does this make sense?)  8 health concepts: 1. Physical functioning 2. Role limitations due to physical health 3. Pain 4. General health perceptions 5. Vitality (energy/fatigue) 6. Social functioning 7. Role limitations due to emotional problems 8. Mental health / Emotional well-being NIH Toolbox  There are A LOT of measures that cost $$$$  … and, there are just A LOT OF MEASURES  When everyone uses different measure  difficult to compare  NIH decided to create …. Even more measures to fix this?  NIH Toolbox  nihtoolbox.org  Want us all to use the same measure (good idea, but …)  MEASURES are free … system/software is not Why Test Bias is Controversial  All people may be created equal, but they are not all treated that way  When test scores show differences, are those differences real or an artifact of bias in the test?  Studies have consistently found the same differences  Example – 15 pt/1 SD difference on IQ test b/w AA and white test takers  Distributions of scores overlap!!  People have considered reasons why do differences occur  Environmental factors  Biological differences and the g factor Traditional Defense of Testing QUESTION: Are standardized tests as valid for African Americans and other groups as they are for white individuals? One answer is that tests are differentially valid for the groups  Controversial!  Would mean  test differences between groups DO NOT equal test bias, instead the tests have different meanings for different groups  If a test is valid in different ways for different groups, is it really valuable? Is it ethical to use? Traditional Defense of Testing Content-Related Evidence for Validity  IQ test score may be impacted by language skills developed as a part of a white, middle-class upbringing  Puts other kids at a disadvantage (nothing to do with intelligence!)  Privileged kids more likely to be exposed to words/concepts on the test   advantage to kids from wealthier homes (again, not b/c smarter!)  Evidence of SAT score having + correlation with family income  Study attempted to reduce bias by removing “unfair” items  After many attempts, the difference in score between groups remained Traditional Defense of Testing Differential Item Functioning (DIF) Analysis  Developed by the Educational Testing Service (of GRE & SAT fame)  On all of their tests, white test takers do better than other racial and ethnic groups (Asian test takers best on quant. tests)  Also a consistent difference b/w men and women on math  What is DIF analysis?  When tests show different performances between white test takers and other racial or ethnic groups, items that may show bias are analyzed  The test is then rescored with those items eliminated to see if overall scores show significant change  Outcome: only a slight effect on the gap in scores Traditional Defense of Testing Criterion-Related Sources of Bias  Tests ability to accurately predict  Based on correlation between test and the criterion (in future)  Create a regression line (i.e. best fit line in the data)  Create a regression line with one sample (both scores)  formula  Use that formula to use one score (test) to predict other scores (criterion) for future test takers. (i.e. use GRE to predict GPA) Traditional Defense of Testing Criterion-Related Sources of Bias  In some cases, two regression lines are needed, and the  A single regression slope overlap between their isodensity predicts performance equally curves corresponds to the well for two groups. However, the average regression line between means of the groups differ. them  If these lines are parallel, they reflect similar application for different groups Traditional Defense of Testing  Regression lines with different slopes suggest that a test has different meanings for different groups. This is the most clear-cut example of test bias. “Ignorance vs. Stupidity”  QUESTION: Do tests measure how smart a child is or their ability to memorize the right answer to the test’s questions?  If someone does poorly, are they unintelligent? Or have they just not yet learned/been exposed to the right response?  One implies something stable, the other flexible/fluid  To demonstrate the “ignorance” position, two tests were developed that would intentionally favor people of color and put white test takers at a disadvantage  The Dove Counterbalance General Intelligence Test, AKA Chitling Test  The Black Intelligence Test of Cultural Homogeneity (BITCH)  One both tests, white test takers did worse than African Americans Ethical Concerns and Definition of Test Bias  There is no uniform agreement about what “test bias” means, but three models include:  Unqualified individualism  Quotas  Qualified individualism  Each model relates to a specific statistical definition of test bias, all of which define fairness differently and are based on regression lines  Examine the drawbacks and limitations of each approach to see which is most appropriate for a given situation or circumstance Ethical Concerns and Definition of Test Bias Effect on Effect on average Model Use of regression Rationale minority criterion selection performance Regression Separate regression lines are This is fair because Few minority Good used for different groups. Those those with the group performance with predicted criterion scores highest estimated members on criteria are selected. level of success are selected selected. Constant Points equal to approximately This is fair because Some Somewhat ratio half of the average difference it best reflects the increase in the lower between the groups are added potential of the number of to the test scores of the group lower-scoring group. minority group with the lower score. Then a members single regression line is used, selected and those with the highest predicted scores are selected. Ethical Concerns and Definition of Test Bias Effect on Effect on average Model Use of regression Rationale minority criterion selection performance Cole/Darlington Separate regression equations This is fair Larger Lower are used for each group, and because it selects increase in the points are added to the scores more potentially number of of those from the lower group successful people minority group to ensure that those with the from the lower members same criterion score have the group. selected same predictor score. Quota The proportion of people to be This is fair Best About the selected from each group is because members representation same as for the predetermined. Separate of different of minority Cole/Darlington regression equations are used subgroups are groups model

Use Quizgecko on...
Browser
Browser