Chapter 3 Issues in Personality Assessment PDF

Chapter 3 Issues in Personality Assessment Learning Objectives 3.1 Report three techniques used to access information about personality 3.2 Distinguish between the discernment of the three kinds of reliability 3.3 Analyze the issue of validity in assessment 3.4 Relate the logic behind the theoretical and the empirical approaches to the development of assessment devices 3.5 Analyze the importance of investing in effort in creating and improving tests of personality The measuring of personality is called assessment. It's something we all do informally all the time. We want to know what the people we interact with are like, so we know what to expect of them. For this reason, we develop ways to gauge people, judge what they're like. You probably don't think of this as "assessment," but what you're doing informally is much the same---in principle---as what psychologists do more formally. Forming impressions of what other people are like can be hard. It's easy to get misleading impressions. Personality assessment is also hard for psychologists. All the problems you have, they have. But personality psychologists work hard to deal with those problems. 3.1: Sources of Information 3.1 Report three techniques used to access information about personality Informal assessment draws information from many sources, and so does formal assessment. 3.1.1: Observer Ratings As suggested in Chapter 2, many measures of personality come from someone other than the person being assessed (Funder, 1991; Paunonen, 1989). The name for this approach is observer ratings. There are many kinds of observer ratings. Some of them involve interviews. People being assessed talk about themselves, and the interviewer draws conclusions from what's said and how it's said. Sometimes people being interviewed talk about something other than themselves. In doing this, they reveal something indirectly to the interviewer about what they're like. There are many different types of observer ratings. Here, an observer is directly rating a research participant's overt behavior. Other kinds of observer ratings don't require that kind of complexity. Observers may make judgments about a person based on watching his or her actions. Or observers may be people who already know the person well enough to say what he or she is like, and their ratings are simply those summary judgments. Observers can even observe a person's belongings and draw conclusions about what the person is like (see Box 3.1). Box 3.1 What Does Your Stuff Say about You? Many people assume that the vast reach of the web and popular media has completely homogenized American culture. Everyone buys more or less the same stuff, and everyone's personal space therefore looks more or less the same. Not so. Not even close. Sam Gosling and his colleagues have found that people "portray and betray" their personalities by the objects and mementos they surround themselves with (Gosling, 2008). Practicing a research technique they refer to as "snoopology" (the science of snooping), these researchers have extensively studied people's offices, bedrooms, and other personal domains. They've found evidence of three broad mechanisms that connect people to their spaces. They refer to these as identity claims, feeling regulators, and behavioral residue. Identity claims are symbolic statements about who we are. Photos, awards, bumper stickers, and other objects that symbolize a past, current, or hoped-for identity (e.g., cheerleader pompoms) are identity claims. They are indicators of how we want to be regarded. They can be directed to other people who enter our space, or they can be directed to ourselves, reminders to ourselves of who we are (or want to be). Photos can be particularly revealing. They say, "Here I am being me" (Gosling, 2008, p. 16). Feeling regulators aren't intended to send messages about our identities but to help us manage our emotions. Being in a particular desired emotional state can be important for a variety of life's activities, and emotions can be regulated in a wide variety of ways. You can improve your mood by looking at a picture that reminds you of a time when you were very happy. You can soothe yourself with pictures of tranquil nature scenes and with readily available music playing through a high-quality sound system. A bathtub surrounded by candles and scented oils can be eminently relaxing. And if you're the sort of person who thrives on excitement, there are plenty of things that can be included in your surroundings to stimulate those feelings, as well. Behavioral residues are in some ways less interesting than either of these. Behavioral residues are physical traces left in our surroundings by everyday actions (trash is a special case of residue that is discarded repeatedly). What can these residues tell about you? A simple thing is how much residue you've accumulated. The more the residue, the less organized you probably are. A separate issue is what kinds of residue show up. As noted in Chapter 1, personality is displayed in consistencies. Similarly, behavioral residue tends to give an indication of what sorts of things take place repeatedly in your life space. 3.1.2: Self-Reports Another category of assessment technique is self-report. In self-reports, people themselves indicate what they think they're like or how they feel or act. Self-reports thus resemble the process of introspection described in Chapter 2. Although self-reporting can be done in an unstructured descriptive way, it's usually not. Most self-reports ask people to respond to a specific set of pre-written items. Self-report scales can be created in many formats. An example is the true--false format, where you read statements and decide whether each one is true or false for you. Another common one is a multipoint rating scale. Here, a wider range of response options is available---for example, along a 5-point response scale ranging from "strongly agree" to "strongly disagree." Some self-reports focus on a single quality of personality. Often, though, people who develop assessment devices want to assess several aspects of personality in the same test (as separate scales). A measure that assesses several dimensions of personality is called an inventory. The process of developing an inventory is pretty much the same as that for developing a single scale. The difference is that for an inventory, you go through each step of development for each scale of the inventory, rather than just one. 3.1.3: Implicit Assessment Also of increasing interest over the past decade (though they've been around for a long time) are techniques called implicit assessment. These techniques try to find out what a person is like from the person (like self-reports) but not by asking him or her directly. Rather, the person does a task of some sort that involves making judgments about stimuli. The pattern of responses (e.g., reaction times) can inform the assessor about what the person is like. An example of such a procedure is called the Implicit Association Test (IAT; for review, see Greenwald, McGhee, & Schwartz, 2008). It measures links among memory traces that are believed to be hard to detect by introspection (and thus are "implicit"). The IAT can be applied to virtually any kind of association. As applied to properties of personality, it would go like this. Your job is to categorize a long series of stimuli as quickly as you can. Each can be categorized according to either of two dichotomies: "me" versus "not me" or (for example) "plant" versus "mineral." You don't know which dichotomy makes sense until the item appears. Some items pertain to qualities of personality. If one of those items is strongly associated in your memory with you, your "me" response will be faster than if it isn't. Thus, reaction times across a large number of stimuli can provide information about your implicit sense of self. Implicit assessment techniques have been particularly important in the motive approach to personality. Accordingly, we will spend more time on that technique in Chapter 5. As indicated by the preceding sections, the arsenal of assessment techniques is large. All require two things, though. First, in each case, the person being assessed produces a sample of "behavior." This may be an action, which someone observes; it may be internal behavior, such as a change in heart rate; it may be the behavior of answering questions; or it may be the accumulation of possessions over years. Second, someone then uses the behavior sample as a guide to some aspect of the person's personality. 3.1.4: Subjective versus Objective Measures One more distinction among measures is important. Some measures are termed subjective, and others are termed objective. In subjective measures, an interpretation is part of the measure. An example is an observer's judgment that the person he or she is watching looks nervous. The judgment makes the measure subjective, because it's an interpretation of the behavior. If the measure is of a physical reality that requires no interpretation, then it's objective. For example, you could count the number of times a person stammers while talking. This would involve no interpretation. Although this count might then be used to infer nervousness, the measure itself is objective. To some extent, this issue cuts across the distinction between observer ratings and self-reports. An observer can make objective counts of acts, or can develop a subjective impression of the person. Similarly, a person making a self-report can report objective events as they occur (as in experience sampling) or can report a subjective overall impression of what he or she is like. It should be apparent, though, that self-reports are particularly vulnerable to incorporating subjectivity. Even reports of specific events permit unintended interpretations to creep in. 3.2: Reliability of Measurement 3.2 Distinguish between the discernment of the three kinds of reliability All techniques of assessment confront several kinds of problems or issues. One issue is termed reliability of measurement. The nature of this issue can be conveyed by putting it as a question: Once you've made an observation about someone, how confident can you be that if you looked again a second or third time you'd see about the same thing? When an observation is reliable, it has a high degree of consistency or repeatability. Low reliability means that what's measured is less consistent. The measure isn't just reflecting the person being measured. It's also including a lot of randomness, termed error. All measurement procedures have sources of error (error can be reduced, but not eliminated). When you use a telescope to look at the moon, a little dust on the lens, minor imperfections in the glass, flickering lights nearby, and swirling air currents all contribute error to what you see. When you use a rating scale to measure how self-reliant people think they are, the way you phrase the item can be a source of error, because it can lead to varying interpretations. When you have an observer watch a child's behavior, the observer is a source of error because of variations in how closely he or she is paying attention, thinking about what he or she is seeing, or being influenced by a thousand other things. How do you deal with the issue of reliability in measurement? The general answer is to repeat the measurement---make the observation more than once. Usually, this means measuring the same quality from a slightly different angle or using a slightly different "measuring device." This lets the diverse sources of error in the different devices cancel each other out. Reliability actually is a family of problems, not just a single problem, because it crops up in several different places. Each version of the problem has a separate name, and the tactic used to treat each one differs slightly from the tactics used for the others. 3.2.1: Internal Consistency The simplest act of assessment is the single observation or measurement. How can you be sure that it doesn't include too much error? Let's take an illustration from ability assessment. Think about what you'd do if you wanted to know how good someone was at a particular type of problem---math problems or word puzzles. You wouldn't give just a single problem, because whether the person solved it easily or not might depend too much on some quirk of that particular problem. If you wanted to know (reliably) how well the person solves that kind of problem, you'd give several problems. The same strategy applies to personality assessment. If you were using a self-report to ask people how self-reliant they think they are, you wouldn't ask just once. You'd ask several times, using different items that all reflect self-reliance, but in different words. In this example, each item is a "measuring device." When you go to a new item, you're shifting to a different measuring device, trying to measure the same quality. In effect, you're putting down one telescope and picking up another. The reliability question is whether you see about the same thing through each of the different telescopes. This kind of reliability is termed internal reliability or internal consistency. This is reliability within a set of observations of a single aspect of personality. Because different items have different sources of error, using many items should tend to balance out the error. The more observations, the more likely the random error will cancel out. Because people using self-report scales want good reliability, most scales contain many items (but see Box 3.2). If the items are reliable enough, then they're used together as a single index of the personality quality. Human judges are not infallible. They sometimes perceive things inaccurately. Box 3.2 A New Approach to Assessment: Item Response Theory The idea that having lots of items increases a scale's internal consistency comes from classical test theory, which guided scale construction for years. More recently, a different approach called item response theory (IRT) has emerged. IRT is an attempt to increase the efficiency of assessment (Reeve, Hays, Chang, & Perfetto, 2007), while reducing the number of items. One thing IRT tries to do is determine the usefulness of the response choices. Doing this starts by creating response curves for each item. These show how frequently each response is used, and whether each choice is used differently from other choices (Streiner, 2010). For example, consider a scale with the response choices "always," "often," "sometimes," and "never." Analysis might find that "often" and "sometimes" are actually treated the same. If so, there's no point in having these responses as separate choices. IRT also determines the "difficulty" of an item (Streiner, 2010). For instance, on a scale assessing anxiety, the item "I worry" would be easier to agree with than the item "I get panicky." Why? Because the second item requires more anxiety. A more "difficult" item will better distinguish people who have anxiety from those who do not. IRT has been applied to a diverse range of assessments, including those for personality (e.g., Samuel, Simms, Clark, Livesly, & Widiger, 2010; Walton, Roberts, Krueger, Blonigen, & Hicks, 2008) and psychological disorders (e.g., Gelhorn et al., 2009; Purpura, Wilson, & Lonigon, 2010; Uebelacker, Strong, Weinstock, & Miller, 2009). An interesting finding from these analyses is that there is more overlap than expected between measures of normal versus abnormal personality patterns (Samuel et al., 2010; Walton et al., 2008). How do you find out whether the items you're using have good internal reliability? Just having a lot of items doesn't guarantee it. Reliability is a question about the correlations among people's responses to the items. Saying that the items are highly reliable means that people's responses to the individual items are highly correlated. As a practical matter, there are several ways to investigate internal consistency. All of them test correlations among people's responses across items. Perhaps the best way (although it's cumbersome) is to look at the average correlation between each pair of items taken separately. A simpler way is to separate the items into two subsets (often odd- vs. even-numbered items), add up people's scores for each subset, and correlate the two subtotals with each other. This index is called split-half reliability. If the two halves of the item set measure the same quality, people who score high on one half should also score high on the other half, and people who score low on one half should also score low on the other half. Thus, a strong positive correlation between halves is evidence of internal consistency. 3.2.2: Inter-Rater Reliability As noted, personality isn't always measured by self-reports. Some observations are literally observations, made by one person watching and assessing someone else. Use of observer ratings has a slightly different reliability problem. In observer ratings, the person making the rating is a "measuring device." There are sources of error in this device, just as in other devices. How can you judge reliability in this case? Conceptually, the answer is the same as it was in the other case. You need to put down one telescope and pick up another. In the case of observer ratings, you need to check this observer against another observer. To the extent that both see about the same thing when they look at the same event, reliability is high. This is logically the same as using two items on a questionnaire. Two raters whose judgments correlate highly with each other across many ratings are said to have high inter-rater reliability. In many cases, having high inter-rater reliability requires that the judges be thoroughly trained in how to observe what they're observing. Judges of Olympic diving, for example, have seen many thousands of dives and know precisely what to look for. As a result, their inter-rater reliability is high. Similarly, when observers assess personality, they often receive much instruction and practice before turning to the "real thing," so their reliability will be high. If all judges are seeing the same thing when they rate an event, then inter-rater reliability will be high. 3.2.3: Stability across Time One more kind of reliability is important in the measurement of personality. This type of reliability concerns repeatability across time. That is, assessment at one time should agree fairly well with assessment done at a different time. Why is this important? Remember, personality is supposed to be stable. That's one reason people use the word---to convey a sense of stability. If personality is really stable---doesn't fluctuate from minute to minute or from day to day---then measures of personality should be reliable across time. People's scores should stay roughly the same when measured a week later, a month later, or four years later. This kind of reliability is termed test--retest reliability. It's determined by giving the test to the same people at two different times. A scale with high test--retest reliability will yield scores the second time (the retest) that are fairly similar to those from the first time. People with high scores the first time will have high scores the second time, and people with lower scores at first will have lower scores later on. (For a summary of these three types of reliability, see Table 3.1.) 3.3: Validity of Measurement 3.3 Analyze the issue of validity in assessment Reliability is a starting point in measurement, but it's not the only issue that matters. It's possible for measures to be highly reliable but completely meaningless. Thus, another important issue is what's called validity. This issue concerns whether what you're measuring is what you think you're measuring (or what you're trying to measure). Earlier, we portrayed reliability in terms of random influences on the image in a telescope as you look through it at the moon. To extend the same analogy, the validity issue is whether what you're seeing is really the moon or just a streetlight (see also Figure 3.1). Figure 3.1 A simple way to think about the difference between reliability and validity, using the metaphor of target shooting. (A) Sometimes when people shoot at a target, their shots go all over. This result corresponds to measurement that's neither reliable nor valid. (B) Reliability is higher as the shots are closer together. Shots that miss the mark, however, are not valid. (C) Good measurement means that the shots are close together (reliable) and near the bull's-eye (valid). Figure 3.1 Full Alternative Text How do you decide whether you're measuring what you want to measure? There are two ways to answer this question. One is an "in principle" answer; the other is a set of tactics. The "in principle" answer is that people decide by comparing two kinds of "definitions" with each other. When you see the word definition, what probably comes to mind is a conceptual definition, which spells out the word's meaning in terms of properties or attributes (as in a dictionary). It tells us what information a word conveys, by consensus among users of the language. Psychologists also talk about another kind of definition, however, called an operational definition. This is a description of a physical event. The difference between the two kinds of definition is easy to illustrate. Consider the concept love. Its conceptual definition might be something like "a strong affection for another person." There are many ways, however, to define love operationally. For example, you might ask the person you're assessing to indicate on a rating scale how much she loves someone. You might measure how often she looks into that person's eyes when interacting with him. You might measure how willing she is to give up events she enjoys in order to be with him. These three measures differ considerably from one another. Yet each might be taken as an operational definition (or operationalization) of love. The essence of the validity issue in measurement can be summarized in this question: How well does the operational definition (the event) match the conceptual definition (the abstract quality you have in mind to measure)? If the two are close, the measure has high validity. If they aren't close, validity is low. How do you decide whether the two are close? Usually, psychologists poke at the conceptual definition until they're sure what the critical elements are and then look to see whether the same elements are in the operationalization. If they aren't (at least by strong implication), the validity of the operationalization is questionable. The validity issue is critically important. It's also extremely tricky. It's the subject of continual debate in psychology, as researchers try to think of better and better ways to look at human behavior (Borsboom, Mellenbergh, & van Heerden, 2004). The reason the issue is important is that researchers and assessors form conclusions about personality in terms of what think they're measuring. If what they're measuring isn't what they think they're measuring, they will draw wrong conclusions. Likewise, a clinician may draw the wrong conclusion about a person if the measure doesn't measure what the clinician thinks it measures. Validity is important whenever anything is being observed. In personality assessment, the validity question has been examined closely for a long time. In trying to be sure that personality tests are valid, theorists have come to distinguish several aspects of validity from one another. These distinctions have also influenced the practical process of establishing validity. 3.3.1: Construct Validity The idea of validity you have in mind at this point is technically called construct validity (Campbell, 1960; Cronbach & Meehl, 1955; Strauss & Smith, 2009). Construct validity is an all-encompassing validity, and is therefore the most important kind (Hogan & Nicholson, 1988; Landy, 1986). Construct validity means that the measure (the assessment device) reflects the construct (the conceptual quality) that the psychologist has in mind. Although the word construct sounds abstract, it just means a concept. Any trait quality, for example, is a construct. Establishing construct validity for a measure is a complex process. It uses several kinds of information, each treated as a separate aspect of the validation process. For this reason, the various qualities that provide support for construct validity have separate names of their own. Several are described in the following paragraphs. 3.3.2: Criterion Validity An important part of showing that an assessment device has construct validity is showing that it relates to other manifestations of the quality it's supposed to measure (Campbell, 1960). The "other manifestation" usually means a behavioral index, or the judgment of a trained observer, taken as an external criterion (a standard of comparison). The researcher collects this information and sees how well the assessment device correlates with it. This aspect of validity is sometimes referred to as criterion validity (because it uses an external criterion) or predictive validity (because it tests how well the measure predicts something else it's supposed to predict). As an example, suppose you were interested in criterion validity for a measure of dominance you were developing. One way to approach this problem would be to select people who score high and low on your measure and bring them to a laboratory one at a time to work on a task with two other people. You could record each group's discussion and score it for the number of times each person made suggestions, gave instructions, took charge of the situation, and so on. These would be viewed as behavioral criteria of dominance. If people who scored high on your measure did these things more than people who scored low, it would indicate a kind of criterion validity. Another way to approach the problem would be to have each person who completed your scale spend 20 minutes with a trained interviewer (who didn't know the scale result) who then rated each person's dominance after the interview. The interviewer's ratings would be a different kind of criterion for dominance. If the ratings related to scores on your measure, it would indicate a different kind of criterion validity for the measure. Criterion validity is regarded as the most important way to support construct validity. A controversy has arisen over the process of establishing it, however. Howard (1990; Howard, Maxwell, Weiner, Boynton, & Rooney, 1980) pointed out that people often assume a criterion that's used is a perfect reflection of the construct. In reality, though, this is almost never true. In fact, far too often, researchers choose criterion measures that are poor reflections of the construct. We raise this point to emphasize how important it is to be careful in deciding what criterion to use. Unless the criterion is a good one, associations with it are meaningless. Despite this issue, criterion validity remains the keystone of construct validation. 3.3.3: Convergent Validity Another kind of support for a measure's construct validity involves showing that the measure relates to characteristics that are similar to, but not the same as, what it's supposed to measure. How is this different from criterion validity? It's just a very small step away from it. In this case, though, you know the second measure aims to assess something a little different from what your measure assesses. Because this kind of information gathering often proceeds from several angles, the result is termed convergent validity (Campbell & Fiske, 1959). That is, the evidence "converges" on the construct you're interested in, even though any single finding by itself won't clearly reflect the construct. For example, a scale intended to measure dominance should relate at least a little bit to measures of qualities such as leadership (positively) or shyness (inversely). The correlations shouldn't be perfect, because those aren't quite the same constructs, but they shouldn't be zero either. If you developed a measure to assess dominance and it didn't correlate at all with measures of leadership and shyness, you'd have to start wondering whether your measure really assesses dominance. 3.3.4: Discriminant Validity It's important to show that an assessment device measures what it's intended to measure. But it's also important to show that it does not measure qualities it's not intended to measure---especially qualities that don't fit your conceptual definition of the construct (Campbell, 1960). This aspect of the construct validation process is called establishing discriminant validity (Campbell & Fiske, 1959). The importance of discriminant validity can be easy to overlook. However, discriminant validation is a major line of defense against the third-variable problem in correlational research, discussed in Chapter 2. That is, you can't be sure why two correlated variables correlate. It may be that one influences the other. But it may be that a third variable, correlated with the two you've studied, is really responsible. In principle, it's always possible to attribute the effect of a personality dimension to some other personality dimension. In practice, however, this can be made much harder by evidence of discriminant validity. That is, if research shows that the dimension you're interested in is unrelated to another variable, then that variable can't be invoked as an alternative explanation for any effect of the first. To illustrate this, let's return to an example used in discussing the third-variable problem in Chapter 2: a correlation between self-esteem and academic performance. This association might reflect the effect of an unmeasured variable---for instance, IQ. Suppose, though, that we know this measure of self-esteem is unrelated to IQ, because someone checked that possibility during its validation. This would make it hard to claim that IQ underlies the correlation between self-esteem and academic performance. The process of discriminant validation is never ending, because new possibilities for third variables always suggest themselves. Ruling out alternative explanations is thus a challenging task, but it's also a necessary one. Earlier in the chapter (in Box 3.2), we discussed implications of IRT for internal consistency. IRT also provides safeguards that help ensure that items measure only what they are intended to measure. This new method therefore offers a valuable tool to enhance discriminant validity and help reduce the third-variable problem. 3.3.5: Face Validity One more kind of validity should be mentioned. It's much simpler and a little more intuitive, and most people think it's less important. It's called face validity. Face validity means that the assessment device appears, on its "face," to be measuring the construct it was intended to measure. It looks right. A test of sociability made up of items such as "I prefer to spend time with friends rather than alone" and "I would rather socialize than read books" would have high face validity. A test of sociability made up of items such as "Green is my favorite color" and "I prefer imported cars" would have low face validity. Many researchers regard face validity as a convenience, for two reasons. First, some believe that face-valid measures are easier to respond to than measures with less face validity. Second, researchers sometimes focus on distinctions between qualities of personality that differ in subtle ways. It often seems impossible to separate these qualities from each other except by using measures that are high in face validity. On the other hand, face validity can occasionally be a detriment. This is true when the assessment device is intended to measure something that the person being assessed would find threatening or embarrassing to admit. In such cases, the test developer usually tries to obscure the purpose of the test by reducing its face validity. Whether face validity is good, bad, or neither, it should be clear that it does not substitute for other aspects of validity. If an assessment device is to be useful in the long run, it must undergo the laborious process of construct validation. The "bottom line" is always construct validity. 3.3.6: Culture and Validity Another important issue in assessment concerns cultural differences. In a sense, this is a validity issue; in a sense, it's an issue of generalizability. Let's frame the issue as a question: Do the scores on a personality test have the same meaning for a person from an Asian culture, a Latino culture, or an African American culture as they do for a person from a middle-class European American culture? There are at least two aspects to this question. The first is whether the psychological construct itself has the same meaning from one culture to another. This is a fundamental question about the nature of personality. Are the elements of personality the same from one human group to another? Many people assume the basic elements of personality are universal. That may, in fact, be a dangerous assumption. The second aspect of the question concerns how people from different cultures interpret the items of the measure. If an item has one meaning for middle-class Americans but a different meaning in some other culture, responses to the item will also have different meanings in the two cultures. A similar issue arises when a measure is translated into a different language. This usually involves translating the measure into the new language and then translating it back into the original language by someone who's never seen the original items. This process sometimes reveals that items contain idiomatic or metaphorical meanings that are hard to translate. Adapting a measure from one culture for use in another culture is a complex process with many difficulties (Butcher, 1996). It must be done very carefully if the measure is to be valid in the new culture. 3.3.7: Response Sets and Loss of Validity Any discussion of validity must also note that there are problems in self-reports that can interfere with the validity of the information collected. We've already mentioned that biases in recall can distort the picture and render the information less valid. In the same way, people's motivational tendencies can also get in the way of accurate reporting. The tendency to provide socially desirable responses can sometimes mask a person's true characteristics or feelings. There are at least two biases in people's responses in assessment. These biases are called response sets. A response set is a psychological orientation, a readiness to answer in a particular way (Jackson & Messick, 1967). Response sets create distortions. Personality psychologists want their assessments to provide information that's free from contamination. Thus, response sets are problems. Two response sets are particularly important in personality assessment. One of them emerges most clearly when the assessment device is a self-report that, in one fashion or another, asks the person questions that require a "yes" or "no" response (or a response on a rating scale with "agree" and "disagree" as the opposite ends of the scale). This response set, called acquiescence, is the tendency to say "yes" (Couch & Keniston, 1960). Everyone presumably has a bit of this tendency, but people vary greatly on it. That's the problem. If the set isn't counteracted somehow, the scores of people who are highly acquiescent become inflated. Their high scores reflect the response set, instead of their personalities. People who have extreme personalities but not much acquiescence will also have high scores. But you won't know whose high scores are from personality and whose are from acquiescence. Many view acquiescence as an easy problem to combat. The way it's handled for self-reports is this: Write half the items so that "yes" means being at one end of the personality dimension. Write the other half of the items so that "no" means being at that same end of the personality dimension. Then reverse the response value for each item in the "no" set before summing. In the process of scoring the test, then, any bias that comes from the simple tendency to say "yes" is canceled out. This procedure takes care of the problem of overagreement, but not everyone is convinced it's a good idea. Negatively worded items often are harder to understand or more complicated to answer than positively worded items. The result can be responses that are less accurate (Converse & Presser, 1986). For this reason, some people feel it's better to live with the acquiescence problem than to introduce a different kind of error through complex wording. A second response set is perhaps more important than acquiescence and also more troublesome. It's called social desirability. It reflects the fact that people tend to portray themselves in a good light (in socially desirable ways) whenever possible. Once again, this tendency is stronger among some people than others (Crowne & Marlowe, 1964). As with acquiescence, if it isn't counteracted, people with strong concerns about social desirability will produce scores that reflect the response set, rather than their personalities. For some personality dimensions, this isn't much of a problem. The reason is that there's really no social approval or disapproval at either end of the dimension. In other cases, though, there's a consensus that it's better to be one way (e.g., honest or likable) than the other (e.g., dishonest or unlikable). In these cases, assessment becomes tricky. In general, psychologists deal with this problem by trying to phrase items so that the issue of social desirability isn't salient. As much as anything else, this means trying to avoid even bringing up the idea that one kind of person is approved of more than the other. Sometimes this means phrasing undesirable responses in ways that makes them more acceptable. Sometimes it means looking for ways to let people admit the undesirable quality indirectly. A different way to deal with the problem is to include items that assess the person's degree of concern about social desirability and use this information as a correction factor in evaluating the person's responses to other items. In any event, this is a problem that personality psychologists must constantly be aware of and constantly guarding against in trying to measure what people are like. 3.4: Two Rationales behind the Development of Assessment Devices 3.4 Relate the logic behind the theoretical and the empirical approaches to the development of assessment devices Thus far, this chapter has considered issues that arise when measuring any quality of personality. But how do people decide what qualities to measure in the first place? This question won't be answered fully here, because the answer depends partly on the theoretical perspective underlying the assessment. We will, however, address one general issue. In particular, measure development usually follows one of two approaches, each of which has its own kind of logic. 3.4.1: Rational or Theoretical Approach One strategy is termed a rational or theoretical approach to assessment. This strategy is based on theoretical considerations from the very start. The psychologist first develops a theoretical basis for believing that a particular aspect of personality is important. The next task is to create a test in which this dimension is reflected validly and reliably in people's answers. This approach to test development often leads to assessment devices with a high degree of face validity. It's important to realize that the work doesn't stop once a set of items has been developed. Instruments developed from this starting point must be shown to be reliable, to predict behavioral criteria, and to have good construct validity. Until these steps have been taken, the scale isn't considered a useful measure of anything. It's probably safe to say that the majority of personality measurement devices that exist today were developed using this path. Some of these measures focus on a single construct; others are inventories with scales for multiple constructs. Most of the measures discussed in later chapters were created by first deciding what to measure and then figuring out how to measure it. 3.4.2: Empirical Approaches A second strategy is usually characterized as an empirical, or data-based, approach. Its basic characteristic is that it relies on data, rather than on theory, to decide what items go into the assessment device. There are two important variations on this theme. In one of them, the person developing the measure uses the data to decide what qualities of personality even exist (e.g., Cattell, 1979). Because that line of thought is an important contributor to trait psychology, we're going to wait to discuss it until Chapter 4. We'll focus here on another empirical approach---one that reflects a very pragmatic orientation to the process of assessment. It's guided less by a desire to understand personality than by a practical aim: to sort people into categories. Instead of developing the test first and then validating it against a criterion, this approach works in the opposite direction. The criterion is the groups into which people are to be sorted (maybe two, or maybe more). To develop the test, you start with a huge number of possible items and find out which ones are answered differently by one criterion group than by other people. This is called the criterion keying approach. This label reflects the fact that the items retained are those that distinguish between the criterion group and other people. If an item set can be found for each group, then the test (all item sets together) can be used to tell who belongs to which group. In this view, it doesn't matter what the items look like. Items are chosen solely because members of a specific group (defined on some other basis) tend to answer them differently than other people. This method underlies the Minnesota Multiphasic Personality Inventory, or MMPI (Hathaway & McKinley, 1943), revised in 1989 as the MMPI-2 (Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989). This is a very long true--false inventory that was developed to assess abnormality. A large number of self-descriptive statements were given to a group of normal persons and to groups of psychiatric patients---people already judged by clinicians to have specific disorders. Thus, the criterion already existed. If people with one diagnosis either agreed or disagreed with an item more often than normal people and people with different diagnoses, that item was included in the scale for that diagnosis. The MMPI-2 has become controversial in recent years, for several reasons. Most important for our purposes in this book, it is increasingly recognized that different diagnostic categories are not as distinct as they were formerly thought to be. As a result, scores on the MMPI tend to be elevated on several scales, rather than just one. One consequence of the recognition of this pattern has been a broad (and intense) reconsideration of the nature of psychiatric diagnosis. 3.5: Never-Ending Search for Better Assessment 3.5 Analyze the importance of investing in effort in creating and improving tests of personality No test is perfect, and no test is ever considered finished just because it's widely used. Most personality scales in wide use today have been revised and restandardized periodically. The process of establishing construct validity requires not just a single study but many. It thus takes time. The process of establishing discriminant validity is virtually never ending. Tremendous effort is invested in creating and improving tests of personality. This investment of effort is necessary if people are to feel confident of knowing what the tests measure. Having that confidence is an important part of the assessment of personality. The characteristics of personality tests discussed in this chapter distinguish these tests from those you see in newspapers and magazines, on TV, online, and so forth. Sometimes the items in a magazine article were written specifically for that article. It's unlikely, though, that anyone checked on their reliability. It's even less likely that anyone checked on their validity. Unless the right steps have been taken to create an instrument, you should be careful about putting your faith in the results that come from it. Summary: Issues in Personality Assessment Assessment (measurement of personality) is something that people constantly do informally. Psychologists formalize this process into several distinct techniques. Observer ratings are made by someone other than the person being rated---an interviewer, someone who watches, or someone who knows the people well enough to make ratings of what they are like. Self-reports are reports about themselves made by the people being assessed. Self-reports can be single scale or multiscale inventories. Implicit assessment is measuring patterns of associations within the self that are not open to introspection. Assessment devices can be subjective or objective. Objective techniques require no interpretation as the assessment is made. Subjective techniques involve some sort of interpretation as an intrinsic part of the measure. One issue for all assessment is reliability (the reproducibility of the measurement). Reliability is determined by checking one measurement against another (or several others). Self-report scales usually have many items (each a measurement device), leading to indices of internal reliability or internal consistency. Observer judgments are checked by inter-rater reliability. Test--retest reliability assesses the reproducibility of the measure over time. In all cases, high positive correlation among measures means good reliability. Another important issue is validity (whether what you're measuring is what you want to measure). The attempt to determine whether the operational definition (the assessment device) matches the concept you set out to measure is called construct validation. Contributors to construct validity are evidence of criterion, convergent, and discriminant validity. Face validity isn't usually taken as an important element of construct validity. Validity is threatened by the fact that people have response sets (acquiescence and social desirability) that bias their responses. Development of assessment devices follows one of two strategies or approaches. The rational strategy uses a theory to decide what should be measured and then figures out the best way to measure it. Most assessment devices developed this way. The empirical strategy involves using data to determine what items should be in a scale. The MMPI was developed this way, using a technique called criterion keying, in which the test developers let people's responses tell them which items to use. Test items that members of a diagnostic category answered differently from other people were retained.

Chapter 3 Issues in Personality Assessment PDF

Document Details

Tags

Related

Summary

Full Transcript