Psychological Assessment and Measurement PDF
Document Details
Uploaded by FastestGrowingHydrogen
University of Adelaide
Tags
Summary
This document provides an overview of psychological assessment and measurement. It covers psychometric theory, different types of measurement, and the importance of reliability in assessment. The document aims to provide a foundational understanding of these topics.
Full Transcript
Psychological Assessment and Measurement Table of Contents WEEK 1 2 PSYCHOMETRIC THEORY 2 WEEK 2 2 WEEK 3 2 WEEK 4...
Psychological Assessment and Measurement Table of Contents WEEK 1 2 PSYCHOMETRIC THEORY 2 WEEK 2 2 WEEK 3 2 WEEK 4 2 WEEK 5 2 WEEK 6 2 WEEK 1 Psychometric Theory The psychometric theory relates to "the quantification and measurement of mental attributes, behaviour, and performance as well as the design, analysis, and improvement of the tests used in such measurement". A construct is "a complex idea or concept formed from a synthesis of simpler ideas". Psychometrics enable you to observe and measure constructs (e.g., intelligence, extraversion), and begin to make sense of and compare these. Several methods and instruments have been devised to measure what may seem unmeasurable. This means you can assign a numerical value to something that might be measured or perceived subjectively. This is an important first step, although typically, further information would be required for accurate decision-making, such as a diagnosis of a mental illness. The terms measurement and assessment are interchangeable - though there are key differences. - Measurement is "the process of assigning numbers to observations to quantify important characteristics of individuals". - Assessment is "the systematic process of obtaining information from participants and using it to make inferences or judgments about them". This can be considered in different contexts, such as the terminology differences within the educational context. Lynch's model of evaluation, measurement, and testing is commonly accepted. In this model, evaluation is akin to assessment—the process by which a conclusion is made. It is worth noting that this model also outlines the placement of testing within measurement more generally. ➔ Ordinal = An ordinal scale is a scale (of measurement) that uses labels to classify cases (measurements) into ordered classes. Note that an ordinal scale implies that the classes must be put into an order such that each case in one class is considered greater than (or less than) every case in another. ➔ Interval = It is defined as a quantitative measurement scale in which the difference between the two variables is meaningful. In other words, the variables are measured exactly, not in a relative way in which zero presence is arbitrary. ➔ Ratio Scale = A ratio scale is a quantitative scale with a true zero and equal intervals between neighbouring points. Unlike on an interval scale, a zero on a ratio scale means there is a total absence of the variable you are measuring. Length, area, and population are examples of ratio scales. Functions of measurement may include: Decision-making—For example, psychological measurements may be used in personnel selection, ability, or aptitude measurement for educational decision-making. Comparison—For example, measurement may be used to discover individual personality differences. Research—For example, psychological measurement may be used in cross-sectional, longitudinal, and meta-analytic studies. Diagnosis—For example, measurement may be used to identify strengths and weaknesses in a causal context to facilitate treatment planning. Risk assessment—For example, psychological measurement may help to identify potential risks (e.g., violence, suicidality) to facilitate appropriate action. For example, violence is a social construct that has the potential to impact more broadly than the individual being assessed. To enable sound decision-making, it is often necessary to objectively determine whether a construct exists or to measure its strength. Psychological measurement and assessment are used in a number of settings, and for a range of different purposes (e.g., to explore capabilities, health, and attributes). For example, clinical or health psychologists may use psychological assessment as part of therapeutic interventions, for treatment planning, or to measure change. Similarly, psychological assessment may be used by organisational psychologists for comparing individuals to one another, or comparing individuals against a particular benchmark. ➔ Capability is "an ability, talent, or facility that a person can put to constructive use. For example, a child may have great musical capability." — APA. (2020). APA dictionary of psychology ➔ Health is "the condition of one’s mind, body, and spirit, the idea of being free from illness, injury, pain, and distress". ➔ An attribute is "a quality or property of a person, sensation, or object, for example, the tonal attribute of a note". Reliability In terms of psychological assessment, reliability concerns may be identified in a number of ways: 1. the context in which the testing takes place (including factors related to the test administrator, the test scorer, and the test environment, as well as the reasons the test is undertaken) 2. the test taker (including individual differences, such as fatigue) 3. specific characteristics of the test itself (including whether the test can be reused on multiple occasions, with similar outcomes). It is important that the environment in which psychological assessment takes place is suitable, and that the test chosen is appropriate for the questions being explored. Factors such as room temperature, lighting, background noise, presence of other individuals, and the demeanor of the test administrator can all impact test results. Types of reliability 1. Test‒retest reliability—Whether or not a measure is consistent if the test is taken again by the same individual. 2. Inter-rater reliability—Whether different researchers or practitioners achieve the same result when using the measure. 3. Internal consistency—Whether items included in the measure are responded to in a consistent manner. You want to be sure that the measure you are using is, in fact, measuring what you intend it to. It is also important to know that ethical and accurate decisions are being made, especially given that decisions are often made on the basis of what is determined through psychological tests. ACTIVITY Video of interview tracking negative thoughts. No. of negative thoughts in the 4 minutes intrstruced to watch: 1, 2, 3, This was a difficult task because of the subjectivity within the terminology negative. And how it might be different for each person. Negative thoughts can be defined as "cognitions about the self, others, or the world in general that are characterized by negative perceptions, expectations, and attributions and are associated with unpleasant emotions and adverse behavioral, physiological, and health outcomes". This definition alone may not be enough to make your task easier. It is broad and does not direct you to be able to clearly identify the thoughts of the interviewee. For example, you may be tempted to infer what the interviewee is thinking based on their responses, or you may simply record what is stated. The instructions you were given do not assist you in determining what you need to do. This example was designed to get you to think about inter-rater reliability. How can you achieve consistency between researchers or practitioners to achieve the same result when you are attempting to measure something? From your previous studies, you would be aware that internal consistency reliability is a measure of the extent to which the items in a test are measuring the same thing. It is the most common form of reliability you will encounter. You may remember that Cronbach’s alpha, a number between 0 and 1, is a measure of internal consistency reliability (Tavakol & Dennick, 2011). A commonly used minimum cut-off is.70, but given the sensitivity to sample size, context is very important when considering whether internal consistency reliability is high enough. Consider, for example, research which has explored the internal consistency of a Swedish version of the Pain Catastrophizing Scale (PCS). This measure includes items which aim to measure: rumination (e.g., "I can't seem to get it out of my mind"; I keep thinking about how much it hurts"), magnification (e.g., "I become afraid that the pain will get worse"; "I wonder whether something serious may happen"), and helplessness (e.g., "It's terrible and I think it's never going to get any better"; "I feel I can't stand it anymore"). Cronbach's alpha was calculated for each of the subscales: rumination (α = 0.84), magnification (α = 0.69), and helplessness (α = 0.89). Additionally, Cronbach's alpha was calculated for the total scale (α = 0.92). Results of these analyses indicate that "... internal consistency of the rumination and helplessness subscales were good, but questionable for the magnification subscale (α = 0.69)" (Kemani et al., 2019, p. 263). Test- ReTest Reliability How can you achieve consistency between administrations of the tool to achieve the same result when you are attempting to measure something? Consider the following tips and how you would implement these in practice: 1. Consider whether there are some items that are less reliable. Items that may need to be omitted include those that are more susceptible to external influences. This will be considered in more detail in Module 2. 2. Ensure there has been consistency between the different test administrations (e.g., same location, same time of day). 3. Document any notable life events or changes that may impact test administration. For example, if a participant is experiencing grief, this may impact their concentration. 4. Consider the amount of time between administrations. While a construct is likely to produce consistency on multiple occasions, the length of time between administrations may impact some test results. For example, some tests may be impacted by practice effects. The APA defines practice effects as "any change or improvement that results from practice or repetition of task items or activities. The practice effect is of particular concern in experimentation involving within-subjects designs, as participants’ performance on the variable of interest may improve simply from repeating the activity rather than from any study manipulation imposed by the researcher." Validity 1. Face validity—The degree to which the assessment tool appears to measure what is intended. 2. Criterion validity—The degree to which assessment scores are consistent with another relevant measure. 3. Construct validity—The degree to which the assessment tool adequately represents the construct being measured. 4. Content validity—The degree to which the assessment tool reflects and is representative of the subject matter or behaviour that is being measured. To determine validity, you want to be sure that the psychological measure you are using is providing an accurate assessment of the construct, for the population of interest, on the occasion it has been administered. Face Validity Face validity is "the extent to which the items or content of the test appear to be appropriate for measuring something, regardless of whether they actually are". Example 1 advantage: straight to the point and direct - clear measure without abmisuous interpretation from individual differences Disadvantage: people who suppress their aggression may not be willing to admit their aggression or suppression of such so directly Example 2: Advantage: allows researchers to interpret patterns of individuals who may be unknowing displaying their suppressed aggression, that they otherwise would not admit directly Disadvantage: may be ambiguous and misinterpreted by participants, their pattern or responses may not directly be due to suppressed anger but individual differences and interests. A test that appears to be measuring what it is supposed to measure has high face validity. For example, asking an individual to complete a typing test for an administrative assistant role might be seen to have high face validity; whereas, asking an individual to complete a typing test to become a lifeguard might not be regarded as having the same face validity. In some circumstances, it may not be advisable to have high face validity, especially if the area of interest may be impacted by social desirability in responding. You would know from your previous studies that social desirability is the tendency of individuals to self-report or present themselves in a favourable light, especially where the individual believes that they may be judged negatively by others. Some examples where high face validity may not be desirable, given that accuracy of self-report may be impacted, include substance abuse, offending behaviour, and any other areas where you might suspect individuals to ‘fake good’ or ‘fake bad’. Criterion Validity Criterion validity is a measure of “how well a test correlates with an established standard of comparison (i.e., a criterion)” (APA, 2020, para. 1 Links to an external site.). There are three types of criterion validity: predictive, concurrent, and retrospective validity. For example, Jenkinson et al. (1994) considered criterion validity for a psychological tool, The Short-Form Health Survey (SF-36). As the authors note, an initial test item (a global health question) was used in other studies to test the criterion validity for other measures. On this basis, it was proposed to be an acceptable benchmark against which health may be measured. In this research, concurrent validity was assessed; measuring the SF-36 against an established measure of health (used in other studies) (Jenkinson et al., 1994). This differs from predictive validity, where the intent is to predict future outcomes (e.g., health in the future). Kruskal-Wallis tests were used to statistically assess criterion validity. This is a non-parametric method, which is used as an alternative to ANOVA, to test whether three or more populations have equal means on the variable of interest. This test can be used when the dependent variable is ordinal. It is notable that other statistical methods can be used to assess criterion validity, depending on the data; the intent is to determine whether your measure correlates with a ‘gold standard’. Construct Validity The third type of validity that is notable for psychological measurement and assessment is construct validity. You may recall that earlier in this module, you were asked to think about how you might measure a construct. Measuring construct validity can be quite complex, as it aims to consider the degree to which a theoretical model is representative of the data. For example, does a psychological assessment tool, such as the NEO-PI-3, accurately measure the construct of conscientiousness, where conscientiousness is conceptualised using a theoretical basis (e.g., need for achievement, tendency to keep one’s environment tidy, and so forth)? Construct validity is "the degree to which a test or instrument is capable of measuring a concept, trait, or other theoretical entity. For example, if a researcher develops a new questionnaire to evaluate respondents’ levels of aggression, the construct validity of the instrument would be the extent to which it actually assesses aggression as opposed to assertiveness, social dominance, and so forth. A variety of factors can threaten the basic construct validity of an experiment, including (a) mismatch between the construct and its operational definition, (b) various forms of bias, and (c) various experimenter effects and other participant reactions to aspects of the experimental situation. There are two main forms of construct validity in the social sciences: convergent validity and discriminant validity.". Construct validity has two main subtypes: 1. Convergent Validity: This subtype refers to the degree to which a measure correlates with other measures that are theoretically related to the same construct. For example, if you’re measuring depression, your test should correlate highly with other established depression measures. 2. Discriminant Validity (or Divergent Validity): This subtype refers to the extent to which a measure does not correlate with other measures that assess different, unrelated constructs. For instance, a depression scale should not show a strong correlation with a measure of physical fitness, as they assess different concepts. Content Validity Content validity is “the extent to which a test measures a representative sample of the subject matter or behaviour under investigation. For example, if a test is designed to survey arithmetic skills at a third-grade level, content validity indicates how well it represents the range of arithmetic operations possible at that level” (APA, 2020, para. 1 Links to an external site.). Importantly, a measure that has good content validity will demonstrate both “relevance and representativeness” (Dixon & Johnston, 2019). In other words, it is important that there is sufficient coverage of the construct of interest. Consider the example from Haynes et al. (1995): "For sake of illustration, presume we are attempting to measure the efficacy of a psychosocial treatment for panic attack 7 (as defined in DSM-IV; APA, 1994) with a self-report questionnaire. Scores from the questionnaire on panic attacks would reflect the panic attack construct (i.e., would evidence content validity) to the extent that the items measured all facets of the construct, namely, (a) tapped the 13 criteria for panic attacks (DSM-IV; APA, 1994, pp. 395), (b) targeted the appropriate time frame estimate for peak response (