Chapter 3 - Measurement Concepts And Practices (MOP Text) PDF
Document Details
Uploaded by RegalLute
Susan Magasi, Apeksha Gohil, Mark Burghart, Anna Wallisch
Tags
Related
Summary
This document provides a framework for understanding measurement properties in occupational therapy. It includes discussions on reliability, validity, and responsiveness. The chapter also highlights the importance of using appropriate and well-designed instruments for accurate communication and decision-making within treatment teams.
Full Transcript
3 CHAPTER UNDERSTANDING MEASUREMENT PROPERTIES...
3 CHAPTER UNDERSTANDING MEASUREMENT PROPERTIES Susan Magasi, PhD; Apeksha Gohil, BOT; Mark Burghart, MOT; and Anna Wallisch, MOT Occupational therapists frequently face the need to The purpose of this chapter is to provide occupational select high-quality measurement instruments that are therapy practitioners with the following: applicable to the clientʼs needs and within the param- A framework for understanding measurement proper- eters of oneʼs clinical environment. Selecting appropriate ties (including reliability, validity, and responsiveness). measurement instruments provides therapists with the A framework for interpreting scores from measure- ability to accurately identify problems in occupational ment instruments. performance, monitor change over time, and make appro- A process for comparing and selecting appropri- priate recommendations based on predictions of future ate measurement instruments for their clients and outcomes. In order to select the best measurement instru- setting. ment, one must first understand the clinical applicability of the measurement instruments, as well as its properties. Knowledge regarding measurement properties allows a practitioner to understand how reproducible and consis- A FRAMEWORK FOR UNDERSTANDING tent the measure is (reliability), as well as if the measure is providing results based on the content one wants to MEASUREMENT PROPERTIES measure (validity). The challenge of understanding and interpreting mea- Interpreting and reporting results from appropriate surement properties is not unique to occupational thera- and well-designed measurement instruments can facili- py or rehabilitation. Indeed, different disciplines use dif- tate communication within treatment teams (both inside ferent language and evidence to document measurement and outside of the profession), with care coordinators properties, making it difficult for meaningful comparisons and funders, and with clients themselves. For example, between measures.1 In an effort to address this concep- the ability to demonstrate improvement in occupational tual confusion, an international consensus group, COSMIN performance can be used to communicate shared goals (Consensus Standards for the Selection of Measurement and challenges within the interdisciplinary team, to sup- Instruments) created uniform terminology and interpre- port the need for additional therapy time, and to motivate tation criteria for evaluating measurement properties in Copyright © 2016. Taylor & Francis Group. All rights reserved. clients. health status measures.2 The uniform terminology and A proliferation of measurement instruments exists for application of the COSMIN framework structures our dis- every clinical population and setting, giving the practicing cussion of the measurement properties throughout this clinician a wide range of measurement options. However, chapter. Figure 3-1 provides a graphic representation of not all instruments are created equal. The task of choosing the COSMIN taxonomy. the best measurement instrument for your setting and The COSMIN taxonomy identifies 3 measurement the people you serve as well as interpreting the results to property domains: reliability, validity, and responsive- support clinical decision making can be daunting. A solid ness. Interpretability is not a measurement property but understanding of measurement properties is central to is important when selecting a measurement instrument occupational therapy practitionersʼ abilities to perform for clinical practice and is thus included in the taxonomy. these essential tasks. Law M, Baum C, Dunn W, eds. Measuring Occupational Performance: DOI: 10.1201/9781003525042-3 - 29 - Supporting Best Practice in Occupational Therapy, Third Edition (pp 29-41). © 2017 Taylor & Francis Group. Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. 30 Chapter 3 Figure 3-1. COSMIN Taxonomy. 24 (Reprinted from J Clin Epidemiol, 63(7), Mokkink LB, Terwee CB, Patrick DL, et al, The COSMIN study reached international consensus on taxonomy, terminology, and definitions of mea- surement properties for health-related patient-reported outcomes, pp 737- 745, Copyright 2010, with permission from Elsevier.) Measurement decisions should also be based on theo- assessments. Item Response Theory (IRT) is gaining promi- retical concepts. We must recognize why we are selecting nence in occupational therapy and rehabilitation research a measurement approach based on occupational therapy and practice.3 Many well-known and frequently used principles. For example, checking a personʼs pulse is a measurement instruments in occupational therapy prac- common and conceptually driven measurement for nurs- tice, including the Functional Independence Measure ing as they evaluate the personʼs physical status each day. (FIM), the Assessment of Motor and Process Skills (AMPS), Similarly, occupational therapy practitioners base the and the Pediatric Evaluation of Disability Inventory (PEDI), selection of measurement instruments on the relationship are IRT-based. CTT and IRT use different statistical tests to of the underlying constructs to occupational performance document reliability and validity. Criteria for evaluating and participation. The purpose for which the measure- both CCT- and IRT-based measures will be discussed later ment is used also guides the instrument selection process. in this chapter. Measurement instruments serve 3 main purposes in clini- cal practice: 1) to discriminate or differentiate between Copyright © 2016. Taylor & Francis Group. All rights reserved. characteristics of an individual, 2) to evaluate change LEVELS OF MEASUREMENT over time, and 3) to predict future events and needs. It is important to clarify both what constructs are being Measurement instruments have different ways of cat- measured and why (from both a theoretical and practical egorizing and quantifying the construct of interest. Levels perspective). of measurement refer to properties of numbers assigned Measurement experts also have conceptual models for to an observation.4 The levels of measurement provide helping us decide about the strength and usefulness of rules for understanding, manipulating, and interpreting tests. Their conceptual models describe the statistical rela- different types of numerical data. There are 4 levels of tionships among test items, the personʼs competence, and measurement: nominal, ordinal, interval, and ratio. item difficulty. These statistical methods help test design- ers select strong items and put items in a proper order Nominal when difficulty matters. The Classical Test Theory (CTT) Nominal data are constructed as named categories, is the basis of many widely-used occupational therapy and objects or people are assigned to specific named Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. Understanding Measurement Properties 31 categories based on criteria and descriptors. Gender, handedness, and diagnosis are examples of nominal MEASUREMENT PROPERTIES scales. A rule of nominal scales states that categories Measurement properties help end users determine are mutually exclusive so that no object or person can whether measurement instruments are consistent, well- be assigned to both groups. An additional assumption targeted, and sensitive enough to evaluate the constructs states that, when categorizing people or objects, there of interest in the population they serve. Occupational are exhaustive rules one follows, so every object or person therapists must consider reliability, validity, and respon- can be accurately assigned to only one category. From siveness when selecting a measurement instrument. a statistical perspective, nominal scales are for counting frequencies. Reliability Ordinal Reliability helps occupational therapists decide wheth- er they can trust the measurement instrument to give The next level of measurement is the ordinal scale. consistent, error-free scores. The term also denotes the Ordinal scales are used to rank the phenomena. Just like reproducibility or dependability of a measurement. When running a track race, the objects or people are ranked on a an instrument is administered repeatedly under similar first, second, or third basis, and so on, and the relationship conditions to a person whose ability has not changed, the is based on having more or less of an attribute (eg, a faster derived scores should be the same (or at least very similar). runner compared to a slower runner in track). The inter- Reliability is thus an indication of both the consistency of vals between ranked categories in ordinal scales may be the measurement and the ability to differentiate between consistent or not known. For instance, the time between clients.6 From a clinical perspective, reliability is an indica- the first place runner and the second place runner may be tion of how confident you can be that the scores derived different than between the second and third place runner. from a measurement instrument are truly accurate, and Because ordinal scales portray ranked categories, which thus how confident you can be using those scores as the are essentially labels, there are fewer statistical interpreta- basis for clinical decisions. There are 2 important features tions and manipulations one may use when working with clinicians should notice when interpreting the reliability of ordinal scales. Manual muscle testing ratings are a clinical a measurement. example of an ordinal scale. 1. First, reliability is derived from the sample and is not an attribute of the measurement instrument Interval itself. This is very important from both a clinical and Interval scales resemble ordinal scales because they research perspective because it means that the reli- depict order, but the distance between scores is known ability is dependent on sample context and evaluates and equal. For instance, the distance between years is factors such as methods, characteristics of tested an equally spaced, fixed amount (ie, 365 days). Interval individuals, and the condition of interest. Therefore, scales lack an absolute 0 point, thus a score of 0 does not therapists must be diligent when interpreting an indicate a complete lack of an attribute. For example, a instrumentʼs reliability in the process of selecting a temperature of 0 Fahrenheit does not mean there is a lack measurement instrument or reading a research study. of temperature, it means it is cold outside. Interval scales As a critical consumer of the research literature, you provide researchers with a greater amount of statistics should ask the questions, “reliable for whom?” and, that may be applied to data on an interval scale because “reliable for what purposes?” When selecting a mea- relative difference and equivalence may be determined. surement instrument or reporting on its reliability, you should select data from studies that most closely Ratio approximate your setting and clients. 2. Second, reliability research is conducted under strict Ratio scales are the highest level of measurement. A Copyright © 2016. Taylor & Francis Group. All rights reserved. conditions with a small number of highly trained ratio scale is the same as an interval scale, but it has an and monitored test administrators. Clinical practice absolute 0 point, where a score of 0 means an individual is much more variable and likely to be less precise. has a complete lack of the construct being measured. For Therefore, the reliability reported in the literature instance, when measuring the height of an individual, it should be considered the highest reliability possible. is impossible to get a score of 0 and to completely lack height. Because there is an absolute 0 point, one may According to the COSMIN taxonomy, reliability is a mea- discuss scores as a ratio; for instance, one could discuss surement domain that includes test-retest reliability, inter- a childʼs height as half of his or her parentʼs height, or a rater reliability, intra-rater reliability, measurement error, parent is double the height of his or her child. Ratio scales and for a multi-item assessment, internal consistency. are the highest level because one may run all statistical operations at this level.5 Test-Retest Reliability Test-retest reliability is the consistency of repeated measures over time in an individual who has not changed. Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. 32 Chapter 3 To further understand test-retest reliability, we will exam- nominal scales, whereas weighted kappa is preferred for ine the 9-hole pegboard assessment, a simple perfor- ordinal scales to show percent agreement.7,9,10 mance-based measure of hand dexterity scored by the ICC scores typically range between 0 and 1, with number of seconds required to place and remove pegs. If larger values representing greater reliability. Kappa coef- an occupational therapist tests an individual with stable ficients for nominal and ordinal scales (both weighted hand function, his or her time (score) should be the same and unweighted) can range from -1 to 1 but typically now, in 1 hour, and in 1 week. High test-retest reliability fall between 0 and 1 in clinical assessments, with larger indicates that a measurement instrument is capable of values representing greater reliability.11 When evaluating measuring a variable with consistency. Test-retest reliabil- the reliability of a measurement instrument, Terwee et al12 ity also provides confidence to practitioners that changes recommend an ICC or weighted kappa of ≥ 0.70. in test scores are due to actual changes in clients, which is useful when measuring progress. Internal Consistency Internal consistency is the degree of inter-related- Inter-Rater Reliability ness among items in a multi-item measurement instru- Inter-rater reliability determines how well a measure- ment2 and indicates the extent to which items measure ment instrument provides the consistent results between aspects of the same characteristic and nothing else.5 2 or more practitioners measuring the same construct or For instance, if an individual created a survey for parents individual. For instance, if 2 different occupational therapy to determine their knowledge of developmental mile- practitioners timed the clientʼs performance on the 9-hole stones, the survey items should depict developmental pegboard test, they should both get a similar time. milestones and not questions about child safety in the Perfect agreement between test administrators is dif- home. Unidimensionality (measuring only one construct), ficult to achieve even with pre-determined rules and defi- of the measurement instrument (or subscales within it) is nitions for the raters to follow. In the pegboard test, each a requirement for internal consistency. practitioner may have different reaction times when start- Internal consistency typically measures the degree of ing and stopping the timer, resulting in different scores. correlation between the items with a measure. Cronbachʼs Additionally, when practitioners need to observe to rate a coefficient alpha is the preferred method for evaluating clientʼs behavior, their individual backgrounds may result internal consistency for measures developed using CTT. in different ratings. Thus, when high inter-rater reliability Cronbachʼs coefficient alpha scores above 0.80 are consid- is reported by an assessment, it means the measurement ered to be high quality,12 although Cronbach alpha greater the therapist has chosen yields consistent results across than 0.90 may indicate redundancy in item content.13 An raters. important consideration when analyzing Cronbachʼs coef- ficient alpha is the number of items being tested. When Intra-Rater Reliability more items are being tested, there is a greater chance they Intra-rater reliability determines how well an assess- are measuring the same construct, resulting in a better ment provides the same scores on multiple tests within Cronbachʼs coefficient alpha score simply because there one individual raterʼs report. For example, if a therapist are more items.5 timed a 9-hole pegboard administration multiple times, he or she should get the same time on each test. Measurement Error Intra-rater reliability appears easy to achieve in the We strive for reliable measurements, but in reality, no 9-hole pegboard example, but often is more difficult to measurement instrument is completely reliable or free attain than expected. One feature that challenges intra- from measurement error. Clients, test administrators, the rater reliability is rater bias. Rater bias occurs when a environment, or the instrument itself can introduce error. therapist may be influenced by his or her memory of the For example, a clientʼs motivation to perform the test may first score rated, which results in good intra-rater reliabil- differ from one administration to the next. Test adminis- Copyright © 2016. Taylor & Francis Group. All rights reserved. ity, but is not a true depiction of a measureʼs reliability.5 If trators can introduce error if they do not adhere rigidly an assessment reports good intra-rater reliability, the test to the test protocol. A clinicianʼs expectations based on yields consistent results within a single rater. the clientʼs previous performance may also introduce error. Environmental factors such as lighting, noise, and Statistics Used to Evaluate Test-Retest, temperature can introduce variability and thus error into the assessment process. Instrumentation errors can be Inter-Rater, and Intra-Rater Reliability caused by calibration or scale issues (eg, if the battery The preferred reliability statistic for test-retest, inter- was wearing down in the stopwatch used for a timed test, rater, and intra-rater reliability depends on the measure- the scores may not be accurate) or by defective instru- ment instrumentʼs response options. Intraclass Correlation ments (eg, the pegboard was poorly manufactured and Coefficients (ICC) are preferred for continuous scores the pegs did not fit smoothly in the holes). Adherence because they reflect both correlation and agreement.7-9 to test administration protocols, frequent retraining of Cohenʼs kappa is the preferred statistical method for clinicians, creation of a supportive environment that limits Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. Understanding Measurement Properties 33 distractions, and consistent documentation in alteration measurement properties have been evaluated in people or accommodations to testing protocols are strategies with chronic stroke.16 that can help reduce measurement error. While adher- SEM = 1.45 ence to administration protocols is recommended, many MDC = 4.02 of the clients seen in occupational therapy have physical, Consider a client with a chronic stroke with the following: cognitive, and sensory issues that may impact their abil- Admission Barthel Index Score = 57 + 1.45 or ity to follow administration guidelines. Clinicians should 55.55 ‒ 58.45 be flexible and use clinical judgment when working with clients.14,15 Re-evaluation Barthel Index Score = 60 + 1.45 or Measurement error is composed of 2 components: 58.55 ‒ 63.45 random error and systematic error.2 Specifically, system- Did the client improve? Based on the raw scores alone, atic errors are predictable errors of measurement as they we might conclude that the client is improving. However, usually occur by overestimating or underestimating a the difference in scores is less than the MDC, so we cannot true score.5 When a systematic error is detected, one may conclude that the client has improved. simply be able to fix the under- or overestimation. For example, if the batteries in a grip strength dynamometer Reliability of Item Response Theory– are wearing down or the device needs to be recalibrated, Based Measurement Instruments it may consistently underreport test scores. Similarly, For IRT-based measurement instruments, reliability is culturally insensitive measures that selectively disadvan- determined by the discriminative ability of the items.1,17 tage test takers from certain racial or ethnic groups are Person separation statistics indicate how well items within examples of systematic error. A systematic error generally an IRT-based measurement instrument differentiate peo- affects validity (ie, the values are not true representations ple. Separation statistics range from 0.0 to 1.0, with higher of the construct) more than reliability because the error is values being indicative of better separation. Person sepa- consistent.14 ration reliability values of > 0.8 are acceptable.18 Standard Random errors are due to chance and may affect a error (SE) is comparable to SEM and is the recommended subjectʼs score in an unforeseeable way from one trial to statistic for measurement error in IRT-based measure- another. The random errors come from differences in cli- ment instruments.1 Smaller SE are associated with more ent motivation, fatigue, inconsistent test administration, discriminating items. and inconsistent scores due to cliniciansʼ expectations of a client. As random error diminishes, reliability increases. Validity Because we may not always know how much a value sways from the intended value, we must estimate the After determining a measurement instrumentʼs reliabil- measurement error through the use of reliability.5 ity and that it is relatively free from error, we must deter- The preferred statistic for measurement error in stud- mine that the assessment is valid. Validity is the degree to ies based on CTT is the standard error of measurement which an instrument measures the construct(s) it intends (SEM). SEM allows us to estimate the amount of error by to assess.2 Validity places emphasis on the ability to make measuring the variability of multiple scores on a single inferences based on derived test scores or measurements. subject. Reliability and measurement error are related but For example, clinicians typically use assessment scores to distinctly separate and, when interpreting SEM, the more evaluate and predict functional performance in daily life. reliable the measurement, the smaller the error or SEM. Assessments must have the ability to evaluate the per- SEM statistics also typically provide a confidence level for sonʼs current performance and accurately predict future the derived score. For example, most SEM statistics report functioning. Therefore, validity addresses what practitio- a 95% Confidence Interval (CI), meaning that the SEM esti- ners are able to do with the instrument results. mates that 95% of the time, the errors from the measure- As with reliability, validity must be understood as a Copyright © 2016. Taylor & Francis Group. All rights reserved. ment fall within the reported SEM range. property of the instrumentʼs derived scores, rather than Minimal detectable change (MDC) is related to SEM. of the instrument itself.19 Context of use, including cli- MDC is the smallest amount of change in a score that can ent characteristics, clinical settings, and conditions of be interpreted as real change (and not just a function of measurement, all influence the validity of the scores.6 measurement error). When interpreting a measurement As with reliability claims, the critical consumer must be instrumentʼs derived scores, it is important to consider wary of blanket statements of an instrumentʼs validity and both the score change and the standard error of mea- ask the questions, “valid for whom?” and, “valid for what surement in order to determine the MDC. MDC provides purpose?” an estimate of the error between 2 scores from 2 differ- Reliability is a necessary prerequisite for validity ent assessments of the same person. For example, The because validity implies that an instrument is relatively Barthel Index is a commonly used performance-based free from error. A score derived from an unreliable mea- measure of independence in ADL. Clients receive a score sure cannot be considered valid. Reliability thus defines between 0 and 100, with 0 being total dependence and the upper limit of validity.6 Validity is a unitary con- 100 being complete independence. The Barthel Indexʼs struct,14 but it can also be reported along a continuum. Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. 34 Chapter 3 Occupational therapy practitioners must evaluate the When evaluating criterion validity, there must be con- accumulated validity data to make informed choices when vincing evidence that the comparative assessment is truly selecting and interpreting an instrument and scores. the agreed upon gold standard method for assessing the Validity can be understood as consisting of 3 main construct.12 To assess criterion validity, the target instru- types: content validity, construct validity, and criterion ment and the gold standard tool should be administered validity. Additionally, 3 measurement properties̶fairness and scored independently of one another. Scores from in testing, responsiveness to change, and floor/ceiling the target instrument and the gold standard should be effects̶help us to further understand the application of compared and meet a predefined level of agreement validity to clinical practice. between the target instrument and the gold standard. Criterion validity is further divided into concurrent and Content Validity predictive validity.1 Content validity is the extent to which the domain of Concurrent validity establishes validity when 2 mea- interest is comprehensively sampled by the items in the sures are taken at relatively the same time, with instrument.12 Content validity is evaluated by making 1 measure being the gold standard clinical tool. If the judgments about the relevance and comprehensiveness measures are similar, we may conclude that the target of items within a given measurement. Content valid- assessment is as accurate or more efficient as the ity should be evaluated in a sample similar to the target gold standard measure and, thus, can be used inter- population for which the measurement will be clinically changeably. For example, when determining the con- used. The context in which the instrument will be used current validity of an observational balance assess- is important to consider as not all groups may experi- ment, researchers may have participants balancing on ence the measured construct in the exact same manner. a force platform, which has been previously validated, Relevant experiences and manifestations of the construct while observing the personʼs posture. Concurrent must be represented; otherwise inaccurate inferences validity is appropriate for instruments used for diag- may be made based on the derived scores. nostic/identification purposes (eg, to identify limita- In 2009, the United States Food and Drug Administration tions in occupational performance) or for evaluative issued the Guidance for Industry on Patient-Reported purposes (eg, to track change over time). Outcome Measures, stressing the importance of direct Predictive validity establishes that derived scores input from people in the measured population when from the target assessment can be used to predict evaluating content validity.20 Therefore, when evaluat- future scores or clinical outcomes in the same individ- ing content validity claims, it is important to determine uals. Predictive validity is evaluated for instruments that members of the target population, not just clinical or that seek to determine whether scores can predict the content experts, were included in the validation sample. gold standard outcomes in the future. For example, Content validity is typically evaluated using rigorous do fall risk assessment scores predict future falls in qualitative methods.21 Mixed methods approaches that older adults? Assessments with predictive validity integrate qualitative and advanced quantitative methods, are useful in identifying at-risk populations, allowing such as factor analysis, structural equation modeling, and clinicians to intervene. Assessments with adequate differential item functioning, are also used.22 predictive validity also help practitioners set long- term goals for the people they serve. Face Validity Face validity is a component of content validity but is Construct Validity less rigorous. Face validity is the extent to which an instru- Many constructs of interest in occupational therapy ment “appears” to test what it is intended to test. Face have no commonly agreed upon gold standards. In these validity lends itself to some subjectivity and should not be situations, construct validation should be used to support considered as a significant source for instrument validity validity claims. Construct validity is the degree to which Copyright © 2016. Taylor & Francis Group. All rights reserved. as there is no standard for judging the amount of validity the scores of an instrument are consistent with hypoth- an instrument possesses. eses with regard to internal relationships, relationships to scores of other instruments, or differences between Criterion Validity relevant groups.2 There are 3 aspects to construct validity: Criterion validity is the degree to which scores of an structural validity, hypothesis testing, and cross-cultural instrument are an adequate reflection of a “gold standard” validity. measure (also referred to as the criterion standard).2 By Structural validity is a relevant concept only for multi- definition, criterion validity is only applicable in situations item measurement instruments and determines the when there is an agreed upon gold standard assessment extent to which all items represent the same underlying for the construct being measured. The gold standard construct.2 If a measurement instrument is determined to instrument used for comparison must have previously measure more than one construct, then subscales scores established validity prior to comparisons because without should be reported. For example, the Bruinick-Oseretsky a valid instrument serving as the comparative criterion, Test of Motor Proficiency is a comprehensive assessment the instrument would not be accurate for testing. of pediatric motor performance, but it includes separate Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. Understanding Measurement Properties 35 subscores for strength and dexterity. For measurement continuous variables, Pearson Correlation coefficients are instruments developed using CTT, confirmatory factor often reported.7-9 When the target instrument scores are analysis (CFA) is the preferred statistical approach.1 The continuous and the gold standard is dichotomous (only factor structure and model fit are evaluated using the has 2 categories), the Area Under the Receiver Operating comparative fit index (CFI), the root mean square error of Characteristic (ROC) Curve is commonly reported. Finally, approximation (RMSEA), and the standardized root mean when both measures are dichotomous, sensitivity and square residual (SRMR). Models with CFI ≥ 0.95, RMSEA specificity are reported.25 Sensitivity deals with the pro- ≤ 0.06, and SRMR ≤ 0.08 are considered to be good-fitting portion of true positives that are correctly identified by models.23 the test, whereas specificity identifies the proportion of Hypothesis testing is the extent to which scores on true negatives that are correctly identified.28 Sensitivity a particular measurement instrument relate to other and specificity are particularly important for diagnostic measures in a manner that is consistent with the theo- tests where false positives and false negatives have critical retically derived hypothesis.12 The construct validation of consequences. Refer to Table 3-5 for acceptable statistical measurement instruments is based on the development ranges for validity measures. and empirical testing of conceptually driven relationships between the scores of one measurement instrument with Validity of IRT-Based Measurement the scores from other measures. Hypotheses are evaluated Instruments by the correlation and should include both the direction Validity in the IRT context is related to measurement and magnitude of the correlations or mean differences.24 accuracy, and fit statistics provide evidence of validity of There are 3 main types of construct validity described the items and coherence to the construct.18 Items that in the literature. Convergent validity claims are supported deviate from the IRT measurement model are consid- when scores from measurement instruments that purport ered to be misfitting items and are typically deleted in to measure the same construct are highly correlated. the instrument development process. Fit statistics are Divergent validity claims are supported when scores from evaluated for each item via a Mean Square Statistic ratio of measurement instruments that purport to measure dif- observed variance to expected variance from the model. ferent constructs are weakly correlated. Known groups or Variance of 1.0 indicates perfect agreement. Fit statistics of discriminant validity claims are supported when samples < 2.0 are considered acceptable. with known differences in the constructs of interest obtain significantly different or weakly correlated scores. Fairness in Testing Construct validation is an ongoing process aimed at building a body of evidence to support construct valid- Fairness in testing dictates that all test takers should ity claims.25,26 Terwee and colleagues12 suggest that the have an unobstructed opportunity to demonstrate their formulation of specific hypotheses and at least 75% of the standing on the construct(s) being measured.14 Fairness in results are in accordance with these hypotheses as quality testing is a fundamental validity issue and requires atten- criteria to substantiate construct validity claims. tion through all stages of test development and use.14 Cross-cultural validity is the “degree to which the per- Issues of fairness in testing may be particularly impor- formance of the items on a translated or culturally adapt- tant given that occupational therapy practitioners serve ed [measurement] instrument are an adequate reflection increasingly diverse populations, not only in terms of dis- of the performance of items in the original version of the ability status, but also age, race/ethnicity, gender identity, instrument.”2(p 743) Issues of cross-cultural validity are functional literacy, and culture. particularly important in questionnaires and self-report Principles of fairness in testing stress flexibility in assess- instruments. Some items or constructs that are highly rel- ment administration to provide equal opportunities for evant in one culture may be quite meaningless in another. some test takers.14 It may be appropriate to provide reason- As a starting point, rigorous translation and cultural adap- able accommodations̶alterations to the testing environ- Copyright © 2016. Taylor & Francis Group. All rights reserved. tation practices should be implemented in accordance ment or task demands̶for some people with disabilities. with state of the science guidelines,27 followed by cogni- An in-depth understanding of the constructs being mea- tive and hypothesis testing in the target population.24 A sured and the theoretical relationship between the test cross-culturally valid instrument should function the same items (performance-based or self-report) can help ensure way in different populations. From a clinical perspective, it that accommodations do not fundamentally alter the core is important to recognize that validity scores derived from task demands and thereby invalidate the derived scores. measurement instruments cannot be assumed across cul- Information about reasonable accommodations and modi- tures; rather, translations and cultural adaptations must be fication can help end users evaluate the impact that non- rigorously conducted and systematically evaluated. standard administrations can have on derived scores. Statistics Used to Evaluate Validity Responsiveness to Change Similar to reliability, the preferred validity statistic The ultimate goal of occupational therapy is to opti- depends on the instrumentʼs response options. When mize occupational performance, participation, and quality evaluating concurrent validity and both measures report of life for the people we serve. It is therefore necessary that Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. 36 Chapter 3 practitioners be able to document change in performance in clinical practice. A variety of forms of evidence can over time. Change can be either an improvement or be used to support interpretability of derived scores, decrease in the personʼs ability to complete the measured including comparing means and standard deviations of skill. The ability of an instrument to detect change over subgroups that are known to be different. According to time in the measured construct is called responsiveness.2 Terwee et al,12 positive ratings for interpretability are Responsiveness may be understood as an aspect of given if mean scores and standard deviations are provided validity related to change in scores over time and is some- for at least 4 subgroups, regardless of the measurement times called longitudinal validity. Longitudinal studies approach. MCID also provides evidence of interpretability are required to evaluate responsiveness, but the absolute and is increasingly reported in assessment literature.30 time frame for the study is less important than the obser- vation that changes in scores has occurred.1 As with valid- Clinical Utility ity claims, assessing responsiveness consists of hypothesis Clinicians must have a thorough understanding of testing by comparing change scores in known groups measurement properties in order to make informed deci- or by evaluating changes in measures.12 Change scores sions about the selection of measurement instruments should be obtained independently for both the compara- and the interpretation of scores. It is also important to tive measure (a criterion measure, if available, or another acknowledge the practical considerations that influence instrument whose responsiveness has been established). assessment use in routine clinical care. Clinical utility, or The appropriate statistics for evaluating responsiveness the usefulness of a clinical instrument or intervention, are the correlations between change scores for the Area is related to the feasibility of administering the mea- Under the ROC Curve for continuous variables, and sensi- surement instrument in clinical environments. Riddle tivity and specificity for dichotomous scales.24 and Stratford6 have identified important considerations Terwee and colleagues12 suggest that responsive mea- related to clinical utility at the clinician, client, and organi- sures should also be able to distinguish important change zational level, including the following: from measurement error. Minimally clinically important Clinician-related factors difference (MCID) has become a valuable statistic when assessing changes in performance over time. The MCID ◦ Opportunity to research and identify the best pos- is the smallest difference in the measured construct that sible measurement instruments for each individual indicates a clinically meaningful change has occurred.29 client The determination of the MCID is necessary to judge the ◦ Training and certification requirements needed to effectiveness of treatments and interventions as clinicians properly administer the measurement instrument rely on tracking meaningful changes in performance ◦ Burden of measurement in clinicians, including the when comparing the effectiveness of clinical approaches. time needed to administer and score measures ◦ Clinical reasoning to interpret derived scores and Floor and Ceiling Effects apply them to practice Floor and ceiling effects indicate that a measurement Client-related factors instrument may lack the ability to adequately discriminate ◦ Availability of appropriately targeted measure- individuals with the highest and lowest scores, thereby ment instruments to address the construct of limiting the reliability of the results. A measurement interest instrument with significant floor and ceiling effects does ◦ Accessibility considerations when there is a mis- not allow for measurement of changes at the lowest or match between the clientʼs functional capacity highest ratings, limiting responsiveness and tracking func- and the task demands tional change over time. Clinicians need to be cautious of instruments with documented floor and ceiling effects ◦ Client preferences for active treatment vs assess- when working with individuals testing at the extremes of ment. Some clients may perceive assessment pro- Copyright © 2016. Taylor & Francis Group. All rights reserved. the assessment scale. cesses to be less valuable than treatment. Organization-related factors ◦ Cost of measurement instruments can limit the PRACTICAL CONSIDERATIONS choices practitioners have when evaluating instruments Interpretability ◦ Resource requirements, such as space, equipment, One challenge for occupational therapy practitioners and other instrument-specific requirements, can using measurement instruments in clinical practice is inter- be a limiting factor in some clinical environments. preting the changes in derived scores and using them to Organizational factors including policies, practices, inform treatment decisions. Statistically significant change and procedures can help or hinder measurement. Riddle does not necessarily translate to meaningful changes in and Stratford6 suggest that a systematic and thoughtful daily life experiences and outcomes. While not strictly approach to measurement can reduce the burden and a measurement property, interpretability is an impor- increase application of appropriate instruments following tant consideration for use of measurement instruments a 6-step process: Law, M., Baum, C. M., & Dunn, W. (2016). Measuring occupational performance : Supporting best practice in occupational therapy. Taylor & Francis Group. Created from utoronto on 2024-08-12 16:47:33. Understanding Measurement Properties 37 1. Step 1: Clarify what you hope to assess and why Step 4: Search effectively and efficiently for relevant 2. Step 2: State important patient-related constraints measurement instruments. Based on Steps 1 through 3, 3. Step 3: State important constraints relevant to your a functional performance-based assessment of upper clinical setting extremity function for use in adults after stroke is the most appropriate measure to find. The assessment must 4. Step 4: Search effectively and efficiently for relevant be free or low cost with minimal training requirements. By measurement instruments clearly defining these criteria in advance, the clinician can 5. Step 5: Seek out measures that provide information rule out instruments that are not feasible for the practice on interpretation of score values setting and the client. 6. Step 6: Scrutinize measurement properties relevant to The most rigorous approach to instrument identifica- your clinical context tion and selection is to conduct a systematic literature The following case study applies the Riddle and search. There are a number of searchable instrument data- Stratford strategy to illustrate the identification and selec- bases that provide synthesized, peer-reviewed instrument tion of instruments based on clinical utility and measure- summaries. One such database is the RehabMeasures ment property considerations. Database (RMD) (www.rehabmeasures.org). The RMD is searchable by area of assessment, diagnosis, cost, and Case Study length of time to administer the test. Felipe is a 53-year-old restaurant owner and chef Entering the search criteria in the RMD yielded a total of who experienced a left ischemic stroke, which primarily 5 candidate measures: affected his right side. After intensive inpatient rehabilita- 1. Nine-hole peg test (9-HPT)