Psychometric Properties PDF

Psychometric Properties Assumptions about Psychological Testing and Assessment Characteristics of a “Good Test”: Reliability, Validity, Norms (Standardization), and Utility 1 Psychological Traits and States Exist Psychological Traits and States Can Be Quantified Some and Measured Assumptions Test-Related Behavior Predicts Non-Test-Related Behavior about Tests and Other Measurement Techniques Have Psychological Strengths and Weaknesses Testing and Various Sources of Error Are Part of the Assessment Process Assessment Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner Testing and Assessment Benefit Society 2 Trait “Any distinguishable, relatively enduring way in which one individual varies from another” (Guilford, 1959, p. 6) 1: Psychological Based on a sample of behavior that may Traits and States be obtained through direct observation, analysis of self-report statements, or pen- Exist and-paper test answers States distinguish one person from another but are relatively less enduring The phrase relatively enduring in our definition of trait is a reminder that a trait is not expected to be manifested in behavior 100% of the time So, for example, we may become more agreeable and conscientious as we age, and perhaps become less prone to “sweat the small stuff” (Lüdtke et al., 2009; Roberts et al., 2003, 2006). Yet there also seems to be rank-order stability in personality traits. This is evidenced by relatively high correlations between trait scores at different time points (Lüdtke et al., 2009; Roberts & Del Vecchio, 2000). 3 Describing psychological traits and states Pointing out the way in which 1: Psychological individual varies from another Traits and States Magnitude of observed characteristic (degree of shyness) Exist May also be situation-dependent (a PDL’s violent behaviors in the presence of law enforcers and their family) and contextual (praying in church and in a sports event) PDL: Person deprived of liberty 4 2: Psychological Traits and States Can Be Quantified and Measured Operationally defining constructs aid in quantifying and measuring traits and states Aggressive: Sample operational definitions Number of self-reported acts of physically harming others. Number of observed acts of aggression, such as pushing, hitting, or kicking, that occur in a playground setting If constructs have components or dimensions, should they have the same weight in the test? Intelligence components Philippine history Social judgment Weighting the comparative value of a test’s items comes about as the result of a complex interplay among many factors, including technical considerations, the way a construct has been defined for the purposes of the test, and the value society (and the test developer) attaches to the behaviors evaluated. 5 2: Psychological Traits and States Can Be Quantified and Measured Scoring Test score: some number representing the test taker’s targeted ability, trait, or state’s strength; frequently based on cumulative scoring Cumulative scoring: assumption that the more the test taker responds in a particular direction as keyed by the test manual as correct or consistent with a particular trait, the higher that test taker is presumed to be on the targeted ability or trait. 6 3: Test-Related Behavior Predicts Non-Test-Related Behavior Tasks in some tests mimic the actual behaviors that the test user is attempting to understand. Obtained sample of behavior is typically used to make predictions about future behavior, such as work performance of a job applicant It is beyond the capability of any known testing or assessment procedure to reconstruct someone’s state of mind. In practice, tests have proven to be good predictors of some types of behaviors and not-so-good predictors of other types of behaviors. For example, tests have not proven to be as good at predicting violence as had been hoped. Why do you think it is so difficult to predict violence by means of a test? 7 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses Competent test users understand and appreciate the limitations of the tests they use as well as how those limitations might be compensated for by data from other sources. Emphasis in the code of ethics of assessment professionals 8 5: Various Error: a long-standing assumption that factors Sources of other than what a test attempts to measure will influence performance on the test. Error Are Part A variable that must be considered in any assessment of the Error variance: component of a test score Assessment attributable to sources other than the trait or ability measured. Process Potential sources: Test takers, test users, instruments used, etc. Classical test theory (CTT, aka true score theory) The assumption is made that each test taker has a true score on a test that would be obtained but for the action of measurement error. 9 Most controversial assumption One source of fairness-related problems is the test user who attempts to use a particular test with 6: Testing and people whose background and experience are different from the Assessment Can background and experience of people for whom the test was Be Conducted in a intended. Fair and Unbiased Some potential problems related to test fairness are more political Manner than psychometric. In all questions about tests about fairness, it is important to keep in mind that tests are tools that have proper and improper ways of using. 10 7: Testing and Assessment Benefit Society In a world without tests or other assessment procedures, Personnel might be hired based on nepotism rather than documented merit. Teachers and school administrators could arbitrarily place children in different types of special classes simply because that is where they believed the children belonged. There would be a great need for instruments to diagnose educational difficulties in reading and math and point the way to remediation. There would be no instruments to diagnose neuropsychological impairments. There would be no practical way for the military to screen thousands of recruits regarding many key variables. 11 Characteristics of a “Good Test” Technical criteria used to evaluate the quality of tests and other measurement procedures → Psychometric soundness Reliability Validity Other considerations Contains adequate “norms” when the test’s purpose is to compare the test taker’s performance with other test takers’ performance Test Utility: Practical value of the test – usefulness 12 Reliability is the extent to which a test, assessment, or data collection instrument or procedure measures consistently. Basic starting point for all theories of reliability is the idea that test scores reflect the influence of two sorts of factors: Reliability Factors of consistency: stable characteristics of the person and/or the construct Factors of inconsistency: features of the individual/situation that affect test scores but have nothing to do with the construct Weighing a 100g gold in 3 different weighing scales A. 100g result every weighing B. 103g result every weighing C. 3 weighing resulted to 95g, 101g, and 103g this is not reliable Error may be systematic or random. 13 Reliability Criterion: dependability or consistency of the measuring tool In theory, the perfectly reliable measuring tool consistently measures in the same way. Reliability coefficient (estimate): an index of reliability. A score on a test (X) is presumed to reflect not only the examinee’s true score (T) on the construct being measured but also error (E), or the component of the observed test score that does not have to do with the examinee’s ability: Statistically, it is the proportion of the total variance attributed to true variance. Total variance (Observed Test Score) = True variance (measured construct) + Error variance 𝑋=𝑇+𝑒 14 Sources of Error Variance Test construction Item or content sampling: variation among items within a test as well as to variation among items between tests. Test administration Test taker’s reaction to anything that influence their attention or motivation (test environment and current events) Test taker-related variables (pressing emotional problems, physical discomfort, lack of sleep, effects of drugs or medication, formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state) Examiner-related variables (physical appearance and demeanor, unconscious provision of clues and non-verbal gestures, and level of professionalism. Test scoring and interpretation Scorer differences for manual scoring and individual test administration Other sources of error Sampling error Nonsystematic error (abuse cases – forgetting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting) Sources of Consistency and Inconsistency 1. Lasting and general characteristics of the individual Level of ability on one or more general traits Test taking skills/techniques General ability to comprehend instructions 2. Lasting but specific characteristics of the individual Knowledge and skills specific to a particular form of test items The chance element determining whether the individual knows a particular fact 3. Temporary but general characteristics of the individual Health, fatigues, motivation, emotional strain 4. Temporary and specific characteristics of the individual Comprehension of the specific test task Fluctuations of human memory 5. Systematic or change factors affecting the administration of the test or appraisal of test performance Conditions of testing: adherence to time limits Personality, race, sex of examiner (facilitates/ inhibits performance) 6. Variance not otherwise accounted for (chance) 15 Luck in selection of answers by sheer guessing Momentary distraction 15 Types of Reliability and their Estimates Test-retest reliability: Coefficient of stability Equivalent-forms reliability: Coefficient of equivalence Parallel-forms reliability Alternative-forms reliability Split-half reliability Other Internal or Inter-item consistency measures Kuder-Richardson 20 (KR-20) Coefficient Alpha Average Proportional Distance (APD) Inter-scorer reliability 16 Test-Retest Reliability The extent to which a data collection tool measures consistently across different data collection events Stability over time Contributing factors: 1. Clear instructions for administrators, research participants, and raters Test = Test 2. Tasks/questions in participants’ first language or target language at appropriate level of difficulty Time 1 Time 2 3. Unambiguously phrased tasks/questions If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test- retest method. 17 Test-Retest Reliability Estimates An estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. Coefficient of stability: interval between testing is greater than six months Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. Sources of error variance Length of time between test administrations if crucial (generally, the longer the interval, the lower the reliability) Memory and learning effects Stability of the construct being assessed If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test- retest method. 18 Equivalent Forms reliability Parallel forms: Ideal kind of alternate form Alternate forms: the extent to which different forms of the same tool measure in a similar way (the extent to which the forms are interchangeable) Procedure: 1. Administer both forms to the same people. Form A Stability across forms 2. Get correlation between the two forms. 3. Usually done in educational contexts where you = need alternative forms because of the frequency of retesting Form B Contributing factors: 1. The development of equivalent forms from specifications that describe tool content Time 1 Time 2 2. Trial of tools before data collection to ensure equivalence Two types: 1) Immediate (back-to-back administrations) 2) Delayed (a time interval between administrations) Some Issues: Need same number & type of items on each test Item difficulty must be the same on each test Variability of scores must be equivalent on each test 19 Parallel forms: for each form of the test, the means and the variances of observed test scores are equal. Parallel forms reliability: an estimate of the extent to which item sampling and other errors have affected test scores on versions Equivalent-forms of the same test when, for each form of the test, the means and variances of observed reliability estimates test scores are equal. Alternate forms: different versions of a test that have been constructed so as to be parallel. Alternate forms reliability: an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error. 20 Equivalent-forms reliability estimate Coefficient of equivalence: the degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability. Considerations when obtaining the reliability estimate Two test administrations with the same group are required Test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy (although not as much as when the same test is administered twice) Time consuming and expensive 21 Internal Consistency The extent to which individual test items are congruent with other items on the data collection tool High internal consistency is a necessity for norm-referenced tools Contributing factors: 1. Careful item writing, guided by item specifications 2. Field test and item analysis 3. Construction of tests with reference to item performance 22 Inter-item consistency: assessing the homogeneity of the test Other The degree of correlation among all the items on a scale. A measure of inter-item consistency is Methods of calculated from a single administration of a Estimating single form of a test. Reliability Estimates of Internal Consistency Internal The Kuder–Richardson formula 20 (KR-20) Consistency Coefficient alpha (𝛼) McDonald’s omega (𝜔) Average proportional distance (APD) Coefficient alpha: similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical), As Streiner (2003b) pointed out, a value of alpha above.90 may be “too high” and indicate redundancy in the items. 23 Coefficient alpha (𝛼) Preferred statistic for obtaining an estimate of internal consistency reliability and assumes that the responses to individual questions are Other normally distributed, have equal variance and equally explain the factor. Methods of Widely used as a measure of reliability, in part because it requires only one administration of Estimating the test and when the scale is unidimensional. McDonald's omega (𝜔) Internal Based on a factor analytic approach, in contrast to alpha, which is primarily based on the correlation between the questions Consistency Proven to be more robust than alpha against deviations from the assumptions noted when using alpha, and will thus generally be a more suitable measure of internal consistency for multidimensional scales Coefficient alpha: similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical), As Streiner (2003b) pointed out, a value of alpha above.90 may be “too high” and indicate redundancy in the items. 24 The Kuder-Richardson Formula 20 (KR-20) Other Statistic of choice for determining the inter- item consistency of homogenous dichotomous items, primarily those items Methods of that can be scored right or wrong (such as multiple-choice items-nominal) Estimating Average Proportional Distance (APD) Relatively new measure for evaluating the Internal internal consistency of a test A measure used to evaluate the internal Consistency consistency of a test that focuses on the degree of difference that exists between item scores. Coefficient alpha: similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical), As Streiner (2003b) pointed out, a value of alpha above.90 may be “too high” and indicate redundancy in the items. 25 Internal Consistency Reliability  Average inter-item correlation: is obtained Item 1 Average Inter-Item Correlation by taking all of the items on a test that probe the same construct (e.g., Item 2 I1 I2 I3 I4 I5 I6 reading comprehension), determining the I1 1.00 Item 3 correlation coefficient for I2.89 1.00 Test I3.91.92 1.00 each pair of items, and finally taking the average Item 4 I4.88.93.95 1.00 of all of these correlation I5.84.86.92.85 1.00 coefficients. This final I6.88.91.95.87.85 1.00 Item 5 step yields the average inter-item correlation (Cohen’s alpha). Item 6.90 Internal Consistency Reliability Average item-total correlation Item 1 I1 I2 I3 I4 I5 I6 Item 2 I1 1.00 I2.89 1.00 Item 3 I3.91.92 1.00 Test I4.88.93.95 1.00 Item 4 I5.84.86.92.85 1.00 I6.88.91.95.87.85 1.00 Total.84.88.86.87.83.82 1.00 Item 5 Item 6.85 Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once Reliability Estimates Procedure 1. Divide the test into equivalent halves. Odd-even reliability Split-Half By content so that each half contains items equivalent with respect to content and difficulty Reliability 2. Calculate a Pearson r between scores on the two halves of the test. 3. Adjust the half-test reliability using the Spearman– Brown formula (allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test). 2𝑟 𝑟𝑆𝐵 = ℎℎ 1+𝑟ℎℎ Spearman-Brown formula is a specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by any number of items. 28 Internal Consistency Reliability Split-half correlations Item 1 Item 2 Item 1 Item 3 Item 4 Item 3 Test Item 4.87 Item 5 Item 2 Item 5 Item 6 Item 6 Internal Consistency Reliability Cronbach’s alpha () Item 1 item 1 item 3 item 4 item 1 item 3 item 4 item 1 item 3 item 4 Item 2.87.87.87 item 2 item 5 item 6 item 2 item 5 item 6 item 2 item 5 item 6 Item 3 SH1.87 Test Like the average of all SH2.85 Item 4 SH3.91 possible split-half SH4.83 correlations Item 5 SH5.86... SHn.85 Use KR20 for nominal Item 6 data  =.85 KR20- Kuder-Richardson Reliability Analysis Inter-Scorer Reliability Aka scorer reliability, judge reliability, observer reliability, and inter-rater reliability The degree of agreement or consistency between Object/phenomenon/ two or more scorers (or judges or raters) with CONSTRUCT regard to a particular measure. Considerations Intrarater reliability: the extent to which an individual scorer is consistent in how the criteria are applied ? Interrater reliability: the extent to which multiple scorers are consistent in how the criteria are applied = Coefficient of inter-scorer reliability – correlation Rater 1 Rater 2 coefficient between the ratings of the raters Cohen’s kappa (𝜅) Intrarater reliability contributing factors Unambiguous criteria for scoring Rater’s thorough training (and practice!) in applying the criteria Interrater reliability contributing factors Unambiguous criteria for scoring Raters’ thorough training (and practice!) in applying the criteria 31 Reliability Types Summary Reliability Testing Error Variance Statistical Purpose Typical Uses Types Sessions Sources Procedures Test-retest To evaluate the When 2 Administration Pearson r or stability of a assessing the Spearman rho measure stability of various personality traits Alternate To evaluate the When there is 1 or 2 Test Pearson r or forms relationship a need for construction or Spearman rho between different forms administration different forms of a test (e.g., of a measure makeup tests) 32 Reliability Types Summary Error Reliability Testing Purpose Typical Uses Variance Statistical Procedures Types Sessions Sources Internal To evaluate When 1 Test Pearson r between consistency the extent to evaluating construction equivalent test halves which items the with Spearman Brown on a homogeneity correction or KR-20 for scale relate of a measure dichotomous items, or to one (or, all items coefficient alpha another are tapping a for multipoint items or single APD construct) 33 Reliability Types Summary Error Reliability Testing Statistical Purpose Typical Uses Variance Types Sessions Procedures Sources Inter- To evaluate the Interviews or coding 1 Scoring and Cohen’s scorer level of of behavior. Used interpretation kappa, agreement when researchers Pearson r between raters need to show that or Spearman on a measure there is consensus rho in the way that different raters view a particular behavior pattern (and hence no observer bias). 34 Some Factors affecting Reliability 1. Number of items (the more questions, the higher the reliability) 2. Item difficulty (moderately difficult items lead to higher reliability, e.g., p-value of.40 to.60) 3. Homogeneity/similarity of item content (e.g., item x total score correlation; the more homogeneity, the higher the reliability) 4. Scale format/number of response options (the more options, the higher the reliability) 35 Validity Validity refers to the degree of which evidence and theory support the interpretations of the test score for the proposed use. (AERA, APA, and NCME, 2014, p. 11) Refers to how well a test measures what it is purported to measure A valid measure can also be a reliable measure; but an invalid measure can never be reliable 36 Face validity Content validity Types of Validity Criterion-related validity Concurrent validity Predictive validity Construct validity Convergent validity Discriminant validity 37 Ascertains that the measure appears to be assessing the intended construct under study. Face Validity Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of test-takers may be more a matter of public relations than psychometric soundness 38 Content Validity Gavrilo Princip was a. a poet b. a hero c. a terrorist d. a nationalist e. all of the above 39 Educational achievement tests Content Validity The proportion of material covered by the test approximates the proportion of in Different material covered in the course – test blueprint Assessment Employment test Settings Must be a representative sample of the job-related skills required for employment – behavioral observation Test blueprint: “structure” of the evaluation—that is, a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage 40 Criterion-Related Validity A judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the measure of interest being the criterion. Criterion: relevant, uncontaminated Types of criterion-related validity Concurrent validity: an index of the degree to which a Predictive validity: an index of the degree to which a test score is related to some criterion measure test score predicts some criterion measure. obtained at the same time (concurrently). 41 Criterion-Related Validity Types of statistical evidence 1.Validity coefficient: a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure Cronbach and Gleser (1965) cautioned against the establishment of rules to determine the minimum acceptable size of the validity coefficient because they argued that validity coefficients need to be large enough to enable the test user to make accurate decisions within the unique context in which a test is being used. 2.Incremental validity: the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use. 42 Construct Validity Used to ensure that the measure is actually measuring what it is intended to measure (i.e. the construct), and not other variables. Evidence of Construct Validity The test is homogeneous, measuring a single construct; Test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted; Test scores obtained after some event, or the mere passage of time (or, posttest scores) differ from pretest scores as theoretically predicted; Test scores obtained by people from distinct groups vary as predicted by the theory; Test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question. Convergent and discriminant evidence Traditionally, construct validity has been viewed as the unifying concept for all validity evidence (American Educational Research Association et al., 1999). As we noted at the outset, all types of validity evidence, including evidence from the content- and criterion-related varieties of validity, come under the umbrella of construct validity. 43 Construct Validity Convergent validity Convergence of validity from a number of sources, such as other tests or measures designed to assess the same (or a similar) construct. If scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct Discriminant validity If scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct 44 Construct Validity Factor analysis A term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ Frequently employed as a data reduction method in which several sets of scores and the correlations between them are analyzed. Types Exploratory (EFA): “estimating, or extracting factors; deciding how many factors to retain; and rotating factors to an interpretable orientation” Confirmatory (CFA): the degree to which a hypothetical model (which includes factors) fits the actual data. In studies using FA for data reduction, the purpose of the factor analysis may be to identify the factor or factors in common between test scores on subscales within a particular test, or the factors in common between scores on a series of tests Factor loading in a test conveys information about the extent to which the factor determines the test score or scores. A new test purporting to measure bulimia, for example, can be factor-analyzed with other known measures of bulimia, as well as with other kinds of measures (such as measures of intelligence, self-esteem, general anxiety, anorexia, or perfectionism). High factor loadings by the new test on a “bulimia factor” would provide convergent evidence of construct validity. Moderate to low factor loadings by the new test with respect to measures of other eating disorders such as anorexia would provide discriminant evidence of construct validity. 45 Construct Validity 46 Validity, Bias, and Fairness Test Bias Test Fairness 47 Test Bias A factor inherent in a test that systematically prevents accurate, impartial measurement Bias implies systematic variation. Prevention during test development is the best cure for test bias, though a procedure called estimated true score transformations represents one of many available post hoc remedies (Mueller, 1949; see also Reynolds & Brown, 1984). 48 Test Bias Rating error: a judgment resulting from the intentional or unintentional misuse of a rating scale Leniency error: aka generosity error, an error in rating that arises from the tendency on the part of the rater to be lenient in scoring, marking, and/or grading. Severity error: opposite of leniency error, an error in rating that arises from the tendency on the part of the rater to be too critical in scoring, marking, and/or grading. Central tendency error: the rater exhibits a general and systematic reluctance to giving ratings at either the positive or the negative extreme. Ranking instead of rating can help overcome the preceding errors Halo effect: tendency to give a particular ratee a higher rating than he or she objectively deserves because of the rater’s failure to discriminate among conceptually distinct and potentially independent aspects of a ratee’s behavior Horn effect: tendency to give a particular ratee a lower rating than he or she objectively deserves 49 Test Fairness The extent to which a test is used in an impartial, just, and equitable way. Common issues Discrimination among groups of people Test administration to a particular population a standardized test that did not include members of that population in the standardization sample Remedying situations where bias or unfair test usage has been found to occur. 50 Characteristics of a “Good Test” Technical criteria used to evaluate the quality of tests and other measurement procedures → Psychometric soundness Reliability Validity Other considerations Contains adequate “norms” when the test’s purpose is to compare the test taker’s performance with other test takers’ performance Test Utility: Practical value of the test – usefulness 51 The performances by defined groups on particular tests Test performance data designed for use as reference when interpreting individual test scores References of testing and assessment Norm-referenced testing and assessment: comparing test taker’s score to a normative sample Normative sample: group of people whose performance on Norms a particular test is analyzed for reference in evaluating the performance of individual test takers. (Domain- or Content-) Criterion-referenced testing and assessment: method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard. *Fixed Reference Group Scoring Systems: Used as the basis for the calculation of test scores for future administrations of the test (e.g., SAT scores) Examples of criterion-referenced testing and assessment To be eligible for a high-school diploma, students must demonstrate at least a sixth-grade reading level. To earn the privilege of driving an automobile, would-be drivers must take a road test and demonstrate their driving skill to the satisfaction of a state-appointed examiner. To be licensed as a psychologist, the applicant must achieve a score that meets or exceeds the score mandated by the state on the licensing test. To conduct research using human subjects, many universities and other organizations require researchers to successfully complete an online course that presents testtakers with ethics- oriented information in a series of modules, followed by a set of forced-choice questions. 52 Norms Norming: the process of deriving norms by obtaining distribution of scores for a particular group through administration of test to a sample of people who belong to that group Race norming: the controversial practice of norming based on race or ethnic background. User norms or Program norms: consist of descriptive statistics based on a group of test takers in a given period rather than norms obtained by formal sampling methods (Test) Standardization: The process of administering a test to a representative sample of test takers for the purpose of establishing norms. Sampling: an important process in standardization and collection of normative data. 53 Types of 1. Percentiles: an expression of the percentage of Norms people whose score on a test or measure falls below a particular raw score. Developmental norms: norms developed on the basis of any trait, ability, skill, or other characteristic that is presumed to develop, deteriorate, o otherwise be affected by chronological age, school grade, or stage of life. 2. Age norms: different normative groups for particular age groups; age-equivalent scores (e.g., IQ scores, growth chart) 3. Grade norms: :developed by administering the test to representative samples of children over a range of consecutive grade levels (such as first through sixth grades). 54 Types of 4. National norms: derived from a normative sample Norms that was nationally representative of the population at the time the norming study was conducted. 5. National anchor norms: an equivalency table for scores on the different tests provide the tool for comparison. Equipercentile method: the equivalency of scores on different tests is calculated with reference to corresponding percentile scores. 6. Subgroup norms: a normative sample can be segmented by any of the criteria initially used in selecting subjects for the sample 7. Local norms: provide normative information with respect to the local population’s performance on some test. 55 56 57 Culturally Informed Assessment Do Do Not Be aware of the cultural assumptions Take for granted that a test is based on on which a test is based assumptions that impact all groups in much the same way Consider consulting with members of Take for granted that members of all cultural communities regarding the cultural communities will appropriateness of assessment automatically deem techniques, tests, techniques, tests, or test items or test items appropriate for use Strive to incorporate assessment Take a “one-size-fits-all” view of methods that complement the assessment when it comes to worldview and lifestyle of the test taker evaluation of persons from various who come from a specific cultural and cultural and linguistic populations linguistic population 58 Culturally Informed Assessment Do Do Not Be knowledgeable about the many Select tests or other tools of assessment alternative tests or measurement with little or no regard for the extent to procedures that may be used to fulfill the which such tools are appropriate for use assessment objectives with a particular test taker. Be aware of equivalence issues across Simply assume that a test that has been cultures, including equivalence of translated into another language is language used and the constructs automatically equivalent in every way to measured the original Score, interpret, and analyze assessment Score, interpret, and analyze assessment data in its cultural context with due in a cultural vacuum consideration of cultural hypotheses as possible explanations for findings 59 Implies uniformity of procedure in test administration and scoring – controlled conditions for scores obtained by different persons to be comparable in a test situation, a single Standardization independent variable is often the individual being tested 1. Formulation of directions 2. Giving of instructions 3. Establishment of norms 4. Process of standardization 60 The test developer provides: Detailed directions for administering the Formulation of test The exact materials employed directions Time limits Preliminary demonstrations Ways of handling queries from test takers 61 Giving of instructions/presenting problems orally Giving of Rate of speaking Tone of voice instructions Pauses Facial expression 62 Psychological tests have no predetermined standards of passing or failing Performance on each test is evaluated on the basis of empirical data Usually an individual’s test score is interpreted by comparing it with the scores obtained by others on the Establishment same test of norms Raw scores are meaningless until evaluated in terms of suitable interpretative data Norm is the normal or average performance, may be expressed as: Number of correct items Time required to finish the task Number of errors 63 The test is administered to a large, representative sample of the type of persons for whom it is designed (known as the standardization sample), serves to Process of establish the norms Norms represent not only the average performance but standardization also the relative frequency of varying degrees of deviation above and below that average – thus evaluating the different degrees of inferiority and superiority 64 Utility Usefulness or practical value of testing to improve efficiency, cost-effective Psychometric soundness Factors that affect a test’s utility: Costs Benefits Helps us in making better decisions Psychometric soundness: test is utilized according to its intended purpose when it is reliable and valid Costs: disadvantages, losses, or expenses in both economic and noneconomic terms Benefits: advantages, profits, or gains in both economic and noneconomic terms 65 Family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment Evaluating whether the benefits of using a test (or training program or intervention) outweigh the costs. one test is preferable to another test for use for a specific purpose one tool of assessment (such as a test) is Utility Analysis preferable to another tool of assessment (such as behavioral observation) for a specific purpose the addition of one or more tests (or other tools of assessment) to one or more tests (or other tools of assessment) that are already in use is preferable for a specific purpose no testing or assessment is preferable to any testing or assessment. For training program or intervention evaluation: one training program is preferable to another training program; one method of intervention is preferable to another method of intervention; the addition or subtraction of elements to an existing training program improves the overall training program by making it more effective and efficient; the addition or subtraction of elements to an existing method of intervention improves the overall intervention by making it more effective and efficient; no training program is preferable to a given training program; no intervention is preferable to a given intervention. 66 Expectancy data Expectancy Table or Chart: likelihood that individuals who score within a given range on the predictor will perform successfully on the criterion Taylor-Russell tables: provide an estimate of the extent to which inclusion of a particular test in the selection system will improve selection. Naylor-Shine tables: Likely average increase in Utility Analysis criterion performance as a result of using a particular test or intervention; also provides selection ratio Methods needed to achieve a particular increase in criterion performance The Brogden-Cronbach-Gleser formula: used to calculate the dollar amount of a utility gain resulting from the use of a particular selection instrument under specified conditions Utility gain refers to an estimate of the benefit (monetary or otherwise) of using a particular test or selection method. 67 Decision theory and Test Utility Decision theory provides guidelines for setting optimal cutoff scores. In setting such scores, the relative seriousness of making false-positive or false-negative selection decisions is frequently taken into account. Methods for setting cut scores: The Angoff method: provide estimates regarding how test takers who have at least minimal competence for the position should answer test items correctly (interrater_ The Known Groups method: method of contrasting groups, use of data to choose the score that best discriminates the two contrasting groups from each other, which is the score at the point of least difference between the two groups (Item response theory) IRT-based methods: Item mapping method and bookmark method Method of predictive yield: considers the number of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores. Discriminant analysis: Techniques typically used to shed light on the relationship between identified variables (such as scores on a battery of tests) and two (and in some cases more) naturally occurring groups (such as persons judged to be successful at a job and persons judged unsuccessful at a job). Item mapping method: entails the arrangement of items in a histogram, with each column in the histogram containing items deemed to be of equivalent value Judges who have been trained regarding minimal competence required for licensure are presented with sample items from each column and are asked whether or not a minimally competent licensed individual would answer those items correctly about half the time. Bookmark method Training of experts with regard to the minimal knowledge, skills, and/or abilities that testtakers should possess in order to “pass.” After this training, the experts are given a book of items, with one item printed per page, such that items are arranged in an ascending order of difficulty. The expert then places a “bookmark” between the two pages (or, the two items) that are deemed to separate testtakers who have acquired the minimal knowledge, skills, and/or abilities from those who have not. The bookmark serves as the cut score. Additional rounds of bookmarking with the same or other judges may 68 take place as necessary. Feedback regarding placement may be provided, and discussion among experts about the bookmarkings may be allowed. In the end, the level of difficulty to use as the cut score is decided upon by the test developers. 68 Psychometric Properties Assumptions about Psychological Testing and Assessment Characteristics of a “Good Test”: Reliability, Validity, Norms (Standardization), and Utility 69

Psychometric Properties PDF

Document Details

Tags

Related

Summary

Full Transcript