Module 4 - Psychometric Properties of a Test PDF
Document Details
Kolehiyo ng Lungsod ng Dasmariñas
Tags
Summary
This document is a chapter from a psychology module, describing the psychometric properties of a test, covering topics such as validity and reliability. It explores the different types of validity, including content validity, criterion-related validity, and construct validity, emphasizing the importance of test validity in educational measurement and evaluation.
Full Transcript
1 2 Psychometric Properties of a Test ______________ Psychometrics is a field that deals with the theory and techniques of psychological measurement. It is the science of psychological assessme...
1 2 Psychometric Properties of a Test ______________ Psychometrics is a field that deals with the theory and techniques of psychological measurement. It is the science of psychological assessment, and is a foundation of assessment and measurement (Rust, 2007). According to Borsboom and Molenaar (2015), psychometrics is a scientific discipline concerned with the question of how psychological constructs (intelligence, neuroticism, or depression) can be optimally related to observables (outcomes of psychological tests, genetic profiles, neuroscientific information). Generally, it refers to a field of study within psychology and education devoted to testing, measurement, assessment, and related activities. The field is concerned with the objective measurement of skills and knowledge, abilities, attitudes, personality traits, as well as educational achievement. Psychometrics is a highly interdisciplinary field, with connections to statistics, data theory, econometrics, biometrics, measurement theory, and mathematical psychology. The intrinsic components of a test are precisely called its psychometric properties. These properties are typical characteristics of test that identify and define critical aspects of an instrument, such as its reliability for use in a specific circumstance. In a nut shell, psychometric properties do reveal information about a test's adequacy, relevance and usefulness. The psychometric properties of a test are associated with the data generated from the assessment to determine how well it evaluates the interest construct. The development of a valid test is conditional on the fact that it has been subjected to statistical analysis, which ascertains that it has adequate psychometric properties. Tests are very vital tools in educational measurement and evaluation. They are used in gathering valuable data in which educational decisions are based. Such decisions should be made carefully based on authentic and accurate data. Certainly, erroneous data will lead to wrong decisions. To obtain authentic and accurate data, the test, employed for that purpose should possess some essential characteristics (psychometric properties) (Ezeh, 2003). Two of the most widely discussed psychometric properties of tests are validity and reliability, which are broad categories of related concepts. Test Validity _______________________________ Validity is arguably the most important quality of a test because it has to do with the fundamental measurement issue of what our measuring instruments are really measuring (Bandalos, 2018). The validity of a test is the degree of accuracy with which the test measures what it is intended to measure (Iwuji,1997). Messick (1995) described validity as “an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores or other modes of assessment”. 3 Allen and Yen (1979) referred to validity as the appropriateness, meaningfulness and application of test scores as well as the usefulness of judgments made from test scores. Validity gives meaning to the test scores and gives authority to the link between how an individual performs in the test and the test’s stated measurement criteria. It tells us about the degree to which it is possible to draw specific conclusions or predictions, based on an individual’s test score; in other words, it provides information on the efficacy of the test. Test has been seen as a measuring instrument designed for the measurement of educational or psychological attributes. Most of these attributes are constructs whose existence we cannot see, but rather infer from manifested behaviours associated with individual constructs. The validity of a test, therefore is the degree of accuracy with which a test measures the specific attribute it is constructed to measure. Types of Validity There are three major types of validity – content, criterion-related and construct validity. However, there is another type of validity, usually referred to as "face validity", which in a strict or technical sense is not considered as a type of validity (Ezeh, 2003). 1. Content Validity: Content validity is most often evaluated during the development of an instrument rather than after the instrument has been created. The content validity of any given test refers to the extent to which the test measures both the subject matter content and the instructional objectives designed for a given course. It is the extent to which the questions on the instrument and the scores from these questions represent all possible questions that could be asked about the content or skill (Creswell, 2005). According to DeVellis (2003), content validity refers to the degree to which the instrument comprehensively assesses the underlying construct of interest. It is the most appropriate form of validity for achievement test. According to Nwana (1982), to ensure a sound construction of a valid achievement test, the following principles have been recommended: Questions should be set from all parts of the syllabus. The number of questions set in each section of the syllabus must reflect the relative importance of each section. Content validity usually depends on the judgment of experts in the field. The unclear and obscure questions can be amended, and the ineffective and non functioning questions can be discarded by the advice of the reviewers (Mohajan, 2017). Expert judges may be asked to rate the ‘fit’ of each item on the measuring instrument to its respective domain using a Likert-type rating scale (Briere, 2011). 4 Examining the variance in judges’ fit ratings would then allow the instrument developer to assess the degree of agreement for item fits across all domains and judges. The higher the degree of agreement, the higher the content validity. In order to ensure a high content validity, a "table" is made to map out the content or learning materials which the students have been taught as well as the levels of behaviour objectives at which they have leant them. This provides a basis for the construction of test items. This table is known as table of specification. 2. Criterion-Related Validity: Another name for criterion-related validity is empirical validity. The criterion-related validity of a test can be defined as the degree of correlation that exists between the scores obtained by a group of testees in a test and their scores in another test known to be a standard measure of the characteristics the test under study claims to measure (Nkwocha, 2019). The correlation coefficient estimated is called the validity coefficient. The closer the validity coefficient is to one (1), the higher the level of validity. Gronlund in Nkwocha (2019) states that "criterion- related validity is demonstrated validity by comparing the test scores with one or more external variables considered to provide a direct measure of the characteristics or behaviour in question.” There are two types of criterion-related validity: concurrent-related validity and predictive-related validity. 2.1 Concurrent-criterion-related validity: A test is said to have concurrent-criterion-related validity when there is correlation between the scores obtain by a group of testees in the test and scores they obtained in a criterion test which they took within the same period (Nkwocha, 2019). It is the degree to which the scores on a test are related to the scores on another, already established as valid, designed to measure the same construct, administered at the same time or to some other valid criterion available at the same time (Mohajan, 2017). It is established by correlating one question with another that has previously been validated with standard setting (Okoro, 2002). The two tests to be correlated are taken concurrently. They can be described as equivalent tests. The following steps should be adopted as procedures for estimation of concurrent-criterion-related validity of a test. Identify a criterion test, that is test which has been recognized as a standard measure of the characteristics the new test is designed to measure. Give the test under study and the criterion test to the same testees simultaneously within a short interval. Compute the correlation coefficient of the scores obtained in the two tests using appropriate correlation procedure. If the calculated validity coefficient (correlation coefficient) is high and positive, the test is said to have high concurrent criterion validity. 5 2.2 Predictive criterion-related validity: It is often used in programme evaluation studies, and is very suitable for applied research (Mohajan, 2017). It is a test constructed and developed for the purpose of predicting some form of behaviour (Allen & Yen in Mohajan, 2017). It indicates the ability of the measuring instrument to differentiate among individuals with reference to a future criterion. This implies that it indicates the degree of correspondence between scores on the test in question and future outcomes that are expected to be related to the characteristics measured by the test (Tuckman in Okoye, 2015). A test is said to have predictive criterion-related validity if individuals' present scores in a test can be used to predict their future performance in another test which measures a similar criterion. Examples of tests which are expected to give predictive validity are entrance examination results. They are expected to predict the candidates' grades in the subjects at graduation. If the entrance scores in a subject correlate with the graduation scores in that subject, the test for entrance exam is said to have predictive validity. Predictive validity is most applicable to aptitude or intelligence tests whose scores are intended to predict what the taker will be able to do in future. It should be noted that there is a big interval between the time of obtaining scores from the test (i.e. predictor) and the time of obtaining scores from the criterion. The higher the correlation between the criterion and the predictor indicates the greater the predictive validity. If the correlation is perfect, that is 1, the prediction is also perfect. Most of the correlations are only modest, somewhere between 0.3 and 0.6 (Mohajan, 2017). 3. Construct Validity (Logical Validity): This refers to the extent to which the test measures a psychological construct or trait, which it is supposed to measure. Constructs are terms, which have no direct representations in the empirical world but which are used to describe and explain certain aspects of our behavior. Examples of psychological constructs or traits include intelligence, speed of reading, honesty, anxiety, stability, sociability, verbal fluency, etc. For each of these constructs, a theory may be available aimed at explaining what it is and how people who possess it are likely to behave. If there is a test that is designed to measure any of such construct, we expect scores obtained by people in it to agree with the theoretical prediction. Construct validity therefore deals with the degree to which scores obtained from a test agree with the theory underlying the construct. It involves testing a scale in terms of theoretically derived hypotheses concerning the nature of underlying variables or constructs (Pallant, 2011). The evidence of construct validity can be ascertained by obtaining convergent validity and discriminant validity. An instrument is said to have a convergent validity if scores obtained from it are highly and positively related to scores from an instrument that measures a similar construct. On the other hand, an instrument has discriminant validity, if scores obtained from it have little or no relationship with scores from an instrument that measures a construct that is dissimilar to it. Evidence of convergent validity and discriminant validity can be obtained using factor analysis. 6 4. Face Validity: According to Okoye (2015), face validity refers to whether the test looks valid on its surface. It can be described as the extent to which the test looks like an instrument on the variable the test is designed to examine (Nkwocha, 2019). It is the simplest and least precise method of determining validity which relies entirely on the expertise and familiarity of the assessor concerning the subject matter (Nwana, 2007). Face validity is not computed. It is usually checked through surface observation of items. It is usually used to describe the appearance of validity in the absence of empirical testing (Cook & Beckman, 2006). This type of validity is needed for acceptance of the test by evaluators and for sustaining the motivation of the test users. A mathematics test should contain mathematical signs and symbols that would make it look like mathematics test. If it is composed of a composition passage like an English comprehension test, testees who prepared for a mathematics test may go away with the notion that it is a test for candidates who want to be examined in English comprehension. Before a test is considered for other types of validity, it must first have face validity. Importance of Test Validity Test validity is relevant to education in the following ways: 1. Test validity helps in the achievement of test purposes. 2. Test validity guarantees collection of authentic data. It ensures that tests designed to measure specific attributes, measure them. 3. Validity of a test enhances accurate evaluation of capabilities and potentialities of individuals based on test results. Placement, promotion, and certification of individuals are adequately done when tests measure exactly what they are designed to measure. Factors Affecting the Validity of Tests 1. Poor sentences structuring in tests and the use of inappropriate vocabularies affect the testee's comprehension of the task required in the tests. 2. Poor constructions of items. 3. Ambiguous and misleading statements. 4. Improper arrangement of the test items 5. Unclear instructions. Most tests are specific in terms of the behaviour being measured. Inclusion of items from different behaviours other than that intended for such tests, reduces its validity. 7 Reliability _________________________________ A reliable test is one we can trust or we can use to measure a person’s performance approximately the same way each time. Reliability is used to evaluate the stability of tests administered at different times to the same individuals and the equivalence of sets of items from the same test (Kimberlin & Winterstein, 2008). According to Iwuji (1997), the reliability of any measuring instrument is the degree of consistency of the results of its repeated measures of an attribute it is measuring. Nkwocha (2019) saw reliability as the degree to which test items consistently measure any phenomenon they measure. Anastasi in Nkwocha (2019) opined that reliability refers to the consistency of scores obtained by the same person when reexamined with the same test on different occasions. It measures consistency, precision, repeatability, and trustworthiness of a test (Chakrabartty, 2013). It is the degree to which an assessment tool produces stable (free from errors) and consistent results (Mohajan, 2017). It indicates that the observed score of a measure reflects the true score of that measure. The term reliability in the context of measurement theory refers to the dependability or consistency of measurements across conditions (Bandalos, 2018) Reliability of a test relates to the degree of consistency or stability, which the test exhibits. It is that attribute of a test which is responsible for making its results equal (Kulwinder, 2012). In other words, it is concerned with how reproducible such test results are when the measurement is repeated on different occasions. Assuming that Adaeze and Chizzy obtained scores of 60% respectively in a given test. Three days later, the same test was re-administered to the same class, and their scores were 75% and 30% respectively. As the scores of these two testees on the same test are inconsistent over time, such test is said to be unreliable. Reliability therefore connotes the accuracy, trustworthiness or dependability of a test (Ezeh, 2003). Coefficient of Reliability The degree of consistency of a test is expressed as a coefficient called the coefficient of reliability. The coefficient of reliability falls between 0 and 1, with perfect reliability equaling 1, and no reliability equaling 0. Thus, in most cases it is determined by correlating two sets of scores independently obtained by the test. Reliability coefficient takes value between 0 and 1. The following guidelines can be used for the interpretation of the values of reliability coefficients according to Yolonda in Ohiri and Okoye (2023): 0.9 and greater = excellent reliability 0.8 - 0.9 = good reliability 0.7 – 0.8 = acceptable reliability 0.6 – 0.7 = questionable reliability 8 0.5 – 0.6 = poor reliability 0.5 and less = unacceptable reliability Ways of Establishing Test Reliability The reliability of a test is usually determined by comparing the reliability coefficient. This coefficient can be obtained using various methods, thereby giving rise to different types of reliability. The measures commonly obtained in the bid to ascertain reliability are: measures of stability often referred to as the test-retest method, measures of equivalence usually referred to as equivalent form or parallel form method, and measures of internal consistency which include split-half, Kuder-Richardson and Cronbach’s coefficient alpha methods. I. Test-Retest method: This involves estimating the correlation coefficient of two sets of scores, obtained from the same group of people, on the same test which was given to them at two different occasions, under the same conditions, after a short time interval. The scores the testees obtained at the two different occasions are correlated using an appropriate type of correlation method. The closer the correlation coefficient index is to one (1), the more reliable the test. If the reliability coefficient is high, for example, r = 0.98, we can suggest that the instrument is relatively free of measurement errors. If the coefficients yield above 0.7, are considered acceptable, and coefficients yield above 0.8, are considered very good. Limitations of Test-Retest Method The conditions under which the test is taken at the two occasions may differ, particularly if invigilated by different people. A. The testee's state such as emotional state may differ on the two occasions. B. Some testees may not be available during the second test. C. More learning may take place before the next testing. D. Forgetting may also influence the response made during the second testing. E. It is expensive in terms of time and economy II. Equivalent form method: It is a measure of reliability obtained by administering two different versions of an assessment tool to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. If they are highly correlated, then they are known as equivalent-form reliability. 9 Limitations of Equivalent form method A. It would be fatiguing and time consuming for testees who are required to write two tests almost at the same time. B. It is costly to produce two different tests at the same period. C. An equivalent test may not be easily available. III. Split-half method: In this case, the test is administered to the same group of testees once. The common procedure is to divide the test into two groups, with odd-numbered items usually placed in one group and even-numbered in the other. Each testee gets two scores, one from each half of the test. The scores on one half of the test are correlated with the scores on the second half. The computation is done with Pearson’s product moment correlation method. Since the score of each testee has been divided into two, the correlation index estimate is the split-half reliability coefficient. To calculate the reliability index for the full test, Spearman-Brown computation formula is used based on the split-half. Spearman-Brown formula is given as: Where r = Reliability of the full test n= The number of times the test was shortened or elongated rs = The reliability of the shortened or elongated test Split-half method is used when the items of the test are homogenous. That is, when the items measure one construct. Limitations of Split-half form method A. The two halves may actually not be properly split. Hence, the correlation coefficient it produces may be unreliable and misleading. B. It would not be appropriate for tests which measure different constructs. C. Chance error can arise in the test, which increases the amount of r in the two parts IV. Kuder-Richardson (K-R) method: Kuder-Richardson’s reliability is a method that makes use of the full test. It is used for frequency scores. KuderRichardson’s approach avoids the problem of how to split the items, and it has two procedures - KR-20 and KR-21. The KR-20 is best used for a test that does not have many items because the formula requires computation of the proportion of those who passed each item and the proportion of those failed each item. KR-21 does not require such rigor and hence is used when the test items are many. 10 In the computation of K-R reliability coefficient, a single test is administered to a group of testees. It estimates the consistency of responses to all the items in a test. Where K = number of items the whole test is composed of P = proportion of those who passed each item q = proportion of those who failed each item SDt = square of standard deviation of testees scores of the whole test here K, SDt are defined as in KR-20 X = mean of the summated scores Kuder- Richardson method provides an estimate of the average reliability found by taking all possible splits without actually having to do so. Limitations of Kuder-Richardson (K-R) method A. Its application is not recommended for scales that offer multiple choice formats. B. In this method, the reliability coefficient obtained is somewhat less than that obtained by other methods. C. The Kuder-Richardson method is anchored on the basic assumption that the difficulty level of all items will be the same, which is not possible in practice. D. K-R method cannot be applicable in speed tests, because the parts of the test are not independent. V. Cronbach’s coefficient alpha: This is a lower-bound estimate of reliability under the assumption that items are with uncorrelated errors. It can be used for any mixture of binary (true/false) and partial credit items (true/sometimes/false). This is computed by correlating the score for each scale item with the total score for each observation (usually – individual test takers), and then comparing that to the variance for all individual item scores. 11 The formula for Cronbach’s coefficient alpha according to Nkwocha (2019) is Where r = coefficient alpha (the reliability index) K = number of items that compose the test Vt = variance of total scores of each respondent on the test Vi = variance of scores obtained by all respondents on each item. ∑Vi = Sum of total variance of scores for all items. Limitations of Cronbach’s coefficient alpha method A. It does not account for measurement error, which can lead to an overestimation of reliability B. It assumes that all items in a test are measuring the same attribute, which may not always be true. C. Cronbach's alpha coefficient are highly influenced by the number of items of the measurement instrument. Importance of Reliability Some values of test reliability are: 1. Test reliability makes a test dependable for future planning. 2. Reliable tests can be used to establish criterion-related validity and equivalent form reliability for new tests that measure the same traits. 3. Promotion, placement, certification, all these are based on test reliability. Factors that Affect Test Reliability 1. Use of different administration procedures: The results obtained when the test is conducted in a well arranged, spacious and conducive environment would differ from a result obtained in a crowded examination hall that gave room to examination malpractice. 2. Momentary fluctuations such as level of anxiety, motivation, fatigue and lack of concentration on the part of both the testee and the scorer can reduce the degree of the test reliability. 3. Ambiguity: An ambiguous question can be interpreted by testees in different ways at different occasions. The responses they will give at different occasions may therefore differ. 4. Learning and forgetting: when test-retest method is involved, test scores may be affected by the fact that the testee may acquire new knowledge or forget some of the things he earlier knew, after the initial testing. Similarities Between Test Validity and Reliability 1. Validity and reliability are indispensable qualities a good test must possess 2. Both are concerned with measuring the attribute of whatever or whoever they are measuring, not the person or object itself. 12 3. Ambiguity of test items reduces both test validity and reliability. 4. Chance factors such as involvement in examination malpractice can affect both validity and reliability 5. Valid test tends also to be reliable. Conclusion Psychometrically sound tests are indisputable tools in the measurement of educational or psychological attributes. As a result, ascertaining whether or not a test produces valid and reliable scores should be of great concern for test developers and users. Evidence of validity and reliability are not only crucial when developing new testing instruments but are also imperative when choosing instruments to be used in educational and psychological settings. Irrespective of the circumstances, maximum accuracy in measurement is always the goal. Thus, choosing the most valid and reliable measure of the construct of interest helps to ensure maximum accuracy. 13 References: _______________________________ Cohen, R., J & Swerdlik, M., E. (2018) Psychological Testing and Assessment, Introduction to Tests and Measurement, Ninth Edition Do you have any queries about the module? Please email Inst. Jom Caballero at [email protected]