Assessing Psychometric Quality of a Test - Lesson 10-11 PDF

Summary

This lesson discusses assessing the psychometric quality of a test, focusing on concepts such as item difficulty, classical test theory, item discrimination, and others. It provides information on how to interpret results and address biases in testing.

Full Transcript

Assessing the Psychometric Quality of a Test PSYASSMENT1 Item Difficulty Item difficulty is a psychometric property that measures how easy or difficult an item is for respondents to answer correctly. Examining item difficulty is important because it can help identify items that are t...

Assessing the Psychometric Quality of a Test PSYASSMENT1 Item Difficulty Item difficulty is a psychometric property that measures how easy or difficult an item is for respondents to answer correctly. Examining item difficulty is important because it can help identify items that are too easy or too difficult, which can limit the variability of responses and make it harder to discriminate between participants who have different levels of the construct being measured. The proportion of correct responses for each item is calculated and reported as the item difficulty value. This calculation can be done manually using spreadsheet software or programmatically using statistical software such as R or SPSS. R has many packages and functions out there that we can use to calculate item difficulties. Classical Test Theory To calculate classical item difficulty with dichotomous items, you simply count the number of examinees that responded correctly (or in the keyed direction) and divide by the number of respondents. Classical Test Theory This gets you a proportion, which is like a percentage but is on the scale of 0 to 1 rather than 0 to 100. Therefore, the possible range that you will see reported is 0 to 1. This approach to calculating difficulty is sample-dependent. If we had a different sample of people, the statistics could be quite different. This is one of the primary drawbacks to classical test theory. Item response theory tackles that issue Classical Item with a different paradigm. It also has an index with the right “direction” – high Difficulty values mean high difficulty with IRT. If you are working with multiple choice items, remember that while you might have 4 or 5 responses, you are still scoring the items as right/wrong. Therefore, the data ends up being dichotomous 0/1. Item Difficulty Interpreting the results of item difficulty is straightforward. Items with higher difficulty values indicate that they were easier for participants to answer correctly, while items with lower difficulty values were more difficult. Example, the output shows that item 40 had the highest difficulty value of 0.952, meaning that 95.2% of participants answered this item correctly. Item 30, on the other hand, had the lowest difficulty value of 0.044, meaning that only 4.4% of participants answered this item correctly. Item Difficulty It’s important to note that each construct should be evaluated within its own concept while interpreting item difficulties. Yet, for achievement tests, a generic classification might be defined as “easy” if the index is 0.85 or above; “moderate” if it is between 0.41 and 0.84; and “hard” if it is 0.40 or below. Also, item difficulty is not the only factor to consider when evaluating the quality of a measure. Items that are too easy or too difficult may still be valid and reliable, depending on the construct being measured and the purpose of the measure. However, examining item difficulty can provide valuable insights into the psychometric properties of the measure and inform decisions about item selection and revision. Item Discrimination Item discrimination measures the extent to which each item differentiates between participants who have high or low levels of the construct being measured. It indicates how well an item distinguishes between participants with different levels of the construct. Item Discrimination It’s important to note that (just like item difficulties) each construct should be evaluated within its own concept while interpreting item discriminations. For achievement tests, a generic classification might be defined as “good” if the index is above 0.30; “fair” if it is between 0.10 and 0.30; and “poor” if it is below 0.10. Item Discrimination Correlation between item and total score with the item This approach is based on calculating the point-biserial correlation coefficient (rpb) between each item and the total score of the measure. The total score is calculated by summing the scores of all items. The rpb ranges from -1 to 1, with values closer to 1 indicating higher discrimination. Correlation between item and total score without the item This approach is very similar to the first one. The only difference is that when we calculate the correlation between an item and the total score, we do not include the item. This type of an approach will result in slightly reduced index values when compared to the first approach. Therefore, it is usually a more- preferred approach by test developers (stay in the safe-zone). Upper-lower groups index This approach is the most meaningfully-related approach in terms of “discrimination”. That’s because while calculating it, we divide the whole group into sub-groups (usually three groups) according to their total scores. Then, calculate the discrimination index for an item by comparing these groups’ responses to that item. This definition feels more like a discrimination index. Item Difficulty Index Standard Error of Measurement The standard error of measurement is directly related to the reliability of the test. It is an index of the amount of variability in an individual student’s performance due to random measurement error. If it were possible to administer an infinite number of parallel tests, a student’s score would be expected to change from one administration to the next due to a number of factors. For each student, the scores would form a “normal” (bell-shaped) distribution. The mean of the distribution is assumed to be the student’s “true score,” and reflects what he or she “really” knows about the subject. Standard Error of Measurement The standard deviation of the distribution is called the standard error of measurement and reflects the amount of change in the student’s score which could be expected from one test administration to another. Whereas the reliability of a test always varies between 0.00 and 1.00, the standard error of measurement is expressed in the same scale as the test scores. For example, multiplying all test scores by a constant will multiply the standard error of measurement by that same constant, but will leave the reliability coefficient unchanged. Standard Error of Measurement A general rule of thumb to predict the amount of change which can be expected in individual test scores is to multiply the standard error of measurement by 1.5. Only rarely would one expect a student’s score to increase or decrease by more than that amount between two such similar tests. The smaller the standard error of measurement, the more accurate the measurement provided by the test. Item Analysis Basic Concepts in Summary Item analysis is a technique that evaluates the effectiveness of items in tests. Two principal measures used in item analysis are item difficulty and item discrimination. Item Difficulty: The difficulty of an item (i.e., a question) in a test is the percentage of the sample taking the test that answers that question correctly. This metric takes a value between 0 and 1. High values indicate that the item is easy, while low values indicate that the item is difficult. Item Discrimination is a measure of how well an item (i.e., a question) distinguishes between those with more skill (based on whatever is being measured by the test) from those with less skill. Interpreting Item Analysis Results Item analysis data are not synonymous with item validity. An external criterion is required to accurately judge the validity of test items. By using the internal criterion of total test score, item analyses reflect internal consistency of items rather than validity. The discrimination index is not always a measure of item quality. There is a variety of reasons an item may have low discriminating power: (a) extremely difficult or easy items will have low Interpreting ability to discriminate but such items are often needed to adequately sample course content and Item objectives; (b) an item may show low discrimination if the test Analysis measures many different content areas and cognitive skills. Results For example, if the majority of the test measures “knowledge of facts,” then an item assessing “ability to apply principles” may have a low correlation with total test score, yet both types of items are needed to measure attainment of course objectives. Interpreting Item Analysis Results Item analysis data are tentative. Such data are influenced by the type and number of students being tested, instructional procedures employed, and chance errors. If repeated use of items is possible, statistics should be recorded for each administration of each item. Item-criterion Correlations The main goal is to Item-criterion determine how well each correlations are used in Refers to the correlation item on a test predicts test development and between a test item and the criterion. High item- validation to improve the a criterion measure, criterion correlations quality of the test. By which could be the total indicate that the item is identifying items that do test score, a subtest a good predictor of the not correlate well with score, or an external criterion, while low the criterion, test measure like job correlations suggest that developers can refine performance or the item may not be the test to better academic achievement measuring the same measure the intended construct. construct Calculation of Item-criterion Correlations This correlation is often calculated using the biserial or point-biserial correlation coefficient, depending on whether the item is dichotomous (e.g., true/false) or continuous. Point-biserial correlation: Used for dichotomous items. Biserial correlation: Used for continuous items and provides a correction for the dichotomous nature of the data Interpretation of Item-criterion Correlations High Correlation: Indicates that the item is strongly related to the criterion and is likely a good measure of the underlying construct. Low or Negative Correlation: Suggests that the item may not be measuring the same construct as the criterion and might need to be revised or removed Item Characteristic Curve (ICC) A fundamental concept in Item Response Theory (IRT), used to describe the relationship between a test-taker's ability and the probability of a correct response to a specific test item. It is a graphical representation that shows how the probability of a correct response to an item changes with varying levels of the latent trait (e.g., ability, proficiency) being measured. X-axis: Represents the latent trait or ability level of the test-taker Y-axis: Represents the probability of a correct response to the item Item Characteristic Curve (ICC) The curve is typically S- shaped (sigmoid or ogive), indicating that as the ability level increases, the probability of a correct response also increases ICCs are used to evaluate and improve test items, ensuring they are appropriate for measuring the intended construct across different ability levels Item Characteristic Curve (ICC) Each ICC is defined by several parameters: Difficulty (b): The point on the ability scale where the probability of a correct response is 50%. Higher values indicate more difficult items. Discrimination (a): Indicates how well the item differentiates between test-takers with different levels of ability. Steeper slopes suggest higher discrimination. Guessing (c): Represents the probability of a correct response due to guessing, often relevant for multiple- choice items Item Bias Item bias occurs when a test item favors one group of test-takers over another, not due to differences in the trait being measured, but because of some irrelevant characteristics. Item bias, also known as differential item functioning (DIF), happens when individuals from different groups (e.g., gender, ethnicity) with the same underlying ability have different probabilities of answering an item correctly. Biased items can lead to unfair advantages or disadvantages for certain groups, affecting the validity and fairness of the test. This can result in inaccurate conclusions about the abilities of individuals from different groups. Determining Item Bias Mantel-Haenszel Procedure: A statistical test used to detect DIF by comparing the odds of different groups answering an item correctly. Logistic Regression: Analyzes the probability of a correct response while controlling for the overall ability level. Item Response Theory (IRT): Uses item characteristic curves to compare how different groups respond to an item Addressing Item Bias Review and Revise Items: Conduct thorough reviews to identify and revise or remove biased items. Pilot Testing: Use diverse samples in pilot testing to detect potential biases before the test is finalized. Expert Panels: Involve experts from various backgrounds to review items for cultural, gender, or other biases. Ensuring that tests are free from bias is crucial for ethical testing practices. It helps in providing equal Ethical opportunities for all test-takers and maintaining the Considerations integrity of the test results. References 1. An Intersection of Test Score Interpretation and Item Analysis - JSTOR 2. Item-total correlation - Wikipedia 3. Item Analysis - ed 4. Characteristics and uses of item-analysis data. - APA PsycNet 5. Inter-item Correlations - SpringerLink 6. http://www.bsc-cdhs.org 7. What is Classical Item Difficulty (P Value... - Assessment Systems 8. Understanding Item Analyses | Office of Educational Assessment 9. Item Analysis Basic Concepts - Real Statistics Using Excel 10.- Item difficulty & discrimination: Exploring the psychometric... References 1. Item Response Theory - Columbia Public Health 2. What is an Item Characteristic Curve (ICC)? | Brght.org 3. APA Dictionary of Psychology 4. CHAPTER 1 The Item Characteristic Curve - EdRes.org 5. The Item Characteristic Curve - SpringerLink 6. A Matter of Test Bias in Educational Policy Research: Bringing the... 7. Item Bias and Individual Differences | SpringerLink 8. Chapter 6 Item Analysis | Introduction to Educational and Psychological... 9. Guide to Item Analysis - Pennsylvania State University 10.Item Analysis: How to Improve Tests with Psychometrics - ASC

Use Quizgecko on...
Browser
Browser