Validity & Reliability Lesson 3 PDF

Summary

This lesson provides an overview of validity and reliability in educational testing. It explains how these concepts affect the quality of test scores and how to assess the reliability and validity of different kinds of tests. The material also discusses different types of reliability, such as test-retest reliability.

Full Transcript

Validity & Reliability No matter how well learning objectives are written, or how clever the items, the quality and usefulness of an examination is predicated on Validity and Reliability, thus making these concepts essential in any testing situation. This lesson...

Validity & Reliability No matter how well learning objectives are written, or how clever the items, the quality and usefulness of an examination is predicated on Validity and Reliability, thus making these concepts essential in any testing situation. This lesson will provide you with a fundamental overview of some of the different forms of reliability and validity evidence that exist and provide you with some basic tools to assess the reliability and validity of company made tests as well as teacher made assessments. EXPLAIN What are Reliability and Validity? Reliability and validity are both associated with the quality of the score you receive from an exam or assessment. IMPORTANT: Reliability and Validity are related to the score or Validity product or outcome of an assessment, not the test itself. We do not say “an exam is reliable and valid.” Instead, we say “the exam score is reliable and valid for a specified purpose.” EXAMPLE: We would not say “SAT scores are reliable and valid.” Instead we would say “SAT scores are a reliable and valid Reliability indicator of college performance.” Without validity and reliability evidence, a test is meaningless. Reliability—answers questions relative to stability and clarity of examination results. How consistent are the results? Was the measure of a student’s ability reached by accident or chance? Or was the measure of a student’s ability reached through clear and stable, meaningful examination? Reliability is a STATISTIC. Specifically, it is a correlation ranging from 0 (completely unreliable) to 1 (perfectly reliable). Correlations measure relationships between two variables, or two sets of test scores, or two items on a test. Correlations range from -1 to +1, but with reliability we only use the positive end of the correlation (no negative reliability coefficients). Rule of Thumb: measures with reliability coefficients of.70 or greater are acceptable. Reliability Interpretation Score.9 – 1.0 Very High.7 –.89 High.5 –.69 Moderate.3 –.49 Low 0 –.29 Negligible Reliability is also a measure of error within an examination. Poor wording, ambiguity, correct answer issues, failure to link with objectives, etc. are quality issues that result in high error obtained in a test score. The information we seek (student knowledge). All test scores are made up of two parts: Ability + Error Human interference (i.e., poorly written questions, student bad day, etc.) We are never able to measure True Ability; we always have some degree of Error mixed in with our test score. Validity & Reliability Page 2 of 5 Why is this concept of “error” important? The purpose of every assessment is to discriminate between those “who know” and those “who do not know” the content. If we have too much error in our results, we cannot effectively determine who really has the ability and who does not have the ability we are testing for. NOTE: Do not confuse test error with the # of correct items a student achieves. The # correct measures the difficulty of the items on the test. You could have a very difficult test where many students fail that is highly reliable—consistently keeps students ranked in the same order from test administration to test administration. Types of Reliability Reliability evidence comes in multiple forms. We will discuss only a few of the most common types of reliability evidence: Test-Retest and Internal Consistency. Test-Retest Reliability---Measures the stability/consistency of participant’s results over time. (Cadillac of reliability measures) How is this done? Administer test to students on January 1. Administer the same test to students on February 1. Are students performing in a consistent manner from Time 1 to Time 2? We expect that students may grow from Time 1 to Time 2 but will remain in the same (consistent) order and have a similar amount of growth. Example 1 Example 2 Highly Reliable Test-Retest Results Unreliable Test-Retest Results Math Math T1 Ability T2 T1 Ability T2 Abby Joe Abby Joe Abby Abby Jill Carl Joe Joe Sue Jill Carl Jill Jill Beth Rick Carl Carl Beth Sue Beth Beth Rick Sue Sue Rick Rick Check Your Understanding Why are the Test-Retest results from Example 1 highly reliable and the results from Example 2 highly unreliable? Validity & Reliability Page 3 of 5 Internal Consistency Reliability---refers to inter-item reliability and assesses the degree of consistency among the items in a scale or test. This tells us if all items are measuring the same construct. Split-Half Reliability and Cronbach’s Alpha are both measures of Internal Consistency. Split-Half Reliability---Measures internal consistency of items by correlating scores from one half of the test with scores from the other half of the items. How is this done? Randomly divide items into two equal subsets of items and examine the consistency in total scores across the two subsets. Are scores from one half of the test consistent with scores from the other half of the test? We would expect to see a high reliability coefficient if scores were consistent between halves of the test (both halves measuring the same construct). Cronbach’s Alpha---Most commonly used measure of internal consistency for a set of items across all respondents. Conceptually, it’s the average consistency across all possible split-half reliabilities. How is this done? Use a statistical analytic software package to directly compute this reliability coefficient from data. Are all items on the test measuring the same construct? We would expect to see a high reliability coefficient if all items are measuring the same construct. Check Your Understanding Which type of internal consistency reliability would you expect to see reported for the Ohio Achievement and Graduation Tests and why? Validity—answers questions about the reasonability of content, criterion, and construct associated with the learning objectives (LOs) and assessment results. Did the test measure what we intended it to measure? Types of Validity Validity evidence comes in multiple forms. We will only discuss Content, Criterion-Related, and Construct in this lesson. Content Validity---Addresses the match between test questions and the content or subject area they are intended to assess. Does the content we have selected for our test (items) appropriately represent our construct (top level domain)? How is this assessed? This is tested by linking test items to learning objectives. It’s a judgment call. Assessing whether or not test items are properly linked/aligned with LOs is the MOST important step a teacher can take to ensure their tests have high content validity evidence. While this is a minimum requirement for a useful test, it is not a guarantee for having a quality test. Check Your Understanding You have taught your 3rd grade students about forces, motion, and simple machines in science so far this year. Next week your students will participate in a standardized achievement test where science is one component that is tested. The science portion of the test will cover the concepts you have taught along with content about Earth and Space Sciences (rocks, minerals, soil, etc.) and Life Sciences (life cycles, animal structures and survival, animal classification by characteristics, etc.). What does the concept of Content Validity have to do with this situation? And do you think your students will perform poorly or well on this test? Explain. Validity & Reliability Page 4 of 5 Criterion-Related Validity---Looks at the relationship between a test score and an outcome (criterion). Do the test results either strongly predict or relate to student’s actual performance? Predictive Validity and Concurrent Validity are both measures of Criterion-Related Validity. Predictive Validity---Refers to the usefulness of test scores to accurately predict future performance. How is this assessed? Suppose we want to predict who will graduate from high school. We administer a test and then wait to see how many graduate. How well do the test results match the actual performance measured? If our test has high Predictive Validity evidence, we will have predicted most of the students who actually graduated from high school from the test results before graduation. Concurrent Validity---Used to examine if one measure can be substituted for another measure—test scores and outcome (criterion) measures are either made at the same time (concurrently) or very closely to each other. How is this assessed? Let’s say we’re trying to test Reading Ability. There’s already a quality Reading Ability test available, but it’s 100 items long and we want ours to be shorter but of similar quality. How highly correlated is our test with the already well-established test? If our shorter (20-item) Reading Ability test is highly correlated with the well-established longer test we have demonstrated Concurrent Validity evidence. Check Your Understanding Most colleges require undergraduates to take and submit scores from the SAT or ACT for admissions consideration. Why might they do this? And what type of criterion-referenced validity evidence might it provide the college admissions department? Construct Validity---Refers to the degree a test assesses the underlying theoretical construct it is supposed to measure. How is this assessed? We need to see if our test is related to other information with some theory (logical explanation). Does the test measure what it is alleged to measure? And do we have concepts (constructs) that actually exist? (e.g., Does “Reading Ability” really exist?) For example, if we were testing mechanical aptitude, we would expect mechanics and engineers to perform better than poets or ballerinas. We would also expect our mechanical aptitude test to be highly correlated with math ability, but not necessarily correlated with social studies ability. Check Your Understanding You are a test developer and have created what you believe will be a quality assessment of Writing Aptitude. How might you assess the Construct Validity of your new assessment? What “abilities” might your assessment be highly correlated with? Which “abilities” do you think may not need to be highly correlated with Writing Aptitude? Who would you expect to perform better on your assessment? And who would you expect to perform worse on your assessment? Validity & Reliability Page 5 of 5 How are Reliability and Validity Related? The figure below depicts the relationship between reliability and validity. First Target—The shooter is neither Reliable nor Valid. The holes are spread apart indicating the shooter is not consistent (not reliable), and the shooter is also not on target because he fails to hit the bull’s eye (not valid). Second Target—The shooter is Reliable but not Valid. The holes are all together in the same general area, so the shooter is consistently hitting the same place (reliable). However, he is consistently off target (not valid). Our assessments could be consistently (reliably) measuring the wrong information (not valid). Third Target—The shooter is both Reliable and Valid because the holes are consistently in the same area (reliable) and on target (valid). Reliability is a necessary but not sufficient condition for Validity An assessment MUST be reliable in order to be valid. But just because it is reliable does not make it valid. We could reliably be measuring the wrong construct.

Use Quizgecko on...
Browser
Browser