Lesson 6: Establishing Test Validity and Reliability PDF
Document Details
Uploaded by FuturisticLucchesiite
Tags
Summary
This document is a lesson on test validity and reliability. It covers different types of validity, such as content validity, face validity, predictive validity, concurrent validity, construct validity, convergent validity, and divergent validity. It also explains different types of reliability, such as test-retest reliability, parallel forms reliability, split-half reliability, and internal consistency reliability.
Full Transcript
Lesson 6: Establishing Test Validity and Reliability How do we establish the validity and reliability of tests? Desired Significant Learning Outcomes In this lesson, you are expected to: explain the procedures and statistical analysis to establish test validity and reliability; and decide wh...
Lesson 6: Establishing Test Validity and Reliability How do we establish the validity and reliability of tests? Desired Significant Learning Outcomes In this lesson, you are expected to: explain the procedures and statistical analysis to establish test validity and reliability; and decide whether a test is valid or reliable; What is test reliability? Reliability is the consistency of the responses to measure under three conditions: (1) when retested on the same person; (2) when retested on the same measure; and (3) similarity of responses across items that measure the same characteristic. In the first condition, consistent response is expected when the test is given to the same participants. In the second condition, reliability is attained if the responses to the same test is consistent with the same test or its equivalent or another test that measures but measures the same characteristic when administered at a different time. In the third condition, there is reliability when the person responded in the same way or consistently across items that measure the same characteristic. There are different factors that affect the reliability of a measure. The reliability of a measure can be high or low, depending on the following factors: The number of items in a test - The more items a test has, the likelihood of reliability is high. The probability of obtaining consistent scores is high because of the large pool of items. Individual differences of participants - Every participant possesses characteristics that affect their performance in a test, such as fatigue, concentration, innate ability, perseverance, and motivation. These individual factors change over time and affect the consistency of the answers in a test. External environment - The external environment may include room temperature, noise level, depth of instruction, exposure to materials, and quality of instruction, which could affect changes in the responses of examinees in a test. What are the different ways to establish test reliability? There are different ways in determining the reliability of a test. The specific kind of reliability will depend on the following: (1) variable being measured, (2) type of test, and (3) number of versions of the test. Methods in Testing Reliability 1. Test-retest How is this reliability done? You have a test, and you need to administer it at one time to a group of examinees. Administer it again at another time to the "same group" of examinees. There is a time interval of not more than 6 months between the first and second administration of tests that measure stable characteristics, such as standardized aptitude tests. 1. Test-retest How is this reliability done? The post-test can be given with a minimum time interval of 30 minutes. The responses in the test should more or less be the same across the two points in time. Test-retest is applicable for tests that measure stable variables, such as aptitude and psychomotor measures (e.g., typing test, tasks in physical education). 1. Test-retest What statistics is used? Correlate the test scores from the first and the next administration. Significant and positive correlation indicates that the test has temporal stability overtime. Correlation refers to a statistical procedure where linear relationships expected for two variables. You may use Pearson Product Moment Correlation or Pearson r because test data are usually in an interval scale. 2. Parallel Forms How is this reliability done? There are two versions of a test. The items need to exactly measure the same skill. Each test version is called a "form." Administer one form at one time and the other form to another time to the "same" group of participants. The responses on the two forms should be more or less the same. 2. Parallel Forms How is this reliability done? Parallel forms are applicable if there are two versions of the test. This is usually done when the test is repeatedly used for different groups, such as entrance examinations and licensure examinations. Different versions of the test are given to a different group of examinees. 2. Parallel Forms What statistics is used? Correlate the test results for the first form and the second form. Significant and positive correlation coefficient are expected. The significant and positive correlation indicates that the responses in the two forms are the same or consistent. Pearson r is usually used for this analysis. 3. Split-Half How is this reliability done? Administer a test to a group of examinees. The items need to be split into halves, usually using the odd-even technique. In this technique, get the sum of the points in the odd-numbered items and correlate it with the sum of points of the even-numbered items. 3. Split-Half How is this reliability done? Each examinee will have two scores coming from the same test. The scores on each set should be close or consistent. Split-half is applicable when the test has a large number of items. 3. Split-Half What statistics is used? Correlate the two sets of scores using Pearson r. After the correlation, use another formula called Spearman- Brown Coefficient. The correlation coefficient The correlation coefficient obtained using Pearson r and Spearman Brown should be significant and positive to mean that the test has internal consistency reliability. 4. Test of Internal Consistency How is this reliability done? This procedure involves determining if the scores for each item are consistently answered by the examinees. After administering the test to a group of examinees, it is necessary to determine and record the scores for each item. The idea here is to see if the responses per item are consistent with each other. 4. Test of Internal Consistency How is this reliability done? This technique will work well when the assessment tool has a large number of items. It is also applicable for scales and inventories (e.g., Likert scale from "strongly agree" to "strongly disagree"). 4. Test of Internal Consistency What statistics is used? A statistical analysis called Cronbach's alpha or the Kuder Richardson is used to determine the internal consistency of the items. A Cronbach’s alpha value of 0.60 and above indicates that the test items have internal consistency. 5. Inter-rater Reliability How is this reliability done? This procedure is used to determine the consistency of multiple raters when using rating scales and rubrics to judge performance. The reliability here refers to the similar or consistent ratings provided by more than one rater or judge when they use an assessment tool. Inter-rater is applicable when the assessment requires the use of multiple raters. 5. Inter-rater Reliability What statistics is used? A statistical analysis called Kendall's tau coefficient of concordance is used to determine if the ratings provided by multiple raters agree with each other. Significant Kendall’s tau value indicates that the raters concur or agree with each other in their ratings. What is test validity? A measure is valid when it measures what it is supposed to measure. If a quarterly exam is valid, then the contents should directly measure the objectives of the curriculum. If a scale that measures personality is composed of five factors, then the scores on the five factors should have items that are highly correlated. If an entrance exam is valid, it should predict students' grades after the first semester. What are the different ways to establish test validity? Type of Definition Procedure validity 1. Content When the items The items are compared with Validity represent the the objectives of the domain being program. The items need to measured measure directly the objectives (for achievement) or definition (for scales). A reviewer conducts the checking. Type of Definition Procedure validity 2. Face When the test is The test items and layout Validity presented well, are reviewed and tried out free of errors, and on a small group of administered respondents. A manual for well administration can be made as a guide for the test administrator. Type of Definition Procedure validity 3. Predictive A measure should A correlation Validity predict a future coefficient is obtained criterion. Example is where the X-variable is an entrance exam used as the predictor predicting the and the Y-variable as grades of the the criterion. students after the first semester. Type of Definition Procedure validity 4. Construct The components The Pearson r can be used Validity or factors of the to correlate the items for test should each factor. However, contain items there is a technique called that are strongly factor analysis to correlated. determine which items are highly correlated to form a factor. Type of Definition Procedure validity 5. Concurrent When two or more The scores on the Validity measures are present measures should be for each examinee correlated. that measure the same characteristic Type of Definition Procedure validity 6. Convergent When the components Correlation is done Validity or factors of a test are for the factors of hypothesized to have the test. a positive correlation Type of Definition Procedure validity 7. Divergent When the components Correlation is done Validity or factors of a test are for the factors of the hypothesized to have test. a negative correlation. An example to correlate are the scores in a test on intrinsic and extrinsic motivation. Sample Cases for Each Type of Validity 1. Content Validity A coordinator in science is checking the science test paper for grade 4. She asked the grade 4 science teacher to submit the table of specifications containing the objectives of the lesson and the corresponding items. The coordinator checked whether each item is aligned with the objectives. 2. Face Validity The assistant principal browsed the test paper made by the math teacher. She checked if the contents of the items are about mathematics. She examined if instructions are clear. She browsed through the items if the grammar is correct and if the vocabulary is within the students' level of understanding. 3. Predictive Validity The school admission's office developed an entrance examination. The officials wanted to determine if the results of the entrance examination are accurate in identifying good students. They took the grades of the students accepted for the first quarter. They correlated the entrance exam results and the first quarter grades. 3. Predictive Validity They found significant and positive correlations between the entrance examination scores and grades. The entrance examination results predicted the grades of students after the first quarter. Thus, there was predictive- prediction validity. 4. Concurrent Validity A school guidance counselor administered a math achievement test to grade 6 students. She also has a copy of the students' grades in math. She wanted to verify if the math grades of the students are measuring the same competencies as the math achievement test. The school counselor correlated the math achievement scores and math grades to determine if they are measuring the same competencies. 5. Construct Validity A science test was made by a grade 10 teacher composed of four domains: matter, living things, force and motion, and earth and space. There are 10 items under each domain. The teacher wanted to determine if the 10 items made under each domain really belonged to that domain. 5. Construct Validity The teacher consulted an expert in test measurement. They conducted a procedure called factor analysis. Factor analysis is a statistical procedure done to determine if the items written will load under the domain they belong. 6. Convergent Validity A math teacher developed a test to be administered at the end of the school year, which measures number sense, patterns and algebra, measurement, geometry, and statistics. It is assumed by the math teacher that students’ competencies in number sense improves their capacity to learn patterns and algebra and other concepts. 6. Convergent Validity After administering the test, the scores were separated for each area, and these five domains were intercorrelated using Pearson r. The positive correlation between number sense and patterns and algebra indicates that, when number sense scores increase, the patterns and algebra scores also increase. This shows student learning of number sense scaffold patterns and algebra competencies. 7. Divergent Validity An English teacher taught metacognitive awareness strategy to comprehend a paragraph for grade 11 students. She wanted to determine if the performance of her students in reading comprehension would reflect well in the reading comprehension test. She administered the same reading comprehension test to another class which was not taught the metacognitive awareness strategy. 7. Divergent Validity She compared the results using a t-test for independent samples and found that the class that was taught metacognitive awareness strategy performed significantly better than the other group. The test has divergent validity.