Test Reliability and Validity PDF
Document Details
Uploaded by CheaperRapture3435
Tags
Summary
This document provides an introduction to the concepts of test reliability and validity in the context of education. It explains the concept of true score and compares different techniques of estimating reliability and validity of a test.
Full Transcript
Topic Test Reliability and Validity 8 LEARNING OUTCOMES By the end of this topic, you should be able to: 1. Explain the concept of true score; 2. Compare the different techniques of estimating the reliability...
Topic Test Reliability and Validity 8 LEARNING OUTCOMES By the end of this topic, you should be able to: 1. Explain the concept of true score; 2. Compare the different techniques of estimating the reliability of a test; 3. Compare the different techniques of establishing the validity of a test; and 4. Discuss the relationship between reliability and validity. INTRODUCTION In this topic we will address two important issues, namely the reliability and validity of an assessment. How do we ensure that the techniques we use for assessing the knowledge, skills and values of students are reliable and valid? We are making important decisions about the abilities and capabilities of our future generation, so obviously we want to ensure that we are making the right decisions. 8.1 WHAT IS RELIABILITY? You have given a geometry test to a group of Form Four students and one of your students, Swee Leong, obtained a score of 66 per cent in the test. How sure are you that the score is what Swee Leong should actually receive? In other words, is that his true score? When you develop a test and administer it to your students, you are attempting to measure, as far as possible, the true score of your students. The true score is a hypothetical concept of the actual ability, competency and capacity Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 161 of an individual. A test attempts to measure the true score of a person. When measuring human abilities, it is practically impossible to develop an error-free test. However, just because there is error, it does not mean that the test is not good. The more important factor is the size of the error. Formally, an observed test score, X, is conceived as the sum of a true score, T, and an error term, E. The true score is defined as the average of test scores if a test is repeatedly administered to a student (and the student can be made to forget the content of the test in-between repeated administrations). Given that the true score is defined as the average of the observed scores, in each administration of a test, the observed score departs from the true score and the difference is called measurement error. This departure is not caused by blatant mistakes made by test writers but it is caused by some chance elements in studentsÊ performance during the test. Observed Score = True Score + Error Measurement error mostly comes from the fact that we have only sampled a small portion of a studentÊs capabilities. Ambiguous questions and incorrect markings can contribute to measurement error but it is only a small part of measurement error. Imagine if there are 10,000 items and a learner can obtain 60 per cent of all 10,000 items administered (which is not practically feasible). Then 60 per cent is the true score. Now when you sample only 40 items in a test, the expected score for the learner is 24 items. But the learner may get 20, 26, 30 and so forth depending on which items are in the test. In this example, this is the main source of measurement error. That is to say, measurement error is due to the sampling of items rather than poorly written items. Error may come from various sources such as within the test takers (the learners), within the test (questions are not clear), in the administration of the test or even during scoring (or marking). For example, fatigue, illness, copying or even the unintentional noticing of another learnerÊs answer all contribute to error from within the test taker. Generally, the smaller the error, the greater the likelihood you are closer to measuring the true score of the learners. If you are confident that your geometry test (observed score) has a small error, then you can confidently infer that Swee LeongÊs score of 66 per cent is close to his true score or his actual ability in solving geometry problems, in other words, what he actually knows. To reduce the error in a test, you must ensure that your test is reliable and valid. The higher the reliability and validity of your test, the greater the likelihood you will be measuring the true score of your learners. Copyright © Open University Malaysia (OUM) 162 TOPIC 8 TEST RELIABILITY AND VALIDITY We will first examine the reliability of a test. What is reliability? Reliability is the consistency of the measurement. Would your learners get the same scores if they took your test on two different occasions? Would they get approximately the same score if they took two different forms of your test? These questions have to do with the consistency of your classroom tests in measuring learnersÊ abilities, skills and attitudes or values. The generic name for consistency is reliability. Reliability is an essential characteristic of a good test because if a test does not measure consistently (reliably), then you could not count on the scores resulting from the administration of the test (Jacobs, 1991). 8.2 THE RELIABILITY COEFFICIENT Reliability is quantified as a reliability coefficient. The symbol used to denote a reliability coefficient is r with two identical subscripts (rxx). The reliability coefficient is generally defined as the variance of the true score divided by the variance of the observed score. The following is the equation. Variance of the True Score 2 True Score rxx 2 Variance of the Observed Score Observed Score If there is relatively little error, the ratio of the true score variance to the observed score variance approaches a reliability coefficient of 1.00 which is a perfect reliability. If there is relatively large amount of errors, the ratio of the true score variance to the observed score variance approaches 0.00 which is total unreliability. Test with no reliability Test with perfect reliability 0.00 1.00 High reliability means that the questions of a test tended to „pull together‰. Learners who answered a given question correctly were more likely to answer other questions correctly. If an equivalent or parallel test were developed by using similar items, the relative scores of learners would show little change. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. The Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 163 resulting test scores reflect that something is wrong with the items or the testing situation rather than learnersÊ knowledge of the subject matter. The following guidelines may be used to interpret reliability coefficients for classroom tests as shown in Table 8.1. Table 8.1: Interpretation of Reliability Coefficients Reliability Interpretation 0.90 and above Excellent reliability (comparable to the best standardised tests). 0.80ă0.90 Very good for a classroom test. 0.70ă0.80 Good for a classroom test but there are probably a few items which could be improved. 0.60ă0.70 Somewhat low. There are probably some items which could be removed or improved. 0.50ă0.60 The test needs to be revised. 0.50 and below Questionable reliability and the test should be replaced or in need of a major revision. If you know the reliability coefficient of a test, can you estimate the true score of a learner on a test? In testing, we use the Standard Error of Measurement to estimate the true score. The Standard Error of Measurement Standard Deviation 1 r Note: „r‰ is the reliability of the test. Using the normal curve, you can estimate a learnerÊs true score with some degree of certainty based on the observed score and Standard Error of Measurement. For example: You gave a history test to group of 40 students. Khairul obtained a score of 75 in the test, which is his observed score. The standard deviation of your test is 2.0. Earlier you had established that your history test had a reliability coefficient of 0.7. You are interested to find out KhairulÊs true score. The Standard Error of Measurement Standard Deviation 1 r 2.0 1 0.7 2.0 0.55 1.1 Copyright © Open University Malaysia (OUM) 164 TOPIC 8 TEST RELIABILITY AND VALIDITY Therefore, based on the normal distribution curve (refer to Figure 8.1), KhairulÊs true score should be: (a) Between 75 ă 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68% of the time. (b) Between 75 ă 2.2 and 75 + 2.2 or between 72.8 and 77.2 for 95% of the time. (c) Between 75 ă 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99% of the time. Figure 8.1: Determining KhairulÊs true score based on a normal distribution SELF-CHECK 8.1 1. Define the reliability of a test. 3. What does the reliability coefficient indicate? 4. Explain the concept of true score. ACTIVITY 8.1 Shalin obtains a score of 70 in a biology test. The reliability of the test is 0.65 and the standard deviation of the test is 1.5. Compute the true score of Shalin for the biology test. (int: Use 1 standard error of measurement) Share your answer with your coursemates in the myINSPIRE online forum. Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 165 8.3 METHODS TO ESTIMATE THE RELIABILITY OF A TEST Let us now discuss how we estimate the reliability of a test. Figure 8.2 lists three common methods of estimating the reliability of a test. It is not possible to calculate reliability exactly and so we have to estimate reliability. Figure 8.2: Methods for estimating reliability Let us now look at each method in detail. (a) Test-retest Using the test-retest technique, the same test is administered again to the same group of learners. The scores obtained in the first administration of the test are correlated to the scores obtained in the second administration of the test. If the correlations between the two scores are high, then the test can be considered to have high reliability. However, a test-retest situation is somewhat difficult to conduct as it is unlikely that learners will be prepared to take the same test twice. There is also the effect of practice and memory that may influence the correlation. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval. (b) Parallel or Equivalent Forms For this technique, two equivalent tests (or forms) are administered to the same group of learners. The two tests are not similar but are equivalent. In other words, they may have different questions but they are measuring the same knowledge, skills or attitudes. Therefore, you have two sets of scores which are correlated and reliability can be established. Unlike the test-retest technique, the parallel or equivalent forms reliability measure is not affected Copyright © Open University Malaysia (OUM) 166 TOPIC 8 TEST RELIABILITY AND VALIDITY by the influence of memory. One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct. This is often not an easy feat. (c) Internal Consistency Internal consistency is determined using only one test administered once to learners. Internal consistency refers to how the individual items or questions behave in relation to each other and to the overall test. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. The following are two common internal consistency measures that can be used. (i) Split-half To solve the problem of having to administer the same test twice, the split-half technique is used. In the split-half technique, a test is administered once to groups of learners. The test is divided into two equal halves after the learners have completed the test. This technique is most appropriate for tests which include multiple-choice items, true- false items and perhaps short-answer essays. The items are selected based on odd-even method whereby one half of the test consists of odd numbered items while the other half consists of even numbered items. Then, the scores obtained for the two halves are correlated to determine the reliability of the whole test using the Spearman-Brown correlation coefficient. 2rxy rsb 1 rxy In this formula, rsb is the split-half reliability coefficient and rxy represents the correlation between the two halves. Say for example, you have established that the correlation coefficient between the two halves is 0.65. What is the reliability of the whole test? 2rxy 2 0.65 1.3 rsb 0.78 1 rxy 1 0.65 1.65 Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 167 (ii) CronbachÊs Alpha CronbachÊs coefficient alpha can be used for both binary-type (1 = correct, 0 = incorrect or 1 = true and 0 = false) and scale items (1 = strongly agree, 2 = agree, 3 = disagree, 4 = strongly disagree). Reliability is estimated by computing the correlation between the individual questions and the extent to which individual questions correlate with the total test. This is meant by internal consistency. The key is „internal‰. It is unlike the test-retest and parallel or equivalent form which require another test as an external reference. The stronger the items are inter-related, the more likely the test is consistent. The higher the alpha, the more reliable is the test. There is generally no agreed cut-off. Usually 0.7 and above is acceptable (Nunnally, 1978). The formula for CronbachÊs alpha is as follows: k pi 1 pi k i 1 Cronbachs alpha 1 k 1 x2 k is the number of items in the test; pi refers to item difficulty which is the proportion of learners who answered the item i correctly; and 2 x is the sample variance for the total score. For example: Suppose that in a multiple-choice test consisting of five items or questions, the following difficulty index for each item was observed: p 1 = 0.4, p 2 ă 0.5, p 3 = 0.6, p4 = 0.75 and p 5 = 0.85. Sample variance ( 2 x ) = 1.84. CronbachÊs alpha would be calculated as follows: 5 1.045 1 0.54 5 1 1.840 Copyright © Open University Malaysia (OUM) 168 TOPIC 8 TEST RELIABILITY AND VALIDITY Professionally developed standardised tests should have internal consistency coefficient of at least 0.85. High reliability coefficients are required for standardised tests because they are administered only once and the score on that one test is used to draw conclusions about each learnerÊs level on the constructs measured. Perhaps, the closest to a standardised test in the Malaysian context would be the tests for different subjects conducted at the national level in the PT3 and SPM examinations. According to Wells and Wollack (2003), it is acceptable for classroom tests to have reliability coefficients of 0.70 and higher because a learnerÊs score on any one test does not determine the learnerÊs entire grade in the subject or course. Usually, grades are based on several other measures such as project work, oral presentations, practical tests, class participation and so forth. To what extent in this true in the Malaysian classroom? A Word of Caution! When you obtain a low alpha, you should be careful not to immediately conclude that the test is a bad test. You should check to determine if the test measures several attributes or dimensions rather than one attribute or dimension. If it does, there is the likelihood for the Cronbach alpha to be deflated. For example, an Aptitude Test may measure three attributes or dimensions such as quantitative ability, language ability and analytical ability. Hence, it is not surprising that the Cronbach alpha for the whole test may be low as the questions may not correlate with each other. Why? This is because the items are measuring three different types of human abilities. The solution is to compute three different Cronbach alphas; one for quantitative ability, one for language ability and one for analytical ability which tells you more about the internal consistency of the items in the test. SELF-CHECK 8.2 1. What is the main advantage of the split-half technique over the test- retest technique in determining the reliability of a test? 2. Explain the parallel or equivalent forms technique in determining the reliability of a test. 3. Explain the concept of internal consistency reliability of a test. Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 169 8.4 INTER-RATER AND INTRA-RATER RELIABILITY Whenever you use humans as part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distracted. We get tired of doing repetitive tasks. We daydream. We misinterpret. So, how do we determine whether: (a) Two observers are being consistent in their observations? (b) Two examiners are being consistent in their marking of an essay? (c) Two examiners are being consistent in their marking of a project? In order to find the answers to these questions, let us read further on inter-rater reliability and intra-rater reliability. (a) Inter-rater Reliability When two or more people mark essay questions, the extent to which there is agreement in the marks allotted is called inter-rater reliability (refer to Figure 8.3). The greater the agreement, the higher is the inter-rater reliability. Figure 8.3: Examiner A vs Examiner B Inter-rater reliability can be low because of the following reasons: (i) Examiners are subconsciously being influenced by knowledge of the learners whose scripts are being marked; (ii) Consistency in marking is affected after marking a set of either very good or very weak scripts; Copyright © Open University Malaysia (OUM) 170 TOPIC 8 TEST RELIABILITY AND VALIDITY (iii) When there is an interruption during the marking of a batch of scripts, different standards may be applied after the break; and (iv) The marking scheme is poorly developed resulting in examiners making their own interpretations of the answers. According to Frith and Macintosh (1987), inter-rater reliability can be enhanced if the criteria for marking or marking scheme: (i) Contains suggested answers related to the question; (ii) Has made provision for acceptable alternative answers; (iii) Ensures that the time allotted is appropriate for the work required; (iv) Is sufficiently broken down to allow the marking to be as objective as possible and the totalling of marks is correct; and (v) Allocates marks according to the degree of difficulty of the question. (b) Intra-rater Reliability While inter-rater reliability involves two or more individuals, intra-rater reliability is the consistency of grading by a single rater. Scores on a test are rated by a single rater at different times. When we grade tests at different times, we may become inconsistent in our grading for various reasons. For example, some papers that are graded during the day may get our full attention while others that are graded towards the end of the day may be very quickly glossed over. Similarly, change in our mood may affect the grading of papers. In these situations, the lack of consistency can affect intra- reliability in the grading of learnersÊ answers. SELF-CHECK 8.3 List the steps that may be taken to enhance inter-rater reliability in the grading of essay answer scripts. Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 171 ACTIVITY 8.1 Suggest other steps you would take to enhance intra-rater reliability in the grading of learnersÊ assignments. Share your answer with your coursemates in the myINSPIRE online forum. 8.5 TYPES OF VALIDITY What is validity? Validity is often defined as the extent to which a test measures what is was designed to measure (Nutall, 1987). While reliability relates to the consistency of the test, validity relates to the relevancy of the test. If it does not measure what it sets out to measure, then its use is misleading and the interpretation based on the test in not valid or relevant. For example, a test that is supposed to measure the „spelling ability of 8-year old children‰ but does not do so is not a valid test. It would be disastrous if you make claims about what a learner can or cannot do based on a test that is actually measuring something else. It is for this reason that many educators argue that validity is the most important aspect of a test. However, validity will vary from test to test depending on what it is used for. For example, a test may have high validity in testing the recall of facts in economics but that same test may be low in validity with regards to testing the application of concepts in economics. Messick (1989) was most concerned about the inferences a teacher draws from the test score, the interpretation the teacher makes about his learners and the consequences from such inferences and interpretation. You can imagine the power an educator holds in his hand when designing a test. Your test could determine the future of thousands of learners. Inferences based on test of low validity could give a completely different picture of the actual abilities and competencies of learners. Three types of validity have been identified: construct validity, content validity and criterion-related validity which is made up of predictive and concurrent validity (refer to Figure 8.4). Copyright © Open University Malaysia (OUM) 172 TOPIC 8 TEST RELIABILITY AND VALIDITY Figure 8.4: Types of validity Now let us find out more on each type of validity. (a) Construct Validity Construct validity relates to whether the test is an adequate measure of the underlying construct. A construct could be any phenomena such as mathematics achievement, map skills, reading comprehension, attitude towards school, inductive reasoning, environmental awareness, spelling ability and so forth. You might think of construct validity as a „labelling‰ issue. For example, when you measure what you term as „critical thinking‰, is that what you are really measuring? Thus, to ensure high construct validity, you must be clear about the definition of the construct you intend to measure. For example, a construct such as reading comprehension would include vocabulary development, reading for literal meaning and reading for inferential meaning. Some experts in educational measurement have argued that construct validity is the most critical type of validity. You could establish the construct validity of an instrument by correlating it with another test that measures the same construct. For example, you could compare the scores obtained on your reading comprehension test with the scores obtained on another well-known reading comprehension test administered to the same sample of learners. If the scores for the two tests are highly correlated, then you may conclude that your reading comprehension test has high construct validity. A construct is determined by referring to theory. For example, if you are interested in measuring the construct „self-esteem‰, you need to be clear what self-esteem is. Perhaps, you need to refer to literature in the field describing the attributes of self-esteem. You will find that theoretically, self- esteem is made of the following attributes ă physical self-esteem, academic self-esteem and social self-esteem. Based on this theoretical perspective, you can build items or questions to measure self-esteem covering these three types of self-esteem. Through such a process you will be more certain to ensure high construct validity. Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 173 (b) Content Validity Content validity is more straightforward and likely to be related to construct validity. It concerns the coverage of appropriate and necessary content, for example, does the test cover the skills necessary for good performance or all the aspects of the subject taught? It is concerned with sample-population representativeness. That is to say, the facts, concepts and principles covered by the test items should be representative of the larger domain (such as the syllabus) of facts, concepts and principles. For example, the Science unit on „Energy and Forces‰ may include facts, concepts, principles and skills on light, sound, heat, magnetism and electricity. However, it is difficult, if not impossible, to administer a two to three-hour paper to test all aspects of the syllabus on „Energy and Forces‰ (refer to Figure 8.5). Therefore, only selected facts, concepts, principles and skills from the syllabus (or domain) are sampled. The content selected will be determined by content experts who will judge the relevance of the content in the test to the content in the syllabus or particular domain. Figure 8.5: Sample of content tested for the unit on „Energy and Forces‰ Content validity will be low if the questions in the test include testing content not included in the domain or syllabus. To ensure content validity and coverage, most teachers use the Table of Specifications (as discussed in Topic 3). Table 8.2 is an example of a Table of Specifications which specifies the knowledge and skills to be measured and the topics covered for the unit on „Energy and Forces‰. You cannot measure all the content of a topic, therefore, you will have to focus on the key areas and give due weightage to those areas that are important. For example, the teacher has decided that 64 per cent of questions will emphasise on the understanding of concepts while Copyright © Open University Malaysia (OUM) 174 TOPIC 8 TEST RELIABILITY AND VALIDITY 36 per cent will focus on the application of concepts for the five topics. A Table of Specifications provides teachers with evidence that a test has high content validity and that it covers what should be covered. Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰ Understanding of Application of Topics Total Concept Concepts Light 7 4 11 (22%) Sound 7 4 11 (22%) Heat 7 4 11 (22%) Magnetism 3 3 6 (11%) Electricity 8 3 11 (22%) TOTAL 32 (64%) 18 (36%) 50 Content validity is different from face validity which refers to what the test superficially appears to measure. Face validity assesses whether the test „looks valid‰ to the examinees, the administrative personnel who decide on its use and other technically untrained observers. The face validity is a weak measure of validity but that does not mean that it is incorrect, only that caution is necessary. Its importance cannot be underestimated. (c) Criterion-related Validity Criterion-related validity of a test is established by relating the scores obtained to the scores of some other criterion or other test. There are two types of criterion-related validity: (i) Predictive validity relates to whether the test predicts accurately some future performance or ability. Is the STPM examination a good predictor of performance in university? One difficulty in calculating the predictive validity of STPM is because only those who pass the examination proceed to university (generally speaking) and we do not know how well learners who did not pass the examination might have done (Wood, 1991). Moreover, only a small proportion of the population sit for the STPM examination. As such, the correlation between STPM grades and performance at the degree level would be quite high. Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 175 (ii) Concurrent validity is concerned about whether the test correlates with, or gives substantially the same results as, another test of the same skill. For example, does your end-of-year language test correlate with the MUET examination? In other words, if your language test correlates highly with MUET, then your language test has high concurrent validity. 8.6 FACTORS AFFECTING RELIABILITY AND VALIDITY Deale (1975) suggests that to prepare tests which are acceptably valid and reliable, these factors should be taken into account: (a) Length of the Test Generally, the longer the test, the more reliable and valid is the test. A short test would not adequately cover a yearÊs work. The syllabus needs to be sampled. The test should consist of enough questions that are representative of the knowledge, skills and competencies in the syllabus. However, there is also a problem with tests that are too long. A long test may be valid but it will take too much time and fatigue may set in, which may affect performance and the reliability of the test. (b) Selection of Topics The topics selected and the test questions prepared should reflect the way the topics were treated during teaching and learning. It is necessary to be clear about the learning outcomes and to design items that measure these learning outcomes. For example, in your teaching, learners were not given the opportunity of think critically and solve problems. However, your test consists of items requiring learners to think critically and solve problems. In such situation, the reliability and validity of the test will be affected. (c) Choice of Testing Techniques The testing techniques selected will also affect reliability and validity. For example, if you choose to use essay questions, validity may be high but reliability may be low. Essay questions tend to be less reliable than short- answer questions. Structured essays are usually more reliable than open- ended essays. Copyright © Open University Malaysia (OUM) 176 TOPIC 8 TEST RELIABILITY AND VALIDITY (d) Method of Test Administration Test administration is also an important step in the measurement process. This includes the arrangement of items in a test, the monitoring of test taking and the preparation of data files from the test booklets. Poor test administration procedures can lead to problems in the data collected and threaten the validity of test results. Adequate time must be allowed for the majority of learners to complete the test. This would reduce wild guessing and instead encourage learners to think carefully about the answers. Instructions need to be clear to reduce the effects of confusion on reliability and validity. The physical conditions under which the test is taken must be favourable for learners. There must be adequate space, lighting and appropriate temperature. Learners must be able to work independently and the possibility of distractions in the form of movement and noise must be guarded against. (e) Method of Marking The marking should be as objective as possible. Marking which depends on the exercise of human judgement such as in essays, observations of classroom activities and hands-on practices is subject to the variations of human fallibility. (Refer to inter-rater reliability discussed in the earlier subtopic.) It is quite easy to mark objective items quickly, but it is also surprisingly easy to make careless errors. This is especially true where large numbers of scripts are being marked. A system of checks is strongly advised. One method is through the comments of the learners themselves when their marked papers are returned to them. 8.7 RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY Some people may think of reliability and validity as two separate concepts. In reality, reliability and validity are related. Trochim (2005) provided the following analogy (refer to Figure 8.6). Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 177 Figure 8.6: Graphical representation of the relationship between reliability and validity The centre or the bullseye is the concept that we are trying to measure. Say for example, in trying to measure the concept of „inductive reasoning‰, you are likely to hit the centre (or the bullseye) if your inductive reasoning test is both reliable and valid, which is what all test developers aim to achieve (see Figure 8.6(d)). On the other hand, your inductive reasoning test could be reliable but not valid. How is that possible? Your test may not measure inductive reasoning but the score you obtain each time you administer the test is approximately the same (see Figure 8.6(b)). In other words, the test is consistently and systematically measuring the wrong construct (that is not inductive reasoning). Imagine the consequences of making judgement about the inductive reasoning of learners using such a test! In the context of psychological testing, if an instrument does not have satisfactory reliability, one typically cannot claim validity. Validity requires that instruments are sufficiently reliable. The diagram in Figure 8.6(c) does not have validity because the reliability is low. In other words, you are not getting a valid estimate of the inductive reasoning ability of your learners because they are inconsistent. The worst case scenario is when the test is neither reliable nor valid (see Figure 8.6(a)). In this scenario the scores obtained by learners tend to concentrate at the top half of the target and they are consistently missing the centre. Your measure in this case is neither reliable nor valid and the test should be rejected or improved. Copyright © Open University Malaysia (OUM) 178 TOPIC 8 TEST RELIABILITY AND VALIDITY The true score is a hypothetical concept as to the actual ability, competency and capacity of an individual. The higher the reliability and validity of your test, the greater the likelihood you will be measuring the true score of your learners. Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. Validity requires that instruments are sufficiently reliable. Face validity is a weak measure of validity. Using the test-retest technique, the same test is administered again to the same group of learners. For the parallel or equivalent forms technique, two equivalent tests (or forms) are administered to the same group of learners. Internal consistency is determined using only one test administered once to learners. When two or more persons mark essay questions, the extent to which there is agreement in the marks allotted is called inter-rater reliability. While inter-rater reliability involves two or more individuals, intra-rater reliability is the consistency of grading by a single rater. Validity is the extent to which a test measures what it claims to measure. It is vital for a test to be valid in order for the results to be accurately applied and interpreted. Construct validity relates to whether the test is an adequate measure of the underlying construct. Content validity is more straightforward and likely to be related to construct validity. It relates to the coverage of appropriate and necessary content. Some people may think of reliability and validity as two separate concepts. In reality, reliability and validity are related. Copyright © Open University Malaysia (OUM) TOPIC 8 TEST RELIABILITY AND VALIDITY 179 Construct validity Intra-rater reliability Concurrent validity Predictive validity Content validity Reliability Criterion-related validity Test-retest Face validity True score Internal consistency Validity Inter-rater reliability Copyright © Open University Malaysia (OUM)