Test Construction & Development PDF
Document Details
Uploaded by SimplifiedNewton
Tags
Summary
This document discusses the steps involved in test development and construction, which includes, conceptualization, construction, trial run, analysis, and revision. It also touches upon the different approaches and methods to evaluate test items and scaling methods. This document is an academic learning material.
Full Transcript
Psychological Testing and Measurement (PSY-P631) VU Lesson 10 Test Construction Test Develo...
Psychological Testing and Measurement (PSY-P631) VU Lesson 10 Test Construction Test Development: A good test is created by implying established principles of test construction. The process of test development occurs in five steps 1. Test conceptualization 2. Test construction 3. Test try-out 4. Item analysis 5. Test revision Once the idea for a test is conceived (test conceptualization), items for the test are drafted (test construction). This first draft of the test is then tried out on a group of sample test takers (test tryout). Once the data from the tryout are in, test-takers’ performance on the test as a whole and on each of the test’s items will be analyzed. Statistical procedures collectively referred to as item analysis, will be employed to assist in making judgments about which items are good as they are, which items may need to be revised, and which items should be discarded. The analysis of test items may include analyses of item reliability, item validity, item discrimination, and -depending upon the type of test it is- item difficulty level. On the basis of item analysis and related considerations, a revision or second draft of the test will be created. This revised version of the test is then tried out on a new sample of test-takers, the results will be analyzed, the test further revised if necessary- and so it goes. Test Development Process: Test conceptualization ↓ Test construction ↓ Test tryout ↓ Analysis ↓ Revision 1. Test Conceptualization Test development is a result of test developers’ idea of developing a tool to measure a particular construct. The stimulus for developing a test can be anything. For example, Literature on an already developed test might create the need for further work on the psychometric soundness of the test, and the would-be test developer thinks that he/she can do better. The emergence to prominence of some social phenomenon or pattern of behavior might serve as the stimulus for development of a new test. Apart from the stimulus for developing a new test, a number of questions immediately confront the prospective test developer. Some of these questions include What is the test designed to measure? What is the purpose of developing the test? Is there any need for this test? What will be the sample of the test? What should be the test content? What should be the procedure for test administration? What should the ideal format of the test be? Should more than one forms of test be developed? What special training will be required of test users for its administration and interpretation? What type of responses will be required by test takers? Who will get benefit from its administration? Is there any potential for harm as the result of an administration of the test? ©copyright Virtual University of Pakistan Psychological Testing and Measurement (PSY-P631) VU How will meaning be attributed to scores on test? The last question points out the issue of norm versus criterion referenced tests. There are different approaches to test development depending upon whether they are criterion referenced or norm referenced tests. A good item on a norm=referenced achievement test is an item for which high scorers on the test respond correctly; low scorers on the test tend to get that very same item incorrectly. Whereas, development of a criterion-oriented test or technique entails pilot work with at least two groups of test takers’ one group known to have mastered the knowledge or skill being measured and another group known to have not mastered such knowledge or skill. The items that best discriminates between these two groups would be considered “good” items. Pilot Work: In the context of test development, pilot study/research refers to preliminary research surrounding the creation of a prototype of the test. Test items may be pilot studied to evaluate whether they should be included in the final form of the instrument. In pilot work, the test developer typically attempts how to best measure the targeted construct. The process may involve the creation, revision, and deletion of many test items. Once pilot work has been completed, the process of test construction begins. However, in future the need for pilot research is always a possibility because of the test’s requirement for updates and revisions. 2. Test Construction: Scaling may be defined as the process of setting rules for assigning numbers in measurement. In other words, scaling is the process in which values are assigned to different amounts of attributes being measured. Types of scales: The scales can be categorized along a continuum of level of measurement and referred to as nominal, ordinal, interval, or ratio scales. But scales can be categorized in other ways. Age scale: if the test takers’ performance on a test as function of age is of critical interest, then the test might be referred to as an age scale. Grade scale: if the test takers on a test as function of grade is of critical interest, then the test might be referred to as grade scale. Stanine scale: if all raw scores on the test are to be transformed into scores that can range from 1 to 9, then the test might be referred to as a stanine scale. Scaling Methods: The Likert scale is used to scale attitudes. Likert scales are relatively easy to construct. Each item presents test taker with five alternative responses, usually on agree/disagree. Or approve/disapprove type of continuum. Likert (1932) after different experiments concluded, that assigning weights of 1 through 5 generally works best. Another scaling method is the method of paired comparisons. Test takers are presented with pairs of stimuli which are asked to compare. They then must select one of the stimuli more appealing than the other, and so on. For each pair of options the test taker would receive a higher score if they selected the option that was considered more justifiable by the majority of a group pf judges. Another way of deriving ordinal information through scaling system entails sorting tasks. In these approaches, printed cards, drawings, photographs, objects, or other such stimuli are typically presented to test takers for evaluation. One method of sorting, comparative scaling entails judgments of a stimulus in comparison with every other stimulus on the scale. Categorical scaling is another scaling system that relies on sorting. Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum. All the foregoing methods yield ordinal data. The method of equal=appearing intervals, first described by Thurstone (1929) is one scaling method used to obtain data that are presumed to be interval. Writing Items: The process of test construction also involves the ideas related to item writing. The three important considerations in this regard are: What range of content should the items cover? Which of many different types of item formats should be employed? How many items should be written? ©copyright Virtual University of Pakistan Psychological Testing and Measurement (PSY-P631) VU When a standardized test is developed which is based on multiple-choice response format, it is usually advisable that the number of items for the first draft of a standardized test contain approximately twice the number of items that the final version of the test will contain. Sampling provides a basis for content validity of the final version of the test. Because half of the items are eliminated from the test, the test developer should keep in mind that the final version of the test should sample the domain. The test developer may write a large number of items from personal experience. Help from experts can also be taken for item writing. In addition to experts, information for item writing can also be obtained from the sample to be studied. Literature searches may also be a valuable source of inquiry for item writing. Considerations related to variables such as the purpose of the test and the number of examinees to be tested at one time, enter into decisions regarding the format of the test. A good response format for the test is one in which the participant has many choices to answer. This is called selected response format. Item Formats: There are two types of format; selected response format, and constructed response format. Selected response format: This type of format presents the examinee with a choice of answers and requires selection of one alternative. The types of selected-response format are multiple choice, matching, and true/false items Constructed response format: It is the response format that requires the examinee to provide or create the correct answer than just selecting it. Three types of constructed-response items are the completion item, the short answer, and the essay. A completion item requires the examinee to provide a word or phrase that completes sentences. A good short answer item is written clearly enough that the test taker can indeed respond briefly, with a short answer. An essay is a type of response format in which the examinees are asked to describe in detail a single topic which is asked from them. Scoring Items: There are many scoring models but the most common is the cumulative model. The concept underlying this model is that the higher the score on the test, the higher the ability or the trait, being measured, is. For each test taker’s response to targeted items made in a particular way, the test taker earns cumulative credit with regard to a particular construct. The second model is a class model in which test takers’ responses earn credit toward placement in a particular class or category with other test takers whose pattern of score is presumably similar in some way. A third scoring model is ipsative scoring. A typical objective in ipsative scoring is the comparison of a test taker’s score on one scale within a test, with another scale within that same test. Once all of the groundwork for a test has been laid and a draft of the test is ready for administration, the next step is, logically enough, test tryout. 3. Test Tryout: Having created a pool of items from which the final version of the test will be developed, the test developer next tries out the test. The test is tried out on the sample for which it is constructed. It is also important to consider that on how many subjects the test should be tried out. It is usually considered that no fewer than five subjects and preferably as many as ten subjects, for every one item on the test. The more the subjects in try out, the better it is. Test tryout should be executed in the same conditions that are same as possible to the conditions under which standardized test will be administered. Test instructions, and everything from the time limits allotted for completing the test, to the atmosphere at the test site should be similar as possible. What is a Good Item? Characteristics of a good test are considered to be characteristics of a good item. A good test is one that is reliable and valid; similarly, a good test item should be valid and reliable. Further, a good test items helps to discriminate test takers; a good test item is one that high scorers on the test as a whole get right. An item that high scorers on the test as a whole do not get right is probably not a good item. ©copyright Virtual University of Pakistan Psychological Testing and Measurement (PSY-P631) VU A good test item can also be described as one that low scorers on the test as a whole get wrong; an item that low scorers on the test as a whole get right may not be a good item. After the first draft has been administered to a representative group of examinees, it remains for the test developer to analyze test scores and responses to individual items. At this stage the test undergoes different types of statistical analyses which are collectively called as “item analysis”. 4. Item Analysis Statistical procedures are collectively known as item analysis. For item analysis different statistical procedures are employed in order to select the best items from a pool of tryout items. Among the tools that test developer employs to analyze and select items is an index of an item’s difficulty, an item-validity index, an item-reliability index, and an index of item discrimination. Qualitative Items Analysis: Though statistical procedures are employed for item analysis, there are some non-quantitative methods that employ verbal rather than mathematical techniques. Through use of simple questionnaires or individual or group discussions with test takers, any test user can obtain valuable information on how the test could be improved. 5. Test Revision: A great amount of information is gathered at the time of item-analysis stage. On the basis of that information, some items from the original pool will be eliminated and others will be re-written. One approach would be to characterize each item according to its strengths and weaknesses. Test developer may sometimes find it necessary to balance the strengths and weaknesses across items. For example, if many otherwise good items tend to be somewhat easy, the test developer may purposefully include some more difficult items. Having balanced all the concerns, the test developer come out of revision stage with a test of improved quality. The next step is to administer the revised test under standardized conditions. On the basis of item analysis of the data derived from this administration of the second draft of the test, the test developer may consider the test in its finished form. ©copyright Virtual University of Pakistan