Midterms Psych Ass. Reviewer (Test Development) PDF
Document Details
Tags
Summary
This document is a reviewer for a midterms exam in psychology, focusing on test development. It covers topics like test conceptualization, construction, scaling methods (including rating scales, Likert scales, and paired comparisons), item writing techniques, and item analysis. The document also touches on revision strategies and quality assurance.
Full Transcript
MIDTERMS PSYCH ASS. REVIEWER TEST DEVELOPMENT The Five Stages of Test Development - Test cenceptualization➡️ Test construction ➡️ Test tryout ➡️ Analysis ➡️ Revision (then back to test tryout) ➔ Test development is an umbrella term for all that goes into the process of cre...
MIDTERMS PSYCH ASS. REVIEWER TEST DEVELOPMENT The Five Stages of Test Development - Test cenceptualization➡️ Test construction ➡️ Test tryout ➡️ Analysis ➡️ Revision (then back to test tryout) ➔ Test development is an umbrella term for all that goes into the process of creating a test. ★ Test Conceptualization The impetus for developing a new test is some thought that “there ought to be a test for…” The stimulus could be knowledge of psychometric problems with other tests, a new social phenomenon, or any number of things. There may be a need to assess mastery in an emerging occupation. Preliminary questions What is the test designed to measure? What is the objective of the test? Is there a need for this test? Who will use this test? Who will take this test? What content will the test cover? ❖ Item Development in Norm Referenced and Criterion-Referenced Tests Generally a good item on a norm-referenced achievement test is an item for which high scorers on the test respond correctly. Low scorers respond incorrectly. Ideally, each item on a criterion-oriented test addresses the issue of whether the respondent has met certain criteria. Development of a criterion-referenced test may entail exploratory work with at least two groups of testtakers: one group known to have mastered the knowledge or skill being measured and another group known not to have mastered it. Test items may be pilot studied to evaluate whether they should be included in the final form of the instrument. ★ Test Construction ❖ Scaling the process of setting rules for assigning numbers in measurement. ❖ Types of scales Scales are instruments to measure some trait, state or ability. May be categorized in many ways (e.g. multidimensional, unidemensional, etc.). L.L. Thorndike ➔ was very influential in the development of sound scaling methods. ★ Scaling Method Numbers can be assigned to responses to calculate test scores using a number of methods. Rating Scales ○ a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker. Likert scale ○ Each item presents the testtaker with five alternative responses (sometimes seven), usually on an agree–disagree or approve–disapprove continuum. Likert scales are typically reliable. All rating scales result in ordinal level data. Some rating scales are unidimensional, meaning that only one dimension is presumed to underlie the ratings. Other rating scales are multidimensional, meaning that more than one dimension is thought to underlie the ratings. Method of Paired Comparisons ○ Test-takers must choose between two alternatives according to some rule. ○ For each pair of options, testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges. ○ The test score would reflect the number of times the choices of a testtaker agreed with those of the judges. Comparative scaling ○ Entails judgments of a stimulus in comparison with every other stimulus on the scale. Categorical scaling ○ Stimuli (e.g. index cards) are placed into one of two or more alternative categories. Guttman scale ○ Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. All respondents who agree with the stronger statements of the attitude will also agree with milder statements. The method of equal-appearing intervals can be used to obtain data that are interval in nature. ★ Writing Items ➔ Item pool The reservoir or well from which items will or will not be drawn for the final version of the test. (test bank) comprehensive sampling provides a basis for content validity of the final version of the test. ➔ Item format Includes variables such as the form, plan, structure, arrangement, and layout of individual test items. ◆ selected-response format – items require testtakers to select a response from a set of alternative responses. ◆ constructed-response format – items require testtakers to supply or to create the correct answer, not merely to select it. Multiple-choice format has three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options variously referred to as distractors or foils. Other commonly used selective response formats include matching and true-false items Writing Items for Computer Administration Item bank - a relatively large and easily accessible collection of test questions. ➔ Computerized adaptive testing (CAT) An interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the testtaker’s performance on previous items. CAT is able to provide economy in testing time and number of items presented. CAT tends to reduce floor effects and ceiling effects. ★ Scoring Items Cumulatively scored test ○ assumption that the higher the score on the test, the higher the testtaker is on the ability, trait, or other characteristic that the test purports to measure. Class scoring ○ responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way (e.g diagnostic testing). Ipsative scoring ○ comparing a testtaker’s score on one scale within a test to another scale within that same test. ★ Test Tryout Test should be tried out on the same population that it was designed for. 5-10 respondents per item. Should be administered in the same manner, and have the same instructions, as the final product. What is a Good Item? A good item is reliable and valid A good item discriminates testtakers – high scorers on the test overall answer the item correctly. ★ Item Analysis The nature of the item analysis will vary depending on the goals of the test developer. Among the tools test developers might employ to analyze and select items are: ○ an index of the item’s difficulty ○ an index of the item’s reliability ○ an index of the item’s validity ○ an index of item discrimination Item-Difficulty Index The proportion of respondents answering an item correctly. For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approximately.5, with individual items on the test ranging in difficulty from about.3 to.8. Item Reliability Index indication of the internal consistency of the scale. Factor analysis can also provide an indication of whether items that are supposed to be measuring the same thing load on a common factor. The Item-Validity Index Allows test developers to evaluate the validity of items in relation to a criterion measure. The Item-Discrimination Index Indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test. A measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly Analysis of item alternatives The quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers. ❖ Item Characteristic Curves (ICC) A graphic representation of item difficulty and discrimination. Other Considerations in Item Analysis Guessing ◆ Test developers and users must decide whether they wish to correct for guessing but to date no entirely satisfactory solution to correct for guessing has been achieved. Item fairness ◆ the degree, if any, a test item is biased. A biased test item is an item that favors one particular group of examinees in relation to another when differences in group ability are controlled Speed tests ◆ Item analyses of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be. Qualitative Item Analysis Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures. Qualitative item analysis general term for various nonstatistical procedures designed to explore how individual test items work. Think aloud test administration respondents are asked to verbalize their thoughts as they occur during testing. Expert panels Experts may be employed to conduct a qualitative item analysis Sensitivity review items are examined in relation to fairness to all prospective testtakers. Check for offensive language, stereotypes, etc. ★ Test Revision Revision in New Test Development Items are evaluated as to their strengths and weaknesses – some items may be eliminated. Some items may be replaced by others from the item pool. Revised tests will then be administered under standardized conditions to a second sample Once a test has been finalized, norms may be developed from the data and it is said to be standardized. Revision in the Life Cycle of a Test Existing tests may be revised if the stimulus material or verbal material is dated, some out-dated words become offensive, norms no longer represent the population, psychometric properties could be improved, or the underlying theory behind the test has changed. In test revision the same steps are followed as with new tests (i.e. test conceptualization, construction, item analysis, tryout, and revision). Cross-validation and Co-validation Cross-validation refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion. Item validities inevitably become smaller when administered to a second sample – validity shrinkage. Co-validation is a test validation process conducted on two or more tests using the same sample of testtakers. Co-validation is economical for test developers. Quality Assurance Test developers employ examiners who have experience testing members of the population targeted by the test. Examiners follow standardized procedures and undergo training. Anchor protocols are also used in quality assurance An anchor protocol is a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies. A discrepancy between scoring in an anchor protocol and the scoring of another protocol is referred to as scoring drift. The use of IRT in Building and Revising Tests Items are evaluated on item-characteristic curves (ICC) in which performance on items is related to underlying ability. Three possible applications of IRT in building and revising tests include ★ evaluating existing tests for the purpose of mapping test revisions, ★ determining measurement equivalence across testtaker populations, and ★ developing item banks. NON-COGNITIVE CONSTRUCTS The Nature of Non-Cognitive Constructs Human behavior is composed of multiple dimensions. Behaviors are characteristics in which one thinks, feels, and acts as people interact with their environment Affective characteristics are further classified according to specific variables such as attitudes, beliefs, interest, values, and dispositions. Attitudes ➔ learned predispositions to respond in a consistently favorable or unfavorable manner with respect to a given object ➔ favorable or unfavorable evaluative reactions whether exhibited in beliefs, feelings, or inclinations to act toward something. ➔ Psychologists agree that knowing people’s attitude is to predict their actions. Attitudes involve evaluations ➔ An example of an attitude scale is the “Attitude Towards Church Scale” by Thurstone and Chave - The scale measure the respondents position on a continuum ranging from strong depreciation to strong appreciation of the church Beliefs ➔ Beliefs are judgments and evaluations that we make about ourselves, about others, and about the world around us ➔ Beliefs are generalizations about things such as causality or the meaning of specific actions ➔ Examples of belief statements made in the educational environment are : ◆ “A quiet classroom is conducive to learning,” ◆ “Studying longer will improve a student’s score on the test,” ◆ “Grades encourage students to work harder.” ➔ An example of a measure of belief is the Schommer Epistemological Questionnaire. Schommer (1990) developed this questionnaire to assess beliefs about knowledge and learning. Schommer Epistemological Questionnaire. Schommer This Asian version of the Schommer Epistemological Questionnaire has been validated with a sample of 285 Filipino college students. This epistemological questionnaire was revised to have lesser items, and simpler expression of ideas to be more appropriate for Asian learners. The number of statements was reduced to ensure that the participants would not be placed under any stress while completing the questionnaires. Students are asked to rate their degree of agreement for each item on a 5-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). Interests ➔ Interest generally refers to individual's strengths, needs, and preferences. ➔ Strong (1955) defined interests as "a liking/disliking state of mind accompanying the doing of an activity" (p. 138). ➔ According to Holland’s theory, there are six vocational interest types. ➔ Examples of affective measures of interest are the Strong-Campbell Interest Inventory and Strong Interest Inventory (SII), Jackson, Vocational Interest Inventory, Guilford-Zimmerman Interest Inventory, Kuder Occupational Interest Survey Values ➔ Values refer to “the principles and fundamental convictions which act as general guides to behavior, the standards by which particular actions are judged to be good or desirable ➔ The values are internalized and learned at an early stage in life. The school setting is one major avenue where people show how the values are learned, respected and uphold ➔ Examples of values are diligence, respect for authority, emotional restraint, filial piety, and humility. ➔ An example of a measure of values is the Asian Values Scale-Revised (AVS-R). The AVS-R is a 25-item instrument designed to measure an individual’s adherence to Asian cultural values the enculturation process and the maintenance of one’s native cultural values and beliefs Dispositions ➔ which stated that dispositions are: the values, commitments, and professional ethics that influence behaviors toward students, families, colleagues, and communities and affect student learning, motivation, and development as well as the educator's own professional growth. ➔ Dispositions are guided by beliefs and attitudes related to values such as caring, fairness, honesty, responsibility, and social justice ➔ Examples of dispositions include fairness, being democratic, empathy, enthusiasm, thoughtfulness, and respectfulness. ➔ Disposition measures are also created for metacognition, self-regulation, self-efficacy, approaches to learning, and critical thinking. Steps in Constructing Non-Cognitive Measures 1. Decide what information should be sought 2. Write the first draft of items 3. Good questionnaire items should: a. Include vocabulary that is simple, direct, and familiar to all respondents b. Be clear and specific c. Not involve leading, loaded or double barreled questions d. Be as short as possible 5. Include all conditional information prior to the key ideas e. Be edited for readability 4. Select a scaling technique 5. Develop directions for responding 6. Conduct a judgmental review of items 7. Reexamine and revise the questionnaire 8. Prepare a draft and gather preliminary pilot data 9. Analyze Pilot data 10. Revise the Instrument 11. Gather final pilot data 12. Conduct Additional Validity and Reliability Analysis 13. Edit the questionnaire and specify the procedures for its use 14. Prepare the Test Manual Response Formats 3. CONVENTIONAL SCALE TYPES (Commonly used in Surveys) Likert Scale Verbal Frequency Scale Ordinal Scale: also a multiple choice item but the response alternatives don’t stand in any fixed relationship with one another The responses are ordinal because each time a category is listed, it comes before the next one. Forced Ranking Scale: produce ordinal values and items are each ranked relative to one another. This scaling technique obtains not only the most preferred, but also the sequence of the remaining items Paired Comparison Scale Comparative Scale Linear, Numeric Scale used in judging a single dimension and arrayed on a scale with equal intervals. The scale is characterized by a simple, linear, numeric scale with extremes labeled appropriately Semantic Differential Scale In using this scaling device, the image of a brand, store, political candidate, company, organization, institution, or idea, can be measured, assessed, and compared with that of similar topic. The areas investigated are called entities Adjective Checklist