Constructing Written Test Questions PDF

Constructing Written Test Questions For the Basic and Clinical Sciences Third Edition Susan M. Case David B. Swanson National Board of Medical Examiners 3750 Market Street Philadelphia, PA 19104 Printed copies of the publication are not available from the National Board of Medical Examiners (NBME). Additional copies can be obtained by downloading the manual from the NBME’s Web site. Permission to copy and distribute this doc- ument is granted by the NBME provided that (1) the copyright and permission notices appear on all reproductions; (2) use of the document is for noncommercial, educational, and scientific purposes only; and (3) the document is not modified in any way. Any rights not expressly granted herein are reserved by the NBME. Copyright © 1996, 1998 National Board of Medical Examiners® (NBME®). Copyright © 2001 National Board of Medical Examiners® (NBME®). All right reserved. Printed in the United States of America. Table of Contents v Page Section I Issues Regarding Format and Structure of Test Questions............................................ 7 Chapter 1. Introduction......................................................................... 9 Assessment: An Important Component of Instruction................................................. 9 Issues of Sampling.......................................................................... 10 Importance of Psychometric Considerations....................................................... 11 Chapter 2. Multiple-Choice-Item Formats......................................................... 13 True/False vs One-Best-Answer Questions........................................................ 13 The True/False Family....................................................................... 14 The One-Best-Answer Family.................................................................. 16 The Bottom Line on Item Formats.............................................................. 18 Chapter 3. Technical Item Flaws................................................................ 19 Issues Related to Testwiseness................................................................. 19 Issues Related to Irrelevant Difficulty............................................................ 22 Summary of Technical Item Flaws.............................................................. 26 Use of Imprecise Terms in Examination Questions.................................................. 27 Section II Writing One-Best-Answer Questions for the Basic and Clinical Sciences................................ 31 The Basic Rules for One-Best-Answer Items...................................................... 33 Chapter 4. Item Content: Testing Application of Basic Science Knowledge................................. 35 Item Content for the Basic Sciences............................................................. 35 Item Templates............................................................................. 38 Additional Templates......................................................................... 39 Types of Questions and Sample Lead-ins and Option Lists............................................ 40 Writing the Options: Altering Item Difficulty...................................................... 41 Item Shape................................................................................ 42 Problem-Based Learning and Use of Case Clusters.................................................. 43 Sample Items for the Basic Sciences............................................................. 47 Chapter 5. Item Content: Testing Application of Clinical Science Knowledge............................... 51 Methods for Assessment...................................................................... 51 General Issues Regarding What to Test........................................................... 52 Testing Recall of Isolated Facts or Application of Knowledge.......................................... 53 Writing One-Best-Answer Items................................................................ 56 Fine Points on Item Stems.................................................................... 57 Verbosity, Window Dressing, and Red Herrings: Do They Make a Better Test Item?............................................................ 58 Writing Items Related to Physician Tasks......................................................... 61 Writing Items on Difficult Topics............................................................... 66 Section III Extended-Matching Items..................................................................... 69 Chapter 6. Extended-Matching (R-Type) Items...................................................... 71 Avoiding Flaws When You Write Extended-Matching Items for Your Own Examination..................... 72 Sample Lead-ins and Topics for Option Lists...................................................... 74 More on Options for R-Sets................................................................... 75 Writing the Item Stems....................................................................... 76 Sample Good and Bad Item Stems Using the Same Option List........................................ 77 Overview of the Steps for Writing Extended-Matching Items.......................................... 81 Sample Extended-Matching Sets................................................................ 82 Steps for Organizing a Group to Write Clinical R-sets............................................... 90 Form for Writing R-Sets...................................................................... 93 Sample SPSSX Code to Score Multiple-Choice Tests Including Extended-Matching Items.................... 94 Comparison of Items in Five-Option and Extended-Matching Format.................................... 96 A’s to R’s and Back Again.................................................................... 97 Chapter 7. Pick N Items: An Extension of the Extended-Matching Format................................. 99 Section IV Additional Issues........................................................................... 105 Chapter 8. Interpretation of Item Analysis Results................................................... 107 Chapter 9. Establishing a Pass/Fail Standard...................................................... 111 Definitions and Basic Principles............................................................... 111 Two Standard-Setting Methods Based on Judgements about Items..................................... 112 Relative/Absolute Compromise Standards: The Hofstee Method....................................... 114 Chapter 10. Miscellaneous Thoughts on Topics Related to Testing...................................... 115 Appendix A. The Graveyard of NBME Item Formats................................................. 117 Appendix B. Sample Item-Writing Templates, Items, Lead-Ins, and Option Lists For the Basic and Clinical Sciences..129 Preface to Third Edition v This manual was written to help faculty members improve the quality of the multiple-choice questions written for their exam- inations. The manual provides an overview of item formats, concentrating on the traditional one-best-answer and matching formats. It reviews issues related to technical item flaws and issues related to item content. The manual also provides basic information to help faculty review statistical indices of item quality after test administration. An overview of standard-set- ting techniques is also provided. Issues related to exam blueprinting are not addressed in any detail. We have focused almost exclusively on the item level, leaving exam level planning for another manuscript. We anticipate that this manual will be useful primarily by faculty who are teaching medical students in basic science cours- es and clinical clerkships. The examples focus on undergraduate medical education, though the general approach to item writing may be useful for assessing examinees at other levels. This manual reflects lessons that we have learned in developing items and tests over the past 20 years. During this period, we have reviewed (quite literally) tens of thousands of multiple-choice questions and have conducted item-writing workshops for thousands of item writers preparing USMLE, NBME, and specialty board examinations as well as faculty at more than 60 medical schools developing test questions for their own examinations. Each workshop attendee has helped us to frame our thoughts regarding how to write better quality test questions, and, over the years, we have become better able (we believe) to articulate the why’s and wherefore’s. We hope this manual helps to communicate these thoughts. Susan M. Case, PhD David B. Swanson, PhD January 1998 Section I Issues Regarding Format and Structure of Test Questions v This section reviews structural issues important for the construction of high-quality test questions. The following section will review issues related to item content. Chapter 1 Introduction v Assessment: An Important Component of Instruction Assessment is a critical component of instruction; properly used, it can aid in accomplishing key curricular goals. The impact of decisions regarding how and when to evaluate the knowledge and performance of your students cannot be overestimated. A primary purpose of testing is to communicate what you view as important. Tests are a powerful motivator, and students will learn what they believe you value. Assessment also helps to fill instructional gaps by encouraging students to read broad- ly on their own and to participate broadly as educational opportunities are available. This outcome of testing is especially important in the clerkships, where the curriculum may vary from student to student, depending on factors such as the clinical setting and the random flow of patients. This outcome may also be important in some basic-science settings (eg, problem- based learning), where the educational experiences may vary from student to student. Because tests have such a powerful influence on student learning, it is important to develop tests that will further your edu- cational goals. Introduction of a hands-on clinical skills test drives students out of the library into the clinic, where they may seek help with their physical-exam skills; introduction of a test assessing only recall of isolated facts, on the other hand, dri- ves them to “cram” course review books. This manual focuses on how to write high-quality, multiple-choice questions that assess skill in interpreting data and making decisions, which we believe are important components of clinical skills. Students’ paths toward mastery or even excellence will be less rocky if they receive ongoing feedback on their progress. Chapter 1. Introduction 9 Purposes of Testing What Should Be Tested? Communicate to students what material is Exam content should match course/clerkship important objectives Motivate students to study Important topics should be weighted more heavily than less important topics Identify areas of deficiency in need of remediation or further learning The testing time devoted to each topic should reflect the relative importance of the topic Determine final grades or make promotion decisions The sample of items should be representative of the instructional goals Identify areas where the course/curriculum is weak Issues of Sampling The purpose of any assessment is to permit inferences to be drawn concerning the skills of examinees: inferences that extend beyond the particular problems (or, equivalently, cases or test questions) included in the exam to the larger domain from which the cases (or questions) are sampled. It’s clear to all of us that assessment takes time. It’s also clear that, if you increase time spent in one activity, you have to decrease time spent in other activities. Whether you’re deciding on an over- all plan for evaluation or you’re deciding what to include on a single test, you’re basically faced with a sampling problem. Performance on the sample provides a basis for estimating achievement in the broader domain that is actually of interest. With multiple-choice questions (MCQs), you first need to decide what you want to include on the test. The amount of atten- tion given to evaluating something should reflect its relative importance. You need to sample topics and also sample skills (eg, determining the diagnosis, deciding on the next step in management); you cannot ask everything. Performance on the sample provides a basis for estimating achievement in the broader domain that is actually of interest. The nature of the sam- ple determines the extent to which the estimate of true ability is reproducible (reliable, generalizable) and accurate (valid). If the sample is not representative of the broader domain of interest (eg, including only cardiovascular-related content in a 10 test of competence in general medical practice), exam results will be biased and will not provide a good basis for estimat- ing achievement in the domain of interest. If the sample is too small, exam results may not be stable enough to ensure that they reflect true ability. With a multiple-choice test, there’s almost always one grader (usually the computer) and a series of questions or sets of ques- tions; sampling involves selecting a subset of questions to include on the test. With other evaluation methods (eg, oral exams based on patient cases, standardized patient exams, essay exams), the sampling is much more complicated. Any method that can’t be scored mechanically requires sampling on a second dimension: the dimension of grader. In these exams, you are interested in performance across a range of cases and you want the grade to be independent of who the examiner is. You therefore need to sample across two dimensions: one for the questions or cases and one for the judges or raters. You need to sample across a range of cases, because performance on one case is not a very good predictor of performance on other cases. You also need to sample across different raters to minimize the effects of rater harshness or leniency, and other issues like halo that cause problems in the consistency of scoring across raters. With broad samples, peaks and valleys in perfor- mance and peaks and valleys in rater differences tend to average out. Although this manual focuses on multiple-choice questions, we believe that it is generally appropriate to use a variety of test- ing methods. No one method is likely to assess all the skills of interest. It should also be noted that the method used for assessment does not directly affect test quality, nor does it determine the component of competence measured by the test. Importance of Psychometric Considerations The extent to which the psychometric characteristics of an assessment method are important is determined by the purpose of the test and the decisions that will be made based on the results. For “high-stakes” tests (those used for promotion or grad- uation decisions, even course grades), test results must be reasonably reproducible (precise, reliable) and accurate (valid). For “low-stakes” tests, the psychometric characteristics are less important, and the primary consideration should be on direct- ing student learning. As noted above, in order to generate a reproducible score, you need to sample content broadly (ie, typ- ically, a dozen or more cases, 100 or more short answer or MCQs). The following papers include more detail about assessment issues in general: Swanson DB. A measurement framework for performance based tests. In: Hart I, Harden R, eds. Further Developments in Assessing Clinical Com- petence. Montreal: Can-Heal Publications; 1987:13-45. Case SM. Assessment of truths that we hold as self-evident and their implications. In: Scherpbier AJJA, Van der Vleuten CPM, Rethans JJ, Van der Steeg AFW, eds. Advances in Medical Education. Dordrecht, The Netherlands: Kluwer Academic Publishers; 1997:2-6. Chapter 1. Introduction 11 The following three papers discuss item format issues in greater detail: Case SM, Swanson DB, Ripkey DR. Comparison of items in five-option and extended-matching format for assessment of diagnostic skills. Acade- mic Medicine. 1994;69(suppl):S1-S3. Swanson DB, Case SM. Trends in written assessment: a strangely biased perspective. In: Harden R, Hart I, Mulholland H, eds. Approaches to Assessment of Clinical Competence. Norwich, England: Page Brothers; 1992:38-53. Case SM, Downing SM. Performance of various multiple-choice item types on medical specialty examinations: types A, B, C, K and X. Pro- ceedings of the Twenty-Eighth Annual Conference of Research in Medical Education (RIME). 1989:167-172. 12 Chapter 2 Multiple-Choice-Item Formats v In order for a test question to be a good one, it must satisfy two basic criteria. First, the test question must address impor- tant content. This is an essential condition, which will be addressed further along in the manual. Obviously, item content is of critical importance, but, in and of itself, focusing on important content is not sufficient to guarantee that your test ques- tion is a good one. Items that attempt to assess critically important topics cannot do so unless they are well-structured — avoiding flaws that benefit the testwise examinee and avoiding irrelevant difficulty are prerequisites that must be met in order for test questions to generate valid scores. True/False vs One-Best-Answer Questions The universe of multiple-choice questions (MCQ’s) can be divided into two families of items: those that require the exami- nee to indicate all responses that are appropriate (true/false) and those that require the examinee to indicate a single response (one best answer). Each family is represented by several specific formats, as listed below: True/false-item formats require that examinees select all options that are true C (A / B/ Both/ Neither items) K (complex true/false items) X (simple true/false items) Simulations such as Patient Management Problems (PMPs) One-best-answer item formats require that examinees select the single best response A (4 or more options, single items or sets) B (4 or 5 option matching items in sets of 2-5 items) R (Extended-Matching items in sets of 2-20 items) The letters used to label the item formats hold no intrinsic meaning. Letters have been assigned more or less sequentially to new item formats as they are developed (see Appendix A). Chapter 2. Multiple-Choice-Item Formats 13 The True/False Family The true/false and one-best-answer families pose very different tasks for the examinee. True/false items require an exami- nee to select all the options that are “true.” For these items, the examinee must decide where to make the cut-off — to what extent must a response be “true” in order to be keyed as “true.” While this task requires additional judgement (beyond what is required in selecting the one best answer), this additional judgment may be unrelated to clinical expertise or knowledge. Too often, examinees have to guess what the item writer had in mind because the options are not either completely true or completely false. The following is an example of an acceptable true/false item from a structural perspective.* Note that the stem is clear and the options are absolutely true or false with no ambiguity. The options can be diagramed as follows. Which of the following is/are X-linked recessive conditions? 2 1 1. Hemophilia A (classic hemophilia) 4 3 2. Cystic fibrosis Totally Wrong Totally Correct 3. Duchenne’s muscular dystrophy Options Options 4. Tay-Sachs disease This true/false item is flawed. Options 1, 2, and 3 cannot be True statements about cystic fibrosis (CF) include: judged as absolutely true or false; a group of experts would 1. The incidence of CF is 1:2000. not agree on the answers. In thinking about Option 1, note 2. Children with CF usually die in their teens. that the incidence is not exactly 1:2000; experts would want 3. Males with CF are sterile. more information: Is this in the USA? Is this among all eth- 4. CF is an autosomal recessive disease. nic groups? Modifying the language to “approximately 1:2000” doesn’t help, since the band is not specified. Simi- lar issues arise with Options 2 and 3, while Option 4 is clear. *Following tradition, for true/false items, the options are numbered; for one-best-answer items, the options are lettered. 14 While written in jest (by the second author), this true/false item illustrates a common problem — items for which the stem is The way to a man’s heart is through his unclear. Depending on your perspective, Options 1, 2, and 3 1. aorta might be true; alternatively, 1, 2, and 3 might be false while 4 2. pulmonary arteries is true. 3. pulmonary veins 4. stomach In this true/false example, there are vague terms in the options that provide cues to the testwise examinee. For example, the In the clinical assessment of chronic pain, term “may” in Options 1, 2, and 3 cues the testwise examinee that those options are true. Option 4 is harder to guess — 1. the physician’s personal attitude concerning what does “usually” mean? Research has shown that these pain may affect medical judgement vague frequency terms do not have a shared definition. 2. unpleasant emotions may be converted to Experts would not agree on whether the fourth option is true complaints of bodily pain or false. 3. pain may have a symbolic meaning 4. facial appearance or body posture is usually a clue to the severity of the pain The flaws in this item are more subtle. The difficulty is that the examinee has to make assumptions about the severity of In children, ventricular septal defects are associated the disease, the age of the patient, and whether or not the dis- with ease has been treated. Different assumptions lead to differ- 1. systolic murmur ent answers, even among experts. 2. pulmonary hypertension 3. tetralogy of Fallot 4. cyanosis Note that in each sample flawed item, the stem is unclear, the options contain vague terms, or the options are partially correct. In each instance, a group of experts would have difficulty reaching a consensus on the correct answer. Chapter 2. Multiple-Choice-Item Formats 15 Because examinees are required to select all the options that are “true,” true/false items must satisfy the following rules: Stems must be clear and unambiguous. Imprecise phrases such as is associated with; is useful for; is important and words that provide cueing such as may or could be; and vague terms such as usually or frequently should be avoided. Options must be absolutely true or false; no shades of gray are permissible; avoid phrases and words noted in the first item above. The One-Best-Answer Family In contrast to true/false questions, one-best-answer (A-type) questions make explicit the number of options to be selected. A- type items are the most widely used multiple-choice-item format. They consist of a stem (eg, a clinical case presentation) and a lead-in question, followed by a series of choices, typically one correct answer and four distractors. The following question describes a situation (in this instance, a patient) and asks the examinee to indicate the most likely cause of the problem. Stem: A 32-year-old man has a 4-day history of progressive weakness in his extremities. He has been healthy except for an upper respiratory tract infection 10 days ago. His temperature is 37.8 C (100 F), blood pressure is 130/80 mm Hg, pulse is 94/min, and respirations are 42/min and shallow. He has symmetric weakness of both sides of the face and the prox- imal and distal muscles of the extremities. Sensation is intact. No deep tendon reflexes can be elicited; the plantar responses are flexor. Lead-in: Which of the following is the most likely diagnosis? Options: A. Acute disseminated encephalomyelitis B. Guillain-Barré syndrome C. Myasthenia gravis D. Poliomyelitis E. Polymyositis 16 Note that the incorrect options are not totally wrong. The options can be diagramed as follows: D C A E B Least Most Correct Correct Even though the incorrect answers are not completely wrong, they are less correct than the “keyed answer.” The examinee is instructed to select the “most likely diagnosis”; experts would all agree that the most likely diagnosis is B; they would also agree that the other diagnoses are somewhat likely, but less likely than B. As long as the options can be laid out on a single continuum, in this case from “Most Likely Diagnosis” to “Least Likely Diagnosis,” options in one-best-answer questions do not have to be totally wrong. This item is flawed. After reading the stem, the examinee has only the vaguest idea what the question is about. In an Which of the following is true about pseudogout? attempt to determine the “best” answer, the examinees have to A. It occurs frequently in women. decide whether “it occurs frequently in women” is more or B. It is seldom associated with acute pain in less true than “it is seldom associated with acute pain in a a joint. joint.” This is a comparison of apples and oranges. In order C. It may be associated with a finding of chon- to rank-order the relative correctness of options, the options drocalcinosis. must differ on a single dimension or else all options must be D. It is clearly hereditary in most cases. absolutely 100% true or false. E. It responds well to treatment with allopurinol. The diagram of these options might look like this. The options are heterogeneous and deal with miscellaneous facts; they cannot be rank-ordered from least to most true along a single dimension. Although this question appears to assess knowledge of several different points, its inherent flaws preclude this. The question by itself is not clear; the item cannot be answered without looking at the options. Chapter 2. Multiple-Choice-Item Formats 17 In contrast to the options in the item on pseudogout, the options in the item on Guillain-Barré syndrome are homogeneous (eg, all diagnoses); knowledgeable examinees can rank-order the options along a single dimension. Well-constructed one-best-answer questions satisfy the “cover-the-options” rule. The questions could be administered as write- in questions. The entire question is included in the stem. The Bottom Line on Item Formats We recommend that you do not use true/false questions. While many item writers believe the true/false items are easier to write than one-best-answer items, we find that they are more problematic. The item writer had something particular in mind when the question was written, but careful review commonly reveals subtle difficulties that were not apparent to the item author. Often the distinction between “true” and “false” is not clear, and it is not uncommon for subsequent reviewers to alter the answer key. As a result, reviewers rewrite or discard true/false items far more frequently than items written in other formats. Some ambiguities can be clarified, but others cannot. There is a final reason that is more compelling than those noted above. We find that, to avoid ambiguity, we are pushed toward assessing recall of an isolated fact — something we are actively trying to avoid. We find that application of knowledge, inte- gration, synthesis, and judgement questions can better be assessed by one-best-answer questions. As a result, the NBME has completely stopped using true/false formats on its examinations. We also recommend that you not use negative A-type questions. The most problematic are those that take the form: “Each of the following is correct EXCEPT” or “Which of the following statements is NOT correct?” These suffer from the same prob- lem as true/false questions: if options cannot be rank-ordered on a single continuum, the examinees cannot determine either the “least” or the “most” correct answer. On the other hand, we occasionally use well-focused negative A-types with single-word options on some exams, largely as a (poor) substitute for items that instruct the examinee to select more than one response. A superior format for this purpose, the Pick “N” format, in which examinees are instructed to select “N” responses, is discussed later in the manual. The Appendix A illustrates a variety of item formats that are no longer used on NBME exams. 18 Chapter 3 Technical Item Flaws v This section describes two types of technical item flaws: testwiseness and irrelevant difficulty. Flaws related to testwiseness make it easier for some students to answer the question correctly, based on their test-taking skills alone. These flaws com- monly occur in items that are unfocused and do not satisfy the “cover-the-options” rule. Flaws related to irrelevant difficul- ty make the question difficult for reasons unrelated to the trait that is the focus of assessment. The purpose of this section is to outline common flaws and to encourage you to eliminate these flaws from your questions to provide a level playing field for the testwise and not-so-testwise students. The probability of answering a question cor- rectly should relate to the examinee’s amount of expertise on the topic being assessed and should not relate to their exper- tise on test-taking strategies. Issues Related to Testwiseness Grammatical cues: one or more distractors don’t follow grammatically from the stem Because an item writer tends to pay more attention to the cor- rect answer than to the distractors, grammatical errors are A 60-year-old man is brought to the emergency more likely to occur in the distractors. In this example, test- department by the police, who found him lying uncon- wise students would eliminate A and C as options because scious on the sidewalk. After ascertaining that the they do not follow grammatically or logically from the stem. airway is open, the first step in management should be Testwise students then have to choose only between B, D, intravenous administration of and E. A. examination of cerebrospinal fluid B. glucose with vitamin B 1 (thiamine) C. CT scan of the head D. phenytoin E. diazepam Chapter 3. Technical Item Flaws 19 Logical cues: a subset of the options are collectively exhaustive In this item, Options A, B, and C include all possibilities. The testwise student knows that A, B, or C must be correct, Crime is whereas the non-testwise student spends time considering D A. equally distributed among the social classes and E. Often, the item writers add D and E only because they B. overrepresented among the poor want to list five options. In these situations, the item writer C. overrepresented among the middle class may not have paid much attention to the merits of options D and rich and E; sometimes, they are partially correct and confusing D. primarily an indication of psychosexual because they cannot be rank-ordered on the same dimension maladjustment as Options A, B, and C. This flaw is commonly seen in items E. reaching a plateau of tolerability for the nation with options such as “Increases,” “Decreases,” and “Remains the same.” Absolute terms: terms such as “always” or “never” are used in options In this item, Options A, B, and E contain terms that are less absolute than those in Options C and D. The testwise student In patients with advanced dementia, Alzheimer’s type, will eliminate Options C and D as possibilities because they the memory defect are less likely to be true than something stated less absolute- A. can be treated adequately with phosphatidyl- ly. Note that this flaw would not arise if the stem was choline (lecithin) focused and the options were short; it arises only when verbs B. could be a sequela of early parkinsonism are included in the options rather than in the lead-in. C. is never seen in patients with neurofibrillary tangles at autopsy D. is never severe E. possibly involves the cholinergic system 20 Long correct answer: correct answer is longer, more specific, or more complete than other options In this item, Option C is longer than the other options; it is also the only double option. Item writers tend to pay more Secondary gain is attention to the correct answer than to the distractors. A. synonymous with malingering Because you are teachers, you write long correct answers B. a frequent problem in obsessive-compulsive that include additional instructional material, parenthetical disorder information, caveats, etc. Sometimes this can be quite C. a complication of a variety of illnesses and extreme: the correct answer is a paragraph in length and the tends to prolong many of them distractors are single words. D. never seen in organic brain damage Word repeats: a word or phrase is included in the stem and in the correct answer This item uses the word “unreal” in the stem, and “derealiza- tion” is the correct answer. Sometimes, a word is repeated A 58-year-old man with a history of heavy alcohol use only in a metaphorical sense, eg, a stem mentioning bone pain, and previous psychiatric hospitalization is confused with the correct answer beginning with the prefix “osteo-”. and agitated. He speaks of experiencing the world as unreal. This symptom is called A. depersonalization B. derailment C. derealization D. focal memory deficit E. signal anxiety Chapter 3. Technical Item Flaws 21 Convergence strategy: the correct answer includes the most elements in common with the other options This item flaw is less obvious than the others, but it occurs frequently and is worth noting. The flaw is seen in several forms. The underlying premise is that the correct answer is the option that has the most in common with the other options; it is not likely to be an outlier. For example, in numeric options, the correct answer is more often the middle number than an extreme value. In double options, the correct answer is more likely to be the option that has the most elements in common with the other distractors. For example, if the options are “Pencil and pen”; “Pencil and highlighter”; “Pencil and crayon”; “Pen and marker,” the correct answer is likely to be “Pencil and pen” (ie, by simple count, “Pencil” appeared 3 times in the options; “Pen” appeared twice; other elements each appeared only once). While this might seem ridiculous, this flaw occurs because item writers start with the correct answer and write permuta- tions of the correct answer as the distractors. The correct answer is, therefore, more likely to have elements in com- Local anesthetics are most effective in the mon with the rest of the options; the incorrect answers are A. anionic form, acting from inside the more likely to be outliers as the item writer has difficulty nerve membrane generating viable distractors. In this example, the testwise B. cationic form, acting from inside the student would eliminate “anionic form” as unlikely because nerve membrane “anionic form” appears only once; that student would also C. cationic form, acting from outside the exclude “outside the nerve membrane” because “outside” nerve membrane appears less frequently than “inside”. The student would D. uncharged form, acting from inside the then have to decide between Options B and D. Since three nerve membrane of the five options involve a charge, the testwise student E. uncharged form, acting from outside the would then pick Option B. nerve membrane Issues Related to Irrelevant Difficulty Options are long, complicated, or double This item illustrates a common flaw. The stem contains extraneous reading, but, more importantly, the options are very long and complicated. Trying to decide among these options requires a significant amount of reading because of the number of elements in each option. This can shift what is measured by an item from content knowledge to reading speed. Please note that this flaw relates only to options. There are many well-constructed test questions that include a long stem. Decisions about stem length should be made in accord with the purpose of the item. If the purpose of the item is to assess whether or 22 not the student can interpret and synthesize information to determine, for example, the most likely diagnosis, then it is appro- priate for the stem to include a fairly complete description of the situation. Peer review committees in HMOs may move to take action against a physician’s credentials to care for participants of the HMO. There is an associated requirement to assure that the physician receives due process in the course of these activities. Due process must include which of the following? A. Notice, an impartial forum, council, a chance to hear and confront evidence against him/her. B. Proper notice, a tribunal empowered to make the decision, a chance to confront witnesses against him/her, and a chance to present evidence in defense. C. Reasonable and timely notice, impartial panel empowered to make a decision, a chance to hear evidence against himself/herself and to confront witnesses, and the ability to present evidence in defense. Numeric data are not stated consistently When numeric options are used, the options should be listed in numeric order and the options should be listed in a single for- mat (ie, as single terms or as ranges). Confusion occurs when formats are mixed and when the options are listed in an illog- ical order or in an inconsistent format. In this example, Options A, B, and C are expressed as ranges, whereas Options D and E are specific percentages. All Following a second episode of infection, what is the options should be expressed as ranges or as specific percent- likelihood that a woman is infertile? ages; mixing them is ill-advised. In addition, the range for A. Less than 20% Option C includes Options D and E, which almost certainly B. 20 to 30% rules out Options D and E as correct answers. C. Greater than 50% D. 90% E. 75% Chapter 3. Technical Item Flaws 23 Frequency terms in the options are vague (eg, rarely, usually) Research has shown that vague frequency terms are not consis- tently defined, even by experts. A more complete discussion of Severe obesity in early adolescence this research is included elsewhere in the manual. A. usually responds dramatically to dietary regimens B. often is related to endocrine disorders C. has a 75% chance of clearing spontaneously D. shows a poor prognosis E. usually responds to pharmacotherapy and intensive psychotherapy Language in the options is not parallel; options are in a nonlogical order This item illustrates a common flaw in which the options are long and the language makes it difficult and time-consum- In a vaccine trial, 200 2-year-old boys were given a vac- ing to determine which is the most correct. Generally, this cine against a certain disease and then monitored for flaw can be corrected by careful editing. In this particular five years for occurrence of the disease. Of this group, item, the lead-in can be changed to “For which of the fol- 85% never contracted the disease. Which of the fol- lowing reasons can no conclusion be drawn from these lowing statements concerning these results is correct? data?” The options can then be edited (ie, A. No follow-up A. No conclusion can be drawn, since no follow-up was made of nonvaccinated children; B. The number of was made of nonvaccinated children cases was too small; C. The trial involved only boys, and a B. The number of cases (ie, 30 cases over five new option can be written for D). years) is too small for statistically meaningful conclusions C. No conclusions can be drawn because the trial involved only boys D. Vaccine efficacy (%) is calculated as 85-15/100 24 None of the above is used as an option The phrase “None of the above” is problematic in items where judgement is involved and where the options are not The diagnosis of a large ovarian cyst is most strongly absolutely true or false. If the answer is intended to be one suggested by of the listed options, very knowledgeable students are faced A. an anterior dullness, lateral tympany with a dilemma, because they have to decide between a very B. a decreased peristalsis detailed perfect option and the one that you have developed as correct. They can generally construct an option that is C. a fluid wave more correct than the one you have intended to be correct. D. a shifting dullness Use of “none of the above” essentially turns the item into a E. none of the above true/false item; each option has to be evaluated as more or less true than the universe of unlisted options. Stems are tricky or unnecessarily complicated Sometimes, item writers can take a perfectly easy question and turn it into something so convoluted that only the most Arrange the parents of the following children with stalwart will even read it. This item is a sample of that Down’s syndrome in order of highest to lowest risk of kind of item. recurrence. Assume that the maternal age in all cases is 22 years and that a subsequent pregnancy occurs within 5 years. The karyotypes of the daughters are: I: 46, XX, -14, +T (14q21q) pat II: 46, XX, -14, +T (14q21q) de novo III: 46, XX, -14, +T (14q21q) mat IV: 46, XX, -21, +T (14q21q) pat V: 47, XX, -21, +T (21q21q) (parents not karyotyped) A. III, IV, I, V, II B. IV, III, V, I, II C. III, I, IV, V, II D. IV, III, I, V, II E. III, IV, I, II, V Chapter 3. Technical Item Flaws 25 Summary of Technical Item Flaws Issues Related to Testwiseness Grammatical cues - one or more distractors don’t follow grammatically from the stem Logical cues - a subset of the options is collectively exhaustive Absolute terms - terms such as “always” or “never” are in some options Long correct answer - correct answer is longer, more specific, or more complete than other options Word repeats - a word or phrase is included in the stem and in the correct answer Convergence strategy - the correct answer includes the most elements in common with the other options Issues Related to Irrelevant Difficulty Options are long, complicated, or double Numeric data are not stated consistently Terms in the options are vague (eg, “rarely,” “usually”) Language in the options is not parallel Options are in a nonlogical order “None of the above” is used as an option Stems are tricky or unnecessarily complicated The answer to an item is “hinged” to the answer of a related item General Guidelines for Item Construction Make sure the item can be answered without looking at the options OR that the options are 100% true or false. Include as much of the item as possible in the stem; the stems should be long and the options short. Avoid superfluous information. Avoid “tricky” and overly complex items. Write options that are grammatically consistent and logically compatible with the stem; list them in logical or alpha- betical order. Write distractors that are plausible and the same relative length as the answer. Avoid using absolutes such as always, never, and all in the options; also avoid using vague terms such as usually and frequently. Avoid negatively phrased items (eg, those with except or not in the lead-in). If you must use a negative stem, use only short (preferably single word) options. And most important of all: Focus on important concepts; don’t waste time testing trivial facts. 26 Use of Imprecise Terms in Examination Questions While imprecise terms are used in our everyday speech and in our writing, these terms cause confusion when they are used in the text of examination items. In a study conducted at the NBME, 60 members of eight test committees who wrote ques- tions for various medical specialty examinations reviewed a list of terms used in MCQs to express some concept related to frequency of occurrence and indicated the percentage of time that was reflected by each term. Results (shown below) indicated that the terms do not have an operational definition that is commonly shared, even among the item writers themselves. The mean value plus or minus one standard deviation exceeded 50 percentage points for more than half of the phrases. For example, on average, the item writers believed the term frequently indicated 70% of the time; half believed it was between 45% and 75% of the time; actual responses ranged from 20% to 80%. Of particular note is that values for frequently overlapped with values for rarely. The implication of these results for the construction of test questions varies by item format. Vague terms create far more True statements about pseudogout include: severe problems in the various kinds of true/false items (K-, 1. It occurs commonly in women. C- and X-type items) than in one-best-answer (A- and R-type) 2. It is often associated with acute pain. items. For example, imprecise terms cause major problems in 3. It is usually hereditary. 4. Serum calcium levels are frequently increased. true/false items such as this example: In true/false items, the examinee has to judge whether each option is true or false. When options are not absolutely true or false, examinees rely on their personal definition of the ambiguous terms or their guesses about what these terms meant to the item writer. Alternatively, examinee responses may reflect personal response style (the tendency to respond either true or false when the correct answer is unknown). These response style factors may have more of an effect on whether or not an examinee answers the item correctly than knowledge of the subject matter and may be part of the reason why true/false items tend to perform poorly. Chapter 3. Technical Item Flaws 27 Rewording the options by specifying exact numbers does not correct the problem. For example, the statement, “the inci- dence among women is 1:2000” would not be an appropriate modification of Option 1 in the example shown. The incidence is not exactly 1:2000, and because a band is not specified, examinees would define their own bands, narrowly or widely, pre- sumably depending on personal response styles. In true/false items, the appropriate treatment of numeric options is either to generate a comparison (eg, the incidence is greater than that of osteoarthritis) or to specify a range (eg, the incidence is between 1:1000 and 1:2000). The issue noted above with true/false items is not as problematic with well-constructed one-best-answer items (ie, those that pose a clear question and have homogeneous options). For example, the following question includes a vague term in the item stem, yet, because the task is to select the one-best answer, the question is relatively unambiguous. Which of the following laboratory values is usually increased in patients with pseudogout? Problems do arise with one-best answer items that have vague terms in the options as in this example. Patients with pseudogout have pain: A. frequently The only way to make such an item more ambiguous would B. usually be to use a fifth option “none of the above.” C. often D. commonly 28 Box-plot showing distribution of responses for frequency terms. These data are based on responses from 60 members of eight item-writing committees. The horizontal line in each box indicates the median response; the boxes include the ranges for 50% of the responses. The vertical lines extend to the highest and lowest values indicated. For example, the median response was that frequently indicated 70% of the time; half believed it was between 45% and 75% of the time; actual responses ranged from 20% to 80%. From: Case SM. The use of imprecise terms in examination questions: how frequent is frequently? Academic Medicine. 1994:69(suppl):S4-S6. Chapter 3. Technical Item Flaws 29 Section II Writing One-Best-Answer Questions for the Basic and Clinical Sciences v The previous chapters outlined technical issues related to the construc- tion of multiple-choice questions. Section II focuses on item content. The Basic Rules for One-Best-Answer Items Each item should focus on an important concept, typically a common or potentially catastrophic clinical problem. Don’t waste testing time with questions assessing knowledge of trivial facts. Focus on problems that would be encountered in real life. Avoid trivial, “tricky,” or overly complex questions. Each item should assess application of knowledge, not recall of an isolated fact. The item stems may be relative- ly long; the options should be short. Clinical vignettes provide a good basis for a question. For the clinical sciences, each should begin with the presenting problem of a patient, followed by the history (including duration of signs and symptoms), physical findings, results of diagnostic studies, initial treatment, subsequent findings, etc. Vignettes may include only a subset of this information, but the information should be provided in this specified order. For the basic sciences, patient vignettes may be very brief; “laboratory vignettes” are also appropriate. The stem of the item must pose a clear question, and it should be possible to arrive at an answer with the options covered. To determine if the question is focused, cover up the options and see if the question is clear and if the exam- inees can pose an answer based only on the stem. Rewrite the stem and/or options if they could not. All distractors (ie, incorrect options) should be homogeneous. They should fall into the same category as the cor- rect answer (eg, all diagnoses, tests, treatments, prognoses, disposition alternatives). Rewrite any dissimilar distrac- tors. Avoid using “double options” (eg, do W and X; do Y because of Z) unless the correct answer and all distrac- tors are double options. Rewrite double options to focus on a single point. All distractors should be plausible, grammatically consistent, logically compatible, and of the same (relative) length as the correct answer. Order the options in logical order (eg, numeric), or in alphabetical order. Avoid technical item flaws that provide special benefit to testwise examinees or that pose irrelevant difficulty. Do NOT write any questions of the form “Which of the following statements is correct?” or “Each of the following state- ments is correct EXCEPT.” These questions are unfocused and have heterogeneous options. Subject each question to the five “tests” implied by the above rules. If a question passes all five, it is probably well-phrased and focused on an appropriate topic. 33 See also: Swanson DB, Case SM. Assessment in basic science instruction: directions for practice and research. Advances in Health Sciences Education: The- ory and Practice. 1997; 2:71-84. 34 Chapter 4 Item Content: Testing Application of Basic Science Knowledge v Item Content for the Basic Sciences Traditionally, items are classified by the cognitive processes required to answer the question (eg, recall, interpretation, or prob- lem solving; memory, comprehension, or reasoning). Recall items are thought to test examinees’ knowledge of isolated facts. Interpretation items require examinees to review some information (often in tabular or graphical form) and reach some con- clusion (eg, a diagnosis). Problem-solving items present a situation and require examinees to take some action (eg, the next step in patient management). Interpretation and problem-solving items are thought to involve “higher order” skills, rather than just rote memory of factual information. Unfortunately, the cognitive processes required to answer an item are often difficult to determine, because they are as depen- dent on the background of the examinee as they are on the item content. For example, an item concerning blood flow in a patient with ventricular septal defect might require simply recall with little or no conscious thought from a pediatric cardi- ologist or cardiovascular physiologist, but a typical Step 1 examinee might have to reason out the answer from basic princi- ples of hemodynamics. The cognitive processes involved in responding to a question vary by examinee, making this taxo- nomic approach difficult to use. A simpler, more objective approach bases item classification on the task of the examinee. If an item requires an examinee to reach a conclusion, make a prediction, or select a course of action, it should be classified as an application of knowledge item. If an item tests only rote memory for isolated facts (without requiring their application), it should be classified as a recall item. All items should require application of knowledge, allowing assessment of both an examinee’s information base plus ability to use that information. Chapter 4. Item Content: Testing Application of Basic Science Knowledge 35 The following pair of item stems illustrate the difference between a question assessing recall of an isolated fact and a ques- tion assessing application of knowledge. Basic Science Recall Item Stem: What area is supplied with blood by the posterior inferior cerebellar artery? Basic Science Application of Knowledge Item Stem: A 62-year-old man develops left-sided limb ataxia, Horner’s syndrome, nystagmus, and loss of appreciation of facial pain and temperature sensations. What artery is most likely to be occluded? It is common to use clinical vignettes as item stems to assess application of basic science knowledge to interpret clinical sit- uations. For example, instead of asking examinees to identify the muscles innervated by a cranial nerve, provide a set of physical findings and ask examinees to identify the most likely site of the lesion. Instead of asking for a description of res- piratory acidosis or alkalosis, provide values for arterial blood gases (and other patient findings as needed) and ask exami- nees to identify the most likely pathophysiologic explanation. Make sure that examinees can answer the question based on an understanding of basic science; experience in patient care should not be necessary. “Lab vignettes” can also be useful in preparing items that test application of knowledge. These items present lab experi- ments and require examinees to use their understanding of basic science principles to predict or explain the results. The vignettes may describe classic experiments in a basic science area, or they may involve less well-known or hypothetical sit- uations. Such items effectively shift the focus of assessment from knowledge of isolated facts to use of basic science prin- ciples to solve problems. Use of patient and lab vignettes to assess application of knowledge has several benefits. First, the “face validity” of the exam is greatly enhanced by using “problem-solving” items. Second, items are more likely to focus on important information, rather than trivia. Third, it helps to identify those examinees who have memorized a substantial body of factual information, but are unable to use that information effectively. 36 Guidelines for Basic Science Item Content Test application of knowledge using experimental and clinical vignettes Focus items on key concepts and principles that are essential information (without access to references) for all examinees to understand Test material that is relevant to learning in clinical clerkships, postgraduate medical education, and beyond Avoid items that only require recall of isolated facts Avoid esoteric or interesting topics that are not essential These two items were written to assess the same topic. We recommend that questions be written like the second item, not the first one. Acute intermittent porphyria is the result of a defect in An otherwise healthy 33-year-old man has mild weak- the biosynthetic pathway for ness and occasional episodes of steady, severe abdom- A. collagen inal pain with some cramping but no diarrhea. One aunt and a cousin have had similar episodes. During B. corticosteroid an episode, his abdomen is distended, and bowel C. fatty acid sounds are decreased. Neurologic examination shows D. glucose mild weakness in the upper arms. These findings sug- gest a defect in the biosynthetic pathway for *E. heme A. collagen F. thyroxine (T 4) B. corticosteroid C. fatty acid D. glucose *E. heme F. thyroxine (T 4) Chapter 4. Item Content: Testing Application of Basic Science Knowledge 37 Item Templates The overall structure of an item can be depicted by an item template. You can typically generate many items using the same template. For example, the following template could be used to generate a series of questions related to gross anatomy: A (patient description) is unable to (functional disability). Which of the following is most likely to have been injured? This is a question that could be written using this template: A 65-year-old man has difficulty rising from a seated position and straightening his trunk, but he has no difficulty flexing his leg. Which of the following muscles is most likely to have been injured? *A. Gluteus maximus B. Gluteus minimus C. Hamstrings D. Iliopsoas E. Obturator internus Many basic science questions can be presented within the context of a patient vignette. The patient vignettes may include some or all of the following components: Age, Gender (eg, A 45-year-old man) Site of Care (eg, comes to the emergency department) Presenting Complaint (eg, because of a headache) Duration (eg, that has continued for 2 days). Patient History (with Family History ?) Physical Findings +/- Results of Diagnostic Studies +/- Initial Treatment, Subsequent Findings, etc. 38 Additional Templates A (patient description) has a (type of injury and location). Which of the following structures is most likely to be affected? A (patient description) has (history findings) and is taking (medications). Which of the following medications is the most like- ly cause of his (one history, PE or lab finding)? A (patient description) has (abnormal findings). Which [additional] finding would suggest/suggests a diagnosis of (disease 1) rather than (disease 2)? A (patient description) has (symptoms and signs). These observations suggest that the disease is a result of the (absence or presence) of which of the following (enzymes, mechanisms)? A (patient description) follows a (specific dietary regime). Which of the following conditions is most likely to occur? A (patient description) has (symptoms, signs, or specific disease) and is being treated with (drug or drug class). The drug acts by inhibiting which of the following (functions, processes)? A (patient description) has (abnormal findings). Which of the following (positive laboratory results) would be expected? (time period) after a (event such as trip or meal with certain foods), a (patient or group description) became ill with (symptoms and signs). Which of the following (organisms, agents) is most likely to be found on analysis of (food )? Following (procedure), a (patient description) develops (symptoms and signs). Laboratory findings show (findings). Which of the following is the most likely cause? A (patient description) dies of (disease). Which of the following is the most likely finding on autopsy? A patient has (symptoms and signs). Which of the following is the most likely explanation for the (findings)? A (patient description) has (symptoms and signs). Exposure to which of the (toxic agents) is the most likely cause? Which of the following is the most likely mechanism of the therapeutic effect of this (drug class) in patients with (disease)? A patient has (abnormal findings), but (normal findings). Which of the following is the most likely diagnosis? See Appendix B for additional examples. Chapter 4. Item Content: Testing Application of Basic Science Knowledge 39 Types of Questions Guess my drug Guess my toxic exposure Guess my diet Guess my mood Predict physical findings Predict lab findings Predict sequelae Identify underlying cause/diagnosis Identify cause of drug responses Identify drug to administer Sample Lead-ins and Option Lists Which of the following is (abnormal)? Options sets could include sites of lesions; list of nerves; list of muscles; list of enzymes; list of hormones; types of cells; list of neurotransmitters; list of toxins, molecules, vessels, spinal segments. Which of the following findings is most likely? Options sets could include list of laboratory results; list of additional physical signs; autopsy results; results of micro- scopic examination of fluids, muscle or joint tissue; DNA analysis results; serum levels. Which of the following is the most likely cause? Options sets could include list of underlying mechanisms of the disease; medications that might cause side effects; drugs or drug classes; toxic agents; hemodynamic mechanisms, viruses, metabolic defects. Which of the following should be administered? Options sets could include drugs, vitamins, amino acids, enzymes, hormones. Which of the following is defective/deficient/nonfunctioning? Options sets could include list of enzymes, feedback mechanisms, endocrine structures, dietary elements, vitamins. Given the pedigree, what is the likelihood that the next child (specify gender) will have the disease? 40 Writing the Options: Altering Item Difficulty The incorrect options in each question are called distractors. Each distractor should be selected by some examinees; there- Who was the primary author of the Declaration of fore, each distractor should be plausible and none should Independence? stand out as being obviously incorrect. Common miscon- A. Abraham Lincoln ceptions and faulty reasoning provide a good source of plau- B. Thomas Jefferson sible distractors. Distractors directly affect the difficulty of a C. Franklin Roosevelt question. Consider the question to the right. D. King George II E. Catherine the Great In the example above, the options are quite divergent and Thomas Jefferson is easily identified as the correct answer. Some- one who knows relatively little about American history could answer this correctly. Now consider the same question with a different set of options. In this example, the question becomes more difficult; the options are all plausible answers to someone who has limited Who was the primary author of the Declaration of knowledge. For some content areas, options like those in the Independence? first example might be appropriate; for others, those in the sec- A. George Washington ond example are more appropriate. B. Thomas Jefferson C. Alexander Hamilton D. Benjamin Franklin E. James Madison When writing your options, make sure that they are: Homogeneous in content (eg, all are diagnoses; all are next steps in patient care) Incorrect or inferior to the correct answer Plausible and attractive to the uninformed Similar to the correct answer in construction and length Grammatically consistent and logically compatible with the stem Chapter 4. Item Content: Testing Application of Basic Science Knowledge 41 Item Shape An appropriately shaped item includes as much of the item as possible in the stem; the stem should be relatively long and the options should be relatively short. The stem should include all relevant facts; no additional data should be provided in the options. Appropriately Shaped Item: Long Stem A. B. C. Short Options D. E. Poorly Shaped Item: Short Stem A. B. C. Long Options D. E. 42 Problem-Based Learning and Use of Case Clusters An increasing number of medical schools have adopted problem-based learning (PBL) as an instructional strategy for por- tions of the basic science curriculum. Although each school’s approach to PBL is somewhat unique, all involve the use of written patient cases (problems) in basic science instruction. Problems are designed to stimulate learning of material from traditional basic science disciplines (eg, anatomy, physiology, biochemistry) from a clinical perspective, and application of basic science principles to clinical situations is stressed. Material is typically covered through independent study and dis- cussed in small groups with a faculty tutor. PBL courses and curricula typically emphasize the learning process, learning how to learn, responsibility of students for their own learning, and preparation for lifelong learning. However, there are important variations among programs that have implications for assessment. The Open Discovery approach emphasizes the learning process: students have responsibility for determining what to learn, as well as when and how to learn it. Learning to apply broad principles in problem-solving situations is viewed as most important, with minimal guidance provided by instructors and maximum opportunity for explo- ration by students. In contrast, in the Guided Discovery approach, curriculum developers identify specific learning objec- tives for each problem, and these objectives are provided to instructors who use them to organize group discussion and stu- dent learning. These curricula can be highly structured, with careful sequencing of instructional experiences. Students may or may not be aware of the structure and the specific objectives: their experience may be quite similar to students in pro- grams using the Open Discovery approach. In practice, the Open and Guided Discovery approaches are probably best viewed as opposite ends of a continuum. Programs vary along the continuum, and, within a program, problems (and groups) also vary. Assessment in programs using the Open Discovery approach often focuses on process variables such as self-directedness, moti- vation, effort, problem-solving, and attitudes. Assessment of learning outcomes is genuinely problematic, because each student is encouraged to pursue a somewhat different course of study. Use of traditional multiple- choice tests, in particular, is often viewed as inappropriate, because they may cause students to “study to the test,” thus discouraging students from self-determi- nation of the material to be learned and the process for learning it. Assessment of learning outcomes poses fewer problems when the Guided Discovery approach is used, since the same learn- ing objectives that guide problem development and use can also guide test development. To achieve congruence with cur- ricular goals, assessment should focus on students’ understanding of basic mechanisms of health, disease and treatment. Well-written multiple-choice tests can play a major role in assessment, as long as they stress application of basic science Chapter 4. Item Content: Testing Application of Basic Science Knowledge 43 knowledge to patient care. Tests using “case clusters” — multiple-choice questions associated with the same patient pre- sentation — are particularly appropriate for PBL courses. An example of a simple case cluster is shown below. It consists of a brief case presentation, followed by a series of three multiple choice questions. Each question addresses a somewhat different aspect of the case, looking at the clinical situation from a variety of perspectives. Like PBL more generally, use of test material like this emphasizes learning of basic science information so that it is organized to be useful in provision of patient care. A 34-year-old woman has had severe watery diarrhea for the past four days. Two months earlier she had infec- tious mononucleosis. She abuses drugs intravenously and has antibodies to HIV in her blood. Physical exami- nation shows dehydration and marked muscle weakness. 1. Laboratory studies are most likely to show 3. Further studies to evaluate her HIV infection show A. decreased serum K+ concentration the ratio of helper T lymphocytes to suppressor T B. decreased serum Ca2+ concentration lymphocytes to be 0.3. This occurs because HIV C. increased serum HCO 3- concentration A. induces proliferation of helper *D. increased serum Na+ concentration T lymphocytes E. increased serum pH B. induces proliferation of suppressor T lymphocytes *C. infects cells with CD4 receptors 2. In evaluating the cause of the diarrhea, which of the D. infects macrophages following is most appropriate? E. stimulates the synthesis of leukotriene A. Colonic biopsy to identify Giardia lamblia B. Culture of the oral cavity for Candida albicans C. Duodenal biopsy to identify Entamoeba histolytica D. Gastric aspirate to identify Mycobacterium avium-intracellulare *E. Stool specimen to identify Cryptosporidium 44 In addition to principles described earlier in this manual, there are two more considerations required in preparing case clus- ters: cueing and hinging. First, it is desirable to avoid “cueing” — providing hints at the answers to earlier questions in later questions. Students are very likely to “read ahead” for these clues, and item writers should avoid providing them. For exam- ple, in a cluster describing a patient with chest pain, if the first question addresses the most likely cause of the pain and the second requires selection of the most appropriate drug treatment, it is important that each of the diagnoses associated with the first question have a “matching” drug in the second (and vice versa); testwise examinees can rule out diagnoses (and drugs) simply by comparing the option lists. Second, it is desirable to avoid “hinging” — creating questions where students must know the answer to one question in order to answer other questions — unless the topic to be tested is so important that the item writer is willing to have students receive either all of the points or none of the points associated with a cluster. The cluster on the next page, prepared by Drs. David Felten and Ralph Jozefowicz for the final examination in the University of Rochester first-year Neural Science course, illustrates one strategy to avoid hinging. Each of the first three items focuses on a different aspect of the patient presentation, and students are likely to respond cor- rectly to some and incorrectly to others, receiving “partial credit” for partial knowledge. The last item is probably slightly hinged on the preceding items, since it requires students to “put the whole picture together” in order to respond correctly, but this seems reasonable, given the importance of the latter. It can be difficult for a single faculty member to prepare case clusters where the items draw on information from several basic science disciplines — this requires substantial breadth of knowledge. One strategy for coping with this problem is to adopt a “team approach” to preparation of test material analogous to the method generally used for preparation of problems for use in PBL instruction. For example, a clinician member of a team can prepare the patient description with which the cluster begins, along with questions related to pathophysiology. Faculty members from relevant basic science disciplines can con- tribute items that address various aspects of the patient situation from the perspective of their discipline. Use of this kind of material is not, of course, restricted to curricula and courses taught using a PBL approach. It is complete- ly appropriate any time it is desirable to stress clinical application of basic science information in teaching, learning and assess- ment. In our view, this includes most basic science courses — even those taught in the first year. As the neural science exam- ple on the next page illustrates quite well, it is straightforward and appropriate to test basic knowledge of anatomy and phys- iology in the context of patient care in a traditionally taught course. Chapter 4. Item Content: Testing Application of Basic Science Knowledge 45 An unresponsive 58-year-old woman is brought to the emergency department after collapsing at a local shopping mall. Her family reports that she felt well that morning but developed a headache that progressively worsened while she was shopping. She has had hypertension and atrial fibrillation and is taking an antihypertensive med- ication and an oral anticoagulant. Her blood pressure is 220/130 mm Hg and her respiratory pattern is one of apnea alternating with hyperpnea. She responds only to noxious stimuli with extensor posturing involving the right arm and leg. Fundoscopic examination reveals papilledema involving the left optic disc. Pupils are 3.0/7.0 (R/L) with no reaction to light on the left. There is a left gaze preference. There is diffuse hyperreflexia (R > L) and Babinski’s sign is present bilaterally. 1. The dilated, unreactive left pupil is most consistent 3. Her respiratory pattern is best described as with injury to the left A. normal A. optic nerve *B. Cheyne-Stokes B. optic tract C. central neurogenic hyperventilation *C. oculomotor nerve D. apneustic D. lateral geniculate nucleus E. ataxic E. superior colliculus 4. Which of the following herniation syndromes is 2. The extensor posturing on the right is most consis- most consistent with her clinical presentation? tent with injury to the left A. Cingulate gyrus beneath the falx A. telencephalon *B. Temporal lobe uncus across the tentorium B. diencephalon C. Diencephalon through the tentorial notch *C. midbrain D. Brain stem through the tentorial notch D. pons E. Cerebellar tonsils through the foramen E. medulla magnum Additional discussion of assessment in PBL courses and curricula can be found in: Swanson DB, Case SM, and van der Vleuten CM. Strategies for student assessment. In: Boud, Feletti, eds. The Challenge of Problem-Based Learning - Second Edition. London: Kogan Page Ltd; 1997:269-282. 46 Sample Items for the Basic Sciences 1. Several contiguous cells are labeled with a fluorescent 3. During an operation, the arterial PCO 2 and pH of an dye that cannot cross cell membranes. One cell is exper- anesthetized patient are monitored. The patient is being imentally bleached with light that destroys the dye, but ventilated by a mechanical respirator, and the initial val- soon recovers dye fluorescence. This recovery is best ues are normal (PCO 2 = 40 mm Hg; pH = 7.42). If the explained by the presence of which of the following ventilation is decreased, which of the following is most structures between the bleached cell and its fluorescent likely to occur? neighbors? Arterial PCO2 pH A. A basal lamina B. Desmosomes (maculae adherentes) A. Decrease decrease *C. Gap junctions B. Decrease increase D. Glycosaminoglycans C. Decrease no change E. Tight junctions (zonulae occludentes) *D. Increase decrease E. Increase increase F. Increase no change 2. A 30-year-old man has loss of pain and temperature sensation from the neck down on the right side of the body and on the left side of the face; partial paralysis of the soft palate, larynx, and pharynx on the left; and ataxia on the left. This syndrome is most likely to result from thrombosis of which of the following arteries? 4. In the branched metabolic pathway, a different single A. Basilar enzyme catalyzes each of the individual steps. The B. Right posterior inferior cerebellar enzyme that would be expected to be most severely *C. Left posterior inferior cerebellar inhibited by compound V is enzyme D. Right superior cerebellar E. Left superior cerebellar A. A *B. B C. C D. D E. E Chapter 4. Item Content: Testing Application of Basic Science Knowledge 47 5. A patient with posthepatitic cirrhosis develops rapid 7. Laboratory tests on an edematous 35-year-old man show enlargement of the liver associated with deterioration of a normal serum concentration of complement and an hepatic function. Serum concentration of which of the increased serum concentration of cholesterol. Urinalysis following is most likely to be abnormal? shows 4+ protein, 0-5 erythrocytes/hpf, and several hya- line casts. Examination of tissue obtained on renal biop- A. "1-Antitrypsin sy is most likely to show B. Carcinoembryonic antigen C. Chorionic gonadotropin A. acute poststreptococcal (proliferative) *D. "-Fetoprotein glomerulonephritis E. Gastrin B. membranoproliferative glomerulonephritis *C. membranous glomerulonephritis D. minimal change disease (lipoid nephrosis) E. rapidly progressive glomerulonephritis 6. The first-born infant of an Rh-negative 26-year-old woman who had two previous second trimester abortions has severe hemolysis and circulatory failure. This condi- tion could have been prevented by treating the mother 8. Genes on the bacterial chromosome have the following with linkages in conjugal transfer: x and y, 25% of the time; y and z, 50% of the time. If the gene order is x-y-z, A. anti-RhD IgG during the most recent approximately what percentage of the time will x and z pregnancy be transferred together? *B. anti-RhD IgG on termination of each of the first two pregnancies A. 1% of the time C. anti-RhD IgM during the most recent B. 5% of the time pregnancy *C. 13% of the time D. anti-RhD IgM on termination of the first D. 20% of the time pregnancy E. 40% of the time 48 9. At a banquet, the menu included fried chicken, home- 11. A patient seen in the emergency department does not fried potatoes, peas, chocolate eclairs, and coffee. know which “heart drug” he is taking. His heart rate is Within 2 hours, most of the diners became violently ill, greater than 80/min, and the PR and QRS intervals on with nausea, vomiting, and abdominal pain. Analysis an ECG are prolonged. The patient reports ringing in of the contaminated food is most likely to yield large his ears. Which of the following drugs has the patient numbers of which of the following organisms? most likely been taking? A. Escherichia coli A. Digoxin B. Proteus mirabilis B. Lidocaine C. Salmonella typhimurium C. Phenytoin *D. Staphylococcus aureus D. Propranolol E. Streptococcus faecalis *E. Quinidine 10. Drug Y has a volume of distribution (Vd) of 75 L in 12. An 8-year-old boy needs to be coaxed to go to school, both younger and older adult men. In younger adults, it and often, while there, he complains of severe has a clearance rate of 15 L/h, 50% of which is via the headaches or stomach pain. Sometimes his mother has liver and 50% via the kidneys. For younger men, the to take him home because of his symptoms. At night, maintenance regimen is 100 mg every 6 hours. Which he tries to sleep with his parents. When they insist he of the following regimens will produce essentially the sleep in his own room, he says there are monsters in his same steady-state concentration in an older man, whose closet. These findings are most consistent with which creatinine clearance is reduced to half that of younger of the following diagnoses? men, but whose hepatic function is unimpaired? A. Childhood schizophrenia A. 75 mg every 3 hours B. Normal concerns of latency-age children *B. 75 mg every 6 hours *C. Separation anxiety disorder C. 75 mg every 9 hours D. Socialized conduct disorder D. 100 mg every 3 hours E. Symbiotic psychosis E. 100 mg every 6 hours F. 100 mg every 12 hours Chapter 4. Item Content: Testing Application of Basic Science Knowledge 49 Chapter 5 Item Content: Testing Application of Clinical Science Knowledge v Methods for Assessment Despite continued debate about the appropriateness of multiple-choice tests, all three Step exams of the USMLE currently include only multiple-choice questions (MCQs). In a quest for the optimal evaluation instrument, the NBME has conduct- ed continual research on test formats. For the past 25 years, a major focus of this research has been the Computer Based Examination (CBX) project. Since the mid-1970s, a second area of research has focused on standardized patients (SPs). Although these projects are in a research mode currently, both are being considered for live use in the future. As with other forms of more “authentic assessment,” CBX and SP-based examinations appear to provide significant advantages over MCQ tests for assessment of clinical competence: they have good face validity and pose tasks for the examinee in a way that may appear to be more realistic than MCQs. However, there are psychometric and practical problems that require better solutions before these methods are likely to be used for licensure in the United States. While these research projects continue, other projects have focused on enhancing the multiple-choice format. As a result of test development research, MCQs today appear very different than those used in the past. For content, as well as psychome- tric reasons, true/false varieties such as K-types (multiple true/false) and C-types (A, B, Both, Neither) are no longer used on the licensure exam. While most of the questions on Step 2 have the traditional five options, both A-type questions and extend- ed-matching questions include as many as 26 options, pushing the examinee task to something closer to uncued free-response. With few exceptions, every item on Step 2 provides a patient vignette that focuses on a task that is relevant to a new intern, such as determining the diagnosis, or the next step in patient care. These items require interpretation and synthesis of the data that are provided; they also require application of knowledge to familiar or unfamiliar situations (depending on the experiences of the examinee). Item sets that include several questions relating to the same patient scenario are also used. Patient vignette items attempt to include some of the advantages of simulations, while avoiding some of the disadvantages. Chapter 5. Item Content: Testing Application of Clinical Science Knowledge 51 General Issues Regarding What to Test There are several tensions that influence the construction of each Step exam which may be relevant to you as you consider what to include on your

Constructing Written Test Questions PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue