Summary

This article proposes an approach for studying test fairness, linking it directly to test validity. It argues that fairness is comparable validity for identifiable groups. The article uses the TOEFL iBT test as an example.

Full Transcript

Language Testing http://ltj.sagepub.com/ How do we go about investigating test fairness? Xiaoming Xi Language Testing 2010 27: 147 originally published online 8 March 2010 DOI: 10.1177/0265532209349465 The...

Language Testing http://ltj.sagepub.com/ How do we go about investigating test fairness? Xiaoming Xi Language Testing 2010 27: 147 originally published online 8 March 2010 DOI: 10.1177/0265532209349465 The online version of this article can be found at: http://ltj.sagepub.com/content/27/2/147 Published by: http://www.sagepublications.com Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://ltj.sagepub.com/content/27/2/147.refs.html >> Version of Record - May 26, 2010 OnlineFirst Version of Record - Mar 8, 2010 What is This? Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 /$1*8$*( 7(67,1* Language Testing How do we go about 27(2) 147–170 © The Author(s) 2010 investigating test fairness? Reprints and permission: http://www. sagepub.co.uk/journalsPermission.nav DOI: 10.1177/0265532209349465 http://ltj.sagepub.com Xiaoming Xi Educational Testing Service, USA Abstract Previous test fairness frameworks have greatly expanded the scope of fairness, but do not provide a means to fully integrate fairness investigations and set priorities. This article proposes an approach to guide practitioners on fairness research and practices. This approach treats fairness as an aspect of validity and conceptualizes it as comparable validity for all relevant groups. Anything that weakens fairness compromises the validity of a test.This conceptualization expands the scope and enriches the interpretations of fairness by drawing on well-defined validity theories while enhancing the meaning of validity by integrating fairness in a principled way. TOEFL® iBTTM is then used to illustrate how a fairness argument may be established and supported in a validity argument. The fairness argument consists of a series of rebuttals to the validity argument that would compromise the comparability of score-based interpretations and uses for relevant groups, and it provides a logical mechanism for identifying critical research areas and setting research priorities. This approach will hopefully inspire more investigations motivated by and built on a central fairness argument. It may also foster a deeper understanding and expanded explorations of actions based on test results and social consequences, as impartiality and justice of actions and comparability of test consequences are at the core of fairness. Keywords test fairness, validity, validity argument, fairness argument, comparable validity, test validity Motivated by broader social justice theories (Jensen, 1980), test fairness has been defined in many different ways. It is broadly conceived in the Standards for Educational and Psychological Testing (AERA, APA & NCME, 1999) (hereafter called the 1999 Standards) as absence of bias, equitable treatment of all test takers in the testing process, and equity in opportunity to learn the material in an achievement test. The development of conceptual frameworks for fairness in language testing has greatly expanded the scope of fairness (Kunnan, 2000, 2004); however, empirical research that is motivated by and couched in these frameworks has been slow to catch up. For one thing, current empirical research in language testing has been piecemeal. The Corresponding author: Xiaoming Xi, Center for Validity Research, Research & Development Division, MS 07-R, Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA. E-mail: [email protected] Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 148 Language Testing 27(2) studies have typically focused on only one of a number of different aspects of fairness at any one time. These aspects may include differential item functioning (DIF) investiga- tions across sub-groups (see Kunnan, 2000 and Ferne & Rupp, 2007 for comprehensive reviews of DIF research in language testing), the influence of construct-irrelevant test taker characteristics on test performance (Alderson & Urquhart, 1985a, b; Zeidner, 1986; Hale, 1988; Kunnan, 1995; Clapham, 1998; Taylor et al., 1998), the influence of inter- viewer behavior on examinees’ speaking scores across studied groups (Brown, 2003), the influence of gender bias in oral interviews (O’Loughlin, 2002), the invariance of fac- tor structures of test scores across groups (Swinton & Powers, 1980; Hale et al., 1989; Oltman et al., 1990; Ginther & Stevens, 1998; Stricker et al., 2005), and the reliability of multiple-choice test scores across L1 groups (Brown, 1999). No research has analyzed in depth how different manifestations of unfairness may impact the ultimate score interpre- tation and score-based decisions for a particular assessment. Further, all of the empirical studies have looked at the stability of score interpretations across groups in different ways but almost none have addressed the consistency of score-based decisions (see Zeidner, 1987 for an exception) or the comparability of the broader effects of testing for different groups. Although there has been a substantial amount of work on the conse- quences of large-scale language tests (see Cheng, 2008 for a comprehensive review), none of the studies have really looked at the differential impact a language test might have on different groups of test takers. One reason for the lag of empirical fairness research in language testing may be that previous fairness frameworks (Kunnan, 2000, 2004), although very useful in pointing to general areas of potential research and practice, may not provide practical guidance on how to go about developing the relevant evidence to support fairness. Another reason may be that the frameworks themselves do not offer a means to plan fairness research and set pri- orities, nor do they provide a mechanism to integrate all aspects of fairness investigations into a fairness argument. Given that it is impossible for a test to be perfectly fair for the intended use(s), a systematic way to identify areas where research and practice are most needed is necessary to focus resources on the key areas. To guide practical research, a framework should provide a principled way to anticipate potential threats to fairness, to identify and prioritize research needs, and to gauge the progress of fairness investigations. This article proposes an approach to investigating fairness for an assessment to guide practitioners on fairness research and practice. The Test of English as a Foreign LanguageTM Internet-based test (TOEFL® iBT test) is used to illustrate how this approach may be applied to establish and support the overall fairness argument for using an assess- ment for its intended purpose(s). To establish a framework that is useful for practical research, we must ask two funda- mental questions: first, how do we define fairness in a way that is meaningful and not too abstract for practitioners? Second, what are the concrete steps to follow in investigating fairness? The first question concerns the conceptualization of fairness and the second pertains to the process of planning and conducting fairness research. Fairness has been conceptualized in various ways. Although these conceptual approaches may vary on dimensions such as how much emphasis is placed on the social and political aspects of fairness, a central point on which they differ is how fairness is related to validity. In particular, whether fairness is independent of validity, subsumes it, Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 149 or is a facet of it. The first view sees fairness as a relatively independent facet of test quality or general testing practices and does not make clear and consistent connections to validity (Joint Committee on Testing Practice, 1988; ETS, 2002). Another view sees fairness as an overarching test quality that consists of different facets including validity (Kunnan, 2000, 2004). Proponents of the third view treat validity as the fundamental test quality and links fairness directly to it (Willingham & Cole, 1997; Willingham, 1999; AERA, APA & NCME, 1999). The three approaches are discussed in detail below. For each approach, its character- ization of fairness, its strengths, and its limitations as a workable framework for guiding practical research will be addressed. Different Conceptualizations of Fairness View 1: Fairness as an independent test quality In this view, fairness is characterized as a test quality that is separate from validity, although some tenuous and inconsistent references may be made to validity. The Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 1998, 2004) and the Standards for Fairness and Quality by Educational Testing Service (ETS, 2002) are probably representative of this approach. The Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 1988) (hereafter called the 1988 Code) is based on the relevant parts of the 1985 Standards (AERA, APA & NCME, 1985). However, unlike the 1985 Standards, which are primarily for professionals involved in making tests, using tests and interpret- ing test results, the 1988 Code is intended to be accessible to the general public (espe- cially test takers and their parents or guardians) and focuses only on issues that impact the proper use of educational tests. It argues that test developers and test users share joint responsibilities in advocating fair testing practices in four areas: developing and select- ing tests, interpreting scores, striving for fairness, and informing test takers. The 1988 Code clearly defines the responsibilities that test developers and users have respectively in ensuring fairness in each area. To strive for fairness, it is recommended that test devel- opers conduct sensitivity review of test materials, ensure that differences in test perfor- mance across sub‑groups are not due to irrelevant factors, and provide appropriate accommodations for examinees with disabilities. Correspondingly, test users are encour- aged to review and evaluate the sensitivity review procedures and the information on performance differences and to use appropriately modified test forms or administration procedures for examinees with disabilities. In 2004, the 1988 Code was modified and expanded (Joint Committee on Testing Practices, 1988). Striving for fairness is not listed as a separate area; rather fairness issues are discussed in relation to the aspect of the testing process they are associated with: developing and selecting appropriate tests, administering and scoring tests, reporting and interpreting test results, or informing test takers. This treatment high- lights the role of fairness as a test quality that permeates the whole assessment process. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 150 Language Testing 27(2) The primary focus of the Code is the division of responsibilities between test develop- ers and users in ensuring fair testing practices. This is also a major contribution of the Code compared to the Standards, as the partition of responsibilities between test devel- opers and users has not always been clear cut (Shohamy, 2001). Since it requires both test developers and users to work in concert to ensure fairness, guidelines as to who is responsible for what help promote fairness. The ETS Standards for Fairness and Quality (ETS, 2002) is largely informed by the 1999 Standards. It contends that ‘fairness requires that construct-irrelevant personal char- acteristics of test takers have no appreciable effect on test results or their interpretation’ (p. 17). It then provides a comprehensive list of fairness standards that need to be satisfied for assessment products and services, if deemed relevant and appropriate for the particular assessment context. These standards include ensuring fairness in the design, development, administration, and use of the assessment product or service for the studied groups; obtain- ing and documenting judgmental and empirical evaluations of fairness for studied groups; eliminating offensive, sexist, and racist symbols, language, or content in test materials; providing impartial access to products and services and impartial registration, administra- tion, and reporting of assessment results; and offering appropriate and reasonable accom- modations for test takers with disabilities. Although the ETS Standards for Fairness and Quality includes a broad list of fairness standards, it does not provide a mechanism for prioritizing them and for weighing one piece of fairness evidence against another. Most of the standards for ensuring validity in the ETS Standards for Fairness and Quality apply to the design, development, administration, and use of an assessment prod- uct for the entire test-taking group. However, one of the standards requires the examina- tion of the consequences of using assessment results for different studied groups. It further elaborates that if the use of assessment results causes unintended consequences for a studied group, the validity evidence should be investigated to see if the differential impact for the studied group is a result of construct irrelevant factors or construct under‑representation. This treatment supports the linkage between fairness and validity and points to the potential of connecting fairness and validity in a more consistent and coherent way. View 2 – Fairness as an all-encompassing test quality The second view gives primacy to test fairness and defines it as a test quality which sub- sumes and goes beyond validity. Implicit in this view is the argument that a test has to be valid to be fair. This view is evident in Kunnan’s work (Kunnan, 2000, 2004), which represents the first attempt in language testing to propose an overarching framework for fairness research. His earlier work on fairness (Kunnan, 2000) expands on the 1988 Code and describes fairness as a three-faceted concept: validity, access and justice. He adopts Jensen’s definition of test fairness and sees fairness as ‘the ways in which test scores (whether of biased or unbiased tests) are used in any selection situation’ (Jensen, 1980, p. 376). He agrees with Jensen that ‘the concepts of fairness, social justice, and equal protection of the laws are moral, legal, and philosophical ideas and therefore must be evaluated in these terms’ (Jensen, 1980, p. 376). Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 151 This definition is built upon broader social justice theories. As Kunnan points out, it highlights the social, ethical, legal and philosophical aspects of fairness, in addition to the traditional psychometric dimension of tests and testing practice that defines fairness. Kunnan’s more recent work (Kunnan, 2004) draws on the 1988 Code, the 1999 Standards and Willingham and Cole’s (1997) notion of comparable validity. However, he brings to the forefront two unique fairness qualities – access and administration – which have been ignored or not emphasized in previous work. The other three qualities, validity, absence of bias, and social consequences, have been addressed in one way or another in previous fairness frameworks. Kunnan sees fairness as a test quality that encompasses validity, absence of bias, access to the test, administration conditions, and test consequences. His discussion of absence of bias comes the closest to traditional ter- ritories of fairness, supporting the central argument that fairness is threatened if test content or language is offensive to test takers from certain backgrounds, or group differ- ences exist in test performance and test scores’ prediction of success, or in selection decisions based on test scores, as a result of construct-irrelevant factors. The other four qualities, though, do not rest on the concept of group differences and pertain to the over- all quality of testing practices. Although Kunnan’s frameworks have made important contributions in broadening the span of fairness in language testing, they could be improved in a few ways. First, current validity theories address the appropriateness of score-based decisions and consequences of testing (Messick, 1989), which are also primary concerns of fairness. Current valida- tion frameworks (Kane, 1992; Kane et al., 1999; Kane, 2001, 2002, 2004, 2006; Bachman, 2005; Chapelle et al., 2008) have provided means to address all the fairness qualities proposed in Kunnan (2004) in a coherent way within the framework of a valid- ity or assessment use argument. It does not seem necessary to treat them as separate facets of fairness. Second, as noted by Bachman (2005), the various fairness qualities discussed in Kunnan’s work are important by themselves but are not necessarily con- nected to one another. Thus a mechanism is needed to integrate them properly to support an overall fairness argument. Third, Kunnan’s frameworks do not offer any guidance in planning and prioritizing fairness research and in evaluating the overall fairness argu- ment, although he notes the potential usefulness of the argument-based approach for fairness investigations (Kunnan, 2004). McNamara and Roever (2006), in endorsing a few important pieces of conceptual fair- ness work in language testing (Kunnan, 2000; Shohamy, 2000), express the view that test fairness encompasses many different aspects. While not providing an elaborated defini- tion and defining the exact scope of test fairness, they steer their discussion toward the social dimensions of language testing that are manifest in investigations of item bias (i.e. test items that favor or are biased against certain groups of test takers). McNamara and Roever distinguish the psychometric and social approaches adopted to investigate item bias, while acknowledging the interrelatedness of the two. The psychometric approach is motivated by the desire to ensure social justice and interpretations of psychometric results are informed by value judgments. The socially and politically inspired approach features fairness reviews by testing agencies and codes of ethics that guide the practices of test developers and practitioners, including the practices that are supported by statistical and measurement procedures. As McNamara and Roever argue, ‘fairness review constitutes a Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 152 Language Testing 27(2) systematic process of identifying possibly biased items or items that may be so controver- sial that a test’s acceptance might suffer. The implementation of fairness review processes is an acknowledgment of the political side of the test design process and a step away from purely psychometric procedures, which make an important contribution to test quality but are the least obviously socially oriented procedures’(2006, p. 147). McNamara and Roever’s approach puts considerable emphasis on the social and political aspects of fairness. They have rightfully highlighted the role of value judgments informed by social and political considerations in addition to psychometric issues in item bias investigations. Although enlightening, their discussion, which is contextualized in a larger discussion of the social dimensions of language assessment, is limited in its scope and does not intend to provide a full treatment of fairness. View 3: Fairness linked directly to validity The 1999 Standards defines ‘customary responsibilities’ for professional test developers, publishers, sponsors, and users in the evaluation of tests, testing practices and effects of test use (AERA, APA & NCME, 1999, p. 73). It contains a section on fairness in testing and test use. While recognizing the existence of many other alternative but equally legiti- mate perspectives on fairness, it endorses three prevalent characterizations of test fair- ness in the field of educational and psychological testing: fairness as lack of bias, fairness as equitable treatment of all examinees in the testing process, and fairness as equity in opportunity to learn the materials covered in an achievement test. The 1999 Standards explicitly rejects a popular view that fairness requires the equality of testing outcomes for different test taker groups, and argues that a more widely accepted view in the profes- sional literature would hold that test takers from different groups who have equal stand- ing with respect to the construct of interest should on average receive the same test score. The 1999 Standards also elaborates on fairness issues related to the rights and responsi- bilities of test takers, testing individuals of diverse linguistic backgrounds, and testing individuals with disabilities. The 1999 Standards advocates the gathering of multiple types of evidence to support test fairness, including those related to the content of the tests, the internal structure of test scores for different sub-groups, and the relationships of test scores to other external measures. The 1999 Standards discusses a few types of validity evidence based on test content, examinee or rater response processes, internal structures of scores, relations of test scores to scores on non-test measures, and consequences of testing. In discussing each type of validity evidence, the corresponding fairness issues are alluded to and then further clari- fied and expanded in the separate section on fairness. In particular, the Standards argues that each type of validity evidence should also be examined for relevant sub‑groups of examinees. The purpose is to determine whether the meaning and interpretation of assessment scores and the consequences of the use of assessment results may differ as a result of construct irrelevant factors or construct under‑representation. This connection between discussions of fairness and validity suggests a strong possibility for linking fair- ness back to validity in a principled way. This kind of linkage would allow fairness research and practice to take advantage of a well-defined framework for validity. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 153 The 1999 Standards has been influential in guiding empirical research and practice in educational testing. It has pointed to general areas of fairness research and practice and offered checklists for fairness investigations. However, it does not provide a systematic approach that allows the integration of all aspects of fairness practices and investiga- tions; nor does it include any discussion of setting priorities or provide a mechanism to set priorities for fairness investigations. Willingham and Cole (Willingham & Cole, 1997; Willingham, 1999) provide the most lucid conceptualization of fairness. They see test fairness as an important aspect of validity and conceive it as comparability in assessment and comparable validity for all individuals and groups. As they contend, ‘viewing fairness as comparable validity defines and elaborates the interpretation of test fairness because validity is based on an established system of features and evidence’ (Willingham & Cole, 1997, p. 7). It fol- lows from this conceptualization that whatever weakens fairness also compromises the validity of a test. Three fairness qualities are highlighted in Willingham and Cole’s fairness framework: comparability of opportunity for examinees to demonstrate relevant proficiency, compa- rable assessment exercises and scores, and comparable treatment of examinees in test interpretation and use. The first quality is most relevant to the design stage in the assess- ment process, the second one is relevant to development and administration and the third one to test use. They also propose that fairness issues be organized around four stages of the assessment process: design, development, administration and use. They argue that at different assessment stages, different validity issues are relevant, which determine the relevant fairness investigations. Willingham (1999) reinforces comparable validity as the fundamental principle for fairness and elaborates more on the fairness issues at each stage of the assessment process in the context of educational measurement. Willingham also stresses that spe- cial attention should be given to the use of assessments and implications for use in fairness investigations because fairness is essentially social judgments that may be informed by statistical and measurement procedures (Willingham, 1999). This per- spective is consistent with the emphasis on social dimensions of fairness in other fair- ness frameworks. Willingham and Cole also point out the importance of integrating fairness investiga- tions rather than treating them in isolation, because they are usually interrelated and impact one another. The assessment process provides a useful framework for anticipat- ing and addressing fairness issues in conceptualizing and developing an assessment for a particular use; however, it does not provide a means to plan and prioritize fairness inves- tigations and to integrate and evaluate all the fairness evidence. Fairness, Ethics, and Professional Standards for Language Testing In addition to explicit discussions of fairness, the topics of ethics and professional stan- dards have attracted growing attention in language testing. Most notably, in the past decade or so, two special journal issues on ethics have come out, one in Language Testing Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 154 Language Testing 27(2) (1997) and the other in Language Assessment Quarterly (2004), both guest edited by Davies. Correspondingly, the language testing field has attempted to codify ethics guide- lines and professional obligations of language testers by publishing the International Language Testing Association’s Code of Ethics (ILTA, 2000) and draft Code of Practice (ILTA, 2005), and the Association of Language Testers in Europe’s Code of Practice (ALTE, 1994). These codes of ethics and practice go beyond individual test makers and provide guidelines for the entire language testing profession on what constitute accept- able professional standards of conduct. Ethics has been conceived as a much broader concept than fairness in the field of language testing, just as Shohamy argues, ‘language tests employing methods which are not fair to all test takers are unethical’ (2000, p. 340). Ethics may go beyond validity and fairness issues and cover all accepted professional standards of conduct. Being ‘ethical’ is defined as ‘conforming to accepted professional standards of conduct’ (Webster’s Ninth New Collegiate Dictionary, 1988). Although the concepts of fairness and ethics do not totally overlap, as McNamara and Roever (2006) point out, awareness of ethical issues is an impetus for language testers to devote more attention to the impact of the decisions they make about the assessment process on various stakeholders, including the impact for different test-taking groups, which is what fairness is all about. A New Approach to Fairness In this section, a new approach to investigating fairness is described. Fairness is first defined and the concrete steps in articulating and supporting a fairness argument are then discussed using the TOEFL iBT test as an example. A few major considerations in priori- tizing fairness investigations are also discussed. The definition of fairness Webster’s Ninth New Collegiate Dictionary defines ‘fair’ as ‘free from favor toward either or any side’ (Webster’s Ninth New Collegiate Dictionary, 1988). This definition suggests that a central focus of fairness in the context of testing is the comparison of test- ing practices and test outcomes across different groups. Willingham and Cole’s view of fairness as comparable validity for all groups (Willingham & Cole, 1997; Willingham, 1999) is consistent with this definition. Built on their conceptualization, fairness is defined here as comparable validity for identifiable and relevant groups across all stages of assessment, from assessment conceptualization to the use of assessment results. This conceptualization of fairness implies that a test has to be fair to be valid. Anything that weakens fairness compromises the validity of test score interpretation and use. To further elaborate on this definition, it is argued that fairness requires that construct-irrelevant factors, construct under-representation, inconsistent test administration practices, inap- propriate decision-making procedures or use of test results have no systematic and appreciable effects on test scores, test score interpretations, score-based decisions and consequences for relevant groups of examinees. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 155 This conceptualization expands the scope and enriches the interpretations of fairness that draw on evolving validity theories, while enhancing the meaning of validity by integrating fairness. The span of fairness investigations and the methods for investigat- ing fairness can be illuminated by well-established validity theories and validation frameworks (Cureton, 1951; Cronbach & Meehl, 1955; Cronbach, 1988; Messick, 1989; Kane, 1992; Kane et al., 1999; Kane, 2001, 2002, 2004; Bachman, 2005; Kane, 2006; Chapelle et al., 2008). Specifically, the psychometric dimensions of fairness (such as differential item functioning, differential factorial structures, and differential criterion-related validity across sub‑groups) can be readily fit into a validity framework; Messick’s expansion of validity to include values and social consequences of testing practices (Messick, 1989) clearly resonates with the social and political dimensions of fairness, thus providing a platform for linking fairness to validity on all levels. Regarding the methods for examining fairness, the recent advancement in the argument-based vali- dation approach by Kane and his associates offers a vehicle for fairness to be articulated and supported as a component in the larger validity argument. Fairness investigations may also benefit from further advances in validity theories and validation approaches, which have consistently attracted major research efforts in the field of educational and psychological testing. Validity is typically established by evidence that supports the soundness of score- based interpretations and uses for the whole test-taking population. With the integration of fairness in a systematic fashion, however, the concept of validity has been expanded by requiring further evidence that pertains to comparability in score-based interpreta- tions and uses for relevant subgroups. Willingham and Cole’s work (Willingham & Cole, 1997) provides a foundation for the approach described above. Nevertheless, because their framework does not draw on an argument-based approach to validation, it cannot take advantage of this systematic approach to investigating validity and fairness. The approach proposed below lends itself to building and supporting an overall fairness argument and provides a systematic way to organize different types of fairness evidence. The approach to investigating fairness – A fairness argument in a validity argument In this section, the TOEFL iBT test is used as an example to demonstrate how a fairness argument may be built and substantiated in the context of a validity argument. The con- structs and intended uses of the test are described first to provide some background for the subsequent substantive discussions. The discussions include the validity argument, the fairness argument, the relationship between the validity and fairness arguments, and priority-setting in fairness investigations. Constructs and intended uses of the TOEFL iBT test An examination of the intended interpretations and uses of the TOEFL iBT test scores is an essential first step in articulating and building a fairness argument. The TOEFL iBT test was released in September of 2005 in North America and has since been launched worldwide. It consists of Reading, Listening, Writing and Speaking sections and is intended to assess the ability Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 156 Language Testing 27(2) to use English for communicative purposes in an academic environment at English- medium post-secondary institutions. The Writing and Speaking sections contain inte- grated tasks which require test takers to first read and/or listen to academic materials and then write or speak about them. The TOEFL iBT test is intended for two uses: its primary purpose is for admitting applicants who are non-native speakers of English into undergraduate or graduate pro- grams and its secondary purpose is for determining if admitted students need remedial English classes. For the first purpose, the TOEFL iBT test scores can be used along with other indicators of candidates’ academic competence to make admissions decisions. For placement purposes, the TOEFL iBT scores can be used alone or along with an in-house English placement test to exempt international students from taking English classes or to place them into English support classes. In addition, the speaking section of the TOEFL iBT test can be used in conjunction with an in-house international teaching assistant (ITA) screening test to qualify international students for teaching assignments. The validity argument for the TOEFL iBT test When fairness is linked to validity, the scope of fairness investigations, the kinds of evidence needed, and the means to orga- nize and integrate fairness evidence depend on how validity is conceptualized and struc- tured. As discussed earlier, current views consider the test validation process as building an argument – a validity argument can be organized around a series of inferences that lead to appropriate test score interpretations and uses (Kane, 1992; Kane et al., 1999; Kane, 2001, 2002, 2004, 2006). Validation consists of two stages. The first stage involves the construction of an interpretive argument, consisting of the chain of inferences linking test performance to a decision, the warrant supporting each inference and the assump- tions upon which the warrant rests. The soundness of the interpretive argument is then evaluated in the context of a validity argument in the second stage. This approach to test validation has inspired expansions and applications to the con- text of language testing (Bachman, 2005; Fulcher & Davidson, 2007; Chapelle et al., 2008). In particular, Chapelle et al. (2008) extended the typical inferential bridges in Kane’s work. They then applied the adapted framework to provide an extensive narrative of the interpretive argument for the TOEFL iBT test and an evaluation of the strength of the interpretive argument in the context of a validity argument. In their modified frame- work, six types of inferences are essential in linking performance on the TOEFL iBT test to the intended score interpretations and uses: Domain Definition, Evaluation, Generalization, Explanation, Extrapolation, and Utilization. Figure 1 illustrates the six inferential steps and the mechanisms under which they can be organized conceptually to link an observation in a test to score-based interpretations and uses. The discussion below on the warrants for the inferences is mostly consistent with that in Chapelle et al. (2008). Domain description: The first link is from the target domain to observations on the test. The warrant supporting this inference is that the target domain of language use in the English-medium institutions of higher education provides a basis for the observations of performance on the TOEFL test to reveal relevant knowledge, skills, and abilities. Evaluation: The second link from observations on the test to observed test scores hinges on the warrant that observations of performance on the TOEFL iBT test are Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 157 obtained and evaluated appropriately to provide observed scores reflective of intended academic language abilities, not other irrelevant factors. Generalization: The third link is from the observed score to the expected (universe) score. The pertinent warrant is that the observed scores on the test are generalizable over similar language tasks in the universe, test forms and occasions. Explanation: The fourth link between the expected scores and the theoretical score interpretation bears on the warrant that expected scores can be accounted for by underly- ing language abilities in an academic environment. Extrapolation: The fifth link connects the theoretical score interpretation and target score interpretation. The warrant is that the theoretical construct of academic language abilities accounts for the quality of language performance in English-medium institu- tions of higher education. At these two links (Explanation and Extrapolation), meaning can be attached to the expected scores in two potential ways to support valid interpretations of the assessment results. The expected scores can be interpreted by drawing on a theoretical construct (e.g. a communicative competence model) that underlies consistencies in test takers’ perfor- mances. For assessments for which specific domains of generalization can be defined, this representation of the meaning of assessment results is further contextualized in the target domain to which the test scores are intended to be generalized. In some instances, in the absence of a strong construct theory, the generalization of test performance to the intended domain may sustain the link from the expected scores to the target score interpretation. Utilization: The last link connects score-based interpretations and test use. The war- rants are that test scores and other information provided to users are relevant, useful and sufficient for evaluating the adequacy of international students’ English proficiency for studying at English medium institutions, for determining the appropriate ESL course- work needed, and for selecting international teaching assistants, and have beneficial con- sequences for the teaching and learning of English. These six inferences, if supported, increasingly add meaning and value to the elicited test performance, thus supporting score-based decisions. The fairness argument in the validity argument This section addresses how a fairness argument can be established that is embedded with the validity argument. The previous section provides an account of the typical inferences underlying the interpreta- tion and use of the TOEFL iBT scores and the warrants supporting the inferences. As previously discussed, for each warrant, there is also a set of assumptions in need of back- ing, which are listed in Table 1 along with the associated warrant and inference. The inferences, warrants, assumptions, and backing are key elements in most interpretive arguments. Some argument structures may also include an explicit representation of rebuttals that would invalidate or reduce the force of the claim or conclusion. This is where the fairness argument can be articulated. As shown in Table 1, the fairness argument consists of a series of rebuttals that may challenge the comparability of scores, score interpretations, score-based decisions and consequences for sub-groups.1 For fairness investigations, we need to articulate a coher- ent fairness argument by specifying the series of rebuttals. To substantiate this argu- ment, evidence has to be put forward that sustains the comparability of the score-based Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 158 Language Testing 27(2) Target domain Domain definition Observation Evaluation Observed score Generalization Expected score Explanation Theoretical score interpretation Extrapolation Domain score interpretation Utilization Test use Figure 1. Types of inferences in an interpretive argument (modified based on Chapelle et al., 2008) interpretations and uses for different relevant groups. This evidence needs to discount or reduce the force of the rebuttals that the score interpretations and uses are not compara- ble across groups due to construct-irrelevant factors, construct under-representation, inappropriate score reporting practices or decision-making procedures, or unintended uses of the test scores. Failures to rebut any of these counter assertions may weaken the fairness argument and thus compromise the validity of the TOEFL iBT test. Just as each inference addresses a different aspect of the validity argument, the foci for fairness that are relevant to each inference vary as well. Below is a brief description of the major fairness issues that pertain to each inference. Domain definition: The relevant fairness issue is whether test tasks are equally relevant to and representative of the sub domains for different test taker groups. For example, since the TOEFL iBT test scores are used for admitting both undergraduate and graduate appli- cants who are non-native speakers of English, a potential fairness issue is that the test tasks may not assess some critical language skills required of undergraduate or graduate students. For another example, since the TOEFL iBT scores are used for admitting inter- national students in English-medium institutions of higher education, using American accents only in the listening section may pose fairness issues for candidates seeking admission to colleagues and universities in Britain, Australia or New Zealand. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Table 1. Inferences, warrants and assumptions in the validity argument and counter-arguments in the fairness argument 1 Inference Warrant supporting Assumptions Rebuttals that would weaken the the inference underlying the warrant fairness argument Xiaoming Xi Domain Observations of 1. Assessment tasks representing the academic 1. Assessment tasks are not equally representative definition performance on TOEFL domain can be identified. of the academic domain for different groups. reveal relevant knowledge, 2. Critical English language skills, knowledge, 2. Critical English language skills, knowledge, and skills, and abilities in and processes needed for study in English processes required for some sub domains are situations representative of medium colleges and universities can be not assessed. those in the target domain identified. 3. Varieties of English included in the test are not of language use in English- 3. Assessment tasks requiring important skills representative of the domain. medium institutions of and representing the academic domain can higher education. be simulated. Evaluation Observations of 1. Rubrics for scoring responses are 1. Rubrics emphasize linguistic features not performance on TOEFL appropriate for providing evidence of relevant to the domain or do not include some tasks are obtained and targeted language abilities. highly relevant features, biasing toward or evaluated to provide 2. The test provides equal opportunities against certain groups. observed scores reflective for test takers to demonstrate intended 2. Inappropriate test content or construct- of targeted academic knowledge, skills and abilities. irrelevant knowledge and skills engaged by some language abilities. 3. Task administration conditions are test items or under-representation of the domain appropriate for providing evidence of lead to group differences in item/test scores. targeted language abilities. 3. Inconsistent test administration practices lead to 4. The test delivery system is appropriate group differences in test scores. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 for supporting the assessment of targeted 4. Factors in the test delivery system introduce language abilities. construct-irrelevant differences in test scores 5. The statistical characteristics of items, across groups. measures and test forms are appropriate for 5. Item/task response format introduces construct- norm-referenced decisions. irrelevant differences in test scores across groups. 6. Appropriate and reasonable accommodations 6. Test takers with physical or learning are provided to test takers with disabilities. disabilities are not provided with appropriate 7. Raters are well trained and monitored to accommodations to help demonstrate their ensure trustworthy scores. relevant abilities. 159 (Continued) Table 1. (Continued) 160 Inference Warrant supporting Assumptions Rebuttals that would weaken the the inference underlying the warrant fairness argument Generalization Observed scores are 1. A sufficient number of tasks are included on 1. Construct-irrelevant factors lead to differences estimates of expected the test to provide stable estimates of test in the generalizability of scores for different scores over the relevant takers’ performances. groups. parallel versions of tasks 2. The configuration of tasks on measures is and test forms and across appropriate for the intended interpretation. raters and occasions. 3. Appropriate scaling and equating procedures for test scores are used. 4. Task and test specifications are well-defined so that parallel tasks and test forms are created. Explanation Expected scores are 1. The linguistic knowledge, processes, and 1. Some assessment tasks engage irrelevant attributed to a construct strategies required to successfully complete processes and strategies from some test taker of academic language tasks are consistent with theoretical groups. proficiency. expectations. 2. Construct-irrelevant factors lead to different 2. Performance on the test measures relates to factor structures for different groups. performance in other test-based measures 3. Construct-irrelevant factors lead to differences of language proficiency as expected in the relationships between the test of interest theoretically. and other relevant test-related measures for 3. The internal structure of the test scores is different groups. consistent with a theoretical view of language Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 proficiency as a number of highly interrelated components. 4. Test performance varies according to amount and quality of experience in learning English. Language Testing 27(2) Table 1. (Continued) Inference Warrant supporting Assumptions Rebuttals that would weaken the the inference underlying the warrant fairness argument Xiaoming Xi Extrapolation The construct of academic Performance on the test is related to other 1. Inappropriate test content or construct under- language proficiency as criteria of language proficiency in the academic representation lead to differences in predicting assessed by the TOEFL context. performances on relevant criterion measures accounts for the quality of for different groups. linguistic performance in English-Medium institutions of higher education. Utilization The test scores and 1. The score reports and other related 1. Inappropriate score aggregation and reporting other related information information provided users support practices lead to biased decisions for members provided to users are appropriate decision-making. in some groups. relevant and useful for 2. The meaning of test scores is clearly 2. Information about group differences is making decisions about interpreted by admissions officers, and inappropriately used in decision-making, leading admissions, appropriate teachers to aid relevant decision-making. to biased decisions for members in some ESL coursework needed, 3. Reasonable admissions standards are used groups. and the selection of to ensure students can cope with the 3. Factors in the decision-making process such as international teaching communication demands. inappropriate cut score models used lead to assistants. 4. The test will have a positive influence on biased decisions for some groups. how English is learned and taught around the 4. Construct-irrelevant factors, construct under- world. representation, or inappropriate decision- Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 making processes cause negative impact on some groups. 5. Different groups of test takers have differential access to test preparation materials, thus impacting the equity of the testing practice. 6. Inappropriate use of test results causes negative impact on some groups. 1 The warrants and assumptions in Table 1 are modified based on Chapelle et al. (2008). 161 162 Language Testing 27(2) Evaluation: The major fairness concerns include score differences across groups that may be introduced by a host of factors, including: inconsistent test administration procedures; inappropriate item/task response format; irrelevant factors in the test delivery system; lack of or inappropriate test accommodations for test takers with disabilities; inappropriate test content; test content that under-represents the construct; rubrics that fail to represent the critical skills required in the domain or that rep- resent irrelevant skills; and rater bias against certain groups associated with the scoring of the Writing or Speaking sections. Generalization: This inference may be weakened by differences in the generalizabil- ity of scores across groups caused by construct-irrelevant factors. When score generaliz- ability differs across sub‑groups, additional investigations are needed to reveal whether the factors causing the difference are construct-irrelevant. Explanation: The pertinent fairness concerns include differences in factorial struc- tures or in relationships between scores on the TOEFL iBT test and other relevant test- based measures for different groups that are caused by construct-irrelevant factors. Evidence of irrelevant knowledge, processes and strategies engaged by some test taker groups to complete the tasks, as revealed through verbal protocol or stimulated recall research, may also weaken the explanatory power of the test scores. Extrapolation: The fairness issue is that the relationships between candidates’ scores on the TOEFL iBT test and on a relevant criterion measure may differ across sub-groups due to construct-irrelevant factors or under-representation of the target domain. Utilization: One fairness issue involves comparability of the relevance and usefulness of the assessment results for making decisions for different groups. Another issue related to decision making is whether the decision-making procedures lead to any unfair deci- sions for certain groups. In addition, the impact and consequences of score-based deci- sions on test taker groups need to be investigated to see if the use of the test incurs any non-comparable consequences on different test taker groups. One point evident in the discussion above is that an inappropriate decision made about the assessment process may have effects on fairness issues pertinent to multiple inferences. This improper decision could gather more and more force working its way through the inferential bridges and become increasingly more pronounced until it is man- ifest in its impact on the score-based decisions and consequences. For example, under- representation of the target domain for some test taker groups may be identified as a fairness concern for the Domain Definition inference. However, at this point, it remains unclear how much impact this would have on the score-based decisions and conse- quences. Then this domain under-representation issue may manifest itself at the inferen- tial step for Evaluation through unjustified score differences across groups. It may subsequently result in differential prediction of test takers’ actual performance in the domain across groups, thus impacting the Extrapolation inference. The illegitimate score Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 163 differences and the differential prediction of domain performances across groups may lead to decisions for some groups which are not fair. Eventually, unfair score-based deci- sions for some groups may have profound consequences on test takers in the disadvan- taged groups, as they may be denied opportunities, access to resources, and credentials that they would otherwise deserve. These consequences may also have far more serious repercussions on the society at large and raise concerns about the fairness of the test. By tracking the changes in the extent of the impact of the inappropriate decision through the inferential steps, it is shown how this inappropriate decision made about the assessment process may accumulate in strength and eventually have an impact at the level of score-based decisions or consequences. This approach does not deviate from prevalent fairness approaches that emphasize the justice of actions. An added advantage of this approach is that it offers a logical mechanism for examining how unfairness mani- fested in previous inferential steps may accumulate force and eventually become salient through biased score-based decisions and inequitable consequences. Another point worth mentioning is that the concept of group differences is central to the discussion above. Some types of group differences can be quantified by using sta- tistical procedures (such as DIF) whereas others may be determined based solely on expert or value judgments (such as evaluations of the explanatory power of test scores or with respect to the consequences for different groups). Irrespective of the type of method used to identify group differences, fairness is essentially motivated and defined by social and political considerations. The interpretation of group differences identified by statistical procedures has to be informed by value judgment to determine whether bias actually exists. Relationship between the fairness argument and the validity argument This section uses an example to illustrate the relationship between a validity argument and a fairness argument. Figure 2 shows the argument structure for the Extrapolation inference. This structure can be used to lay out the argument supporting each of the inferences in the chain. This particular argument structure demonstrates the grounds on which the Extrapolation inference rests, the warrant that authorizes the step or the linkage from the grounds to the conclusion, the assumptions that underlie the warrant, the backing that is needed for the assumptions to hold true, and the rebuttals that may potentially compromise the soundness of the warrant. In this network of inferences, the intermediate conclusions that derive from the previous inferences become the data or the grounds for the subsequent inference (Toulmin et al., 1984). In Figure 1, the inferences that precede Extrapolation are Domain Definition, Evaluation, Generalization and Explanation. Suppose that the corresponding warrants for these previous inferences are fully backed to support the intermediate conclusions, they actually become the grounds for the Extrapolation inference. The warrant that authorizes this step from Explanation to Extrapolation is ‘The construct of academic English proficiency accounts for the quality of language performance on relevant tasks in an academic setting.’ There are two types of rebuttals that would weaken the strength of this intermediate conclusion: Type 1 rebuttals weaken the conclusion for all test takers and thus a lack of counter-evidence tends to reduce the force of this conclusion for the whole test-taking population. Type 2 rebuttals, on the other hand, point to the specific examinee groups to Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 164 Language Testing 27(2) which the conclusion may not apply or to which it may not be completely tenable2. For example, one potential rebuttal to the fairness argument is that the test measures the lan- guage skills which are important for the undergraduate applicants only but may not mea- sure all of the critical language skills required of those who are graduate school bound, thus potentially leading to over- or under-prediction of their actual communication skills in an academic environment (Type 2 Rebuttal 1). In this case, if counterevidence were not available to reject this rebuttal, the conclusion that the test scores are predictive of test takers’ actual communicative skills in an academic setting would stand true for only the undergraduate applicants rather than for the whole test-taking population. Similarly, Type 2 Rebuttal 2, if no counterevidence were put forward, would probably weaken the strength of the claim for particularly test takers residing outside the United States, if the test measures knowledge about American culture that is not defined as part of the con- struct. These two Type 2 rebuttals are specific to the fairness argument, which is further linked to the validity argument. In contrast, failures to provide conclusive evidence against Type 1 rebuttals would result in reduction in the force of the warrant for the whole test-taking group. The mag- nitude of the reduction will depend on how strong the counterevidence is. The conditions that weaken a warrant compromise the absolute authority of the warrant, typically expressed with some qualification to indicate the degree of strength and limitations of the corresponding claim. In the example provided, one potential Type 1 rebuttal may be that faculty ratings of their students’ language performance are biased by their perceptions of students’ mastery of the content knowledge. In other words, a student who has poor understanding of the content knowledge may be rated unfavorably for his/her language performance whereas mastery of the content knowledge may boost up their language ability ratings in an inap- propriate way. This rebuttal, if not rejected by counterevidence, would call into question the soundness of Backing 1. Similarly, if no evidence was available to counter the argu- ment that the sample, based on which the association was made between students’ TOEFL iBT scores and faculty ratings, was sufficiently large and representative, or that faculty members were inconsistent in rating their students’ language abilities, the sound- ness of Backing 1 would also be questioned. As shown in the example, the issues raised systematically by this fairness argument may be used as a means to systematize the generation of rebuttals for the validity argu- ment. This is just an illustration of how a fairness argument in a validity argument could be established and substantiated for the Extrapolation inference. The same approach could be applied to build an argument structure for each of the remaining inferences. Setting priorities for fairness investigations As pointed out by Kane (2001), the strength of the chain of inferences linking test performance to score-based decisions is most affected by its weakest links. Therefore, we need to anticipate what the potential weaknesses are in the TOEFL iBT test fairness argument and be prepared to provide backing to refute the rebuttals that would most compromise the overall argument. Focusing on the weakest areas in the argument would help maximize the use of resources, as it is typically not possible to address all the potential fairness issues. This overall analysis of links in order to identify the weakest links leads to the development of a critical research Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 165 Claim: The test scores reflect the quality of language performance on relevant tasks Type 1 Rebuttal 1: The in an academic setting. faculty members’ ratings of students’ language abilities are influenced by students’ mastery of the content knowledge. SO Type 1 Rebuttal 2: Different faculty members Warrant: The construct of provided inconsistent academic language abilities UNLESS ratings of students’ can account for the quality SINCE language abilities. of language performance on relevant tasks in an Type 1 Rebuttal 3: The academic setting. relationship between students’ TOEFL iBT test scores and faculty ratings of their performance on the relevant real-world tasks Assumption: Performance was investigated based on a on the test is related to small and unrepresentative other criteria of language sample. proficiency in the academic context. Type 2 Rebuttal 1: Construct under- BECAUSE OF representation leads to Rebuttals under- or over-prediction of Backing 1: Students’ test takers’ performances on TOEFL iBT test scores and specific to relevant language tasks in faculty ratings of their the an academic setting for academic language abilities fairness graduate applicants. were strongly correlated. argument Type 2 Rebuttal 2: Inappropriate test content leads to differences in predicting performance on relevant tasks in an academic setting for test takers residing in and Grounds: Expected scores are outside the USA. attributed to the construct of academic language abilities. Figure 2. Illustration of the fairness argument in the validity argument for the Extrapolation inference agenda. It is consistent with Willingham and Cole’s view that different fairness issues are interconnected and should be considered globally as ‘particular fairness issues, considered in isolation, may suggest contradictory solutions or test modifications that have contradictory effects on different groups of examinees’ (Willingham & Cole, 1997, p. 11). Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 166 Language Testing 27(2) A few considerations are critical in identifying potential weaknesses in the fairness argument. First, priority should be given to the gathering of evidence to counter fairness threats that are likely to carry over to the subsequent inferential steps and have appre- ciable effects on the final score-based decisions and test consequences. DIF investiga- tions are a good example. Although TOEFL does not conduct routine DIF analyses, it conducts well-motivated DIF research to examine group differences that may be caused by construct-irrelevant factors (Angoff, 1989; Zwick & Thayer, 1995; Breland et al., 2004). If item-level differential performance is found across two studied groups, addi- tional analyses are needed to investigate whether this difference impacts the overall reported score that supports decision-making. Item-level bias would pose less of a threat unless the pattern was prevalent and always in the same direction, thereby affecting the section and total scores, based on which decisions are made. In other instances, DIF may be detected for some items but positive and negative DIFs may cancel out, leading to no differences at the level of section or total scores for the sub-groups. As mentioned above, the fairness assurance strategies that have been implemented in the assessment process to lessen the impact of potential fairness threats can be used to set research priorities and allocate resources. For example, many large-scale testing pro- grams implement some sort of fairness and sensitivity review procedures in their test development process (e.g. ETS, 2003). Fairness reviews that are conducted with rigor and draw on solid expert judgments can be used as support for counteracting some poten- tial fairness threats. However, if resources allow and sample sizes are sufficient, it is always recommended to strengthen the argument with additional empirical support that bias indeed does not exist. Evaluation of existing fairness assurance strategies during the assessment process is especially important to support the fairness of newly revised tests, as operational test data is typically not yet available for investigating DIF, differential factorial structure, differential criterion-related validity, or differential impact for test taker sub-groups. Based on the two considerations discussed above, we need to think about which assumptions underlying a specific warrant may be susceptible to rebuttals that are spe- cific to the fairness argument. Then for these assumptions, a careful analysis is needed to identify factors that would potentially lead to irrelevant differences in the accuracy, gen- eralizability and meaning of assessment results, decisions based on assessment results or consequences incurred through the use of the assessment or results. These irrelevant fac- tors should determine the appropriate comparison groups that pertain to the investigation of each fairness issue. For instance, if knowledge about a particular subject matter may impact test performance, test takers’ field of study may be used to form comparison groups. Then for each set of comparison groups identified to address a fairness issue, the impact on fairness is evaluated, considering the likelihood of finding construct-irrelevant group differences and the fairness assurance strategies adopted in the assessment pro- cess. For example, for the Reading and Listening sections, the impact of subject matter familiarity is less of a problem compared to the Writing and Speaking sections since 1) the stimulus materials are carefully designed to be self-contained and the associated items do not assume any knowledge about the subject matters; 2) various subject matters are covered in each test form to minimize the influence of a specific topic on perfor- mance. Subject matter familiarity is more of a concern for the integrated writing and Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 167 speaking tasks on academic course content in the TOEFL iBT test, although special care has been taken to ensure that the understanding of the stimulus materials does not depend on familiarity with the subject matter. This is because 1) balancing content is almost impossible given the small number of items; 2) subject matter familiarity may have a larger impact on the task-taking processes and task performance given the large amount of information to be processed in real time. This analysis, based on a coherent fairness argument, helps us pinpoint the critical areas of research needed to support the fairness of the TOEFL iBT test. Conclusion This article proposes an approach for studying fairness that links it directly to validity. Fairness is characterized as comparable validity for relevant groups that can be identi- fied. The fairness argument consists of a series of rebuttals that may challenge the com- parability of score-based decisions and consequences for sub‑groups. This framework organizes different fairness investigations into a coherent framework and offers a prin- cipled approach to evaluating the soundness of the overall fairness argument and setting research priorities. This conceptual approach allows the extent of fairness explorations to be expanded and clarified, taking advantage of the well-defined framework for validity. This characterization of fairness as a facet of validity also augments the traditional inter- pretations of validity by demanding additional support for the comparability of assess- ment results, interpretations, decisions and consequences for relevant sub-groups. This approach draws on current argument-based methods of test validation to system- atize fairness investigations. Within this framework, a fairness argument can be used to systematically generate rebuttals to the validity argument that would compromise the comparability of assessment results, interpretations, decisions and consequences for rel- evant sub-groups. These rebuttals are in contrast to those that would potentially weaken the validity for the whole test-taking population. This argument-based structure allows us to track how fairness issues permeate the inferential steps and become prominent in score-based decisions, actions and consequences. The emerging research on building validity or assessment use arguments in language testing (Bachman, 2005; Chapelle et al., 2008) may give empirical fairness research the momentum to be further developed and systematized. The approach described in this article will hopefully inspire more fairness investigations that are motivated by and built on a central fairness argument. It is also my hope that the integration of fairness into validity will foster a deeper understanding and expanded explorations of actions based on test results and social consequences, as impartiality and justice of actions and compa- rability of test consequences are at the core of fairness. Notes 1. You can think of the series of rebuttals as part of the interpretive argument that is specific to fairness, and the support for this part as the fairness argument. However, in this paper, for the sake of simplicity, I do not differentiate the interpretive argument and the fairness argument. 2. Only Type 2 rebuttals are presented in Table 1, since the focus of the paper is on fairness. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 168 Language Testing 27(2) References Alderson, J. C. & Urquhart, A. H. (1985a). The effect of students’ academic discipline on their performance in ESP reading tests. Language Testing, 2, 192–204. Alderson, J. C. & Urquhart, A. H. (1985b). This test is unfair: I’m not an economist. In Hauptman, P. C., LeBlanc, R. & Wesche, M. B. (Eds.), Second language performance testing (pp. 25–44). Ottawa: University of Ottawa Press. Angoff, W. H. (1989). Context bias in the Test of English as a Foreign Language (TOEFL Research Report. RR-29). Princeton, NJ: Educational Testing Service. Association of Language Testers in Europe (1994). The Association of Language Testers of Europe Code of Practice. http://www.alte.org. American Educational Research Association, American Psychological Association and National Council on Measurement in Education (1985). Standards for educational and psychological testing. Washington, DC: Author. American Educational Research Association, American Psychological Association and National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: Author. Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quar- terly, 2, 1–34. Breland, H., Lee, Y-W., Najarian, M. & Muraki, E. (2004). An analysis of TOEFL- CBT writing prompt difficulty and comparability for different gender groups (TOEFL Research Report. RR-76). Princeton, NJ: Educational Testing Service. Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20, 1–25. Brown, J. D. (1999). Relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16, 216–237. Chapelle, C. A., Enright, M. K. & Jamieson, J. M. (Eds.) (2008). Building a validity argument for the Test of English as a Foreign Language™. Mahwah, NJ: Lawrence Erlbaum. Cheng, L. (2008). Washback, impact and consequences. In Shohamy, E. & Hornberger, N. H. (Eds.), Encyclopedia of language and education. Volume 7: Language testing and assessment, 2nd edn. (pp. 349–364). New York: Springer Science and Business Media LLC. Clapham, C. (1998). The effect of language proficiency and background knowledge on EAP students’ reading comprehension. In Kunnan, A. J. (Ed.), Validation in language assessment (pp. 141–168). Mahwah, NJ: Lawrence Erlbaum. Cronbach, L. J. (1988). Five perspectives on the validity argument. In Wainer, H. & Braun, H. I. (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum. Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Cureton, E. E. (1951).Validity. In Lindquist, E. F., (Ed.), Educational measurement, 1st edn (pp. 621–694). Washington, DC: American Council on Education. Davies, A. guest editor (1997). Ethics (Special issue). Language Testing, 14. Davies, A. guest editor (2004). Ethics (Special issue). Language Assessment Quarterly, 4. Educational Testing Service (2002). ETS standards for quality and fairness. Princeton, NJ: Author. Educational Testing Service (2003). ETS fairness review guidelines. Princeton, NJ: Author. Ferne, T. & Rupp, A. (2007). A synthesis of research on DIF in language testing: methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4, 113–148. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 Xiaoming Xi 169 Fulcher, G. & Davidson, F. (2007). Language testing and assessment. London and New York: Routledge. Ginther, A. & Stevens, J. (1998). Language background, ethnicity, and the internal construct valid- ity of the Advanced Placement Spanish language examination. In Kunnan, J. (Ed.), Validation in language assessment (pp. 169–194). Mahwah, NJ: Lawrence Erlbaum. Hale, G. (1988). Student major field and text content: Interactive effects on reading comprehension in the TOEFL. Language Testing, 5, 49–61. Hale, G. A., Rock, D. A. & Jirele, T. J. (1989). Confirmatory factor analysis of the Test of Eng- lish as a Foreign Language (TOEFL Research Report. RR-32). Princeton, NJ: Educational Testing Service. International Language Testing Association (2000). Code of Ethics for ILTA, Retrieved February 18, 2008 from http://www.iltaonline.com/code.pdf. International Language Testing Association (2005). Draft Code of Practice, Version 3, Retrieved February 18, 2008 from http://www.iltaonline.com/ILTA-COP-ver3-21Jun2006.pdf. Jensen, H. R. (1980). Bias in mental testing. New York: Free Press. Joint Committee on Testing Practices (1988). Code of fair testing practices in education. Wash- ington, DC: Author. Joint Committee on Testing Practices (2004). Code of fair testing practices in education. Wash- ington, DC: Author. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535. Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319–342. Kane, M. T. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice, 21, 31–41. Kane, M. T. (2004). Certification testing as an illustration of argument-based validation. Measure- ment: Interdisciplinary Research and Perspectives, 2, 135–170. Kane, M. T. (2006). Validation. In Brennan, R. L. (Ed.), Educational measurement, 4th edn. (pp.18–64). Washington, DC: American Council on Education/Praeger. Kane, M., Crooks, T. & Cohen, A. (1999). Validating measures of performance. Educational Mea- surement: Issues and Practice, 18, 5–17. Kunnan, A. J. (1995). Test taker characteristics and performance: A structural modeling approach. Cambridge, UK: Cambridge University Press. Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 1–14). Cambridge, UK: Cambridge University Press. Kunnan, A. J. (2004). Test fairness. In Milanovic, M. & Weir C., (Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona Conference (pp. 27–48). Cambridge, UK: Cambridge University Press. McNamara, T. F. & Roever, C. (2006). Language testing: The social dimension. Oxford: Blackwell. Messick, S. (1989). Validity. In Linn, R. L. (Ed.), Educational measurement, 3rd edn. (pp. 13–103). New York: American Council on Education and Macmillan. O’Loughlin, K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19, 169–192. Oltman, P., Stricker L. & Barrows, T. (1990). Analyzing test structure by multidimensional scal- ing. Journal of Applied Psychology, 75, 21–27. Shohamy, E. (2000). Fairness in language testing. In Kunnan, A. J. (Ed.), Fairness and validation in language assessment (pp. 15–19). Cambridge, UK: Cambridge University Press. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012 170 Language Testing 27(2) Shohamy, E. (2001). The power of tests: A critical perspective of the uses of language tests. London: Longman. Swinton, S. S. & Powers, D. E. (1980). Factor analysis of the TOEFL® test for several language groups (TOEFL Research Report. RR-06). Princeton, NJ: Educational Testing Service. Stricker, L. J., Rock, D. A. & Lee, Y. W. (2005). Factor structure of the LanguEdgeTM Test across language groups (TOEFL Monograph Series MS-32). Princeton, NJ: Educational Testing Service. Taylor, C., Jamieson, J., Eignor, D. & Kirsch, I. (1998). The relationship between computer familiarity and performance on computer-based TOEFL test tasks (TOEFL Research Report. RR-61). Princeton, NJ: Educational Testing Service. Toulmin, S., Rieke, R. & Janik, A. (1984). An introduction to reasoning, 2nd edn. New York: Macmillan. Webster’s Ninth New Collegiate Dictionary (1988). Springfield, MA: Merriam-Webster. Willingham, W. W. (1999). A systemic view of test fairness. In Messick S. (Ed.), Assessment in higher education: Issues in access, quality, student development, and public policy (pp. 213–242). Mahwah, NJ: Lawrence Erlbaum. Willingham, W. W. & Cole, N. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum. Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different minor- ity groups? Some Israeli findings. Language Testing, 3, 80–95. Zeidner, M. (1987). A comparison of ethnic, sex and age biases in the predictive validity of English language aptitude tests: Some Israeli data. Language Testing, 4, 55–71. Zwick, R. J. & Thayer, D. T. (1995). A comparison of the performance of graduate and under- graduate school applicants on the Test of Written English (TOEFL Research Report. RR-50). Princeton, NJ: Educational Testing Service. Downloaded from ltj.sagepub.com at UNIV OF VIRGINIA on September 28, 2012

Use Quizgecko on...
Browser
Browser