Studies in Language Testing Volume 18: European Language Testing in a Global Context (2001) PDF
Document Details
Uploaded by ImpeccableBanshee
Center of Languages Libya
2001
Michael Milanovic and Cyril J Weir
Tags
Summary
This book, edited by Michael Milanovic and Cyril J Weir, details proceedings from the ALTE Barcelona Conference held in 2001. It analyzes European language testing within a global context, encompassing various research studies and insights into language test development and validation. It serves as a valuable resource for postgraduate students in language testing.
Full Transcript
Studies in Language Testing 18 Edited by Michael Milanovic and Cyril J Weir...
Studies in Language Testing 18 Edited by Michael Milanovic and Cyril J Weir European Language Testing in a Global Context European Language Testing in a Global Context Proceedings of the ALTE Barcelona Conference, July 2001 Studies in European Language Testing in a Global Context brings together selected papers presented at the first ALTE International Conference held in Barcelona, Spain, in July 2001. This Language Testing European 18 inaugural conference was organised in support of the European Year of Languages and the volume provides a flavour of the issues addressed during the event. Language Testing 9780521680974 Hawkey: Studies in Language Testing 24 CVR BLACK PMS 368 This book is divided into four sections. Section One contains two papers exploring general matters of current concern to language testers worldwide, including technical, political and in a global ethical issues. Section Two presents a set of six research studies covering a wide range of contemporary topics in the field: the value of qualitative research methods in language test Context development and validation; the contribution of assessment portfolios; the validation of questionnaires to explore the interaction of test-taker characteristics and L2 performance: Proceedings of the ALTE rating issues arising from computer-based writing assessment; the modeling of factors affecting oral test performance; and the development of self-assessment tools. Section Barcelona Conference, Three takes a specific European focus and presents two papers summarising aspects of the July 2001 ongoing work of the Council of Europe and the European Union in relation to language policy. Section Four develops the European focus further by reporting work in progress on test development projects in various European countries, including Germany, Italy, Spain and the Netherlands. Its coverage of issues with both regional and global relevance means this volume will be of Edited by interest to academics and policymakers within Europe and beyond. It will also be a useful Michael Milanovic resource and reference work for postgraduate students of language testing. and Cyril J Weir Aslo available: Studies in Multilingualism and Assessment: Language Achieving transparency, assuring quality, sustaining diversity Testing ISBN: 978 0 521 71192 0 Language Testing Matters: Investigating the wider social and educational impact of assessment 18 ISBN: 978 0 521 16391 0 Series Editors Michael Milanovic and Cyril J Weir European language testing in a global context Proceedings of the ALTE Barcelona Conference July 2001 Also in this series: An investigation into the comparability of two tests of English as a Foreign Language: The Cambridge-TOEFL comparability study Lyle F. Bachman, F. Davidson, K. Ryan, I.-C. Choi Test taker characteristics and performance: A structural modeling approach Antony John Kunnan Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem Michael Milanovic, Nick Saville The development of IELTS: A study of the effect of background knowledge on reading comprehension Caroline Margaret Clapham Verbal protocol analysis in language testing research: A handbook Alison Green A multilingual glossary of language testing terms Prepared by ALTE members Dictionary of language testing Alan Davies, Annie Brown, Cathie Elder, Kathryn Hill, Tom Lumley, Tim McNamara Learner strategy use and performance on language tests: A structural equiation modelling approach James Enos Purpura Fairness and validation in language assessment: Selected papers from the 19th Language Testing research Colloquium, Orlando, Florida Antony John Kunnan Issues in computer-adaptive testing of reading proficiency Micheline Chalhoub-Deville Experimenting with uncertainty: Essays in honour of Alan Davies A.Brown, C. Elder, N. Iwashita, E. Grove, K. Hill, T. Lumley, K. O’Loughlin, T. McNamara An empirical investigation of the componentiality of L2 reading in English for academic purposes Cyril Weir The equivalence of direct and semi-direct speaking tests Kieran O’Loughlin A qualitative approach to the validation of oral language tests Anne Lazaraton Continuity and Innovation: Revising the Cambridge Proficiency in English Examination 1913 – 2002 Edited by Cyril Weir and Michael Milanovic Unpublished The development of CELS: A modular approach to testing English Language Skills Roger Hawkey Testing the Spoken English of Young Norwegians: a study of testing validity and the role of ‘smallwords’ in contributing to pupils’ fluency Angela Hasselgren Changing language teaching through language testing: A washback study Liying Cheng European language testing in a global context Proceedings of the ALTE Barcelona Conference July 2001 PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge CB2 1RP, UK CAMBRIDGE UNIVERSITY PRESS The Edinburgh Building, Cambridge CB2 2RU, UK 40 West 20th Street, New York, NY 10011– 4211, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia Ruiz de Alarcón 13, 28014 Madrid, Spain Dock House, The Waterfront, Cape Town 8001, South Africa http://www.cambridge.org © UCLES 2004 This book is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2004 Printed in the United Kingdom at the University Press, Cambridge Typeface Times 10/12pt. System QuarkXPress® A catalogue record for this book is available from the British Library ISBN 0 521 82897 X hardback ISBN 0 521 53587 5 paperback Contents Series Editor’s note vii Section One Issues in Language Testing 1 The shape of things to come: will it be the normal distribution? 1 Charles Alderson 2 Test fairness 27 Antony Kunnan Section Two Research Studies 3 Qualitative research methods in language test development and validation 51 Anne Lazaraton 4 European solutions to non-European problems 73 Vivien Berry and Jo Lewkowicz 5 Validating questionnaires to examine personal factors in L2 test performance 93 James E. Purpura 6 Legibility and the rating of second-language writing 117 Annie Brown 7 Modelling factors affecting oral language test performance: a large-scale empirical study 129 Barry O’Sullivan 8 Self-assessment in DIALANG. An account of test development 143 Sari Luoma v Section Three A European View 9 Council of Europe language policy and the promotion of plurilingualism 157 Joseph Sheils 10 Higher education and language policy in the European Union 173 Wolfgang Mackiewicz Section Four Work in Progress 11 TestDaF: Theoretical basis and empirical research 189 Rüdiger Grotjahn 12 A Progetto Lingue 2000 Impact Study, with special reference to language testing and certification 205 Roger Hawkey 13 Distance-learning Spanish courses: a follow-up and assessment system 221 Silvia María Olalde Vegas and Olga Juan Lazáro 14 Certification of knowledge of the Catalan language and examiner training 237 Mònica Pereña and Lluís Ràfols 15 CNaVT: A more functional approach. Principles and construction of a profile-related examination system 249 Piet van Avermaet, José Bakx, Frans van der Slik and Philippe Vangeneugden 16 Language tests in Basque 261 Nicholas Gardner 17 Measuring and evaluating competence in Italian as a foreign language 271 Giuliana Grego Bolli vi Series Editor’s note The conference papers presented in this volume represent a small subset of the many excellent presentations made at the ALTE conference – European Language Testing in a Global Context – held in July 2001 in Barcelona in celebration of the European Year of Languages – 2001. They have been selected to provide a flavour of the issues that the conference addressed. A full listing of all presentations is attached at the end of this note. The volume is divided into three parts. The first, with two papers, one written by Charles Alderson and the other by Antony Kunnan, has a focus on more general issues in Language Testing. Alderson looks at some key issues in the field; he considers “the shape of things to come” and asks if it will be the “normal distribution”. Using this pun to structure his paper, he focuses on two aspects of language testing; the first relates to the technical aspects of the subject (issues of validity, reliability, impact etc.), the second relates to ethical and political concerns. Most of his paper chimes well with current thinking on the technical aspects and, as he admits, much of what he presents is not new and is uncontroversial. Within the European context he refers to the influential work of the Council of Europe, especially the Common European Framework and the European Language Portfolio; he describes a number of other European projects, such as DIALANG and the national examination reform project in Hungary, and he praises various aspects of the work of ALTE (e.g. for its Code of Practice, for organising useful conferences, for encouraging exchange of expertise among its members, and for raising the profile of language testing in Europe). In focusing on the political dimension, however, he positions himself as devil’s advocate and sets out to be provocative – perhaps deliberately introducing a “negative skew” into his discussion. As always his contribution is stimulating and his conclusions are certainly controversial, particularly his criticism of ALTE and several other organisations. These conclusions would not go unchallenged by many ALTE members, not least because he misrepresents the nature of the association and how it operates. Kunnan’s paper discusses the qualities of test fairness and reflects his longstanding concerns with the issues involved in this area. The framework he presents is of great value to the field of Language Testing and Kunnan has contributed significantly to the on-going debate on the qualities of test fairness within ALTE. The second part of the volume presents a number of research studies. Anne vii Lazaraton focuses on the use of qualitative research methods in the development and validation of language tests. Lazaraton is a pioneer of qualitative research in language testing and her involvement dates back to the late eighties and early nineties when such approaches were not yet widely used in the field. It is in part due to her efforts that researchers are now more willing to embrace approaches that can provide access to the rich and deep data of qualitative research. Readers are encouraged to look at her volume in this series (A Qualitative Approach to the Validation of Oral Language Tests). Vivien Berry and Jo Lewkowicz focus on the important issue of compulsory language assessment for graduating students in Hong Kong. Their paper considers alternatives to using a language test alone for this purpose and looks at the applicability of variations on the portfolio concept. Jim Purpura’s work on the validation of questionnaires, which addresses the interaction of personal factors and second language test performance, represents an interesting and challenging dimension of validation in language testing. Readers may also wish to refer to Purpura’s volume in this series (Learner strategy use and performance on language tests: A structural equation modelling approach), which looks in more depth at the development of questionnaires to determine personal factors and a methodology that can be used to investigate their interactions with test performance. Annie Brown‘s paper is particularly relevant as we move towards greater use of computers in language testing. Such a move is of course fraught with issues, not least of which is the one of legibility that Brown addresses here. Her findings are interesting, giving us pause for thought and indicating, as she suggests, that more research is required. In the context of IELTS, such research is currently being conducted in Cambridge. Barry O’Sullivan’s paper attempts to model the factors affecting oral test performance, an area of particular significance in large-scale assessment. The paper is part of on-going research commissioned by the University of Cambridge Local Examinations Syndicate and it is hoped that a collection of research studies into the dimensions of oral assessment will be published in this series in due course. Finally, Sari Luoma’s paper looks at self-assessment in the context of DIALANG. The DIALANG project, also referred to in Alderson’s paper, has been one of the key initiatives of the European Commission in relation to language testing. As such it has benefited from significant funding and generated much research potential. The last two parts of the volume cover aspects of work in progress. On the one hand, Joe Shiels and Wolfgang Mackiewicz summarise aspects of the on- going work of the Council of Europe and the European Union in relation to language policy. On the other, a number of researchers bring us up-to-date with test development work largely, though not exclusively, in the context of ALTE. These papers provide the reader with a reasonable overview of what is going on in a number of European countries. viii In the context of the conference reflected in this volume, it is appropriate to overview how ALTE has developed over the years and what is of particular concern to the members of ALTE at the moment. ALTE has been operating for nearly a decade and a half. It was first formed when a few organisations, acknowledging the fact that there was no obvious forum for the discussion of issues in the assessment of one’s own language as a foreign language in the European context, decided to meet with this aim in mind. The question of language assessment generally is an enormous one and dealt with in different ways by national and regional authorities throughout Europe and the world. Trying to bring together such a large and diverse community would have been a very significant task and far beyond the scope of ALTE’s mission. ALTE’s direct interests and aims are on a much smaller scale and it important to underline that it seeks to bring together those interested in the assessment of their own language as a foreign language. This is often in an international context, particularly with the more widely spoken languages but also in a national context, as is the case with lesser spoken languages in particular. While some ALTE members are located within ministries or government departments, others are within universities and cultural agencies. The members of ALTE are part of the international educational context and ALTE itself, as well as the members that form it, is a not-for-profit organisation. As a group, ALTE aims to provide a benchmark of quality in the particular domain in which it operates. Should ALTE’s work be of relevance outside its own context, then so much the better, but ALTE does not set out to establish or police the standard for European language assessment in general. The recent history of language testing in the European context is very mixed. In the case of English we are fortunate that there has been significant interest and research in this field in English speaking countries for many years. In relation to some other European languages this is not the case. ALTE recognises that the field of language testing in different languages will be at different stages of development and that developing a language testing capacity in the European context, albeit in a relatively narrow domain, is an on-going venture. Similarly, progress, in contexts where participants are free to walk away at any time, cannot be achieved through force or coercion but rather through involvement, greater understanding and personal commitment. ALTE operates as a capacity builder in the European context, albeit in a relatively narrow domain. As with any association, ALTE has a Secretariat, based in Cambridge and elected by the membership. The Secretariat has a three-year term of office and is supported by a number of committees, made up from the membership, who oversee various aspects of ALTE’s work. The group is too large for all members to be involved in everything and there are a number of sub-groups, organised by the members and focusing on particular areas of interest. The sub-groups are formed, reformed and disbanded as circumstances and ix interests dictate, and at the moment there are several active ones. We will briefly describe the work of some of these here. The whole of ALTE has been working for some time on the ALTE Framework which seeks to place the examinations of ALTE members onto a common framework, related closely through empirical study, to the Common European Framework. The process of placing examinations on the framework is underpinned by extensive work on the content analysis of examinations, guidelines for the quality production of examinations and empirically validated performance indicators in many European languages. This work has been supported by grants from the European Commission for many years and is now being taken forward by a number of sub-groups which are considering different domains of use such as language specifically for work purposes, for young learners or for study through the medium of a language. A group has been established to look at the extent to which teacher qualifications in different languages can be harmonised and placed on some kind of framework. The group is not looking specifically at state organised qualifications but rather those common in the private sector for example, those offered by The Alliance Francaise, the Goethe Institute, the Cervantes Institute or Cambridge amongst others. It seeks to provide greater flexibility and mobility for the ever growing body of language teachers often qualified in one language and wishing to teach another while having their existing qualifications recognised as contributing to future ones in a more systematic way than is possible at present. The Council of Europe has made and continues to make a substantial contribution to the teaching, learning and assessment of languages in the European context and in recent years has developed the concept of the European Language Portfolio as an aid and support to the language learning and teaching community. ALTE and the European Association for Quality Language Services have collaborated on the development of a portfolio for adults, which is now in the public domain. It is hoped that this will be a valuable aid to adult learners of languages in the European context. An ALTE sub-group has been working with the Council of Europe and John Trim in the elaboration of a Breakthrough level which would complement the Waystage, Threshold and Vantage levels already developed. ALTE’s work in this area has also been supported by the European Commission in the form of funding to a group of members from Finland, Ireland, Norway, Greece and Sweden who have a particular interest in language teaching and testing at the Breakthrough level. Another ALTE sub-group has been working on the development of a multilingual system of computer-based assessment. The approach, which is based on the concept of computer adaptive testing, has proved highly successful and innovative, providing assessment in several European languages and recently won the European Academic Software award in 2000. x ALTE members have developed a multilingual glossary of language testing terms. Part of this work has been published in this series (A multilingual glossary of language testing terms) but is ongoing, and as new languages join ALTE, further versions of the glossary are being developed. The glossary has allowed language testers in about 20 countries to define language testing terms in their own language and thus contributes to the process of establishing language testing as a discipline in its own right. The European Commission has supported this work throughout. In the early 1990s, ALTE developed a code of professional practice and work has continued to elaborate the concept of quality assurance in language testing through the development of quality assurance and quality management instruments for use initially by ALTE members. This work has been in progress for several years and is now in the hands of an ALTE sub-group. As noted above, developing the concept of quality assurance and its management has to be a collaborative venture between partners and is not prone to imposition in the ALTE context. ALTE members are aware that they carry significant responsibility and aim to continue to play a leading role in defining the dimensions of quality and how an effective approach to quality management can be implemented. This work is documented and has been elaborated in ALTE News as well as at a number of international conferences. Details are also available on the ALTE website: www.alte.org. Members of ALTE are also concerned to measure the impact of their examinations and work has gone on in the context of ALTE to develop a range of instrumentation to measure impact on stakeholders in the test taking and using constituency. Roger Hawkey discusses the concept of impact in the context of the Lingua 2000 project in one of the papers in this volume – see contents page. ALTE members meet twice a year and hold a language testing conference in each meeting location. This is an open event, details of which are available on the ALTE website. New ALTE members are elected by the membership as a whole. Members are either full – from countries in the European Union – or associate – from countries outside. For organisations which do not have the resources to be full or associate members or who operate in a related field, there is the option of observer status. Information on all of these categories of membership is available on the ALTE website. Finally, following the success of the Barcelona conference ALTE has agreed to organise another international conference in 2005. Details are available on the website. Mike Milanovic Cyril Weir March 03 xi Presentations at ALTE Conference Barcelona, 2001 Karine Akerman Sarkisian, Camilla Geoff Brindley Bengtsson and Monica Langerth-Zetterman Macquarie University, NCELTR, Australia Uppsala University, Department of Investigating outcomes-based assessment Linguistics, Sweden Annie Brown Developing and evaluating web-based The University of Melbourne, Language diagnostic testing in university language Testing Research Centre, education The impact of handwriting on the scoring of J. Charles Alderson essays Lancaster University, Department of Annie Brown (Co-authors: Noriko Iwashita Linguistics and Modern English Language and Tim McNamara) The shape of things to come: Will it be the The University of Melbourne, Language normal distribution? Testing Research Centre, José Bakx and Piet Van Avermaet Investigating rater’s orientations in specific- Katholieke Universiteit Nijmegen and Leuven, purpose task-based oral The Netherlands and Belgium Peter Brown and Marianne Hirtzel Certificate of Dutch as a Foreign Language: A EAQUALS, ALTE more functional approach The EAQUALS/ALTE European Language David Barnwell Portfolio ITÉ Jill Burstein and Claudia Leacock Using the Common European framework in an ETS Technologies, USA LSP setting: Rating the Italian, Spanish, Applications in Automated Essay Scoring and French and German abilities of workers in the Feedback teleservices industry Ross Charnock Marsha Bensoussan and Bonnie Ben-Israel Université Paris 9 – Dauphine University of Haifa, Department of Foreign Taking tests and playing scales – remarks on Languages, Israel integrated tests Evaluating reading comprehension of academic texts: Guided multiple-choice David Coniam summary completion The Chinese University of Hong Kong, Faculty of Education, Hong Kong Aukje Bergsma Establishing minimum-language-standard Citogroep, The Netherlands benchmark tests for English language teachers NT2-CAT: Computer Adaptive Test for Dutch in Hong Kong as a second language Vivien Berry and Jo Lewkowicz The University of Hong Kong, The English Centre, Hong Kong European solutions to non-European problems xii ˆ Margaretha Corell and Thomas Wrigstad Ina Ferbezar and Marko Stabej Stockholm University Department of University of Ljubljana, Centre for Slovene as a Second/Foreign Language and Scandinavian Languages and Centre for Department of Slavic Languages and Research on Bilingualism. Literature, Slovenia What’s the difference? Analysis of two paired Developing and Implementing Language Tests conversations in the Oral examination of the in Slovenia National Tests of Swedish as a Jésus Fernández and Clara Maria de Vega Second/Foreign Language Santos Benó Csapó and Marianne Nikolov Universidad de Salamanca, Spain University of Szeged and University of Pécs, Advantages and disadvantages of the Vantage Hungary Level: the Spanish version (presentation in Hungarian students’ performances on English Spanish) and German tests Neus Figueras John H.A.L. de Jong Generalitat de Catalunya, Department Language Testing Services, The d’Ensenyament, Spain Netherlands Bringing together teaching and testing for Procedures for Relating Test Scores to certification. The experience at the Escoles Council of Europe Framework Oficials d’Idiomes Clara Maria de Vega Santos Nicholas Gardner Universidad de Salamanca, Spain Basque Government, Basque country Otra manera de aproximarse a la Evaluación: Test of Basque (EGA) La Asociación Europea de Examinadores de Malgorzata Gaszynska Lenguas (presentation in Spanish) Jagiellonian University, Poland Veerle Depauw and Sara Gysen Testing Comprehension in Polish as a Foreign Catholic University of Leuven, Centre for Language (paper in Spanish) Language and Migration, Belgium April Ginther (Co-author: Krishna Prasad) Measuring Dutch language proficiency. A Purdue University, English Department, Biloa computer-based test for low-skilled adults and University, and Purdue University, USA an evaluation system for primary school pupils Characteristics of European TOEFL Urszula Dobesz Examinees Uniwersytet Wroclawski, Poland Giuliana Grego Bolli To know, to understand, to love (Knowledge Università per Stranieri di Perugia, Italy about Poland in the teaching of Polish as a Measuring and evaluating the competence of second language) the Italian as a second language Ana Maria Ducasse Rüediger Grotjahn LaTrobe University, Spanish Department, Ruhr-Universitaet Bochum, Germany Australia TestDAF: Theoretical basis and empirical Assessing paired oral interaction in an oral research proficiency test xiii Anne Gutch Sue Kanburoglu Hackett and Jim Ferguson UCLES, UK The Advisory Council for English A major international exam: The revised CPE Language Schools Ltd, Ireland Interaction in context: a framework for H.I. Hacquebord and S.J. Andringa assessing learner competence in action University of Groningen, Applied Linguistics, The Netherlands Lucy Katona Testing text comprehension electronically Idegennyelvi Továbbképzö Központ (ITK), Hungary Roger Hawkey The development of a communicative oral c/o UCLES, UK rating scale in Hungary Progetto Lingue 2000: Impact for Language Friendly Schools Antony John Kunnan Assoc. Prof., TESOL Program, USA Nathalie Hirschprung Articulating a fairness model Alliance Française, France Teacher certifications produced by the Rita Kursite Alliance Française in the ALTE context Jaunjelgava Secondary School, Latvia Analyses of Listening Tasks from Different Maria Iakovou Points of View University of Athens, School of Philosophy, Greece Michel Laurier and Denise Lussier The teaching of Greek as a foreign language: University of Montreal – Faculty of Reality and perspectives Education and McGill University The development of French language tests Miroslaw Jelonkiewicz based on national benchmarks Warsaw University and University of Wroclaw, Poland Anne Lazaraton Describing and Testing Competence in Polish University of Minnesota, USA Culture Setting standards for qualitative research in language testing Miroslaw Jelonkiewicz Warsaw University, Poland Jo Lewkowicz Describing and gauging competence in Polish The University of Hong Kong, The English culture Centre, Hong Kong Stakeholder perceptions of the text in reading Neil Jones comprehension tests UCLES, UK Using ALTE Can-Do statements to equate Sari Luoma computer-based tests across languages University of Jyväskylä Self-assessment in DIALANG Neil Jones, Henk Kuijper and Angela Verschoor Denise Lussier UCLES, Citogroep, UK, The Netherlands McGill University, Canada Relationships between paper and pencil tests Conceptual Framework in Teaching and and computer based testing Assessing Cultural Competence xiv Wolfgang Mackiewicz Christine Pegg Freie Universität Berlin, Germany Cardiff University, Centre for Language Higher education and language policy in the and Communication Research, UK European Union Lexical resource in oral interviews: Equal assessment in English and Spanish? Waldemar Martyniuk Jagiellonian University, Poland Mònica Pereña & Lluís Ràfols Polish for Europe – Introducing Certificates in Generalitat de Catalunya Polish as a Foreign Language The new Catalan examination system and the examiners’ training Lydia McDermott University of Natal, Durban Juan Miguel Prieto Hernández Language testing, contextualised needs and Universidad de Salamanca, Spain lifelong learning Problemas para elaborar y evaluar una prueba de nivel: Los Diplomas de Español como Debie Mirtle Lengua Extranjera MATESOL, Englishtown, Boston, USA Online language testing: Challenges, James E. Purpura Successes and Lessons Learned Columbia Teachers College, USA Developing a computerised system for Lelia Murtagh investigating non-linguistic factors in L2 ITÉ learning and test performances Assessing Irish skills and attitudes among young adult secondary school leavers John Read Victoria University of Wellington, New Marie J. Myers Zealand Queen’s University, Canada Investigating the Impact of a High-stakes Entrance assessments in teacher training: a International Proficiency Test lesson of international scope Diana Rumpite Barry O’Sullivan Riga Technical University, Latvia The University of Reading, School of Innovative tendencies in computer based Linguistics and Applied Language Studies, testing in ESP UK Modelling Factors Affecting Oral Language Raffaele Sanzo Test Performance: An empirical study Ministero della Pubblica Istruzione, Italy Foreign languages within the frame of Italian Silvia María Olalde Vegas and Olga Juan educational reform Lázaro Instituto Cervantes, España Joseph Sheils Spanish distance-learning courses: Follow-up Modern Languages Division, Council of and evaluation system Europe Council of Europe language policy and the promotion of plurilingualism xv Elana Shohamy Lynda Taylor Tel Aviv University, School of Education, UCLES, UK Israel Revising instruments for rating speaking: The role of language testing policies in combining qualitative and quantitative insights promoting or rejecting diversity in John Trim multilingual/multicultural societies Project Director for Modern Languages, Kari Smith Council of Europe Oranim Academic College of Education, The Common European Framework of Israel Reference for Languages and its implications Quality assessment of Quality Learning: The for language testing digital portfolio in elementary school Philippe Vangeneugden and Frans van der M. Dolors Solé Vilanova Slik Generalitat de Catalunya, Centre de Katholieke Universiteit Nijmegen, The Recursos de Llengües Estrangeres of the Netherlands and Katholieke Universiteit Department of Education Leuven, Belgium The effectiveness of the teaching of English in Towards a profile related certification the Baccalaureate school population in structure for Dutch as a foreign language. Catalonia. Where do we stand? Where do we Implications of a needs analysis for profile want to be? selection and description Bernard Spolsky Juliet Wilson Bar Ilan University, Israel UCLES, UK Developing cross-culturally appropriate Assessing Young Learners – What makes a language proficiency tests for schools good test? Claude Springer Minjie Xing and Jing Hui Wang Université Marc Bloch, Strasbourg Salford University, School of Languages, A pragmatic approach to evaluating language UK and Harbin Institute of Technology, competence: Two new certifications in France Foreign Language Department, China ˆ A New Method of Assessing Students’ Marko Stabej and Natasa Pirih Svetina Language and Culture Learning Attitudes University of Ljubljana, Department of Slavic Languages and Literatures, Slovenia The development of communicative language ability in Slovene as a second language Sauli Takala, Felianka Kaftandjieva and Hanna Immonen University of Jyväskylä, Centre for Applied Language Studies, Finland Development and Validation of Finnish Scales of Language Proficiency xvi Poster Presentations Guy Bentner & Ines Quaring Centre de Langues de Luxembourg Tests of Luxembourgish as a foreign language Else Lindberg & Peter Vilads Vedel Undervisingsministeriet, Copenhagen, Denmark The new test of Danish as a foreign language Reidun Oanaes Andersen Norsk Språktest, Universitetet i Bergen, Norway Tests of Norwegian as a foreign language José Pascoal Universidade de Lisboa, Portugal Tests of Portuguese as a foreign language Heinrich Rübeling WBT, Germany Test Arbeitsplatz Deutsch – a workplace related language test in German as a foreign language Lászlo Szabo Eotvos Lorand University, Budapest, Centre for Foreign languages, Hungary Hungarian as a foreign language examination xvii List of contributors J. Charles Alderson Wolfgang Mackiewicz Department of Linguistics and Conseil Européen pour les Modern English Language, Langues/European Language Council Lancaster University Rüdiger Grotjahn Antony Kunnan Ruhr-Universität Bochum, Germany California State University Roger Hawkey Anne Lazaraton Educational Consultant, UK University of Minnesota, Minneapolis, USA Silvia María Olalde Vegas and Olga Juan Lázaro Vivien Berry and Jo Lewkowicz Instituto Cervantes University of Hong Kong Mònica Pereña and Lluís Ràfols James E. Purpura Direccío General de Política Teachers College, Lingüística, Departament de Cultura, Columbia University Generalitat de Catalunya Annie Brown Piet van Avermaet (KU Leuven), University of Melbourne José Bakx (KUN), Frans van der Slik (KUN) and Philippe Barry O’Sullivan Vangeneugden (KU Leuven) University of Reading Nicholas Gardner Sari Luoma Department of Culture/Kultura Saila, Centre for Applied Language Government of the Basque Studies, University of Jyväskylä, Country/Eusko Jaurlaritza, Spain Finland Giuliana Grego Bolli Joseph Sheils Università per Stranieri di Perugia, Modern Languages Italy Division/Division des langues vivantes, DGIV Council of Europe/Conseil de l’Europe, Strasbourg xviii Section 1 Issues in Language Testing 1 The shape of things to come: will it be the normal distribution? Plenary address to the ALTE Conference, Barcelona, July 2001 Charles Alderson Department of Linguistics and Modern English Language Lancaster University Introduction In this paper I shall survey developments in language testing over the past decade, paying particular attention to new concerns and interests. I shall somewhat rashly venture some predictions about developments in the field over the next decade or so and explore the shape of things to come. Many people see testing as technical and obsessed with arcane procedures and obscure discussions about analytic methods expressed in alphabet soup, such as IRT, MDS, SEM and DIF. Such discourses and obsessions are alien to teachers, and to many other researchers. In fact these concepts are not irrelevant, because many of them are important factors in an understanding of our constructs – what we are trying to test. The problem is that they are often poorly presented: researchers talking to researchers, without being sensitive to other audiences who are perhaps less obsessed with technical matters. However, I believe that recent developments have seen improved communication between testing specialists and those more generally concerned with language education which has resulted in a better understanding of how testing connects to people’s lives. Much of what follows is not necessarily new, in the sense that the issues have indeed been discussed before, but the difference is that they are now being addressed in a more critical light, with more questioning of assumptions and by undertaking more and better empirical research. 1 1 The shape of things to come: will it be the normal distribution? Washback and consequential validity Washback is a good example of an old concern that has become new. Ten years ago washback was a common concept, and the existence and nature of washback was simply accepted without argument. Tests affect teaching. Bad tests have negative effects on teaching; more modern, good tests will have positive effects; therefore change the test and you will change teaching. I certainly believed that, and have published several articles on the topic. But in the late 1980s and early 1990s, Dianne Wall and I were engaged in a project to investigate washback in Sri Lanka, intended to prove that positive washback had been brought about by a suite of new tests. To our surprise we discovered that things were not so simple. Although we found evidence of the impact of tests on the content of teaching, not all of that impact was positive. Moreover, there was little or no evidence of the impact of the test on how teachers taught – on their methodology. As a result we surveyed the literature to seek for parallels, only to discover that there was virtually no empirical evidence on the matter. We therefore decided to problematise the concept (Alderson and Wall 1993). The rest is history, because washback research quickly took off in a fairly big way. A number of studies on the topic have been reported in recent years and washback, or more broadly the impact of tests and their consequences on society, has become a major concern. Language testing is increasingly interested in what classrooms look like, what actually happens in class, how teachers prepare for tests and why they do what they do. We now have a fairly good idea of the impact of tests on the content of teaching, but we are less clear about how tests affect teachers’ methods. What we do know is that the washback is not uniform. Indeed, it is difficult to predict exactly what teachers will teach, or how teachers will teach. In extreme cases, such as TOEFL test preparation, we know that teachers will tend to use test preparation books, but how they use them – and above all why they use them in the way they do is still in need of research. In short, washback needs explaining. There have been fewer studies of what students think, what their test preparation strategies are and why they do what they do, but we are starting to get insights. Watanabe (2001) shows that students prepare in particular for those parts of exams that they perceive to be more difficult, and more discriminating. Conversely, those sections perceived to be easy have less impact on their test preparation practices: far fewer students report preparing for easy or non-discriminating exam sections. However, those students who perceived an exam section to be too difficult did not bother preparing for it at all. Other studies have turned to innovation theory in order to understand how change occurs and what might be the factors that affect washback (e.g. Wall 1999), and this is a promising area for further research. In short, in order to 2 1 The shape of things to come: will it be the normal distribution? understand and explain washback, language testing is engaging with innovation theory, with studies of individual teacher thinking and student motivation, and with investigations of classrooms. Interestingly, however, washback has not yet been properly researched by testing bodies, who may well not welcome the results. Despite the widely claimed negative washback of TOEFL, the test developer, Educational Testing Service New Jersey, has not to my knowledge funded or engaged in any washback research and the only empirical study I know of into the impact of TOEFL is an unfunded small-scale study in the USA by Hamp-Lyons and myself (Alderson and Hamp-Lyons 1996). Hopefully, members of ALTE (the Association of Language Testers in Europe) will begin to study the impact of their tests, rather than simply asserting their beneficial impact. After all, many ALTE tests affect high-stakes decisions. I know of no published washback studies among ALTE partners to date, but would be happy to be proved wrong. Certainly I would urge members of ALTE to initiate investigations into the impact of their tests on classrooms, on teachers, on students, and on society more generally. The results of washback studies will inevitably be painful, not just for test providers but for teachers, too. From the research that has been done to date, it is becoming increasingly clear that a) what teachers say they do is not what they do in class; b) their public reasons for what they do do not always mesh with their real reasons; and c) much of teacher-thinking is vague, muddled, rationalised, prejudiced, or simply uninformed. It is certainly not politically correct to make such statements, and the teacher education literature is full of rosy views of teaching and teachers. But I firmly believe that we need a more realistic, honest view of why teachers do what they do. Ethics: new focus on old issues Hamp-Lyons (1997) argues that the notion of washback is too narrow a concept, and should be broadened to cover ‘impact’ more generally, which she defines as the effect of tests on society at large, not just on individuals or on the educational system. In this, she is expressing a growing concern with the political and related ethical issues that surround test use. Others, like Messick (1994, 1996), have redefined the scope of validity and validation to include what he calls consequential validity – the consequences of test score interpretation and use. Messick also holds that all testing involves making value judgements, and therefore language testing is open to a critical discussion of whose values are being represented and served, which in turn leads to a consideration of ethical conduct. Tests and examinations have always been used as instruments of social policy and control, with the gate-keeping function of tests often justifying their existence. Davies (1997) argues that language testing is an intrusive 3 1 The shape of things to come: will it be the normal distribution? practice, and since tests often have a prescriptive or normative role, then their social consequences are potentially far-reaching. In the light of such impact, he proposes the need for a professional morality among language testers, both to protect the profession’s members and to protect the individual within society from misuse and abuse of testing instruments. However, he also argues that the morality argument should not be taken too far, lest it lead to professional paralysis, or cynical manipulation of codes of practice. A number of case studies illustrate the use and misuse of language tests. Two examples from Australia (Hawthorne 1997) are the use of the access test to regulate the flow of migrants into Australia, and the step test, allegedly designed to play a central role in the determining of asylum seekers’ residential status. Similar misuses of the IELTS test to regulate immigration into New Zealand are also discussed in language testing circles – but not yet published in the literature. Perhaps the new concern for ethical conduct will result in more whistle-blowing accounts of such misuse. If not, it is likely to remain so much hot air. Nevertheless, an important question is: to what extent are testers responsible for the consequences, use and misuse of their instruments? To what extent can test design prevent misuse? The ALTE Code of Practice is interesting, in that it includes a brief discussion of test developers’ responsibility to help users to interpret test results correctly, by providing reports of results that describe candidate performance clearly and accurately, and by describing the procedures used to establish pass marks and/or grades. If no pass mark is set, ALTE members are advised to provide information that will help users set pass marks when appropriate, and they should warn users to avoid anticipated misuses of test results. Despite this laudable advice, the notion of consequential validity is in my view highly problematic because, as washback research has clearly shown, there are many factors that affect the impact a test will have, and how it will be used, misused and abused. Not many of these can be attributed to the test, or to test developers, and we need to demarcate responsibility in these areas. But, of course, the point is well taken that testers should be aware of the consequences of their tests, and should ensure that they at least behave ethically. Part of ethical behaviour, I believe, is indeed investigating, not just asserting, the impact of the tests we develop. Politics Clearly, tests can be powerful instruments of educational policy, and are frequently so used. Thus testing can be seen, and increasingly is being seen, as a political activity, and new developments in the field include the relation between testing and politics, and the politics of testing (Shohamy 2001). But this need not be only at the macro-political level of national or local 4 1 The shape of things to come: will it be the normal distribution? government. Politics can also be seen as tactics, intrigue and manoeuvring within institutions that are themselves not political, but rather commercial, financial and educational. Indeed, I argue that politics with a small ‘p’ includes not only institutional politics, but also personal politics: the motivation of the actors themselves and their agendas (Alderson 1999). Test development is a complex matter intimately bound up with a myriad of agendas and considerations. Little of this complex interplay of motives and actions surfaces in the language-testing literature (just as so little of teachers’ motives for teaching test-preparation lessons the way they do is ever addressed critically in the literature). I do not have the space to explore the depth and breadth of these issues, but I would call for much more systematic study of the true politics of testing. Clearly, any project involving change on a national level is complex. However, in language testing we often give the impression that all we have to do to improve our tests is to concentrate on the technical aspects of the measuring instruments, design appropriate specifications, commission suitable test tasks, devise suitable procedures for piloting and analysis, train markers, and let the system get on with things. Reform, in short, is considered a technical matter, not a social problem. However, innovations in examinations are social experiments that are subject to all sorts of forces and vicissitudes, and are driven by personal, institutional, political and cultural agendas, and a concentration on the technical at the expense of these other, more powerful, forces risks the success of the innovation. But to concentrate on the macro-political at the expense of understanding individuals and their agendas is equally misleading. In my experience, the macro-politics are much less important than the private agendas, prejudices and motivations of individuals – an aspect of language testing never discussed in the literature, only in bars on the fringes of meetings and conferences. Exploring this area will be difficult, partly because of the sensitivities involved and partly because there are multiple perspectives on any event, and particularly on political events and actions. It will probably be difficult to publish any account of individual motivations for proposing or resisting test use and misuse. That does not make it any less important. Testing is crucially affected by politics and testers need to understand matters of innovation and change: how to change, how to ensure that change will be sustainable, how to persuade those likely to be affected by change and how to overcome, or at least understand, resistance. Standards: codes of practice and levels Given the importance of tests in society and their role in educational policy, and given recent concerns with ethical behaviour, it is no surprise that one area of increasing concern has been that of standards in testing. One common 5 1 The shape of things to come: will it be the normal distribution? meaning of standards is that of ‘levels of proficiency’– ‘what standard have you reached?’ Another meaning is that of procedures for ensuring quality, as in ‘codes of practice’. Language testing has developed a concern to ensure that tests are developed following appropriate professional procedures. Despite the evidence accumulated in the book I co-authored (Alderson, Clapham and Wall 1995), where British EFL exam boards appeared not to feel obliged to follow accepted development procedures or to be accountable to the public for the qualities of the tests they sold, things have now changed, and a good example of this is the publication of the ALTE Code of Practice, which is intended to ensure quality work in test development throughout Europe. ‘In order to establish common levels of proficiency, tests must be comparable in terms of quality as well as level, and common standards need, therefore, to be applied to their production.’ (ALTE 1998). Mechanisms for monitoring, inspecting or enforcing such a code do not yet exist, and therefore the consumer should still be sceptical, but having a Code of Practice to refer to does strengthen the position of those who believe that testing should be held accountable for its products and procedures. The other meaning of ‘standards’, as ‘levels of proficiency’, has been a concern for some considerable time, but has received new impetus, both with recent changes in Central Europe and with the publication of the Council of Europe’s Common European Framework. The Council of Europe’s Common European Framework is not only seen as independent of any possible vested interest, it also has a long pedigree, originating over 25 years ago in the development of the Threshold level, and thus its broad acceptability is almost guaranteed. In addition, the development of the scales of various aspects of language proficiency that are associated with the Framework has been extensively researched and validated, by the Swiss Language Portfolio project and DIALANG amongst others. I can confidently predict that we will hear much more about the Common European Framework in the coming years, and that it will increasingly become a point of reference for language examinations across Europe and beyond. National tests One of the reasons we will hear a great deal about the Common European Framework in the future is because of the increasing need for mutual recognition and transparency of certificates in Europe, for reasons of educational and employment mobility. National language qualifications, be they provided by the state or by quasi-private organisations, vary enormously in their standards – both quality standards and standards as levels. International comparability of certificates has become an economic as well as an educational imperative, and the availability of a transparent, independent 6 1 The shape of things to come: will it be the normal distribution? framework like the Common European Framework is central to the desire to have a common scale of reference and comparison. In East and Central Europe in particular, there is great interest in the Framework, as educational systems are in the throes of revising their assessment procedures. What is desired for the new reformed exams is that they should have international recognition, unlike the current school-leaving exams which in many places are seen as virtually worthless. Being able to anchor their new tests against the Framework is seen as an essential part of test development work, and there is currently a great deal of activity in the development of school-leaving achievement tests in the region. National language tests have always been important, of course, and we still see much activity and many publications detailing this work, although unfortunately much of this is either description or heated discussion and is not based on research into the issues. This contrasts markedly with the literature surrounding international language proficiency examinations, such as TOEFL, TWE, IELTS and some Cambridge exams. Empirical research into various aspects of the validity and reliability of such tests continues apace, often revealing great sophistication in analytic methodology, and such research is, in general, at the leading edge of language-testing research. This, however, masks an old concern: there is a tendency for language-testing researchers to write about large-scale international tests, and not about local achievement tests (including school- leaving tests that are clearly relatively high stakes). Given the amount of language testing that must be going on in the real world, there is a relative dearth of publications and discussions about achievement testing (especially low-stakes testing), and even less about progress testing. Test development work that is known to be going on, e.g. in Slovakia, the Baltics, St Petersburg and many other places, tends not to get published. Why is this? In many cases, reports are simply not written up, so the testing community does not know about the work. Perhaps those involved have no incentive to write about their work. Or perhaps this is because test developers feel that the international community is not interested in their work, which may not be seen as contributing to debates about test methods, appropriate constructs, the consequences of test use, and so on. However, from my own involvement in exam reform in Hungary, I can say that there is a lot of innovative work that is of interest to the wider community and that should be published. In Hungary we have published articles based on the English examination reform, addressing issues such as the use of sequencing as a test method, research into paired oral tests, and procedures for standard setting, and we have even produced evidence to inform an ongoing debate in Hungary about how many hours per week should be devoted to foreign-language education in the secondary school system. Indeed, testing is increasingly seen as a means of informing debates in 7 1 The shape of things to come: will it be the normal distribution? language education more generally. Examples of this include baseline studies associated with examination reform, which attempt to describe current practice in language classrooms. What such studies have revealed has been used in INSET and PRESET in Central Europe. Washback studies can also be used in teacher training, both in order to influence test preparation practices and also, more generally, to encourage teachers to reflect on the reasons for their and others’ practices. Testing and language education I am, of course, not the first to advance the argument that testing should be close to – indeed central to – language education. Not only as a means by which data can be generated to illuminate issues in language education, as I have suggested, and not only as an external control of curricular achievement, or as a motivator of students within classrooms. But also, and crucially, as contributing to and furthering language learning. It is a commonplace to say that tests provide essential feedback to teachers on how their learners are progressing, but frankly, few tests do. Teacher-made tests are often poorly designed, provide little meaningful information, and serve more as a disciplinary function than a diagnostic one. Many language textbooks are not accompanied by progress or achievement tests, and those that are are rarely properly piloted and researched. There is a great lack of interest among testing researchers in improving classroom-based testing. And those who reject testing, as I shall discuss later, claim that teachers know better than tests anyway: they have a more intimate, deep and complex knowledge of what the students they teach are capable of. Frankly I doubt this, and I have yet to see convincing (or indeed any) evidence that this might be the case. What language education needs is research and development work aimed at improving regular classroom assessment practice. This can partly be addressed by INSET workshops helping teachers to write better tests, but these can only reach so many teachers, and in any case teachers need more incentives to change their behaviour than can be provided by the occasional workshop. What holds much more promise is the development of low-stakes tests that can be made available to teachers for little or no charge via the Internet, which do not deliver certificates, but which are deliberately aimed at learning, at supporting teachers’ needs for student placement, at the diagnosis of students’ strengths and weaknesses, and at assessing student progress. There are already many language tests out there on the Internet, but the quality of many of these is atrocious, and what are urgently needed are high-quality, professionally- developed tests that can be made available to regular classroom teachers to select to suit their own particular needs. At the centre of testing for learning purposes, however, is the key question: 8 1 The shape of things to come: will it be the normal distribution? what CAN we diagnose? Diagnosis is essentially done for individuals, not groups, and testing researchers will increasingly have to ask themselves: what do we understand about individual rather than group performances? Given what we know or suspect about the variation across individuals on tests, what confidence can we have in our knowledge of which ability or process underlies a test taker’s response to an item? I shall return to this issue below, but here I raise the question: does it matter if individual learners respond to test items in different ways? If we are dealing with total scores, probably not, because the whole is more than the parts, and normally decisions are made on the basis of total scores, not responses to individual items. But when we are looking at individual skills and individual weaknesses, when we are attempting diagnosis, rather than the characterisation of overall proficiency, what confidence can or must we have that we are accurate? What can we say with confidence about an individual, about his/her individual knowledge or ability, other than through a detailed examination of each item and each response? In the past we could not dream of conducting such a detailed examination on anything other than a very small scale, but now we can. With the help of technology, we can reveal detailed item-level scores and responses (as provided in DIALANG, for example). Thanks to computers we are now able to face the dilemma: what does it all mean? Technology and testing Although computers have been used in language testing for a long time, the 1990s saw an explosion of interest in mounting tests on computer, as personal computers and computer labs became much more available, and the accessibility of the World Wide Web increased. Many have pointed out that computer-based testing relies overwhelmingly on selected response (typically multiple-choice) discrete-point tasks rather than performance-based items, and thus computer-based testing may be restricted to testing linguistic knowledge rather than communicative skills. No doubt this is largely due to the fact that computer-based tests require the computer to score responses. But recent developments offer some hope. Human-assisted scoring systems (where most scoring of responses is done by computer but responses that the programs are unable to score are given to humans for grading) could reduce this dependency. Free-response scoring tools are capable of scoring responses up to 15 words long, which correlate with human judgements at impressively high levels. ETS has developed ‘e-rater’ which uses natural language- processing techniques to duplicate the performance of humans rating open- ended essays. Already, the system is used to rate GMAT essays and research is on-going for other programs, including second/foreign language testing situations. 9 1 The shape of things to come: will it be the normal distribution? Another example is PhonePass, which is delivered over the telephone, using tasks like reading aloud, repeating sentences, saying opposite words, and giving short answers to questions. Speech synthesis techniques are used to rate the performances, and impressive reliability coefficients have been found as well as correlations with the Test of Spoken English and with interviews. The advantages of computer-based assessment are already evident, not only in that they can be more user-friendly, but also because they can be more compatible with language pedagogy. Computer-based testing removes the need for fixed delivery dates and locations normally required by traditional paper-and-pencil-based testing. Group administrations are unnecessary, and users can take the test when they wish, and on their own. Whilst diskette- and CD-ROM-based tests also have such advantages, tests delivered over the Internet are even more flexible in this regard: purchase of disks is not required, and anybody with access to the Internet can take a test. Moreover, disks are fixed in format, and once the disk has been created and distributed, it cannot easily be updated. However, with tests delivered by the Internet, access is possible to a much larger database of items, which can be constantly updated. Using the Internet, tests can be piloted alongside live test items. Once a sufficient number of responses has been obtained, they could be calibrated automatically and could then be entered into the live database. Use of the Internet also means that results can be sent immediately to designated score users. Access to large databases of items means that test security can be greatly enhanced, since tests can be created by randomly accessing items in the database and producing different combinations of items. Thus any one individual is exposed to only a tiny fraction of available items and any compromise of items that might occur will have a negligible effect. Test results can be made available immediately, unlike paper-and-pencil- based tests, which require time to be collected, marked and for the results to be issued. As well as being of obvious benefit to the users (receiving institutions, as well as candidates), the major pedagogic advantage is that of immediate feedback to the learner, either after each item has been responded to, or at the end of a sub-test, or after the whole test. Feedback given immediately after an activity has been completed is likely to be more meaningful and to have more impact than feedback which is substantially delayed. In traditional paper-and-pencil tests, the test results can be delayed for several months. If feedback is given immediately after an item has been attempted, users could be allowed to make a second attempt at the item – with or without penalties for doing so in the light of feedback. The interesting question then arises: if the user gets the item right the second time, which is the true measure of ability, the performance before or after the feedback? I would argue that the second performance is a better indication, since it results from the users’ 10 1 The shape of things to come: will it be the normal distribution? having learned something about their first performance and thus is closer to current ability. Computers can also be user-friendly in offering a range of support to test takers: on-line Help facilities, clues, tailor-made dictionaries and more, and the use of such support can be monitored and taken into account in calculating test scores. Users can be asked how confident they are that the answer they have given is correct, and their confidence rating can be used to adjust the test score. Self-assessment and the comparison of self-assessment with test performance is an obvious extension of this principle of asking users to give insights into their ability. Similarly, adaptive tests need not be merely psychometrically driven, but the user could be given the choice of taking easier or more difficult items, especially in a context where the user is given immediate feedback on their performance. Learners can be allowed to choose which skill they wish to be tested on, or which level of difficulty they take a test at. They can be allowed to choose which language they wish to see test rubrics and examples in, and in which language results and feedback are to be presented. An example of computer-based diagnostic tests, available over the Internet, which capitalises on the advantages I have mentioned, is DIALANG (see Chapter 8 by Sari Luoma, page 143). DIALANG uses self-assessment as an integral part of diagnosis, asking users to rate their own ability. These ratings are used in combination with objective techniques in order to decide which level of test to deliver to the user. DIALANG provides immediate feedback to users, not only on scores, but also on the relationship between their test results and their self-assessment. DIALANG also gives extensive explanatory and advisory feedback on test results. The language of administration, of self- assessment, and of feedback, is chosen by the test user from a list of 14 European languages, and users can decide which skill they wish to be tested in, in any one of 14 European languages. One of the claimed advantages of computer-based assessment is that computers can store enormous amounts of data, including every keystroke made by candidates and their sequence and the time taken to respond to a task, as well as the correctness of the response, the use of help, clue and dictionary facilities, and much more. The challenge is to make sense of this mass of data. A research agenda is needed. What is needed above all is research that will reveal more about the validity of the tests, that will enable us to estimate the effects of the test method and delivery medium; research that will provide insights into the processes and strategies test takers use; studies that will enable the exploration of the constructs that are being measured, or that might be measured. Alongside development work that explores how the potential of the medium might best be harnessed in test methods, support, diagnosis and feedback, we need research that investigates the nature of the most effective and meaningful 11 1 The shape of things to come: will it be the normal distribution? feedback; the best ways of diagnosing strengths and weaknesses in language use; the most appropriate and meaningful clues that might prompt a learner’s best performance; the most appropriate use and integration of media and multimedia that will allow us to measure those constructs that might have eluded us in more traditional forms of measurement – for example, latencies in spontaneous language use, planning and execution times in task performance, speed reading and processing time more generally. And we need research into the impact of the use of the technology on learning, on learners and on the curriculum’ (Alderson 2000). In short, what is the added value of the use of computers? Constructs and construct validation Recently, language testing has come to accept Messick’s conception of construct validity as being a unified, all-encompassing concept, which recognises multiple perspectives on validity. In this recently-accepted view there is No Single Answer to the validity question ‘What does our test measure?’ or ‘Does this test measure what it is supposed to measure?’. Indeed, the question is now rephrased along the following lines: ‘What is the evidence that supports particular interpretations of scores on this test?’ New perspectives incorporated into this unified view of construct validity include test consequences, but there is considerable debate in the field, as I have hinted already, about whether test developers can be held responsible for test use and misuse. Is consequential validity a legitimate area of concern or a political posture? As a result of this unified perspective, validation is now seen as on-going, as the continuous monitoring and updating of relevant information, as a process that is never complete. What is salutary and useful about this new view of construct validity is that it places the test’s construct at the centre of focus, and somewhat readjusts traditional concerns with test reliability. Emphasising the centrality of constructs – what we are trying to measure – necessarily demands an applied linguistic perspective. What do we know about language knowledge and ability, and ability for use? Assessing language involves not only the technical skills and knowledge to construct and analyse a test – the psychometric and statistical side to language testing – but it also requires a knowledge of what language is, what it means to ‘know’ a language, and what is involved in learning a language as your mother tongue or as a second or subsequent language, what is required to get somebody to perform, using language, to the best of their ability. In the early 1990s Caroline Clapham and I published an article in which we reported our attempt to find a consensus model of language proficiency on which we could base the revised ELTS test – the IELTS, as it became known (Alderson and Clapham 1992). We were somewhat critical of the lack of 12 1 The shape of things to come: will it be the normal distribution? consensus about which applied linguistic models we could use, and we reported our decision to be eclectic. If we repeated that survey now, I believe we would find more consensus: the Bachman model, as it is called, is now frequently referred to, and is increasingly influential as it is incorporated into views of the constructs of reading, listening, vocabulary and so on. The model has its origins in applied linguistic thinking by Hymes, and Canale and Swain, and by research, e.g. by Bachman and Palmer and by the Canadian Immersion studies, and it has become somewhat modified as it is scrutinised and tested. But it remains very useful as the basis for test construction, and for its account of test-method facets and task characteristics. I have already suggested that the Council of Europe’s Common European Framework will be influential in the years to come in language education generally, and one aspect of its usefulness will be its exposition of a model of language, language use and language learning – often explicitly referring to the Bachman model. The most discussed aspect by North and others of the Common European Framework to date are the various scales, which are generally perceived as the most useful aspect of the Common European Framework, not only for their characterisation and operationalisation of language proficiency and language development, but above all for their practical value in measuring and assessing learning and achievement. I would argue that the debate about what constructs to include, what model to test, has diminished in volume as testers are now engaged in exploring the empirical value and validity of these consensus models through comparability studies and the like. Yet despite this activity, there are still many gaps in our knowledge. It is for example a matter of regret that there are no published studies of tests like the innovative Royal Society of Arts’ Test of the Communicative Use of English as a Foreign Language – CUEFL (now rebadged as Cambridge’s Certificates in Communicative Skills in English, UCLES 1995). Empirical studies could have thrown valuable light both on development issues – the repetition of items/tasks across different test levels, the value of facsimile texts and realistic tasks – as well as on construct matters, such as the CUEFL’s relation to more traditional tests and constructs. Why the research has not been done – or reported – is probably a political matter, once again. We could learn a lot more about testing if we had more publications from exam boards – ALTE members – about how they construct their tests and the problems they face. A recent PhD study by Sari Luoma looked at theories of test development and construct validation and tried to see how the two can come together (Luoma 2001). Her research was limited by the lack of published studies that could throw light on problems in development (the one exception was IELTS). Most published studies are sanitised corporate statements, emphasising the positive features of tests rather than discussing knotty issues in development or construct definition that have still to be 13 1 The shape of things to come: will it be the normal distribution? addressed. My own wish list for the future of language testing would include more accounts by developers (along the lines of Alderson, Nagy, and Öveges 2000) of how tests were developed, and of how constructs were identified, operationalised, tested and revised. Such accounts could contribute to the applied linguistic literature by helping us understand these constructs and the issues involved in operationalisation – in validating, if you like, the theory. Pandora’s boxes Despite what I have said about the Bachman Model, McNamara has opened what he calls ‘Pandora’s box’ (McNamara 1995). He claims that the problem with the Bachman Model is that it lacks any sense of the social dimension of language proficiency. He argues that it is based on psychological rather than socio-psychological or social theories of language use, and he concludes that we must acknowledge the intrinsically social nature of performance and examine much more carefully its interactional – i.e. social – aspects. He asks the disturbing question: ‘whose performance are we assessing?’ Is it that of the candidate? Or the partner in paired orals? Or the interlocutor in one-to-one tests? The designer who created the tasks? Or the rater (and the creator of the criteria used by raters)? Given that scores are what is used in reporting results, then a better understanding of how scores are arrived at is crucial. Research has intensified into the nature of the interaction in oral tests and I can confidently predict that this will continue to be a fruitful area for research, particularly with reference to performance tests. Performance testing is not in itself a new concern, but is a development from older concerns with the testing of speaking. Only recently, however, have critiques of interviews made their mark. It has been shown through discourse analysis that the interview is only one of many possible genres of oral task, and it has become clear that the language elicited by interviews is not the same as that elicited by other types of task, and by different sorts of social interaction which do not have the asymmetrical power relations of the formal interview. Thus different constructs may be tapped by different tasks. Hill and Parry (1992) claim that traditional tests of reading assume that texts ‘have meaning’, and view text, reader and the skill of reading itself as autonomous entities. In contrast, their own view of literacy is that it is socially constructed, and they see the skill of reading as being much more than decoding meaning. Rather, reading is the socially structured negotiation of meaning, where readers are seen as having social, not just individual, identities. Hill and Parry’s claim is that this view of literacy requires an alternative approach to the assessment of literacy that includes its social dimension. One obvious implication of this is that what it means to understand a text will need to be revisited. In the post-modern world, where multiple 14 1 The shape of things to come: will it be the normal distribution? meanings are deconstructed and shown to be created and recreated in interaction, almost anything goes: what then can a text be said to ‘mean’ and when can a reader be said to have ‘understood’ a text? Another Pandora’s box A related issue in foreign-language reading is that of levels of meaning, and the existence of reading skills that enable readers to arrive at these different levels. This hierarchy of skills has often been characterised as consisting of ‘higher-order’ and ‘lower-order’ skills, where ‘understanding explicitly stated facts’ is held to be ‘lower-order’ and ‘synthesising ideas contained in text’ or ‘distinguishing between relevant and irrelevant ideas’ is held to be ‘higher- order’. However, this notion has been challenged on the grounds that, firstly, expert judges do not agree, on the whole, on whether given test questions are assessing higher- or lower-order skills, and secondly, that even for those items where experts agree on the level of skill being tested, there is no correlation between level of skill and item difficulty. Item difficulty does not necessarily relate to ‘level’ of skill, and students do not have to acquire lower-order skills before they can acquire higher-order skills. Such a conclusion has proved controversial and there is some evidence that, provided that teachers can agree on a definition of skills, and provided disagreements are discussed at length, substantial agreements on matching sub-skills to individual test items (but not according to level of skill) can be reached. I argue that all that this research proves is that judges can be trained to agree. That does not mean that skills can be separated, or be tested separately by individual test items. Introspective accounts from students completing tests purporting to assess individual sub-skills in individual items demonstrate that students can get answers correct for the ‘wrong’ reason – i.e. without exhibiting the skill intended – and can get an item wrong for the right reason – i.e. whilst exhibiting the skill in question. Individuals responding to test items do so in a complex and interacting variety of different ways, and experts judging test items are not well placed to predict how students with quite different levels of language proficiency might actually respond to test items. Therefore generalisations about what skills reading test items might be testing are fatally flawed. This clearly presents a dilemma for test developers and researchers. Substantive findings on the nature of what is being tested in a reading test remain inconclusive. A third Pandora’s box Authenticity is a long-standing concern in language testing as well as in teaching, with the oft-repeated mantra that if we wish to test and predict a candidate’s ability to communicate in the real world, then texts and tasks 15 1 The shape of things to come: will it be the normal distribution? should be as similar to that real world as possible. More recent discussions have become more focused, but, rather like washback a few years ago, have not to date been informed by empirical research findings. However, Lewkowicz (1997) reports a number of studies of authenticity, which result in some disturbing findings. Firstly, she found that neither native nor non-native speaker judges could identify whether listening or reading texts were or were not genuine. Secondly, students did not perceive test authenticity as an important quality that would affect their performance – they tended to be pragmatic, looking at whether they were familiar with the test format and whether they thought they would do well on the tests. Thirdly, moderating committees responsible for developing tests claimed to be authentic to target language-use situations are shown rarely to appeal to the criterion of authenticity when selecting texts and tasks on a communicative test, and frequently edit texts and tasks in order, for example, to disambiguate texts. And finally, a study of the integration of reading and writing tasks, in an attempt to make writing tasks more authentic in terms of Target Language Use needs, showed that when students were given source texts to base their writing on they did not produce writing that was rated more highly. Indeed, some students in the group that had source texts to refer to were arguably disadvantaged by copying long chunks from the source texts. Bachman (1990) distinguishes between ‘situational authenticity’ and ‘interactional authenticity’, where situational authenticity relates to some form of replication of actual speech events in language-use situations (‘life- likeness’) and interactional authenticity is ‘a function of the extent and type of involvement of test takers’ language ability in accomplishing a test task’. Bachman argues that authenticity is not an all-or-nothing affair: a test task could be high on situational authenticity and low on interactional authenticity, or vice versa. In a later publication, Bachman and Palmer (1996) separate the notion of authenticity from that of interactiveness and define authenticity as ‘the degree of correspondence of the characteristics of a given language test task to the features of a TLU task’. Bachman and Palmer consider ‘authenticity’ to be a ‘critical quality of language tests’, an aspect of what they call test usefulness, alongside validity, reliability, consequences, interactiveness and practicality, and they claim that authenticity has a strong effect on candidates’ test performance. Lewkowicz’s research, cited above, challenges this belief, and it is clear that much more empirical research, and less theoretical speculation, is needed before the nature and value of ‘authenticity’ can be resolved. Yet another Pandora’s box Ten years ago, I argued (Alderson and North 1991) that the traditional distinction between reliability and validity, whilst conceptually distinct, is quite unclear when we examine how reliability is operationalised. It is easy 16 1 The shape of things to come: will it be the normal distribution? enough to understand the concept of reliability when it is presented as ‘being consistent, but you can be consistently wrong’. Validity is ‘being right, by measuring the ability you wish to measure’, and of course this can only be measured if you do so consistently. Commentators have always argued that the two concepts are complementary, since to be valid a test needs to be reliable, although a reliable test need not be valid. However, I argued that this is too simple. In fact, if we look at how reliability is measured, problems emerge. The classic concept of reliability is test-retest. Give a test to the same people on a second occasion and the score should remain the same. But what if they have learned from the first administration? What if their language ability has changed in some way between the two administrations? In neither case is this necessarily a matter of unreliability, but low correlations between the two administrations would be valid, if the ability had changed. Of course, there are obvious practical difficulties in persuading test takers to take the test again, unless there is some intervening period, but the longer the period that intervenes, the more likely we are to expect a less-than-perfect correlation between the two scores for perfectly valid reasons. Test-retest reliability is rarely measured, not for the theoretical reasons I have presented, but for practical reasons. More common is the parallel form of reliability, where on the second occasion, a parallel form of the test is administered, in an attempt to avoid the learning effect. But immediately we are faced with the problem of the nature of parallelism. Traditional concurrent validity is the usu