Issues in Language Testing PDF

ELT-38 Issues in Language Testing Milestones in ELT Milestones in ELT The British Council was established in 1934 and one of our main aims has always been to promote a wider knowledge of the English language. Over the years we have issued many important publications that have set the agenda for ELT professionals, often in partnership with other organisations and institutions. As part of our 75th anniversary celebrations, we re-launched a selection of these publications online, and more have now been added in connection with our 80th anniversary. Many of the messages and ideas are just as relevant today as they were when first published. We believe they are also useful historical sources through which colleagues can see how our profession has developed over the years. Issues in Language Testing This book is based on papers and discussions at a Lancaster University symposium in October 1980 where seven applied linguists met to discuss problems in language testing. In the Introduction, the book’s editor Charles Alderson refers to the discomfort felt by many language teaching practitioners faced with the subject of ‘testing’, given the predominance of statistical analysis in the field. Nevertheless, Alderson noted increasing needs to clarify issues in three areas – corresponding to the three main sections of the book: Communicative language testing, Testing of English for specific purposes, and Testing of general language proficiency. Within each section there are three parts: the original article(s), reaction papers and an account of the discussion based upon tape recordings of the proceedings by Alderson. Ill- Issues in Language Testing The British Council ELT documents 111- Issues in Language Testing Editors: J Charles Alderson Arthur Hughes The British Council Central Information Service English Language and Literature Division The opinions expressed in this volume are those of the authors and do not necessarily reflect the opinion of the British Council. ELT Documents is now including a correspondence section. Comments arising from articles in current issues will therefore be most welcome. Please address comments to ELSD, The British Council, 10 Spring Gardens, London SW1A2BN. The articles and information in ELT Documents are copyright but permission will generally be granted for use in whole or in part by educational establishments. Enquiries should be directed to the British Council, Design, Production and Publishing Department, 65 Davies Street, London W1Y 2AA. ISBN 0901618 51 9 ©The British Council 1981 CONTENTS Page INTRODUCTION 5 J Charles Alderson, University of Lancaster SECTION 1: Communicative Language Testing Communicative language testing: revolution or evolution 9 Keith Morrow, Bell School of Languages, Norwich Reaction to the Morrow paper (1) 26 Cyril J Weir, Associated Examining Board Reaction to the Morrow paper (2) 38 Alan Moller, The British Council, London Reaction to the Morrow paper (3) 45 J Charles Alderson, University of Lancaster Report of the discussion on Communicative Language Testing 55 J Charles Alderson, University of Lancaster SECTION 2: Testing of English for Specific Purposes Specifications for an English Language Testing Service 66 Brendan J Carroll, The British Council, London Reaction to the Carroll Paper (1) 111 Caroline M Clapham, University of Lancaster Reaction to the Carroll paper (2) 117 Clive Griper, University of Edinburgh Background to the specifications for an English Language Testing Service and subsequent developments 121 lan Seaton, ELTSLU, The British Council, London Report of the discussion on Testing English for Specific Purposes 123 J Charles Alderson, University of Lancaster SECTION 3: General Language Proficiency Basic concerns in test validation 135 Adrian S Palmer, English Language Institute, University of Utah, USA and Lyle F Bachman, University of Illinois, USA Why are we interested in 'General Language Proficiency'? 152 Helmut J Vollmer, University of Osnabruck, Germany Reaction to the Palmer & Bachman and the Vollmer Papers (1) 176 Arthur Hughes, University of Reading Reaction to the Palmer & Bachman and the Vollmer Papers (2) 182 Alan Davies, University of Edinburgh Report of the Discussion on General Language Proficiency 187 J Charles Alderson, University of Lancaster Response: Issue or non-issue General Language Proficiency revisited 195 Helmut J Vollmer, University of Osnabruck Epilogue 206 Arthur Hughes, University of Reading INTRODUCTION This book arose from an occasion in October 1980 when seven applied linguists met in Lancaster to discuss what they felt were important problems in the assessment of learning a second or foreign language. This Symposium resulted, partly because of its informal nature and its deliberately small size, in an intense discussion in certain areas, a concentration which is rarely possible in conferences or large seminars. It was felt that the Symposium had been so useful that it was decided to make the discussion public, in order not only to let others know what had happened at Lancaster, but also to encourage and stimulate a much broader and hopefully even richer debate in the areas touched upon. Testing has become an area of increased interest to language teachers and applied linguists in the last decade. Yet as Davies says (Davies 1979) testing has for many years firmly resisted attempts to bring it within the mainstream of applied linguistics. This is no doubt to some extent due to historical reasons, as both Davies and Morrow (this volume) suggest. In the era that Spolsky dubbed the 'psychometric-structuralist period' language testing was dominated by criteria for the establishment of educational measuring instruments developed within the tradition of psychometrics. As a result of this emphasis on the statistical analysis of language tests, a group developed, over the years, of specialists in language testing. Testing Experts', popularly believed to live in an arcane world of numbers and formulae. As most language teachers are from a non-numerate background (sometimes having deliberately fled 'figures') it is not surprising that they were reluctant to involve themselves in the mysteries of statistics. Consequently, an expertise developed in language testing and particularly proficiency testing, divorced from the concerns of the language classroom, and imbued with its own separate concerns and values which to outsiders were only partially com- prehensible and apparently irrelevant. Despite the advent of Spolsky's third phase of language testing the psycholinguistic-sociolinguistic phase (what Moller (this volume) calls the third and fourth phases psycholinguistic- sociolinguistic and sociolinguistic-communicative phases) 'testing' has not yet recovered from this image of being stubbornly irrelevant to or uncon- cerned with the language teacher, except for its embodiment in 'exams' which dominate many a syllabus (be it the Cambridge First Certificate or the TOEFL). Teachers who have felt they should be concerned with assessing what or whether learners have learned have found the jargon and argument- ation of 'Testing' forbidding and obscure. But evaluation (note how the terminology has changed over the years, with the intention of making the subject less threatening) is readily acknowledged by teachers and curriculum theorists alike to be an essential part of language learning, just as feedback is recognised as essential in any learning process. The consequence of this need to evaluate has been the fact that teachers have actually carried out tests all along but have felt uncomfortable, indeed guilty and apologetic about doing so when there is apparently so much about 'testing' they do not know. So when suggesting that Testing' has become more central to the present-day concerns of language teachers, it is not intended to imply that previously 'in the bad old days' nobody tested, or that the testing that was done was of ill repute, but merely to suggest that teachers felt that what they were doing was in some important sense lacking in respectability however relevant or important it might actually have been. The fact is, however, that testing has become an area of increased research activity, and many more articles are published on the subject today in professional journals than ten years ago. This is evidence of a turning in the tide of applied linguistics towards more empirical concerns. It has been suggested that testing has to date remained outside the mainstream of applied linguistics; in particular, the view of language incorporated in many tests has become increasingly at odds with theories of language and language use indeed, to some extent at least, it no longer reflects classroom practice in language teaching. Now there may be good arguments for tests not to follow the whim of fashion in language teaching, but when there is a serious discrepancy between the teaching and the means of evaluating that teaching, then something appears to be amiss. The feeling abroad today is that theories abound of communicative language teaching, of the teaching of ESP, of integrated language teaching, but where are the tests to operationalise those theories? Where are the communicative language tests, the ESP tests, the integrated language tests? Applied linguists and language teachers alike are making increasingly insistent demands on language testers to supply the language tests that current theory and practice require, and the response of testers has, to date, been mixed. Some have rushed in where others have feared to tread: extravagant claims have been made for new techniques, new tests, new assessment procedures. Others have stubbornly resisted the pressure, claiming that tests of communicative competence or ESP are either impossible (in theory, or in practice) or unnecessary because existing tests and techniques are entirely adequate. Inevitably, there are also agnostics on the side lines, who remain sceptical until they have seen the evidence for and against the claims of either side. This book is for those agnostics, though believers and non-believers alike may find something of interest. The Symposium at Lancaster was an attempt to focus, without taking sides, on areas of major concern to teachers and testers at present: communicative language testing, the testing of English for Specific Purposes, the testing of general language proficiency. It was hoped by intense debate to establish what the important issues were in these areas, so that the interested reader could provide himself with a set of criteria for judging (or constructing) language tests, or perhaps more realistically, for investigating further. It is clear, always, that more research is needed but it is hoped that this book will help to clarify where research and development needs to be concentrated at present. We are living in a world of claim and counter-claim, where the excitement of the battle may make us lose sight of the reasons for the conflict: namely the need for learners and outsiders to assess progress in language learning or potential for such progress, as accurately as possible. No research programme or test development should forget this. The format of the Symposium was as follows. Having decided on the three main areas for debate, recent and influential articles in those areas were selected for study and all Symposium participants were asked to produce papers reacting to one or more of these articles, outlining what they felt to be the important issues being raised. These reaction papers were circulated in advance of the Symposium, and the Symposium itself consisted of a discussion in each of the three areas, based on the original articles and the related reaction papers. Like the Symposium, the volume is divided into three main sections: one section for each of the areas of communicative language testing, ESP testing, and general language proficiency. Within each section there are three parts: the original article(s), the reaction papers and an account of the discussion based upon tape recordings of the proceedings by the present writer. These accounts of the discussion do not represent the views of any one participant, including the present writer, but are an attempt to summarise the issues that were raised. However, it should be stressed that although the accounts of the discussion attempt to be fair to the substance and quality of the debate, they must, inevitably, ultimately represent one person's view of what was said, since it would be impossible to achieve complete consensus on what was said, let alone its correctness or significance. At times the accounts repeat points made in the reaction papers also published in this volume, but no apologies are offered for repetition, as this simply reflects the level of interest in or concern over these particular points. Although it was hoped to include responses from the authors of the original articles only one response was available at the time of going to press, that of Helmut Vollmer. Nevertheless, it is hoped that subsequent debate will include the responses and further thoughts of the other authors in the light of these discussions. This is not a definitive volume on language testing and it does not attempt to be such. What this book hopes to do is to encourage further debate, a critical or sceptical approach to claims made about 'progress' and 'theories', and to encourage practical research in important areas. It has not been the intention of this Introduction to guide the reader through the discussions that would have been presumptuous and unnecessary but rather to set the scene for them. Thus there is here no summary of positions taken, arguments developed and issues raised. However, there is, after the three main sections, an Epilogue, and the reader is advised not to ignore this: it is intended, not to tell the reader what he has read, but to point the way forward in the ongoing debate about the assessment of language learning. 'Testing' should not and cannot be left to Testers': one of the most encouraging developments of the last decade is the involvement of more applied linguists in the area of assessment and evaluation. In a sense, there can be no Epilogue, because the debate is unfinished, and we hope that participation in the debate will grow. It is ultimately up to the reader to write his own 'Way Forward'. Thanks are due to all Symposium participants, not only for their contribu- tions, written and spoken, to the Symposium, but also for their help in preparing this volume. Thanks are also due to the Institute for English Language Education, Lancaster, for hosting the Symposium and contributing materially to the preparation of this book. J Charles Alderson, University of Lancaster SECTION 1 COMMUNICATIVE LANGUAGE TESTING: REVOLUTION OR EVOLUTION? 1 Keith Morrow, Bell School of Languages, Norwich Introduction Wilkins (1976) concludes with the observation that, 'we do not know how to establish the communicative proficiency of the learner' and expresses the hope that, 'while some people are experimenting with the notional syllabus as such, others should be attempting to develop the new testing techniques that should, ideally, accompany it' (loccit). In the two years that have passed since the publication of this book, the author's hope on the one hand has been increasingly realised, and if his observation on the other is still valid, there are grounds for believing that it will not be so for much longer. At the time of writing, it is probably true to say that there exists a considerable imbalance between the resources available to language teachers (at least in E F L) in terms of teaching materials, and those available in terms of testing and evaluation instruments. The former have not been slow to incorporate insights into syllabus design, and increasingly methodology, deriving from a view of language as communication; the latter still reflect, on the whole, ideas about language and how it should be tested which fail to take account of these recent developments in any systematic way. 2 This situation does seem to be changing, however. A number of institutions and organisations have set up working parties to assess the feasibility of tests based on communicative criteria, and in some cases these have moved on to 'This article was first published in The Communicative approach to language teaching ed: C J Brumfit and K Johnson. Oxford University Press, 1979. Reprinted here by kind permission of Oxford University Press. Exceptions to this are the two oral examinations promoted by the Association of Recognised English Language Schools: The ARELS Certificate and the ARELS Diploma, as well as the Joint Matriculation Board's Test in English for Overseas Students. But without disrespect to these, I would claim that they do not meet in a rigorous way some of the criteria established later in this paper. the design stage. 3 It therefore seems reasonable to expect that over the next five years new tests and examinations will become available which will aim to do precisely the job which Wilkins so recently held up as a challenge, ie to measure communicative proficiency. This paper, then, will be concerned with the implications for test design and construction of the desire to measure communicative proficiency, and with the extent to which earlier testing procedures need to be reviewed and reconsidered in the light of this objective. But it is a polemical paper. The assumption which underlies it is that the measurement of communicative proficiency is a job worth doing, and the task is ultimately a feasible one. The Vale of Tears A wide range of language tests and examinations are currently in use but most belong to a few key types. Spolsky (1975) identifies three stages in the recent history of language testing: the pre-scientific, the psychometric-structuralist, and the psycholinguistic-sociolinguistic. We might characterise these in turn as the Garden of Eden, the Vale of Tears and the Promised Land, and different tests (indeed different parts of the same test) can usually be seen to relate to one or other of these stages. The historical perspective offered by Spolsky is extremely relevant to the concerns of this paper. While critiques of the 'prescientific' approach to testing are already familiar (Valette, 1967), it seems useful to take some time here to clarify the extent to which current developments relate to what has more immediately gone before through a critical look at some of the characteristics of psychometric-structuralist testing. The point of departure for this is Lado (1961). Atomistic A key feature of Lado's approach is the breaking down of the complexities of language into isolated segments. This influences both what is to be tested and how this testing should be carried out. What is to be tested is revealed by a structural contrastive analysis between the target language and the learner's mother tongue. Structural here is not limited to grammatical structure though this is of course important. My own work in this field has been sponsored by the Royal Society of Arts who have established a Working Party to re-design their range of examinations for foreign students. The English Language Testing Service of the British Council is developing communicative tests in the area of English for Academic Purposes, and a similar line is likely to be followed soon by the Associated Examining Board. 10 Contrastive analysis can be carried out of all the levels of structure (syntactic down to phonological) which the language theory encompasses, and test items can be constructed on the basis of them. The same approach is adopted to the question of how to test. Discrete items are constructed, each of which ideally reveals the candidate's ability to handle one level of the language in terms of one of the four skills. It soon became recognised that it was in fact extremely difficult to construct 'pure' test items which were other than exceedingly trivial in nature, and thus many tests of this sort contain items which operate on more than one level of structure. The clear advantage of this form of test construction is that it yields data which are easily quantifiable. But the problem is equally clearly that its measurement of language proficiency depends crucially upon the assumption that such proficiency is neatly quantifiable in this way. Indeed the general problem with Lado's approach, which attaches itself very firmly to certain very definite views about the nature of language, is that it crumbles like a house of cards as soon as the linguistic foundation on which it is constructed is attacked. This is not the place to develop a generalised linguistic attack, but one particular assumption is worth picking up, since it is so central to the issue under discussion. An atomistic approach to test design depends utterly on the assumption that knowledge of the elements of a language is equivalent to knowledge of the language. Even if one adopts for the moment a purely grammatical view of what it is to know a language (cf Chomsky's definition in terms of the ability to formulate all and only the grammatical sentences in a language), then it seems fairly clear that a vital stage is missing from an atomistic analysis, viz the ability to synthesise. Knowledge of the elements of a language in fact counts for nothing unless the user is able to combine them in new and appropriate ways to meet the linguistic demands of the situation in which he wishes to use the language. Driving a car is a skill of a quite different order from that of performing in isolation the various movements of throttle, brake, clutch, gears and steering wheel. Quantity v. Quality In the previous section it was the linguistic basis of tests such as Lado's which was questioned. Let us now turn to the psychological implications. Following the behaviourist view of learning through habit formation, Lado's tests pose questions to elicit responses which show whether or not correct habits have been established. Correct responses are rewarded and negative ones punished in some way. Passing a test involves making a specified proportion of correct responses. Clearly language learning is viewed as a process of accretion. 11 An alternative view of the psychology of language learning would hold, however, that the answers to tests can, and should, be considered as more than simply right or wrong. In this view learners possess 'transitional competence' (Corder, 1975) which enables them to produce and use an 'interlanguage' (Selinker, 1972). Like the competence of a native speaker, this is an essentially dynamic concept and the role of the test is to show how far it has moved towards an approximation of a native speaker's system. Tests will thus be concerned with making the learner produce samples of his own 'interlanguage', based on his own norms of language production so that conclusions can be drawn from it. Tests of receptive skills will similarly be concerned with revealing the extent to which the candidate's processing abilities match those of a native speaker. The clear implication of this is that the candidate's responses need to be assessed not quantitatively, but qualitatively. Tests should be designed to reveal not simply the number of items which are answered correctly, but to reveal the quality of the candidate's language performance. It is not safe to assume that a given score on the former necessarily allows conclusions to be drawn about the latter. Reliability One of the most significant features of psychometric tests as opposed to those of 'pre-scientific' days is the development of the twin concepts of reliability and validity. The basis of the reliability claimed by Lado is objectivity. The rather obvious point has, however, not escaped observers (Pilliner, 1968; Robinson, 1973) that Lado's tests are objective only in terms of actual assessment. In terms of the evaluation of the numerical score yielded, and perhaps more importantly, in terms of the construction of the test itself, subjective factors play a large part. It has been equally noted by observers that an insistence on testing proce- dures which can be objectively assessed has a number of implications for the data yielded. Robinson (op cit) identifies three areas of difference between testing procedures designed to yield data which can be objectively assessed and those which are open to subjective assessment. 1 The amount of language produced by the student. In an objective test, students may actually produce no language at all. Their role may be limited to selecting alternatives rather than producing language. 12 2 Thus the type of ability which is being tested is crucially different. In a subjective test the candidate's ability to produce language is a crucial factor; in an objective test the ability to recognise appropriate forms is sufficient. 3 The norms of language use are established on different grounds. In an objective test the candidate must base his responses upon the language of the examiner; in a subjective test, the norms may be his own, deriving from his own use of the language. Thus an objective test can reveal only differences and similarities between the language norms of the examiner and candidate; it can tell us nothing of the norms which the candidate himself would apply in a use situation. The above factors lead to what Davies (1978) has called the reliability-validity 'tension'. Attempts to increase the reliability of tests have led test designers to take an over-restrictive view of what it is that they are testing. Validity The idea that language test designers should concern themselves with validity in other words that they should ask themselves whether they are actually testing what they think they are testing, and whether what they think they are testing is what they ought to be testing is clearly an attractive one. But unfortunately, because of the 'tension' referred to above, designers working within the tradition we are discussing seem to have been content with answers to these questions which are less than totally convincing. Five types of validity which a language test may claim are traditionally identified (cf Davies, 1968). Face the test looks like a good one. Content the test accurately reflects the syllabus on which it is based. Predictive the test accurately predicts performance in some subsequent situation. Concurrent the test gives similar results to existing tests which have already been validated. Construct the test reflects accurately the principles of a valid theory of foreign language learning. Statistical techniques for assessing validity in these terms have been developed to a high, and often esoteric level of sophistication. But unfortunately, with two exceptions (face, and possibly predictive) the types of validity outlined above are all ultimately circular. Starting from a certain set of assumptions 13 about the nature of language and language learning will lead to language tests which are perfectly valid in terms of these assumptions, but whose value must inevitably be called into question if the basic assumptions themselves are challenged. Thus a test which perfectly satisfies criteria of content, construct or concurrent validity may nonetheless fail to show in any interesting way how well a candidate can perform in or use the target language. This may occur quite simply if the construct of the language learning theory, and the content of the syllabus are themselves not related to this aim, or if the test is validated against other language tests which do not concern themselves with this objective. There is clearly no such thing in testing as 'absolute' validity. Validity exists only in terms of specified criteria, and if the criteria turn out to be the wrong ones, then validity claimed in terms of them turns out to be spurious. Caveat emptor. Comments This criticism, implicit and explicit, made in the preceding sections applies to a theory of testing which has hardly ever been realised in the extreme form in which Lado presented it. Certainly in the UK., a mixture of pragmatism and conservatism has ensured that much of the institutionalised testing of foreign languages owes as much to the 1920's as to the 1960's. This does not mean though, that there is anything chimerical about the ideas put forward by Lado. Their influence has been recognised by writers on language testing ever since the first publication of his book. But it is as representation of theory that the ideas are most significant. In practice, as Davies (1978) remarks, there is very often a gap between what Lado himself does and what he says he does. But this gap is often of detail rather than principle. Even if the totality of Lado's views have been more often honoured in the breach than in the observance, the influence of his work has been tremendous. Of the ideas examined above, very few have failed to find implicit acceptance in the mafb'rity of 'theory-based' tests developed over the last fifteen years. The overriding importance of reliability (hence the ubiquitous multiple-choice), the acceptance of validity of a statistical rather than necessarily of a practical nature, the directly quantifiable modes of assessment these are all ideas which have become common currency even among those who would reject many of the theories of language and language learning on which Lado based his approach. 14 Only in one area has a consistent alternative to Lado's views been argued, and that is the development of 'integrated' tests/test items 4 as opposed to Lado's arguments (at least in principle) in favour of 'pure' discrete items. 5 A clear statement of an 'integrated' position is made by Carroll (1968): '... since the use of language in ordinary situations call upon all these aspects [of language], we must further recognise that linguistic performance also involves the individual's capability of mobilizing his linguistic competences and performance abilities in an integrated way, ie in the understanding, speaking, reading or writing of connected discourse.' This implies a view of language which runs directly counter to a key assumption which we have earlier examined in Lado's work. It denies the atomistic nature of language as a basis for language testing. To this extent, Carroll's contribution is extremely important, but even here it must be observed that in practical terms he was doing no more than providing a post-hoc rationalisation. For the purely practical reasons alluded to earlier, very few 'pure' items had found their way into tests; in a sense, Carroll was merely legitimising the existing situation. Less casuistically, it must be observed that attempts to develop more revolutionary integrated tests (Oiler, 1971, 1973) have left out of account a crucial element in the original formulation, viz. 'the use of language in ordinary situations'. Both cloze and dictation are fundamentally tests of language competence. Both have their uses in determining the basic level of language proficiency of a given candidate. (More accurately, they enable the level of language proficiency to be assessed relative to that of other people who take exactly the same test under the same conditions.) Oiler claims that both test basic language processing mechanisms (analysis by synthesis); both sample a wide range of structural and lexical items in a meaningful context. But neither 4 Note that the word 'integrated' is used in different ways by different writers. For some it is possible to conceive of individual items which test integration of various elements of the language; for others the very isolation of separate items means that full integration is not being achieved. 5 Earlier it was implied that Lado himself very rarely used items of a totally pure kind. See Davies (1978) for an interesting discussion of integrated v. discrete-point testing. Davies argues that they are at different ends of the same continuum rather than in different universes. 15 gives any convincing proof of the candidate's ability to actually use the language, to translate the competence (or lack of it) which he is demonstrating into actual performance 'in ordinary situations', ie actually using the language to read, write, speak or listen in ways and contexts which correspond to real life. Adopting this 'use' criterion might lead us to consider precisely why neither discrete-point nor integrative tests of the type we have considered are able to meet it. Let us look in a rather simple way at some of the features of language use which do not seem to be measured in conventional tests. Interaction Based: in the vast majority of cases, language in use is based on an interaction. Even cases such as letter writing, which may seem to be solitary activities, can be considered as weak forms of interaction in that they involve an addressee, whose expectations will be taken into account by the writer. These expectations will affect both the content of the message and the way in which it is expressed. A more characteristic form of interaction, however, is represented by face-to-face oral interaction which involves not only the modification of expression and content mentioned above but also an amalgam of receptive and productive skills. What is said by a speaker depends crucially on what is said to him. Unpredictability: the apparently trivial observation that the development of an interaction is unpredictable is in fact extremely significant for the language user. The processing of unpredictable data in real time is a vital aspect of using language. Context: any use of language will take place in a context, and the language forms which are appropriate will vary in accordance with this context. Thus a language user must be able to handle appropriacy in terms of: context of situation eg physical environment role/status of participants attitude/formality linguistic context eg textual cohesion Purpose: a rather obvious feature of communication is that every utterance is made for a purpose. Thus a language user must be able to recognise why a certain remark has been addressed to him, and be able to encode appropriate utterances to achieve his own purposes. 16 Performance: What Chomsky (1965) described as 'competence', leaving out of account: 'such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic)' has been the basis of most language tests. Such conditions may or may not be 'grammatically irrelevant', but they certainly exist. To this extent the idealised language presented in listening tests fails to measure the effectiveness of the candidate's strategies for receptive performance. Similarly, the demand for context-free language production fails to measure the extent to which features of the candidate's performance may in fact hamper communication. Authenticity: a very obvious feature of authentic language should be noted in this context, ie with rare exceptions it is not simplified to take account of the linguistic level of the addressee. Thus measuring the ability of the candidate to, eg read a simplified text tells us nothing about his actual communicative ability, since an important feature of such ability is precisely the capacity to come to terms with what is unknown. Behaviour-Based: the success or failure of an interaction is judged by its participants on the basis of behavioural outcomes. Strictly speaking no other criteria are relevant. This is an extreme view of the primacy of content over form in language and would probably be criticised by language teachers. Nevertheless, more emphasis needs to be placed in a communicative context on the notion of behaviour. A test of communication must take as its starting point the measurement of what the candidate can actually achieve through language. None of the tests we have considered have set themselves this task. These then are some of the characteristics of language in use as communication which existing tests fail to measure or to take account of in a systematic way. Let us now turn to an examination of some of the implications of building them into the design specification for language tests. The Promised Land We can expect a test of communicative ability to have at least the following characteristics: 1 It will be criterion-referenced against the operational performance of a set of authentic language tasks. In other words it will set out to show whether or not (or how well) the candidate can perform a set of specified activities. 17 2 It will be crucially concerned to establish its own validity as a measure of those operations it claims to measure. Thus content, construct and predictive validity will be important, but concurrent validity with existing tests will not be necessarily significant. 3 It will rely on modes of assessment which are not directly quantitative, but which are instead qualitative. It may be possible or necessary to convert these into numerical scores, but the process is an indirect one and recognised as such. 4 Reliability, while clearly important, will be subordinate to face validity. Spurious objectivity will no longer be a prime consideration, although it is recognised that in certain situations test formats which can be assessed mechanically will be advantageous. The limitations of such formats will be clearly spelt out, however. Designing a test with these characteristics raises a number of interesting issues. Performance Tests Asking the question, 'What can this candidate do?' clearly implies a performance-based test. The idea that performance (rather than competence) is a legitimate area of concern for tests is actually quite a novel one and poses a number of problems, chiefly in terms of extrapolation and assessment. If one assesses a candidate's performance in terms of a particular task, what does one learn of his ability to perform other tasks? Unless ways of doing this in some effective way can be found, operational tests which are economical in terms of time are likely to run the risk of being trivial. Problems of assessment are equally fundamental. Performance is by its very nature an integrated phenomenon and any attempt to isolate and test discrete elements of it destroys the essential holism. Therefore a quantitative assessment procedure is necessarily impractical and some form of qualitative assessment must be found. This has obvious implications for reliability. Given these problems, the question obviously arises as to whether communicative testing does necessarily involve performance tests. This seems to depend on what the purpose of the test is. If the purpose is proficiency testing, ie if one is asking how successful the candidate is likely to be as a user of the language in some general sense, then it seems to be incontrovertible that performance tests are necessary. The reasons for saying this should by now be clear, but at the risk of labouring the point let me re-state the principle that in language use the whole is bigger than the parts. No matter how sophisticated the analysis of the parts, no matter whether the parts are 18 isolated in terms of structures, lexis or functions, it is implausible to derive hard data about actual language performance from tests of control of these parts alone. However, if the test is to be used for diagnostic purposes rather than proficiency assessment, a rather different set of considerations may apply. In a diagnostic situation it may become important not simply to know the degree of skill which a candidate can bring to the performance of a particular global task, but also to find out precisely which of the communicative skills and elements of the language he has mastered. To the extent that these can be revealed by discrete-point tests and that the deficiencies so revealed might form the input to a teaching programme, this might be information worth having. (The form that such tests might take is discussed in Morrow, 1977.) But one more point must be made. It might be argued that discrete-point tests of the type under discussion are useful as achievement tests, ie to indicate the degree of success in assimilating the con- tent of a language learning programme which is itself based on a communi- cative (notional) syllabus. This seems to me misguided. As a pedagogic device a notional syllabus may specify the elements which are to be mastered for communicative purposes. But there is little value in assimilating these elements if they cannot be integrated into meaningful language performance. Therefore discrete-point tests are of little worth in this context. The clear implication of the preceding paragraphs is that by and large it is performance tests which are of most value in a communicative context. The very real problems of extrapolation and assessment raised at the beginning of this section therefore have to be faced. To what extent do they oblige us to compromise our principle? Let us deal first with extrapolation. A model for the performance of global communicative tasks may show for any task the enabling skills which have to be mobilised to complete it. Such a model is implicit in Munby (1978) and has been refined for testing purposes by B J Carroll (1978). An example of the way this might work is as follows: Global Task Search text for specific information Enabling Skills eg Distinguish main point from supporting details Understand text relations through grammatical cohesion devices Understand relations within sentences Understand conceptual meaning Deduce meaning of unfamiliar lexis 19 The status of these enabling skills vis-il-vis competence:performance is interesting. They may be identified by an analysis of performance in operational terms, and thus they are clearly, ultimately performance-based. But at the same time, their application extends far beyond any one particular instance of performance and in this creativity they reflect an aspect of what is generally understood by competence. In this way they offer a possible approach to the problem of extrapolation. An analysis of the global tasks in terms of which the candidate is to be assessed (see later) will usually yield a fairly consistent set of enabling skills. Assessment of ability in using these skills therefore yields data which are relevant across a broad spectrum of global tasks, and are not limited to a single instance of performance. While assessment based on these skills strictly speaking offends against the performance criterion which we have established, it should be noted that the skills are themselves operational in that they derive from an analysis of task performance. It is important that the difference between discrete-point tests of these enabling skills and discrete-point tests of structural aspects of the language system is appreciated. Clearly, though, there exists in tests of enabling skills a fundamental weakness which is reminiscent of the problem raised in connection with earlier structural tests, namely the relationship between the whole and the parts. It is conceivable that a candidate may prove quite capable of handling individual enabling skills, and yet prove quite incapable of mobilising them in a use situation or developing appropriate strategies to communicate effectively. Thus we seem to be forced back on tests of performance. A working solution to this problem seems to be the development of tests which measure both overall performance in relation to a specified task, and the strategies and skills which have been used in achieving it. Written and spoken production can be assessed in terms of both these criteria. In task- based tests of listening and reading comprehension, however, it may be rather more difficult to see just how the global task has been completed. For example, in a test based on the global task exemplified above and which has the format of a number of true/false questions which the candidate has to answer by searching through a text, it is rather difficult to assess why a particular answer has been given and to deduce the skills and strategies employed. In such cases questions focusing on specific enabling skills do seem to be called for in order to provide the basis for convincing extrapolation. If this question of the relationship between performance and the way it is achieved, and the testing strategy which it is legitimate to adopt in order to 20 measure it seems to have been dealt with at inordinate length in the context of this paper, this reflects my feeling that here is the central distinction between what has gone before and what is now being proposed. Admitting the necessity for tests of performance immediately raises the problem of assessment. How does one judge production in ways which are not hopelessly subjective, and how does one set receptive tasks appropriate for different levels of language proficiency? The answer seems to lie in the concept of an operational scale of attainment, in which different levels of proficiency are defined in terms of a set of performance criteria. The most interesting work I know of in this area has been carried out by B J Carroll (Carroll, 1977). In this, Carroll distinguishes different levels of performance by matching the candidate's performance with operational specifications which take account of the following parameters: Complexity } of text which can be handled Range of, eg enabling skills, structures, functions which can be handled Speed at which language can be processed Flexibility Shown in dealing with changes of, eg topic Accuracy I with which, eg enabling skills, structures, Appropriacy / functions, can be handled Independence from reference sources and interlocutor Repetition ).. Hesitation f '" processing text These specifications (despite the difficulties of phrasing them to take account of this in the summary given) are related to both receptive and productive performance. It may well be that these specifications need to be refined in practice, but they seem to offer a way of assessing the quality of performance at different levels in a way which combines face validity with at least potential reliability. This question of reliability is of course central. As yet there are no published data on the degree of marker reliability which can be achieved using a scheme of this sort, but informal experience suggests that standardisation meetings should enable fairly consistent scorings to be achieved. One important factor is obviously the form which these scores should take and the precise basis on which they should be arrived at. 21 It would be possible to use an analytic system whereby candidates' performance was marked in terms of each of the criteria in turn and these were then totalled to give a score. More attractive (to me at least) is a scheme whereby an overall impression mark is given with the marker instructed simply to base his impression on the specified criteria. Which of these will work better in practice remains to be seen, but the general point may be made that the first belongs to a quantitative, analytic tradition, the second to a qualitative, synthetic approach. Content We have so far considered some of the implications of a performance-based approach to testing, but have avoided the central issue: what performance? The general point to make in this connection is perhaps that there is no general answer. One of the characteristic features of the communicative approach to language teaching is that it obliges us (or enables us) to make assumptions about the types of communication we will equip learners to handle. This applies equally to communicative testing. This means that there is unlikely to be, in communicative terms, a single overall test of language proficiency. What will be offered are tests of proficiency (at different levels) in terms of specified communicative criteria. There are three important implications in this. First, the concept of pass:fail loses much of its force; every candidate can be assessed in terms of what he can do. Of course some will be able to do more than others, and it may be decided for administrative reasons that a certain level of proficiency is necessary for the awarding of a particular certificate. But because of the operational nature of the test, even low scorers can be shown what they have achieved. Secondly, language performance can be differentially assessed in different communicative areas. The idea of 'profile reporting' whereby a candidate is given different scores on, eg speaking, reading, writing and listening tests is not new, but it is particularly attractive in an operational context where scores can be related to specific communicative objectives. The third implication is perhaps the most far-reaching. The importance of specifying the communicative criteria in terms of which assessment is being offered means that examining bodies will have to draw up, and probably publish, specifications of the types of operation they intend to test, the content areas to which they will relate and the criteria which will be adopted in assessment. Only if this is done will the test be able to claim to know what it is measuring, and only in this way will the test be able to show meaningfully what a candidate can do. 22 The design of a communicative test can thus be seen as involving the answers to the following questions: 1 What are the performance operations we wish to test? These are arrived at by considering what sorts of things people actually use language for in the areas in which we are interested. 2 At what level of proficiency will we expect the candidate to perform these operations? 3 What are the enabling skills involved in performing these operations? Do we wish to test control of these separately? 4 What sort of content areas are we going to specify? This will affect both the types of operation and the types of 'text'6 which are appropriate. 5 What sort of format will we adopt for the questions we set? It must be one which allows for both reliability and face validity as a test of language use. Conclusion The only conclusion which is necessary is to say that no conclusion is necessary. The rhetorical question posed by the title is merely rhetoric. After all it matters little if the developments I have tried to outline are actually evolutionary. But my own feeling is that those (eg Davies, 1978) who minimise the differences between different approaches to testing are adopting a viewpoint which is perhaps too comfortable; I think there is some blood to be spilt yet. 6 Use of the term 'text' may mislead the casual reader into imagining that only the written language is under discussion. In fact the question of text type is relevant to both the written and the spoken language in both receptive and productive terms. In the written mode it is clear that types of text may be specified in terms such as 'genre' and 'topic' as belonging to a certain set in relation to which performance may be assessed; specifying spoken texts may be less easy, since the categories that should be applied in an analysis of types of talking are less well established. I am at present working in a framework which applies certain macro-functions (eg ideational, directive, interpersonal) to a model of interaction which differentiates between speaker-centred and listener- centred speech. It is hoped that this will allow us to specify clear 1 / enough the different types of talking candidates will be expected to deal with. More problematical is the establishing of different role-relationships in an examination context and the possibility of testing the candidates' production of anything but rather formal stranger:stranger language. Simulation techniques, while widely used for pedagogic purposes, may offend against the authenticity of performance criterion we have established, though it is possible that those who are familiar with them may be able to compensate for this. 23 BIBLIOGRAPHY CARROLL.JB The psychology of language testing. In: DAVIES A, ed (1968), qv. CHOMSKY, N Aspects of the theory of syntax. MIT Press, 1965. CORDER, SP Error analysis, interlanguage and second language acquisition. Language teaching and linguistics abstracts. Vol. 8, no. 4,1975. DAVIES, A, ed Language testing symposium. London: Oxford University Press, 1968. DAVIES, A Language testing. Language teaching and linguistics abstracts, Vol. 11, nos. 3/4, 1978. MORROW, K Techniques of evaluation for a notional syllabus. Royal Society of Arts (mimeo), 1977. OLLER,J Dictation as a device for testing foreign language proficiency. English language teaching journal. Vol. 25, no. 3,1971. OLLER,J Cloze tests of second language proficiency and what they measure. Language learning. Vol. 23, no. 1,1973. PILLINER, AEG Subjective and objective testing. In: DAVIES, A, ed (1968), qv. ROBINSON, P Oral expression tests. English language teaching. Vol. 25, nos. 2 - 3,1973. SELINKER, L Interlanguage. International review of applied linguistics. Vol. 10, no. 3, 1972. 24 SPOLSKY, B Language testing: art or science? Paper presented at the Fourth AILA International Congress, Stuttgart, 1975. VALETTE, R M Modern language testing: a handbook. Harcourt Brace & World, 1967. WILKINS, D A Notional syllabuses. Oxford University Press, 1976. 25 REACTION TO THE MORROW PAPER (1) Cyril J Weir, Associated Examining Board Three questions need to be answered by those professing adherence to this 'new wave' in language testing: 1 What is communicative testing? 2 Is it a job worth doing? 3 Is it feasible? 1 What is communicative testing? There is a compelling need to achieve a wider consensus on the use of terminology in both the testing and teaching of language if epithets such as 'communicative' are to avoid becoming as debased as other terms such as 'structure' have in EFL metalanguage. Effort must be made to establish more explicitly what it is we are referring to, especially in our use of key terms such as 'competence' and 'performance', if we are to be more confident in the claims we make concerning what it is that we are testing. Canale and Swain (1980) provide us with a useful starting point for a clarification of the terminology necessary for forming a more definite picture of the construct, communicative testing. They take communicative competence to include grammatical competence (knowledge of the rules of grammar), sociolinguistic competence (knowledge of the rules of use and rules of discourse) and strategic competence (knowledge of verbal and non- verbal communication strategies). In Morrow's paper a further distinction is stressed between communicative competence and communicative performance, the distinguishing feature of the latter being the fact that performance is the realisation of Canale and Swain's (1980) three competences and their interaction: '... in the actual production and comprehension of utterances under the general psychological constraints that are unique to performances.' Morrow agrees with Canale and Swain (1980) that communicative language testing must be devoted not only to what the learner knows about the form of the language and about how to use it appropriately in contexts of use (competence), but must also consider the extent to which the learner is actually able to demonstrate this knowledge in a meaningful communicative 26 situation (performance) ie what he can do with the language, or as Rea (1978) puts it, his '... ability to communicate with ease and effect in specified sociolinguistic settings.' It is held that the performance tasks candidates might be faced with in communicative tests should be representative of the type they might encounter in their own real world situation and would correspond to normal language use where an integration of communicative skills is required with little time to reflect on or monitor language input and output. If we accept Morrow's distinction between tests of competence and per- formance and agree with him that the latter is now a legitimate area for concern in language testing, then this has quite far-reaching ramifications for future testing operations. For if we support the construct of performance based tests then in future far greater emphasis will be placed on the ability to communicate, and as Rea (1978) points out, language requirements will need to be expressed in functional terms and it will be necessary to provide operationally defined information on a candidate's test proficiency. Morrow raises the interesting possibility that in view of the importance of specifying the communicative criteria in terms of which assessment is being offered, public examining bodies would have to demonstrate that they know what it is that they are measuring by specifying the types of operation they intend to test and be able to show meaningfully in their assessment what a candidate could actually do with the language. Morrow also points out that if the communicative point of view is adopted there would be no one overall test of language proficiency. Language would need to be taught and tested according to the specific needs of the learner; ie in terms of specified communicative criteria. Carroll (1980) makes reference to this: '... different patterns of communication will entail different configurations of language skill mastery and therefore a different course or test content.' Through a system of profile reporting, a learner's performance could be differentially assessed in different communicative areas and the scores related to specific communicative objectives. 2 Is it a job worth doing? Davies (1978) suggests that by the mid '70s, approaches to testing would seem to fall along a continuum which stretches from 'pure' discrete item tests at one end, to integrative tests such as cloze at the other. He takes the view 27 that in testing, as in teaching, there is a tension between the analytical on the one hand and the integrative on the other. For Davies: '... the most satisfactory view of language testing and the most useful kinds of language tests, are a combination of these two views, the analytical and the integrative.' Morrow argues that this view pays insufficient regard to the importance of the productive and receptive processing of discourse arising out of the actual use of language in a social context with all the attendant performance constraints, eg processing in real time, unpredictability, the interaction-based nature of discourse, context, purpose and behavioural outcomes. A similar view is taken by Kelly (1978) who puts forward a convincing argument that if the goal of applied linguistics is seen as the applied analysis of meaning, eg the recognition of the context-specific meaning of an utterance as distinct from its system-giving meaning, then we as applied linguists should be more interested in the development and measurement of ability to take part in specified communicative performance, the production of and comprehension of coherent discourse, rather than in linguistic competence. It is not, thus, a matter of whether candidates know, eg through summing the number of correct responses to a battery of discrete-point items in such restricted areas as morphology, syntax, lexis and phonology, but rather, to take the case of comprehension, whether they can use this knowledge in combination with other available evidence to recover the writer's or speaker's context-specific meaning. Morrow would seem justified in his view that if we are to assess proficiency, ie potential success in the use of the language in some general sense, it would be more valuable to test for a knowledge of and an ability to apply the rules and processes, by which these discrete elements are synthesized into an infinite number of grammatical sentences and then selected as being appropriate for a particular context, rather than simply test a knowledge of the elements alone. In response to a feeling that discrete-point tests were in some ways inadequate indicators of language proficiency, the testing pendulum swung in favour of global tests in the 1970s, an approach to measurement that was in many ways contrary to the allegedly atomistic assumptions of the discrete-point testers. It is claimed by Oiler (1979) that global integrative tests such as cloze and dictation go beyond the measurement of a limited part of language competence achieved by discrete-point tests and can measure the ability to integrate disparate language skills in ways which more closely approximate to the actual process of language use. He maintains that provided linguistic tests such as cloze require 'performance' under real life contraints, eg time, they are at least a guide to aptitude and potential for communication even if not tests of communication itself. 28 Kelly (1978) is not entirely satisfied by this argument and although he admits that to the extent that: '... they require testees to operate at many different levels simultaneously, as in authentic communication, global tests of the indirect kind have a greater initial plausibility than discrete items... and certainly more than those items which are both discrete and indirect, such as multiple-choice tests of syntax.' he argues that: 'only a direct test which simulates as closely as possible authentic communication tasks of interest to the tester can have a first order validity ie one derived from some model of communicative interaction.' Even if it were decided that indirect tests such as cloze were valid in some sort of derived fashion, it still remains that performing on a cloze test is not the same sort of activity as reading. This is a point taken up by Morrow who argues that indirect integrative tests, though global in that they require candidates to exhibit simultaneous control over many different aspects of the language system and often of other aspects of verbal interaction as well, do not necessarily measure the ability to communicate in a foreign language. Morrow correctly emphasises that though indirect measures of language abilities claim extremely high standards of reliability and validity as established by statistical techniques, the claim to validity remains suspect. Morrow's advocacy of more direct, performance-based tests of actual communication has not escaped criticism though. One argument voiced is that communication is not co-terminous with language and a lot of communication is non-linguistic. In any case, the conditions for actual real-life communication are not replicable in a test situation which appears to be by necessity artificial and idealised and, to use Davies's phrase (1978), Morrow is perhaps fruitlessly pursuing 'the chimera of authenticity'. Morrow is also understandably less than explicit with regard to the nature and extent of the behavioural outcomes we might be interested in testing and the enabling skills which contribute to their realisation. Whereas we might come nearer to specifying the latter as our knowledge of the field grows, the possibility of ever specifying 'communicative performance', of developing a grammar of language in use, is surely beyond us given the unbounded nature of the surface realisations. 29 Reservations must also be expressed concerning Morrow's use of the phrase 'performance tests'. A test which seeks to establish how the learner performs in a single situation, because this is the only situation in which the learner will have to use the target language, (a very unlikely state of affairs) could be considered a performance test. A performance test is a test which samples behaviours in a single setting with no intention of generalising beyond that setting. Any other type of test is bound to concern itself with competence for the very act of generalising beyond the setting actually tested implies some statement about abilities to use and/or knowledge. In view of this it would perhaps be more accurate if instead of talking in terms of testing performance ability we merely claimed to be evaluating samples of performance, in certain specific contexts of use created under particular test constraints, for what they could tell us about a candidate's underlying competence. Though a knowledge of the elements of a language might well be a necessary prerequisite to language use, it is difficult to see how any extension of a structuralist language framework could accommodate the testing of communicative skills in the sense Morrow is using the term. Further, a framework such as Lado's might allow us to infer a student's knowledge which might be adequate, perhaps, for diagnostic/ordering purposes, but is it adequate for predicting the ability of a student to use language in any communicative situation? I do not feel we are yet in a position to give any definite answer to the question 'Is communicative testing a job worth doing?'. Though I would accept that linguistic competence must be an essential part of communicative competence, the way in which they relate to each other or either relates to communicative performance has in no sense been clearly established by empirical research. There is a good deal of work that needs to be done in comparing results obtained from linguistically based tests with those which sample communicative performance before one can make any positive statements about the former being a sufficient indication of likely ability in the latter or in real-life situations. Before any realistic comparisons are possible, reliable, effective, as well as valid, methods for establishing and testing relevant communicative tasks and enabling skills need to be devised and investigated. This raises the last of the three questions posed at the start of this paper: 'How feasible is communicative testing?'. A satisfactory standard of test reliability is essential because communicative tests, to be considered valid, must first be proven reliable. Rea (1978) argues that simply because tests which assess language as communication cannot automatically claim high standards of reliability in the same way that discrete item tests are able to, this should not be accepted as a justification for continued reliance on measures with very suspect validity. 30 Rather, we should first be attempting to obtain more reliable measures of communicative abilities if we are to make sensible statements about their feasibility. 3 Is it feasible? Corder (1973) noted: 'The more ambitious we are in testing the communicative competence of a learner, the more administratively costly, subjective and unreliable the results are.' Because communicative tests will involve us to a far greater extent in the assessment of actual written and oral communication, doubts have been expressed concerning time, expenditure, ease of construction, scoring, requirements in terms of skilled manpower and equipment, in fact, about the practicability of a communicative test in all its manifestations. To add to these problems we still lack a systematic description of the language code in use in meaningful situations and a comprehensive account of language as a system of communication. For Kelly (1978) the possibility of devising a construct-valid proficiency test, ie one that measures ability to communicate in the target language, is dependent on the prior existence of: '... appropriate objectives for the test to measure.' Advocates of communicative tests seem to be arguing that it is only necessary to select certain representative communication tasks as we do not use the same language for all possible communication purposes. In the case of proficiency tests, these tasks are seen as inherent in the nature of the communication situation for which candidates are being assessed. Caution, however, would demand that we wait until empirical evidence is available before making such confident statements concerning the identification of these tasks as only by first examining the feasibility of establishing suitable objectives through research into real people coping with real situations, will we have any basis for investigating the claims that might be made for selecting a representative sample of operational tasks to assess performance ability. Even if it were possible to establish suitable objectives, ie successfully identify tasks and underlying constituent enabling skills, then we would still have to meet the further criticism that the more authentic the language task we test, the more difficult it is to measure reliably. If, as Morrow suggests, we seek to construct simulated communication tasks which closely resemble those a candidate would face in real life and which make realistic demands on him in 31 terms of language performance behaviours, then we will certainly encounter problems especially in the areas of extrapolation and assessment. Kelly (1978) observed that any kind of test is an exercise in sampling and from this sample an attempt is made to infer students' capabilities in relation to their performance in general: That is, of all that a student is expected to know and/or do as a result of his course of study (in an achievement test) or that the position requires (in the case of a proficiency test), a test measures students only on a selected sample. The reliability of a test in this conception is the extent to which the score on the test is a stable indication of candidates' ability in relation to the wider universe of knowledge, performance, etc., that are of interest.' He points out that even if there is available a clear set of communication tasks: '... the number of different communication problems a candidate will have to solve in the real world conditions is as great as the permutations and combinations produced by the values of the variables in the sorts of messages, contexts of situation and performance conditions that may be encountered.' Thus on the basis of performance, on a particular item, one ought to be circumspect, to say the least, in drawing conclusions about a candidate's ability to handle similar communication tasks. In order to make stable predictions of student performance in relation to the indefinitely large universe of tasks, it thus seems necessary to sample candidates' performances on as large a number of tasks as is possible, which conflicts immediately with the demands of test efficiency. The larger the sample, and the more realistic the test items, the longer the test will have to be. In the case of conventional language tests aimed at measuring mastery of the language code, extrapolation would seem to pose few problems. The grammatical and phonological systems of a language are finite and manageable and the lexical resources can be delimited. The infinite number of sentences in a language are made up of a finite number of elements and thus tests of the mastery of these elements are extremely powerful from a predictive point of view. Thus, we might tend to agree with Davies (1978): 32 '... what remains a convincing argument in favour of linguistic competence tests (both discrete point and integrative) is that grammar is at the core of language learning... Grammar is far more powerful in terms of generalisability than any other language feature.' However, Kelly (1978) puts forward an interesting argument against this viewpoint. It is not known, for example, how crucial a complete mastery of English verb morphology is to the overall objective of being able to communicate in English, or how serious a disability it is not to know the second conditional. We thus have: "... no reliable knowledge of the relative functional importance of the various structures in a language.' Given this failing, it would seem impossible to make any claims about what students should be able to do in a language on the basis of scores on a discrete-point test of syntax. The construct, ability to communicate in the language, involves more than a mere manipulation of certain syntactic patterns with a certain lexical content. In consequence, it seems we still need to devise measuring instruments which can assess communicative ability in some more meaningful way. As a way out of the extrapolation quandary, Kelly (1978) suggests a two- stage approach to the task of devising a test that represents a possible compromise between the conflicting demands of the criteria of validity, reliability and efficiency. The first stage involves the development of a direct test that is maximally valid and reliable, and hence inefficient. The second stage calls for the development of efficient, hence indirect, tests of high validity. The validity of the indirect tests is to be determined by reference to the first battery of direct tasks.' As far as large-scale proficiency testing is concerned, another suggestion that has been made is that we should focus attention on language use in individual and specified situations while retaining, for purposes of extrapolation, tests of the candidate's ability to handle that aspect of language which obviously is generalisable to all language use situations, namely the grammatical and phonological systems. The hard line Morrow has adopted in the article under consideration makes it unlikely that he would contemplate either of these suggestions and would continue to argue for the use of pure direct performance-based tests. 33 Morrow's argument is that a model (as yet unrealised) for the performance of global communicative tasks may show, for any task, the enabling skills which have to be mobilised to complete it. He argues that assessment of ability in using these skills would yield data which are relevant across a broad spectrum of global tasks, and are not limited to a single instance of performance, though in practice these are by no means as easy to specify as precisely as he assumes nor are there any guidelines available for assessing their relative importance for the successful completion of a particular communicative operation, let alone their relative weighting across a spectrum of tasks. He is also aware that there exists in tests of enabling skills a fundamental weakness in the relationship between the whole and the parts, as a candidate may prove quite capable of handling individual enabling skills and be incapable of mobilising them in a use situation or developing appropriate strategies to communicate effectively. In practice it is by no means easy even to identify those enabling skills which might be said together to contribute towards the successful completion of a communicative task. Morrow would appear to assume that we are not only able to establish these enabling skills, but also able to describe the relationship that exists between the part and the whole in a fairly accurate manner (in this case, how 'separate' enabling skills contribute to the communicative task). He would seem to assume that there is a prescribed formula: possession and use of _ successful completion of enabling skills X+Y+Z communicative task whereas it would seem likely that the added presence of a further skill or the absence of a named skill might still result in successful completion of the task in hand. The second main problem area for Morrow is that of assessment. Given that performance is an integrated phenomenon, a quantitative assessment procedure would seem to be invalid so some form of qualitative assessment must be found. This has obvious implications for reliability. A criticism often made is that it is not possible to assess production qualitatively in ways which are not hopelessly subjective. For Morrow, the answer seems to lie in the concept of an operational scale of attainment, in which different levels of proficiency are defined in terms of a set of performance criteria. B J Car roll (op. cit. and 1978a and this volume) distinguishes different levels of perform- ance by matching the candidate's performance with operational specifications which take account of parameters such as: size, complexity, range, speed, flexibility, accuracy, appropriacy, independence, repetition and hesitation. 34 Morrow, as Carroll, advocates the use of a qualitative-synthetic approach, a form of banded mark scheme (see Caroll, this volume, for examples of this type of rating scheme) where an overall impression mark is awarded on the basis of specified criteria in preference to any analytic scheme. It is quite likely that the operational parameters of B J Carroll (op. cit.) eg size, com- plexity, range, accuracy, appropriacy, etc., will be subject to amendment in practice and in some cases even omission, but as Morrow argues in the article under review: '... they seem to offer a way of assessing the quality of performance at different levels in a way which combines face validity with at least potential reliability.' There are no published-data on the degree of marker reliability which can be achieved using a scheme of this sort, but Morrow's experience with the new R S A examination and the vast experience of G C E boards in the impression- based marking of essays suggests that standardisation meetings should enable fairly consistent scorings to be achieved, or at least as consistent as those achieved by analytical marking procedures. Perhaps the point that should be made in answer to the question 'Is it feasible?' is that once again we do not yet know the answer. Until we have actually sought to confront the problems in practice, I feel it would be wrong to condemn communicative testing out of hand. What is needed is empirical research into the feasibility of establishing communicative tests, plus a comparison of the results that can be obtained through these procedures with those that are provided by discrete-point and indirect integrative measures. 35 BIBLIOGRAPHY CANALE, M and SWAIN, M Theoretical bases of communicative approaches to second language teaching and testing. In: Journal of applied linguistics, 1.1,1-47,1980. CAR ROLL, BJ An English language testing service: specifications. London: British Council, 1978 and this volume. CARROLL, BJ Guidelines for the development of communicative tests. London: Royal Society of Arts, 1978a. CARROLL, BJ Testing communicative performance: an interim study. Pergamon, 1980. COOPER, R L An elaborated testing model. In: Language learning (special issue) 3: Problems in foreign language testing, 57-72,1968. CORDER, SP Introducing applied linguistics. London: Penguin, 1973. DAVIES,A«/. Language testing symposium: a psycholinguistic approach. London: Oxford University Press, 1968. DAVIES,A Language testing: survey article. In: Language teaching and linguistics abstracts, 2 3/4; part 1: 145-159; part 2: 215-231; 1978. FITZPATRICK, R and MORRISON, E J Performance and product evaluation. In: THORNDIKE, R L, ed. Educational measurement. 2nd ed. Washington, DG: American Council on Education, 1971. HYMES, D H On communicative competence. In: PRIDE AND HOLMES, eds, Sociolinguistics. Harmondsworth: Penguin, 1972, pp 269-293 (excerpts from the paper published 1971 by University of Pennsylvania Press, Philadelphia). 36 KELLY,R On the construct validation of comprehension tests: an exercise in applied linguistics. PhD. University of Queensland, 1978. MORROW, K Techniques of evaluation for a notional syllabus. London: Royal Society of Arts, 1977. MORROW, K Testing: revolution or evolution. In: JOHNSON, K and BRUMFIT, C, eds. The communicative approach to language teaching. London: Oxford University Press, 1979 and this volume. OLLER.JW l< Language tests at school. London: Longman, 1979. RE A, PM Assessing language as communication. In: MALS journal (new series: 3). University of Birmingham: Department of English, 1978. ROYAL SOCIETY OF ARTS Examinations in the communicative use of English as a foreign language: specifications and specimen papers. London: Royal Society of Arts, 1980. SPOLSKY, B Language testing: the problem of validation. In: TESOL quarterly, 2, 88-94,1968. WIDDOWSON, HG Teaching language as communication. London: Oxford University Press, 1978. 37 REACTION TO THE MORROW PAPER (2) Alan Moller, The British Council, London Morrow's article is an important contribution to the discussion of communicative language testing. Some of the content, however, is marred by a somewhat emotional tone, although Morrow admits at the end that the title is rhetorical. The effect on the reader who is not informed about language testing could be misleading. The case for communicative language testing may well be stated forthrightly and with conviction, but talk of 'revolution' and 'spilling of blood' implies a crusading spirit which is not appropriate. The most traditional forms of language examining, and indeed of examining in most subjects, have been the viva and the dissertation or essay, both basic forms of communication. Reintroduction of these forms of examining, with some modifications, can hardly be termed revolutionary. What is new is the organisation of these traditional tasks. The nature of the task is more clearly specified, there is a more rigorous approach to the assessing of the language produced, and the label given to this process is new. More suitable titles for this discussion might be 'language testing: the communicative dimension', or 'communicative language testing: a re-awakening'. Work in this area is recent and falls within the compass of what Spolsky (1975) termed the psycholinguistic-sociolinguistic phase of language testing. However, it is perhaps time to identify a fourth phase in language testing, closely linked to the third, the sociolinguistic-communicative phase. As is often the case with discussion of communicative competence, communicative performance, and now communicative testing, no definition is given! But the characteristics identified by Morrow give some indication as to what might be included m definitions. It would seem that the general purpose of communicative tests is to establish first whether communication is taking place and secondly the degree of acceptability of the communication. This implies making judgements on the effectiveness and the quality of the communication observed. The deficiencies of the structuralist method of language teaching and of that phase of language testing are well rehearsed, and Morrow need not have devoted so much space to it. He was right to point out J B Carroll's (1968) underlining of the integrated skills of listening, speaking, reading and writing. But he has failed to point out that although integrated texts were presented to students, and although students were often asked to produce short 38 integrated texts, the items themselves were normally discrete, focusing on structural or lexical features. While agreeing that the primacy of contrastive analysis as a basis of language tests is no longer acceptable, we must beware of implying or insisting that the primacy of language as communication is the sole basis for language proficiency tests. Discussions on language testing normally touch on two key questions. Morrow's concern with language as communication and his failure to define communicative language testing ensure that reaction to his article bring these questions to the fore: 1 What is language, and what is language performance? 2 What is to be tested? In answer to these questions we might propose the following definition of communicative language testing: an assessment of the ability to use one or more of the phonological, syntactic and semantic systems of the language 1 so as to communicate ideas and information to another speaker/reader in such a way that the intended meaning of the message communicated is received and understood, and 2 so as to receive and understand the meaning of a message communicated by another speaker/writer that the speaker/writer intended to convey. This assessment will involve judging the quality of the message, the quality of the expression and of its transmission, and the quality of its reception in its transmission. Morrow has commented on discrete item (atomistic) tests and integrated (competence) tests and concluded that neither type 'gives any convincing proof of the candidate's ability to actually use the language'. Seven features of language use 'which do not seem to be measured in conventional tests' are then examined. If by conventional tests is meant discrete item and integrated tests, it is true that certain features may not be measured. It is equally questionable whether some of these features are even measured in so-called communicative tests. Does the measurement of a subject's performance include measuring the purpose of the text, its authenticity or its unpredictability, for example? It would seem to me that the claim is being 39 made that these features are not present in the test task in conventional tests. Even this claim is not entirely accurate. It is helpful to examine the characteristics put forward by Morrow individually. Purpose of text The implication that every utterance produced in a communicative test is purposeful may not always be so. In many tests candidates may participate in communication and make statements which fulfil no other purpose than to follow the rules of what is likely to be an artificial situation. There is apparent purpose to the text being uttered, but the text may genuinely be no more purposeful than the texts presented in discrete and integrative test tasks. Context There are few items, even in discrete item tests, that are devoid of context. Communicative tests may attempt to make the context more plausible. Performance is not wholly absent from integrative tests, although it may be limited. Perhaps what is meant is production. Interaction Many conventional reading and listening tests are not based on interaction between the candidate and another speaker/ hearer, but the candidate does interact with the text both in cloze and dictation. Authenticity This notion has been questioned elsewhere by Davies (1980) and seems to me to need careful definition. Language gene- rated in a communicative test may be authentic only insofar as it is authentic to the context of a language test. It may be no more authentic in the sense of resembling real life communication outside the test room than many a reading comprehension passage. Unpredictability It is certain that unpredictability can occur naturally and can be built into tests of oral interaction. This feature would seem to be accounted for most satisfactorily in communicative language tests as would certain behaviour as the outcome of communicative test tasks. Thus there are only two features of language use which are likely to occur only in communicative language tests. The absence or presence of seven characteristics in different types of test is shown more clearly in the table below. Column D refers to discrete item testing, column I to integrative tests and column C to communicative tests. Absence of a characteristic is indicated by X and presence by -J. There is, however, an important difference in the role of the candidate in the various kinds of tests. In the discrete and integrative tests the candidate is an outsider. The text of the test is imposed on him. He has to respond and interact in the ways set down. But in communicative performance tests the candidate is an insider, acting in and shaping the communication, producing the text together with the person with whom he is interacting. 40 Characteristics D 1 C Purpose of text x y y Context (j\ y y Performance x y(limited) J Interaction y y Authenticity ? ? ? Unpredictability xx y Behaviour-based xx v There may be little new in the subject's actual performance in communicative language tests. The main differences between traditional (pre-scientific) and communicative tests will lie more in the content of the tests and the way in which student performance is assessed. The content of the tests will be specified in terms of linguistic tasks and not in terms of linguistic items. Tests will be constructed in accordance with specifications and not simply to conform to formats of previous tests. Criteria for assessment will also be specified to replace simple numerical or grading scales which frequently do not make it clear what the points on the scale stand for. Certain criteria at different levels of performance will be worked out incorporating agreed parameters. These criteria may well take the form of a set of descriptions. Another way of comparing communicative language testing with other types of tests is by considering the relative importance of the roles of the test constructor, the subject (or candidate) and the assessor in each of the phases of language testing identified by Spolsky - the pre-scientific, the psychometric-structuralist, and the psycholinguistic-sociolinguistic (com- petence) phases. The table below summarises these roles. The type of test is given on the left, column T refers to the role of the test constructor, column S to the role of the student, and column A to the role of the assessor. A V indicates the importance of the role, (V) indicates minor importance, and ( ) no importance. 41 Test type T s A Pre-scientific (/) >/ / Psych/Struct y M ( ) Psych/Socio (A y Insider-insider professional professional Consultant-client Adult-adult Junior senior Equal-equal Leader-follower Professional- Man/woman-man/ Man/woman-man/ Adult-adult woman woman Professional- professional Student-student Customer-server professional * Junior-sen ior(+vv) Member of pub-official Professional- Advisee -adviser Guest-host no n -professional *Man/woman- Senior-junior l+w) man/woman Equal-equal 'Equal-equal Friend-friend Guest-host 91 Instrumentality P1. Business P2. Agriculture Spec. 4 Medium Listening as P1 Speaking Reading Writing Mode Monologue as PI Dialogue (spoken and written to be heard or read; sometimes to be spoken as if not written) Channel Face-to-face Face-to-face Print Print Tape Film Spec. 5 Dialect All sections: Understand British Standard English dialect. Produce acceptable regional version of Standard English accent. Spec. 6 Target Level (in the 4 media for each section) Dimensions L Sp R Wr L Sp R Wr (max=7) Size 6 3 7 3 2 1 7 3 Complexity 7 4 6 5 2 1 6 3 Range 5 4 5 5 2 1 4 2 Delicacy 5 5 6 6 1 1 5 3 Speed 6 4 5 6 3 2 5 3 Flexibility 5 5 3 3 1 1 2 1 Tolerance Conditions L Sp R Wr L Sp R Wr (max-5) Error 3 4 3 3 4 5 1 2 Style 4 4 5 4 5 5 4 4 Reference 3 4 2 2 5 5 3 3 Repetition 3 4 2 3 5 5 5 3 Hesitation 3 4 4 3 4 5 3 3 92 P3. Social P4. Engineering P5. Technician P6. Medicine as P1 as P1 as P1 as P1 as P1 as P1 as P1 as P1 Face-to-face Face-to-face Face-to-face Face-to-face Telephone Print Telephone Telephone Print Film Radio Print Public address Pictorial Print Radio Mathematical Tape recorder TV Disc Tape recorder Film Dialect All sections Understand British Standard English dialect. Produce acceptable regional version of Standard English accent. L Sp R Wr L Sp R Wr L SP R Wr L Sp R Wr 4 3 4 1 6 3 7 3 6 4 5 3 6 5 6 4 4 3 4 1 6 5 6 5 6 3 5 3 6 4 6 4 7 3 5 1 5 4 6 4 6 5 6 3 6 4 6 4 4 4 4 1 6 4 6 5 6 5 6 3 6 5 6 5 6 4 4 1 6 3 4 4 6 3 5 2 5 4 5 4 6 4 4 1 5 3 4 3 3 2 1 1 6 5 6 4 L Sp R Wr L Sp R Wr L SP R Wr L Sp R Wr 3 4 3 5 1 3 3 2 4 4 3 4 3 4 3 4 4 4 4 5 2 3 3 3 5 5 5 5 3 3 3 3 2 2 5 3 5 4 5 5 6 5 5 5 3 3 4 4 2 3 5 4 3 4 3 5 5 5 5 5 4 3 4 3 2 3 4 4 4 5 4 4 3 4 3 3 3 3 4 4 93 Spec. 7 Events/Activities P1. Business P2. Agriculture P3. Social 1 Lectures 1 Reference Study 1 Off icial discussions Listen for overall Intensive for all infm Reading forms Comprehension Specific assignments Complete documents Make notes Evaluative reading Discuss with officials Ask for clarification Main infm rdg 2 Seminars/Tutorials 2 Current Literature 2 Social in Britain Discuss given topics Routine check Personal information Listen for comprehension Keep abreast Invitations Make notes For main information Mealtime conversation Ask for clarification Complaints Polite conversation 3 Reference Study 3 English lessons 3 Places of Interest Intensive reading Test study Reading text for infm Reading for main infm Teacher exposition Entrance/tickets Assignment rdg Group work Guidebooks Assessment rdg Listen to commentary Ask for information 4 Writing Reports 4 Other 4 Shopping Sort out information (Note: English is not Attract attention Factual writing much used in this Discuss goods Evaluative writing Spanish context, Give choice outside the study area) Arr payment Complaints

Issues in Language Testing PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue