Introduction to Learner Corpus Research PDF
Document Details
Uploaded by DefeatedBlueLaceAgate
Université catholique de Louvain
2021
Fanny Meunier
Tags
Summary
This paper provides an introduction to learner corpus research. It discusses the key features of learner corpora, including their authenticity, communicative tasks, and contextual conditions. The author explores the different approaches and methods used in learner corpus research, including corpus-based and corpus-driven approaches.
Full Transcript
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/347356234 Introduction to Learner Corpus Research Chapter · January 2021 CITATIONS READS 4...
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/347356234 Introduction to Learner Corpus Research Chapter · January 2021 CITATIONS READS 4 2,283 1 author: Fanny Meunier Université Catholique de Louvain - UCLouvain 103 PUBLICATIONS 2,042 CITATIONS SEE PROFILE All content following this page was uploaded by Fanny Meunier on 21 December 2020. The user has requested enhancement of the downloaded file. 3 Introduction to Learner Corpus Research Fanny Meunier Introduction Work in Learner Corpus Research (LCR) started around the 1980s as “an offshoot of corpus linguis- tics” (Granger et al., 2015, p. 1). Corpus linguistics and LCR share a set of common features, among which is the use of corpora and corpus tools to analyze language. A corpus is defined by McEnery et al. (2006, p. 5) as a “collection of machine-readable authentic texts (including transcripts of spoken data) which is sampled to be representative of a particular language or language variety”. A learner corpus is thus a specific type of corpus which, to follow up on McEnery et al.’s definition, can broadly be defined as a collection of machine-readable texts consisting in representative samples of the language written and/or spoken by learners of an additional language (viz. not their mother tongue, but a foreign/second/nth target language). LCR uses learner corpus data as its main data source. As for the results of learner corpus studies, they typically serve two main purposes: inform SLA research, provide useful input for applied projects (including the creation or improvement of teaching materials/approaches, or the training/development of Natural Language Processing tools). A comparison of the oft-cited definitions of LCR (see Gilquin, 2015 for more details) reveals that one of the key features of learner corpora is that the language they contain is meant to be as authentic as possible and is often defined as (near-) natural. As explained by Granger (2008, p. 337), “the term near-natural is used to highlight the ‘need for data that reflects as closely as possible ‘natural’ language use (i.e. language that is situationally and interactionally authentic) while recognizing that the limitations facing the collection of such data often obligate researchers to resort to clinically elicited data (for example, by using pedagogic tasks (Ellis & Barkhuizen, 2005, p. 7))”. As cases of purely spontaneous oral or written learners’ productions are rare1 – or, when they take place, cannot easily be ‘spontaneously collected’ – pedagogic tasks serve as the main prompts to (near)-natural learner language productions. Another key feature is that the texts2 included in learner corpora have been selected on the basis of a number of criteria or variables related to, among others: the learners themselves (e.g. target language, mother tongue, proficiency level), the type of communicative task (e.g. written/oral communication, descriptive/persuasive/na rrative/expository writing/narrative, informal/formal level), 23 Fanny Meunier the contextual conditions of language production or task setting (e.g. interactive tasks, com- puter–mediated communication, use of reference tools or not). The criteria/variables listed above are typically used as metadata to organize the electronic stor- age of the data in large databases that can later be queried. Researchers can, for instance, select sub-sections of the data collected (e.g. only texts spoken by female learners of German as a foreign language/from a lower beginner level/collected during an informal discussion). The vari- ables can later serve as dependent/independent/predictive variables in the linguistic analyses car- ried out (see section 3 for more details). The learners’ initial productions (often called ‘raw’ texts) are often further annotated to enable researchers to access richly annotated data. The texts can be annotated automatically with the help of natural language processing tools, edited with the help of semi-automated tools, or anno- tated fully manually. Some examples of typical linguistic annotations include: automatic part-of-speech tagging: each word in the corpus is attributed a part-of-speech (noun, verb, adjective, etc.) thanks to the help of fully automatic part-of-speech tagging software (see Chapters 5 and 6, this volume, for more information); computer-aided error annotation (CEA): as learner corpora are produced by learners, some researchers may be interested in spotting areas of difficulty that learners have in producing an L2. Annotating these aspects makes it possible to subsequently focus on them to help foster learners’ proficiency. Errors/infelicities in the corpus are first spotted by researchers who then use an editor to insert codes in the corpus (e.g. a plural determiner followed by a singular common noun can receive a ‘noun number’ error code). More details on CEA can be found in Chapter 7. As the two examples provided above show, annotations may include fully automatic tools (such as part-of-speech taggers, semantic taggers, or syntactic parsers), but also semi-auto- matic annotation tools requiring human intervention before the analysis can be done. Some annotations can also be done fully manually by researchers (by inserting codes in the text using text processing tools, for example) when the analysis cannot be automated, as would be the case for the inclusion of non-verbal comments in transcripts of videoed interactions ([laughs], [unfilled pauses], [gestures], [contextual comments], etc.). This type of tagging is often referred to as problem-oriented annotation/tagging, viz. the manual annotation by the researcher of any feature of interest. A last key feature is that learner corpora, like any other type of corpus, can be queried using corpus tools such as concordancers (see Chapter 6, this volume), which can be used to: extract word lists, word combinations, tags, keywords or annotations, display occurrences of words/phrases/tags in the selected corpus, compare different subcorpora in terms of keywords, frequency distribution of items, etc. Given the space limitations of an introductory chapter, it is not really possible to describe in detail all the tools that can be used to annotate or query learner corpora. I thus warmly recom- mend the Tools for Corpus Linguistics webpage to readers (see https://corpus-analysis.com/) as it offers an impressive list of 228 corpus tools, each described in terms of the following aspects: name, description, categories, platform, and pricing, plus a link to the tool. I also refer readers to the software index of the Handbook of Learner Corpus Research (Granger et al., 2015) as it contains a list of over 80 tools (annotation tools, DDL tools, statistical packages, text retrieval tools, iCALL and CALL packages, etc.) whose concrete use, description and illustration can be found in the handbook. 24 Introduction to Learner Corpus Research As can be seen from the list of key features above, technology is clearly part and parcel of LCR. Thanks to giant strides in computer technology in the last quarter of the 20th century, it became possible to collect data from much larger cohorts of learners and to use computer soft- ware to assist researchers in the annotation and analysis of the data. The affordances of technol- ogy also made it possible to perform data analyses that were either previously not conceivable or, at least, not feasible in a reasonable timeframe and at a reasonable cost. As the second section of this chapter will show, LCR has evolved remarkably through its three decades of existence. Just as learners typically go through stages of development in their learning of an additional language, LCR also evolved from a novice field (filled with the excitement that usually goes with novelty) towards more competent and reflective practices. This evolution has impacted most of the core features and issues in LCR, as will be shown in the next section. Core Issues Size, Collection, Variables, and Analysis: Limits and Strengths Access to large (for the time) electronic learner corpora in the 1990s led to a revolution in the way learner language was analyzed and described. The first learner corpora that exceeded one million word tokens in size were collected, and the new options offered by automatic corpus analysis tools (word lists ordered in decreasing order of frequency, retrieval of words in con- texts through concordancers, automatic part-of-speech tagging, etc.) offered unprecedented insights into learner language. The urge to get access to previously inaccessible frequency information led to a ‘descriptive fever’ (analysis of productions by numerous learners; lists of the top n words in a corpus; frequency of errors; most frequently used verbs; overused and/ or underused linguistic items, often in relation to an L1 corpus etc.). The term ‘fever’ is not used here in any derogatory way but simply points to the focus of interest at the time, even if cautionary tales were already given. Granger (1994, p. 27) warned readers that “quantitative data should not be regarded as an end in itself” but rather “as a springboard for a qualitative investigation of the data” and of its patterns of use. Such cautionary tales notwithstanding, it must be acknowledged that numerous publications back then were essentially descriptive with frequency lists being provided and compared, with – in many cases – no clear reference to SLA theories, except for the sometimes simplistic reference to transfer. This led some research- ers to consider learner corpus linguistics as synonymous to distributional number crunching, which – despite the limitations mentioned above – also constituted an unfair shortcut. Granger (2009) responded to criticism levelled against LCR and the lack of collaboration between LCR and SLA by pointing that one of the main assets of the former is that it brings to the SLA field a much wider empirical basis than previously available. She also explained that learner corpora which have been collected on the basis of strict, well-described criteria and which have been stored in easily queryable databases contain data from hundreds and sometimes thousands of learners, which greatly enhance representativeness of data. It also makes controlling the many variables that affect learner production possible. Over the years, practices in LCR have also evolved significantly, moving from a focus on one main variable (mother tongue background) to studies analyzing the effects of and/or relationships between a much wider range of variables. Examples include planning time (Ädel, 2008), time of exposure/learning (Meunier & Littré, 2013), genre (Gentil & Meunier, 2018), or a combination of variables such as learning context and emotional aspects (De Smet et al., 2018). Overall, the initial criticism levelled against LCR – be it fair or not – proved very fruitful as it prompted learner corpus researchers to explicitly verbalize the numerous advantages of LCR and move the field further. Gries (2009, p. 2), for instance, argued that corpus linguistic methods are “a method just as acceptability judgments, experimental data, etc.” are and that 25 Fanny Meunier “linguists of every theoretical persuasion can use corpus data”. He also explained that usage- based cognitive-linguistic theories are particularly compatible with corpus linguistics methods, thereby throwing the spotlight on some of the specific strengths of LCR. The constant ques- tioning and reassessment of LCR led to a more reflective and competent practice in LCR, also prompting the collection of a much larger variety of learner corpus types, which subsequently opened up new avenues for analysis. Whilst the first learner corpora were mainly targeting written L2 English by relatively advanced learners, typically university students, a much larger range of target languages and text types has since been collected. The ‘Learner corpora around the world’ webpage3 maintained by the Centre for English Corpus Linguistics at the Université catholique de Louvain pays tribute to this variety of target languages (Arabic, French, German, Korean, Spanish, etc.), text types and production conditions (exam essays, argumentative and literary essays, letters, diaries, picture descriptions, book reviews, monologues, dialogues, computer-mediated communication, mails, translations, etc.). Other welcome advances have been made in terms of: proficiency levels (covering the whole range of proficiency levels, from beginners to advanced) and types of learners (children, teenagers, adults, non-native ‘learners’ but also non-native ‘users’ including teachers, heritage speakers, translators, etc.); variety of research designs (cross-sectional, quasi-longitudinal, longitudinal), The publication of the first handbook of Learner Corpus Research (Granger et al., 2015) and the launch of the first journal entirely devoted to LCR, the International Journal of Learner Corpus Research (IJLCR) pay tribute to the variety of current LCR studies addressing areas as diverse as interdisciplinarity (Callies & Paquot, 2015), linguistic innovations and creativity in non-native Englishes (Deshors et al., 2018), and study quality (Paquot & Plonsky, 2017). Other developments include the use of more complex statistical techniques to interpret quan- titative data (see e.g. Gries, 2013) and the popularization of mixed-methods designs to comple- ment LCR methods and studies (see Gilquin & Gries, 2009; Meunier & Littré, 2013). One of the limitations of LCR is that some of the language features studied by researchers may not naturally occur frequently enough in unconstrained, open-ended (semi-) authentic production. The col- lection and analysis of other data types to triangulate research results and offer converging or diverging evidence is then particularly useful. Such data types may include experimental data, questionnaires, semi-guided interviews, think-aloud protocols or ethnographic approaches (also see Chapter 10, this volume). The (Native Speaker) Norm/Myth? Native corpus studies have demonstrated their added value in making it possible to compare dif- ferent varieties of the same language, both synchronically and diachronically, and in providing a more balanced/refined description of languages. For instance, books like Brief Grammar for English (attributed to William Bullokar in 1586 and which aimed to show that English was as rule-bound as Latin) were replaced by thick and detailed accounts like the Longman Grammar of Spoken and Written English (Biber et al., 1999) where the grammatical specificities of various text types/registers were minutely described. The power of corpus data for comparing different language varieties is also a central asset of LCR. As Granger (2015, p. 8) explains, two types of comparison appeared to be particularly worthwhile in LCR: a comparison with native language (NL), seen as the ultimate attainment of learning a for- eign/second language; 26 Introduction to Learner Corpus Research a comparison of one sample of learner language (IL, for interlanguage) with other samples of learner language, particularly from learners with different mother tongue backgrounds, for example, E2F (the English produced by learners with French as an L1) vs E2G (the English produced by learners with German as an L1) in Figure 3.1. This double entry approach to LCR, conceptualized by Granger in 1996, was labelled Contrastive Interlanguage Analysis, or CIA. Whilst the IL vs IL approach has always been promoted and accepted, “CIA has been sub- jected to a range of criticism, most targeted at the L1/L2 branch” (Granger, 2015, p. 13), which prompted a new version of the methodology, abbreviated as CIA² (see Figure 3.2 for a visual representation). Put briefly, the reference to native speaker language was interpreted as the recog- nition of one idealized native speaker norm and even labelled as “imperialistic assumptions about the ownership of English” (Tan, 2005, p.: 128). This was an unfair criticism, according to Granger (2015, p. 15), as plenty of L1 standards (such as British, American, Australian, Canadian, Hong Kong, India, Singapore, Sri Lanka, etc.) have been used as reference corpora for CIA studies. In CIA², new terms have been proposed to avoid misunderstandings: RLV (for Reference Language Varieties) and ILV (for Interlanguage Varieties). The use of RLV points to the large number of different reference points against which learner data can be set (inner circle varieties such as British or American English, outer circle varieties such as Indian or Singapore English), as well as corpora of competent L2 user data, English as a Lingua Franca. As for ILVs, they refer to learner language varieties, given the “highly variable nature of interlanguage” (Granger 2015, p. 18). Comparing an ILV with an RLV makes it possible to better understand the processes at CIA NL vs IL IL vs IL E1 E2 E2F E2G E2S E2J Figure 3.1 Contrastive Interlanguage Analysis (Granger 1996) Figure 3.2 CIA² (Granger 2015, p. 17) 27 Fanny Meunier play in the acquisition of that specific ILV. A comparison of various ILVs (e.g. learners of L2 English whose mother tongues are French, Dutch, Italian, Greek, or Finnish) can help detect potential universal paths of acquisition versus L1 induced phenomena. A comparison of the lan- guage produced in various modes, genres or registers produced by the same learners (e.g. written and oral productions by the same learners) can help researchers discover mode/genre/register specific features (also see Chapter 8, this volume, for more details on comparing learner corpora). Applied Perspectives in LCR: The Continuous/Contextualized Text Paradox in LCR As explained in the introductory section, the applied perspectives of LCR are numerous. Mukherjee (2009, p. 212) states that ‘[l]earner corpus analyses always, at least implicitly, raise the question of what the language-pedagogical implications and applications might be” and numerous publications have addressed the links between (learner) corpora and pedagogy (see for instance Burnard & McEnery, 2000; Granger et al., 2002b; Granger, 2008; Aijmer, 2009; Meunier, 2010). Learner corpora have been used to inform lexicography, syllabus design, materi- als design, computer-aided language learning and pedagogical approaches such as data-driven learning. Some learners’ dictionaries (e.g. the Longman Dictionary of Contemporary English (2009) or the Cambridge Advanced Learner’s Dictionary (2008)) contain error notes intended to help learners avoid common mistakes. Learner corpora have also been used by textbook writ- ers to inform the design of tasks addressing typical problems that learners face (see the error correction/rewriting exercises in the Grammar and Beyond textbook series4). Some large-scale initiatives like the English Profile Project (see http://www.englishprofile.org/ for more details and related publications) rely on learner corpus data to help teachers and educators understand what aspects of English are typically learned/acquired at each level of the Common European Framework of Reference for Languages (Council of Europe, 2001). Learner corpora can also be used to create data-driven learning activities, i.e. the use of corpora and concordances (typi- cally keywords presented in their context of production) so that learners can work as language researchers in awareness-raising activities. Learners can check specific patterns of the use of keywords in native speaker texts and then compare that use to learners’ productions. Despite numerous publications on the pedagogical value of learner corpora, a lack of uptake of corpus-informed pedagogy has been noted (Granger, 2009; McCarthy, 2008; Shirato & Stapleton, 2007; Römer, 2009; Wilson, 2013; Meunier, 2018). Besides the technical problems that are often put forward, another reason that may explain the lack of uptake of corpus-informed pedagogy can be found in – what I would label – the contextual/continuous text paradox in LCR. Proponents of (learner) corpus studies lay strong emphasis on the fact that corpus data is unique in that it contains continuous stretches of discourse (not single words, phrases or sentences) and consist in contextualized data (i.e. data not produced in isolation but in the context of a meaningful, set task). And yet, when it comes to pedagogical applications, the use of learner corpus data rarely goes beyond the sentence level. This paradox would surely need to be addressed in the future to pay better tribute to the uniqueness of corpus data and maybe also ensure a clearer understand- ing on the part of learners and/or teachers of the usefulness of corpus-informed pedagogy (see Section 6 for some suggestions). Main Research Methods As some aspects related to research methods also constitute core issues in LCR, they have been addressed in the previous section (this includes contrastive interlanguage analysis (and its evolu- tion over time) as well as a number of issues related to annotation and corpus analysis tools. This 28 Introduction to Learner Corpus Research third section will thus focus on only two aspects: the basic types of corpus approaches and the three main study designs typically used in LCR. Corpus-based and Corpus-driven Approaches Two basic approaches can be used to analyze a (learner) corpus. The corpus-based approach uses corpora as a source of information to explore a theory or hypothesis, aiming to validate it, refute it, or refine it. One concrete example would be the study of grammatical variation in terms of dative alternation, as speakers have a choice between the prepositional dative construction (e.g. give something to someone) and the double object construction (e.g. give someone some- thing). SLA studies put forward different hypotheses when it comes to the dative alternation in L2 English (result of lexicalized verbal preferences, order of acquisition of the two constructions, etc.). As explained by Jäschke (2016, p. 19) “very few studies explored whether the learners’ use and judgments of the two variants are governed by the same linguistic factors which have been found to be predictive for English native speakers”. A corpus-based approach can be used in such cases to explore learners’ actual use of dative constructions (as was done by Deshors, 2014). In corpus-driven approaches, the corpus is viewed as a source of inspiration to formulate hypotheses about language (Tognini-Bonelli, 2001, p. 84-5). “The role of the researcher is to for- mulate questions and to draw conclusions derived from what corpus data reveal when subjected to statistical analysis rather than using the data to test a research hypothesis by approaching a corpus with a number of preconceived ideas” (Callies, 2015, p. 36). One concrete example of such an approach can be found in Belz and Vyatkina (2008) who investigated the pedagogical application of a learner corpus study in language teaching and in the developmental analysis of language learning in an instructed setting. The authors used L1 German data as a baseline against which learner German data was compared. Using a corpus-driven approach (thanks to a care- ful qualitative – usage in context – follow-up analysis of frequency lists), they spotted learners’ emerging use of some focal features. These included the use of fixed and creative constructions of the German modal particles ja, denn, doch, and mal. Such studies contribute to second language acquisition research via dense documentation of micro-changes in learners’ language use over time and to the formation of new hypotheses for future research. Cross-sectional, Quasi-longitudinal, and Longitudinal Research Designs Earlier studies in LCR were mostly cross-sectional, which means that they examined the lan- guage behavior of a group or groups of language learners at a single point in their development. Those studies usually compared one ILV with one or more other ILVs or with an RLV. With a view to addressing developmental paths in SLA, researchers decided to carry out a comparison of cross-sectional studies of different groups of learners at different developmental stages, thereby adopting what Huat (2012, p. 197) calls a pseudo-longitudinal approach. The learners’ productions are not from the same learners, hence the use of the ‘pseudo’ prefix, and the ‘time’ variable is somehow measured by proxies such as age or proficiency level. In such pseudo-longitudinal designs, researchers compare several groups of learners at different levels of proficiency. Cross-sectional and pseudo-longitudinal designs do not allow for the analysis of individual development. Individual variation within each group or sub-group can however be analyzed – as can group development in pseudo-longitudinal designs. Longitudinal study designs, in contrast to the two previous types, follow the same individual(s) over time. Longitudinal research is defined as ‘emphasizing the study of change and containing at minimum three repeated observations on at least one of the substantive constructs of interest’ 29 Fanny Meunier (Ployhart & Vandenberg, 2010, p. 97). As explained in Meunier (2015), the collection of longi- tudinal learner corpus data is time- and cost-consuming, and the analysis can only start when the entire data collection is over. Other issues include attrition (i.e. the sometimes significant number of participants dropping out before data-collection points). Such obstacles probably account for the scarcity of longitudinal studies in the early days of LCR. In longitudinal study designs, group progress, individual variation within groups and indi- vidual trajectories can be analyzed. This requires the use of, for instance, multi-level model- ling – also referred to as hierarchical linear modelling or mixed-effects models (see Raudenbush & Bryk, 2002; Baayen et al., 2008; Cunnings, 2012; Gries, 2015). “Multi-level modelling allows a variety of predictors to be analyzed, with ‘time’ being a key predictor in longitudinal studies: do participants become more proficient as time goes by and, if so, how strong is the effect of time? Such statistical modelling can be applied to individuals within groups as well as to individuals as individuals, by analyzing both endpoints and trajectories” (Meunier, 2015, p. 382). With the benefit of hindsight, it can be argued that there has been a true qualitative evolution over time in the research methods used in LCR. The field broadly evolved from the descriptive analysis of aggregate data in cross-sectional designs to the use of inferential statistics and a focus on intra- and inter-learner variability in more complex types of designs (including longitudinal studies and mixed-methods approaches). Representative Corpora and Research The International Corpus of Learner English (ICLE) is probably the exemplar of first-genera- tion learner corpora. It has been and is still being used massively by learner corpus researchers. Its first edition (Granger et al., 2002a) – resulting from ten years of international collaboration between numerous universities – contained 2.5 million words of English (mostly argumenta- tive essays by university students of English) written by learners from eleven different mother tongue backgrounds and was released in CD-ROM format including an interface to compile tailor-made subcorpora on the basis of a set of predefined learner or task variables. The second version (Granger et al., 2009) differed from the first one in scope (larger amount and greater diversity of the learner data included) and in functionalities. It included a built-in concordancer and direct links between learner profile information and search results. A third extended and web-based version of ICLE (Granger et al., 2020) will soon be available. Ädel’s (2008) study, presented below, is based on ICLE v1. Ädel (2008) examines variables related to task effects on language use. The research question addresses how the variable of ‘task setting’ (time and reference sources available) affects the learners’ writing styles on the written/spoken continuum. She uses the concepts of ‘involvement’ and ‘detachment’ typically used in variationist corpus-based approaches to language, with infor- mal speech typically characterized as involved (first-person reference, emphatic particles, etc.) and formal writing as detached (passive constructions, inanimate subjects, etc.). The learners were university students of English with L1 Swedish who wrote argumentative and expository essays for the Swedish subcorpus of the International Corpus of Learner English (SWICLE: Granger et al., 2009) and the Uppsala Student English Corpus (USE; Axelsson, 2000). The over- all results of the study show that learners exhibit more involvement features in timed than in untimed essays but less if they have access to source texts. In addition to a possible lack of regis- ter awareness, the study reveals that the extreme rate of involvement found in SWICLE is rather linked to both the lack of time that writers have to make the text more written-like and the lack of model texts to rely on. The Longitudinal Database of Learner English (LONGDALE) (Meunier, 2016) is one rep- resentative example of truly longitudinal learner corpora. It currently contains data collected by 30 Introduction to Learner Corpus Research five teams (Radboud University Nijmegen (the Netherlands), University of Hannover (Germany), University of Louvain (Belgium), University of Padua (Italy), and University Paris-Diderot (France). The same students are followed over a period of at least three years and data collec- tions are organized at least once per year, with some teams organizing several data collections per year. The term ‘database’ (rather than ‘corpus’) has been used from the onset of the project as LONGDALE includes a wide range of data types including argumentative essays, narratives, and informal interviews, but also more guided types of productions (such as picture descriptions). Experimental data is also included for some of the subcorpora. The database also includes com- prehensive learner profile information which is gathered during each data collection. The study presented below focuses on the acquisition of phonology/pronunciation. Méli (2013) analyzes the segmental realizations of French learners of English with a view of checking whether ‘perceived dissimilarity’ is a hindrance or an advantage for the L2 acquisition of sounds that do not exist in the learner’s L1. He focusses on the realizations of the interdental fricative, as well as some of the phonemic vowel length asymmetries for vowels, {/i/ in French, /i-/i:/ in English} and {/u/ in French, /u/-/u:/ in English}. Eighteen students were recorded lon- gitudinally over three years (the date of year two and four are used in the present study). The acoustic characteristics of some features of native speech were compared – using the PRAAT software (Broesma, 2001) – to learners’ realizations of the same sounds using the Bark Difference Metric method5. The analysis of the interdental fricatives stresses the importance of phonotactics (i.e. the syntax of phonemes) and of lexical frequency. It also mentions possible ‘islands of reli- ability’ (for expressions such as I think) which might help oral production in that learners may use these as ‘buying time’ devices or structuring features. The analysis also indicates different learning patterns for some sets of phonemes (with /u/~/u:/ being acquired later than /i/~/i:/). An analysis of the perception and the categorization of the phonemic realizations by learners them- selves was also carried out. The paper tests how accurately the data found is predicted by known Second Language Acquisition (SLA) theoretical frameworks such as Flege’s Speech Learning Model (1995). The results of the study show that the assumptions fail to predict differences in learning patterns. The last corpus presented in this section is rather innovative. The Multilingual, Traditional, Immersion, and Native Corpus (MulTINCo: Meunier, Hendrikx, Bulon, Van Goethem, & Naets, accepted) includes both learner and native data types. It contains: learner data for two target languages: Dutch and English learners’ spoken and written, longitudinal data collected in two different educational set- tings: Content and Language Integrated Learning – CLIL – and traditional foreign language classes; data produced by the same learners in their L1 (on similar task types); comparable native data from native speakers of the learners’ L2 of about the same age; a variety of background variables (age, gender, home language, amount of L2 curricular and extracurricular input, etc.). Van Mensel et al. (accepted) explores the impact of formal and informal input on learners’ vari- ability in writing. It compares two target language conditions (Dutch and English) in two dif- ferent instructed settings, namely Content and Language Integrated Language Learning (CLIL) and traditional foreign language learning classes (non-CLIL) in French-speaking Belgium. The study is part of a large project whose main objective is to investigate the influence of CLIL – and other educational, motivational, and cognitive factors – on the acquisition of a foreign language. Over 900 French-speaking primary and secondary school pupils learning English or Dutch in CLIL and NON-CLIL settings were followed longitudinally for two consecutive school years 31 Fanny Meunier (2015-2016 and 2016-2017) and various data types were collected. Using regression models to check whether the type and amount of input that learners are exposed to6 correlate with profi- ciency levels, the study shows that CLIL is a significant predictor of L2 outcomes for both target languages, but that the relative impact of formal and informal input differs depending on the target language. The results also highlight the importance of the L2 status in research on CLIL, because different L2s can yield different results. Future Directions As illustrated in section 2, LCR has constantly questioned its role, methods, and goals, and has, as a result, evolved remarkably over the last 30 years. It is almost impossible to accurately predict what will lay ahead of us, say, in the next 30 years to come, but I have identified two promising areas. The first one is related to the very status of LCR, which has always been considered as a prod- uct-oriented approach and which actually has the potential to combine both process and product orientations in the future. Mäntylä et al. (2018), for instance, show how the use of keystroke log- ging software (Strömqvist et al., 2006) cannot only help researchers better understand the writing process but also – and perhaps even more importantly here – lead to a reconsideration of what is actually perceived or stored in the learner’s mind as a formulaic sequence (Wray, 2002). Previous LCR research on formulaic language focused on ‘learner-external’ sequences (viz. the linguistic patterns produced). The recording of keyboard activities during the writing process on computers gives researchers unprecedented access to ‘learner-internal’ patterns (Myles & Cordier, 2017). A careful analysis of the pauses between words, for instance, can reveal difficulties in accessing a formulaic sequence but also the fact that the suite of words considered as a formulaic sequence on the basis of learner-external patterns may not have psycholinguistic reality in the learner’s internally stored holistic lexicon (Durrant, 2013). Studies like this one are only the first steps towards studies that integrate both product- and process-oriented approaches. New technologies and digital tools make it possible to record processing ‘moves’ which will, hopefully, be integrated in LCR in the future and hence help researchers revisit some of the LCR findings in a new perspective. The second promising area is the interest in – and need for – collecting more interactive data types. Whilst some learner corpora already include samples of language in interaction (such as the Telekorp: Belz & Vyatkina, 2008), they still constitute a minority of the data types collected. The interactive nature of communication is being increasingly stressed in SLA circles, with a focus on ecological approaches (Kramsch & Vork Steffensen, 2008; Thorne, 2013) and multilingual- ism (May, 2013; Ortega, 2013). As explained by Van Lier (2010, p. 2) “Ecological approaches focus primarily on the quality of learning opportunities, of classroom interaction and of educa- tional experience in general. Important pedagogical principles in an ecological approach are the creation of ecologically valid contexts, relationships, agency, motivation and identity”. Such an approach calls for more attention to be paid to the ecological value of tasks given to learners. As for the multilingual turn, it puts multilingualism at the forefront, thereby opening up new avenues for intrinsically multilingual corpora where, for instance, learners from different mother tongues can interact and translanguage. Instead of the rather homogeneous corpora of L2‘x’ with only one ‘x’, one could collect learner corpora of L2‘xs’. This focus on interaction is also found in pedagogical circles where official curricular documents clearly distinguish the features of spoken and written competences with or without interaction and also insist on the non-verbal strategies that are key to interactive competence. Learner corpora like the Giessen-Long Beach Chaplin Corpus (GLBCC: Jucker, Müller, & Smith, 2006) or the Multimedia Adult ESL Learner Corpus (MAELC: Reder, Harris, & Setzer, 2003) should inspire future learner corpus collections. The GLBCC consists of transcribed interactions between native English, ESL, and EFL speakers. As for MAELC, it contains videotaped classroom interactions associated with written materials 32 Introduction to Learner Corpus Research (copies of classroom written materials, student work, teacher logs, and teacher reflections). The corpus includes materials from four years of adult ESL classes ranging from beginning to upper- intermediate proficiency with over 3600 hours of classroom interaction recorded by six cameras and multiple microphones. The corpus has been partly coded for participation pattern and activ- ity, and portions of these classes have been transcribed, targeting student language during pair work. As explained on the MAELC website “examinations of dyadic interaction can focus on interactions between students from different first language backgrounds as well as on develop- mental studies of individual students who are recorded throughout several terms of study”. It is also of primary importance to reach out and collect data from less favored groups of learners (such as migrants) in order for LCR to be representative of all types of learners both in formal (instructed) and informal (non-instructed) contexts. Further Reading Granger, S., Gilquin, G., & Meunier, F. (Eds.) (2015). The Cambridge handbook of learner corpus research. Cambridge: Cambridge University Press. This volume is the first handbook entirely devoted to LCR. It offers a detailed overview of the field and of the affordances of learner corpora. Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied Linguistics, 32, 130–149. This article focusses on the formulaic/phraseological nature of language, one of the key aspects of lan- guage that corpus-linguistic methodology has helped reveal. Fuchs, R., & Werner, V. (Eds.) (2018). Tense and aspect in second language acquisition and learner corpus research [Special Issue]. International Journal of Learner Corpus Research, 4(2), 143–163 This edited volume presents five studies addressing a topic that has received much attention in SLA, viz. tense and aspect. It provides a fresh LCR perspective on tense and aspect issues. Related Topics Chapters 2, 6, 7, 8, 9, and 11. Notes 1 Whilst instances of informal interactions may be more likely, few learners spontaneously decide to write an argumentative or literary essay. 2 Corpus data are ideally continuous (i.e. consisting of longer stretches of discourse, not single words, phrases or sentences) and contextualized (i.e. not produced in isolation but in the context of a meaning- ful, set task). 3 Centre for English Corpus Linguistics (date of access 24th September 2018): Learner Corpora around the World. Louvain-la-Neuve: Université Catholique de Louvain. https://uclouvain.be/en/research-institu tes/ilc/cecl/learner-corpora-around-the-world.html 4 See https://www.cambridge.org/us/cambridgeenglish/catalog/english-academic-purposes/grammar-an d-beyond/ 5 Put simply, the Bark Difference Metric Method improves acoustic measurements by making it possible to filter out physiological differences in pronunciation while retaining sociolinguistic differences. 6 Computed thanks to a proxy gathering various types of information collected through questionnaires on input type and frequency. References Ädel, A. (2008). Involvement features in writing: Do time and interaction trump register awareness? In G. Gilquin, S. Papp, & M. B. Díez-Bedmar (Eds.), Linking up contrastive and learner corpus research (pp. 35–53). Amsterdam: Rodopi. 33 Fanny Meunier Aijmer, K. (2009). Corpora and language teaching. Amsterdam: John Benjamins. Axelsson, M. W. (2000). USE – The Uppsala student English corpus: An instrument for needs analysis. ICAME Journal, 24, 155–157. Baayen, H., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. Belz, J., & Vyatkina, N. (2008). The pedagogical mediation of a developmental learner corpus for classroom- based language instruction. Language Learning and Technology, 12(3), 33–52. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow: Pearson Education Limited. Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345. Burnard, L., & McEnery, T. (Eds.) (2000). Rethinking language pedagogy from a corpus perspective: Papers from the third international conference on teaching and language corpora. Frankfurt: Peter Lang Publishing. Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 35–56). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139649414.003. Callies, M., & Paquot, M. (2015). Learner corpus research: An interdisciplinary field on the move. International Journal of Learner Corpus Research, 1(1), 1–6. doi: 10.1075/ijlcr.1.1.00edi McIntosh, C. Cambridge Advanced Learner's Dictionary. (2008). Cambridge: Cambridge University Press. Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Cunnings, I. (2012). An overview of mixed-effects statistical models for second language researchers. Second Language Research, 28(3), 369–382. Deshors, S. C. (2014). A case for a unified treatment of EFL and ESL: A multifactorial approach. English World-Wide, 35(3), 277–305. Deshors, S. C., Götz, S., & Laporte, S. (Eds.) (2018). Rethinking linguistic creativity in non-native Englishes (Volume 98). John Benjamins Publishing Company. De Smet, A., Mettewie, L., Galand, B., Hiligsmann, P., & Van Mensel, L. (2018). Classroom anxiety and enjoyment in CLIL and non-CLIL: Does the target language matter? Studies in Second Language Learning and Teaching, 8(1), 47–71. doi:10.14746/ssllt.2018.8.1.3 Durrant, P. (2013). Formulaicity in an agglutinating language: The case of Turkish. Corpus Linguistics and Linguistic Theory, 9(1), 1–38. Ellis, R., & Barkhuizen, G. (2005). Analysing learner language. Oxford: Oxford University Press. Flege, J. E. (1995). Second-language speech learning: Theory, findings and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Theoretical and methodological issues (pp. 229–273). Timonium: York Press. Fuchs, R., & Werner, V. (Eds.) (2018). Tense and aspect in second language acquisition and learner corpus research [Special Issue]. International Journal of Learner Corpus Research, 4(2), 143–163 Gentil, G., & Meunier, F. (2018). A systemic functional linguistic approach to usage-based research and instruction: The case of nominalization in L2 academic writing. In A. E. Tyler, L. Ortega, M. Uno, & H. I. Park (Eds.), Usage-inspired L2 instruction. Researched pedagogy (pp. 267–289). Amsterdam: John Benjamins. Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 9–34). Cambridge: Cambridge University Press. Gilquin, G., & Gries, S. (2009). Corpora and experimental methods: A state-of-the-art review. In G. Gilquin (Ed.), Corpora and experimental methods [Special Issue]. Corpus Linguistics and Linguistic Theory, 5(1), 1–26. Granger, S. (1994). The learner corpus: A revolution in applied linguistics. English Today, 10(3), 25–33. doi:10.1017/S0266078400007665 Granger, S. (1996). From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In K. Aijmer, B. Altenberg, & M. Johansson (Eds.), Languages in contrast: Text- based cross-linguistic studies. Lund studies in English (vol. 88, pp. 37–51). Lund: Lund University Press. Granger, S. (2008). Learner corpora in foreign language education. In N. Van Deusen-Scholl & N. H. Hornberger (Eds.), Encyclopedia of language and education (vol. 4, pp. 337–351). Boston: Springer. Granger, S. (2009). The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation. In K. Aijmer (Ed.), Corpora and language teaching (pp. 13–32). Amsterdam: John Benjamins. 34 Introduction to Learner Corpus Research Granger, S. (2015). Contrastive interlanguage analysis: A reappraisal. International Journal of Learner Corpus Research, 1(1), 7–24. doi: 10.1075/ijlcr.1.1.01gra. Granger, S., Dagneaux, E., & Meunier, F. (2002a). International corpus of learner English. Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain. Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (2009). International corpus of learner English. Version 2 (Handbook + CD-ROM). Louvain-la-Neuve: Presses universitaires de Louvain. Granger, S., Dupont, M., Meunier, F., Naets, H., & Paquot, M. (2020). International corpus of learner English. Version 3 (Handbook + web interface). Louvain-la-Neuve: Presses Universitaires de Louvain. Granger, S., Gilquin, G., & Meunier, F. (Eds.) (2015). The Cambridge handbook of learner corpus research. Cambridge: Cambridge University Press. Granger, S., Hung, J., & Petch-Tyson, S. (Eds.) (2002b). Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam: John Benjamins. Gries, S. (2009). What is corpus linguistics? Language and Linguistics Compass, 3(5), 1225–1241. doi:10.1111/j.1749-818X.2009.00149.x Gries, S. (2013). Statistical tests for the analysis of learner corpus data. In A. Diaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 287–310). Amsterdam: John Benjamins. Gries, S. (2015). Statistics for learner corpus research. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 159–182). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139649414.008 Huat, C. M. (2012). Learner corpora and second language acquisition. In K. Hyland, C. M. Huat, & M. Handford (Eds.), Corpus applications in applied linguistics (pp. 191–207). London: Continuum. Jäschke, K. (2016). The dative alternation in English as a second language. PhD dissertation. Düsseldorf: Heinrich-Heine-Universität. Retrieved from https://d-nb.info/1135382433/34 Jucker, A., Müller, S., & Smith, S. (2006). GLBCC (Giessen - Long Beach Chaplin Corpus). Oxford text archive. Retrieved from http://hdl.handle.net/20.500.12024/2506. See also http://ota.ox.ac.uk/desc/2506. Kramsch, C., & Vork Steffensen, S. (2008). Ecological perspectives on second language acquisition and socialization. In N. H. Hornberger (Ed.), Encyclopedia of language and education (pp. 2595–2606). Boston: Springer. Longman Dictionary of Contemporary English (Fifth edition). (2009). Harlow: Pearson Education Limited. Mäntylä, K., Lahtinen, S., Vaakanainen, V., & Mäkilä, M. (2018). Using keystroke logging to analyse the writing process – tools for teaching writing. EuroCALL 2018 (book of abstracts, p. 27), Jyväskylä, August 22. May, S. (Ed.) (2013). The multilingual turn: Implications for SLA, TESOL, and bilingual education. London: Routledge. McCarthy, M. (2008). Accessing and interpreting corpus information in the teacher education context. Language Teaching, 41(4), 563–574. McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge. Méli, A. (2013). Phonological acquisition in the French–English interlanguage: Rising above the phoneme. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 207–226). Amsterdam: Benjamins. Meunier, F. (2010). Learner corpora and English language teaching: Checkup time. Anglistik: International Journal of English Studies, 21(1), 209–220. Meunier, F. (2015). Developmental patterns in learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 379–400). Cambridge: Cambridge University Press. Meunier, F. (2016). Introduction to the LONGDALE project. In E. Castello K. Ackerley, & F. Coccetta (Eds.), Studies in learner corpus linguistics: Research and applications for foreign language teaching and assessment (pp. 123–126). Berlin: Peter Lang Publishing. Retrieved from https://uclouvain.be/en/r esearch-institutes/ilc/cecl/longdale.html Meunier, F. (2018). Promoting TPACK and professional learning communities: Focus on teaching and learning multiword units. EuroCALL, Jyväskylä, August 23. doi:10.13140/RG.2.2.26823.14244. Retrieved from https://www.researchgate.net/publication/327237628_Promoting_TPACK_and_prof essional_learning_communities_focus_on_teaching_and_learning_multiword_units_EuroCALL_conf erence_paper_Jyvaskyla_Finland_23_August_2018 Meunier, F., Hendrikx, I., Bulon, A., Van Goethem, K., & Naets, H. (2020). MulTINCo: Multilingual traditional immersion and native corpus. Better-documented multi-literacy practices for more refined SLA studies. International Journal of Bilingual Education and Bilingualism, DOI: 10.1080/13670050.2020.1786494 35 Fanny Meunier Meunier, F., & Littré, D. (2013). Tracking learners’ progress: Adopting a dual ‘corpus cum experimental data’ approach. The Modern Language Journal, 97(1), 61–76. Mukherjee, J. (2009). The grammar of conversation in advanced spoken learner English: Learner corpus data and language-pedagogical implications. In K. Aijmer (Ed.), Corpora and language teaching (pp. 203–230). Amsterdam: John Benjamins. Myles, F., & Cordier, C. (2017). Formulaic sequences (FS) cannot be an umbrella term in SLA: Focusing on psycholinguistic FSs and their identification. Studies in Second Language Acquisition, 39(1), 3–28. doi:10.1017/S027226311600036X Ortega, L. (2013). SLA for the 21st century: Disciplinary progress, transdisciplinary relevance, and the bi/ multilingual turn. Language Learning, 63(1), 1–24. Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied Linguistics, 32, 130–149. Paquot, M., & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94. Ployhart, R. E., & Vandenberg, R. J. (2010). Longitudinal research: The theory, design, and analysis of change. Journal of Management, 36(1), 94–120. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks: Sage Publications, Inc. Reder, S., Harris, K., & Setzler, K. (2003). A multimedia adult learner corpus. TESOL Quarterly, 37(3), 65–78. Retrieved from http://www.labschool.pdx.edu/maelc_access.html. Römer, U. (2009). Corpus research and practice: What help do teachers need and what can we offer? In K. Aijmer (Ed.), Corpora and language teaching (pp. 83–98). Amsterdam: John Benjamins. Shirato, J., & Stapleton, P. (2007). Comparing English vocabulary in a spoken learner corpus with a native speaker corpus: Pedagogical implications arising from an empirical study in Japan. Language Teaching Research, 1(4), 393–412. Strömqvist, S., Holmqvist, K., Johansson, V., Karlsson, H., & Wengelin, Å. (2006). What keystroke logging can reveal about writing. In K. P. H. Sullivan & E. Lindgren (Eds.), Computer keystroke logging and writing: Methods and applications (pp. 45–71). Amsterdam: Elsevier. Tan, M. (2005). Authentic language or language errors? Lessons from a learner corpus. ELT Journal, 59(2), 126–134. Thorne, S. (2013). Language learning, ecological validity, and innovation under conditions of superdiversity. Bellaterra Journal of Teaching and Learning Language and Literature, 6(2), 1–27. Tognini-Bonelli, E. (2001). Corpus linguistics at work. Amsterdam: John Benjamins. Van Lier, T. (2010). The ecology of language learning: Practice to theory, theory to practice. Social and Behavioral Sciences, 3, 2–6. Van Mensel, L., Bulon, A., Hendrikx, I., Meunier, F., & Van Goethem, K. (2020). Effects of input on L2 writing skills in English and Dutch: CLIL and non-CLIL learners in French-speaking Belgium. Journal of Immersion and Content-Based Language Education. Wilson, J. (2013). Technology, pedagogy and promotion. How can we make the most of corpora and data- driven learning (DDL) in language learning and teaching? The Higher Education Academy. Retrieved from https://www.heacademy.ac.uk/system/files/corpus_technology_pedagogy_promotion2.pdf Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. 36 View publication stats