Document Details

AwedGauss2256

Uploaded by AwedGauss2256

Universität Regensburg

Dr Thorsten Brato

Tags

corpus linguistics English linguistics language analysis introduction to linguistics

Summary

These lecture notes provide an introduction to corpus linguistics. The notes cover topics such as multilingualism, modelling World Englishes, and corpora. The document also includes information about different types of corpora and their characteristics, and related concepts.

Full Transcript

Corpus Linguistics (1) Dr Thorsten Brato Department of English and American Studies VL Introduction to English Linguistics: English in Use Recap Multilingualism is a common phenomenon Modelling World Englishes around the world...

Corpus Linguistics (1) Dr Thorsten Brato Department of English and American Studies VL Introduction to English Linguistics: English in Use Recap Multilingualism is a common phenomenon Modelling World Englishes around the world Kachru’s Three Circles Model Diglossia: two languages are used in a speech Schneider’s Dynamic Model community for different functions Code-switching: changing between languages Pidgins and creoles are typologically according to situation unrelated languages that developed Code-mixing: changing between languages through language contact within an exchange Simplified structure English around the world English based pidgins and creoles are found in Anglosphere (UK, Ireland, USA, Canada, West Africa, the Caribbean, the US and Central Australia, New Zealand) America and the Pacific Former colonial (L2) varieties in Africa, Asia, the Caribbean and Oceania 2 Today's lecture 1 Introduction 4 Part-of-speech Tagging 2 Corpora 3 Concordancing 3 1 Introduction Corpus linguistics  Corpus linguistics is a scientific method of language analysis. It requires the analyst to provide empirical evidence in the form of data drawn from language corpora in support of any statement made about language. (Brezina 2018: 2) First corpus – the BROWN Corpus – was released in 1967 At 1 million words it was by far the largest source of authentic linguistic data Today, corpus linguistics has become the quasi-standard in many linguistic analyses Some of the widely used corpora consist of several billion words 4 1 Introduction Why is corpus linguistics useful? 1. Language Trends Analysis: Detect changes in language use over time Identification of new words or phrases trending in social media 2. Language Teaching: Develop authentic teaching materials based on real-life language use Creating ESL textbooks using high-frequency phrases 3. Lexicography: Enhance dictionary definitions with usage information Oxford English Dictionary's use of corpus to determine word frequency 4. AI & NLP: Improve computational models of language understanding Training AI models like Siri or Alexa to understand human language more accurately 5. Discourse Analysis: Uncover patterns and structures in spoken/written text Analysis of political speeches to identify common rhetoric strategies 6. Linguistic Hypotheses Testing: Validate or refute theories about language usage Investigating whether 'literally' is used more figuratively in modern English 5 1 Introduction What is a corpus?  Surprisingly, despite this long history [of corpus analysis; TB], there is no standard definition of a “corpus” in introductory corpus linguistics textbooks[.]… A “corpus” is: [a collection of electronic texts] built according to explicit design criteria for a specific purpose (Atkins, Clear, & Ostler 1992: 1) a large and principled collection of natural texts [which is] not simply a collection of texts [but rather it additionally] seeks to represent a language or some part of a language (Biber, Conrad, & Reppen 1998: 246) a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description (Kennedy 1998: 1) a collection of (1) machine readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety (McEnery, Xiao, & Tono 2006: 5) some set of machine-readable texts which is deemed an appropriate basis on which to study a specific set of research questions. The set of texts or corpus dealt with is usually of a size which defies analysis by hand and eye alone within any reasonable timeframe. (McEnery & Hardie 2012: 1- 2). (Egbert et al. 2022: 2-3) 6 1 Introduction What is a corpus? a collection of texts based on a set of design criteria, one of which is that the corpus aims to be representative (Cheng 2011: 3) a representative collection of language that can be used to make statements about language use... A corpus is a collection of a fairly large number of examples (or, in corpus terms texts) that share similar contextual or situational characteristics. (Crawford & Csomay 2015: 6) an electronically available collection of texts or transcripts of audio recordings which is sampled to represent a certain language, language variety, or other linguistic domain (Kübler & Zinsmeister 2015: 4) a collection of spoken or written texts to be used for linguistic analyses and based on a specific set of design criteria influenced by its purpose and scope... any collection of texts that has been systematically assembled in order to investigate one or more linguistic phenomena (Weisser 2016: 13) a collection of written texts or transcripts of spoken language that can be searched by a computer using specialized software. A corpus usually represents a sample of language, i.e. a (small) subset of the language production of interest (Brezina 2018: 6) (Egbert et al. 2022: 2-3) 7 1 Introduction What is a corpus (continued)? A corpus is or consists of: a collection/sample/body/set texts representative electronic/machine-readable principled/designed large a collection that represents a language or domain a collection that enables investigations about language phenomena natural/authentic Key concepts included in the definition of "corpus." Each represents a definition from one source. (Egbert et al. 2022: 3) 8 1 Introduction Difference between text and corpus Text Corpus Read whole Read fragmented Read horizontally Read vertically Read for content Read for formal patterning Read as a unique event Read for repeated events Read as an individual act of will Read as a sample of social practice Instance of parole Gives insight into langue Coherent communicative event Not a coherent communicative event (Bonelli 2010: 19) 9 1 Introduction Corpus structure The structure and level of detail of Most corpora (these days) will have some annotation and mark-up in a corpus varies kind of mark-up, e.g. greatly Sentence IDs All corpora will consist of transcripts of Grammatical annotations (POS tagging, written and spoken data, either parsing) As a (collection of) file(s), e.g. for download and Metadata annotations to be used with special software Corrections Through an online interface … Most corpora are accompanied by a manual or guidelines Providing metadata, e.g. on the author, genre, title, etc. Corpus design 10 1 Introduction Corpus structure A corpus is designed to be used with and by a computer – it is not meant to be read by a human This shows most clearly when we look at examples of what actual corpus files look like: BROWN Old Bailey Corpus ICE Ghana 11 1 Introduction From raw data to corpus file TB 1967-?-? Accra ????? FESTIVAL THE people of Otuam in the Ekumfi State of the Central Region will begin their annual Akwambo-Ayerye festival on Tuesday, September 12. Nana Amoa Fanyin IV, divisional chief of the state who is also known in private life as Mr. J.K. Rockson, is expected to preside over a grand durbar on Tuesday. The celebration also marks the first anniversary of Nana Fanyin's installation. 12 2 Corpora Overview There exist hundreds of corpora on different aspects of (the English) language They can be roughly divided based on these criteria Mode (written – spoken – both) Time depth (synchronic – diachronic – historical) Specificness (many genres, text types, topics and/ or varieties – a particular type of data) Updates (static – dynamic) Comprehensive lists of corpora are available online Martin Weisser’s Corpus-Based Linguistics Links: http://martinweisser.org/corpora_site/CBLLinks.html University of Helsinki’s Corpus Finder: https://varieng.helsinki.fi/CoRD/corpora/corpusfinder/index.html UCLouvain’s Learner corpora around the world: https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html 13 2 Corpora The BROWN family Approx. 1m words of written American Several corpora follow the BROWN* design English from 1961 and thus allow for diachronic and/or cross- 500 samples of about 2,000 words variety analyses, e.g. 8 major genres London-Oslo-Bergen (LOB)*: British English, 1961 Freiburg-Brown (FROWN)*: American English, Press 1991 Religion Freiburg-LOB (FLOB)*: British English, 1991 Skills and hobbies BLOB: British English, 1931 Popular lore The Kolhapur Corpus of Indian English, 1978 Belles-lettres The Wellington Corpus of Written New Zealand Miscellaneous English, 1986-1990* Learned The Australian Corpus of English (ACE), 1986 Humour Phil-BROWN: Philippine English, 1961 *Corpora available at UR (see https://elearning.uni-regensburg.de/course/view.php?id=57501 for details) 14 2 Corpora The International Corpus of English (ICE) The International Corpus of English (ICE) is Canada East Africa (Kenya and Tanzania) a collection of 1m word corpora Great Britain contains written (400,000 words), e.g. Hong Kong Student essays Gibraltar News Reports India Business letters Ireland and spoken (600,000 words) data, e.g. Jamaica New Zealand Conversations Nigeria Legal cross-examinations Singapore Broadcast news Sri Lanka Currently available for 14 varieties of English The Philippines About a dozen more are currently compiled Trinidad and Tobago Uganda https://www.ice-corpora.uzh.ch/en.html USA (written only) 15 3 The International Corpus of English (ICE) Design: Spoken Dialogues (180) Private (100) Face-to-face conversations (90) Phonecalls (10) Public (80) Classroom Lessons (20) Broadcast Discussions (20) Broadcast Interviews (10) Parliamentary Debates (10) Legal cross-examinations (10) Business Transactions (10) Monologues (120) Unscripted (70) Spontaneous commentaries (20) Unscripted Speeches (30) Demonstrations (10) Legal Presentations (10) Scripted (50) Broadcast News (20) Broadcast Talks (20) Non-broadcast Talks (10) 16 3 The International Corpus of English (ICE) Design: Written Non-printed (50) Student Writing (20) Student Essays (10) Exam Scripts (10) Letters (30) Social Letters (15) Business Letters (15) Printed (150) Academic writing (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10) Popular writing(40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10) Reportage (20) Press news reports (20) Instructional writing (20) Administrative Writing (10) Skills/hobbies (10) Persuasive writing (10) Press editorials (10) Creative writing (20) Novels & short stories (20) 17 2 Corpora COCA and COHA Corpus of Contemporary American English Corpus of Historical American English (COHA) (COCA) 1 billion words 475 million words Largest representative corpus of American Largest structured corpus of historical English English 1820s-2010s 1990-2019 8 major categories with fine-grained 5 categories subdivisions, e.g. TV/Movies Spoken Fiction Fiction Popular magazines Academic journals Newspapers TV/Movie subtitles Non-fiction books https://www.english-corpora.org/coca/ https://www.english-corpora.org/coha/ 18 2 Corpora GloWbE and NOW Global Web-Based English (GloWbE) News on the Web (NOW) 1.9 billion words 20.3 billion words (November 2024) 20 countries Largest monitor corpus of English 6 ‘traditional’ 8 Asia 20 countries 5 Africa 2010-yesterday 1 Caribbean https://www.english-corpora.org/now/ 2012-2013 Websites General Blogs https://www.english-corpora.org/glowbe/ 19 2 Corpora BNC and ICLE British National Corpus (BNC) International Corpus of Learner English (ICLE) 100 million words (90m written, 10m spoken) 3.7 million words (Version 2) Largest balanced corpus of British English Essay writing by upper intermediate and 1980-1993 advanced learners 7 major categories and fine-grained subdivisions, e.g. 16 countries Broadcasts 2009 Lectures https://uclouvain.be/en/research- Fiction institutes/ilc/cecl/icle.html Newspapers https://www.english-corpora.org/BNC/ 2014 update 20 3 Concordancing Introduction  Concordancing is an analysis technique that allows linguists to investigate the occurrences and behaviour of different word forms in real-life contexts, that is, in situations where they have actually been used by native or non-native speakers. (Weisser 2016: 67) The traditional approaches relied on the intuition of the native speaker  Essentially, a concordance is a listing of individual word forms in a given specific context, where the exact nature of the context depends on the requirements of the analysis and which particular program one may be using. The word context here refers […] not [to] the situational usage in a particular place and time, but instead the immediately surrounding text, something we can also refer to as co-text in case of ambiguity. (Weisser 2016: 67) 21 3 Concordancing AntConc Key Word in Context (KWIC) Bit of a misnomer – we can search anything from individual letters to larger constructions like sentences There are many concordance tools, some already in-built into the context of, e.g., a website The most common tool used in corpus analysis is AntConc* (Anthony 2020) Available for all major platforms Freeware Well-documented Easy-to-use *In class I use AntConc 3.5.9. A new version (4.3.1) is out now but is more difficult for non-expert users. 22 Keyword Context right Number of hits Hit File Context left Corpus: ICE Canada 4 Part-of-speech (POS) tagging Introduction Part-of-speech (POS) tagging was one of the breakthroughs in corpus linguistic analysis A POS tag provides information on the word class, e.g. noun, adjective, auxiliary verb on inflections, e.g. plural noun, verb in the past participle form, superlative form of an adjective other aspects, e.g. type of conjunction, comparative after-determiner (more, less, …), existential there, … POS tags are usually added to the word form in the form of _TAG The dance made her feel light and free. The_AT dance_NN1 made_VVD her_PPHO1 feel_VV0 light_JJ and_CC free_JJ._. 24 4 Part-of-speech (POS) tagging Usefulness Many words belong to more than one word class, e.g. fast, as these examples drawn from NOW show: Noun: […] an eating window of 10 hours and a daily fast of 14 hours […] Verb: Halve the recipe for cooking and desserts because when you fast you can't eat much. Adjective: […] there are no hard and fast rules about what you can get away with at what age. Adverb: Times are changing fast. If you were to find instances of fast used as an adverb, without POS tagging, you would have to go through the KWIC output line by line to find those tokens you are interested in ICE-Canada yields 80 hits for fast  could be done by hand NOW (for the week starting 4 November 2024) yields 6126 hits for fast  it’s getting cumbersome NOW (in total) yields 1,725,546 hits for fast  even at a rate of 1 token per second you’d need about 20 days(!) 25 4 Part-of-speech (POS) tagging Usefulness NOW (04/11/2024- ICE Canada NOW (all) 10/11/2024 Adjective (JJ): 38 Adjective (J): 3,038 Adjective (J): 874,561 Adverb (RR): 42 Adverb (R): 3,042 Adverb (R): 841,975 26 4 Part-of-speech (POS) tagging CLAWS CLAWS (Constituent Likelihood Automatic Word-tagging System) is the most widely used POS tagger (https://ucrel.lancs.ac.uk/claws/) It takes raw data and based on several algorithms assigns each word a word class Accuracy: approx. 96-97 per cent The CLAWS7 tagset (https://ucrel.lancs.ac.uk/claws7tags.html) consists of 137 individual tags, including 13 for determiners and 31 for verbs, e.g. VBDR – were VDG – doing VM – modal auxiliary (will, should) VV0 – base form of lexical verb (go, drink) If we are interested in finding only the verbal use of broke we could use the tag broke_VVD, for the adjective we would use broke_JJ 27 4 Part-of-speech (POS) tagging TagAnt TagAnt is a versatile free tagging tool. (https://www.laurenceanthony.net/software/tagant/) You can choose from different tagsets: word_pos – basic major word classes, often sufficient for smaller projects The_DET dance_NOUN made_VERB her_PRON feel_VERB light_NOUN and_CCONJ free_ADJ._PUNCT word_pos_lemma includes the lemma The_DET_the dance_NOUN_dance made_VERB_make her_PRON_she feel_VERB_feel light_NOUN_light and_CCONJ_and free_ADJ_free._PUNCT_. 28  Keywords AntConc International Corpus of Learner English (ICLE) British National Corpus (BNC) Key Word in Context (KWIC) BROWN Corpus Monitor corpus CLAWS News on the Web (NOW) Concordance Part-of-speech (POS) tagging Corpus linguistics Representative Corpus of Contemporary American English (COCA) TagAnt Corpus of Historical American English (COHA) Tagset Global Web-Based English (GloWbE) The International Corpus of English (ICE) 31  References Anthony, Laurence. 2020. AntConc. Tokyo: Waseda University. Bonelli, Elena T. 2010. Theoretical overview of the evolution of corpus linguistics. In Anne O'Keeffe & Michael McCarthy (eds.), The Routledge handbook of corpus linguistics, 14–27. New York: Routledge. Brezina, Vaclav. 2018. Statistics in corpus linguistics: A practical guide. Cambridge: Cambridge University Press. Egbert, Jesse, Douglas Biber & Bethany Gray. 2022. Designing and evaluating language corpora: A practical framework for corpus representativeness. Cambridge: Cambridge University Press. Weisser, Martin. 2016. Practical Corpus Linguistics: An Introduction to Corpus-Based Language Analysis. Chichester: Wiley. 32

Use Quizgecko on...
Browser
Browser