Document Details

AwedGauss2256

Uploaded by AwedGauss2256

Universität Regensburg

Dr Thorsten Brato

Tags

corpus linguistics frequency analysis collocations language analysis

Summary

This document contains lecture notes on corpus linguistics covering frequency analysis, collocations, and case studies. The lecture notes are part of an introduction to English linguistics.

Full Transcript

Corpus Linguistics (2) Dr Thorsten Brato Department of English and American Studies VL Introduction to English Linguistics: English in Use Recap Corpus linguistics is the empirical analysis of authentic language data using corpora A corpus is a collection of linguistic...

Corpus Linguistics (2) Dr Thorsten Brato Department of English and American Studies VL Introduction to English Linguistics: English in Use Recap Corpus linguistics is the empirical analysis of authentic language data using corpora A corpus is a collection of linguistic data, compiled following a certain design, usually structured and annotated Corpus typology, e.g. written/spoken, synchronic/diachronic Frequently used corpus ‘families’, e.g. BROWN, ICE, COCA, BNC, GloWbE, NOW Concordancing is a method in corpus linguistics to show keywords in context AntConc is a frequently-used software in corpus linguistics Part-of-speech (POS) tagging adds grammatical information to a corpus 2 Today's lecture 1 Frequency 2 Collocations 3 Case studies 3 1 Frequency Quantitative and qualitative and data analysis Concordances are at the heart of corpus linguistics There are two (often complementary) approaches to working with concordances In a quantitative approach we count occurrences and compare frequencies and often use statistics to uncover patterns In a qualitative approach, the focus is not on how often a feature occurs, but  [t]he data are used only as a basis for identifying and describing aspects of usage in the language to provide real-life’ examples of particular phenomena. (McEnery & Wilson 2001: 76) 4 1 Frequency Quantitative data analysis  outdoor/outdooring (V/N): the bringing ‘out of doors’ of a child after seven days. (Dako 2003: 161) ‘Outdooring’ GloWbE: 72 hits – 1 from Canada, 71 from Ghana NOW: 339 hits – 1 each from Canada, Nigeria, South Africa and US, 336 from Ghana ‘Outdoored’ (verb) GloWbE: 64 hits – 1 from Nigeria, 63 from Ghana NOW: 438 hits – 1 each from Kenya, Nigeria, 6 from South Africa, 430 from Ghana Good evidence that the tradition of ‘outdooring’ is a very Ghanaian thing 5 1 Frequency Qualitative data analysis “The "outdooring" and naming ceremony starts when an elder of the father's family pours libation…” “Later, when a child is outdoored and reckoned to be human…” “The NDC [National Democratic Congress, a political party; TB] has succeeded since its outdooring as such in Cape Coast in 1992.” “When the African Union (AU) was outdoored in 2002 to replace the Organisation of African Unity (OAU)…” “The song is expected to be officially outdoored this week.” “The Ghana Oil Company (GOIL) has outdoored its rejuvenated brand identity…” Using a qualitative analysis, we notice that not only is the concept of outdooring typically Ghanaian, but that semantic shift has taken place. 6 1 Frequency Frequency analysis Frequency analyses are among the most common analyses in corpus linguistics Find all instances of a ‘construction’, e.g. in GloWbE a word (and its spelling variations), e.g. ice-cream (3054) vs. ice cream (17524) vs. icecream (841) vs. ice creme (3) all word-forms of the lemma GIVE, give (883938), gave (294542), given (563102), gives (216734), giving (263499), giveth (684), gived (16) a fixed expression, e.g. merry Christmas (1729) vs. happy Christmas (78) Compare frequencies across varieties and or/time GloWbE: GO on holiday (2263, about half from GB) vs. GO on vacation (1158, less than half from US and CA) COHA: telephone: 24901 (mainly1930-1980s); phone: 42017 (mainly 2000s-2010s) 7 1 Frequency Types and tokens A crucial difference is that between types and tokens Token refers to the total number of words or constructions in a text, corpus or sub-corpus Type refers to the total number of different words or constructions in a text, corpus or sub-corpus ICE GB contains 1,071,926 tokens and 34,421 types Why is the token number so much larger than 1,000,000? Each sample contains ~ 2,000 words – but always complete sentences Tags are included in the token count 8 1 Frequency Type-token ratio (TTR) The type/token ratio (TTR) is a measure of diversification in a text, corpus or sub-corpus 𝑥 100 The TTR for ICE GB is 𝑥 100 ~ 3.2 A high TTR indicates a high degree of lexical variation, a low TTR indicates the opposite TTR is strongly dependent on corpus size It will usually be higher the smaller the corpus  Activity 1 Why is the TTR higher the smaller the corpus? 9 1 Frequency Comparing frequencies Often (sub-)corpora have different sizes, e.g. When comparing frequencies, you must in ICE make sure you normalise your data and Spoken: 600,000 words calculate the per-X-word frequency (also Written: 400,000 words normalised frequency), e.g. in COHA per-million-word (pmw) ~7m words from the 1820s per-thousand-word (ptw) ~30m words from the 1980s per-ten-thousand-word (pttw) Example: What is the difference in frequency of the word Scotland between the written and spoken sections of ICE-GB? 41 times in the spoken section (643024 tokens) 36 times in the written section (428854 tokens) 10 1 Frequency Comparing frequencies  Activity 2 A COHA search for the term television yields the results below. Go to Pingo and answer the question. Decade Tokens N 1960s 29,122,676 2,816 1970s 28,829,225 3,468 1980s 29,851,580 3,572 1990s 33,149,318 3,123 pingo.coactum.de  981125 11 2 Collocations Introduction Collocations are ‘predictable combinations of words’ (Schmid 2003: 237) University of … corpus linguistics get the hang of … broke (up | out | in | my leg) How do we know if something is ‘predictable? Native-speaker intuition or having learned the appropriate construction fast food vs. slow food *quick food vs. *unhurried food quick shower vs. *fast shower 12 2 Collocations AntConc Before searching for collocations, you need to first create a word frequency list in the Word list tab In the Collocates tab you then enter your search term, a word, construction or regular expression Set the Window span, e.g. 1L: one word to the left 3R: three words to the right Choose the minimum collocate frequency 1 – show all collocations n – show only those which occur at least n times in the corpus 13 3 Case studies The Americani[sz]ation of English? It has been argued frequently that American English is constantly gaining ground due to several factors Globalisation Popular culture Media exposure Attitudes (cf. e.g. Anchimbe 2006: World Englishes and the American tongue; Shoba et al. 2013: ‘Locally acquired foreign accent’ (LAFA) in contemporary Ghana; Gonçalves et al. 2018: Mapping the Americanization of English in space and time; Sharbawi 2022: The Americanisation of English in Brunei) If this is the case, we should find American English forms at least to some degree in other varieties of English We may assume it to be strongest in Canada and the Philippines and perhaps least likely in Great Britain 14 3 Case studies The Americani[sz]ation of English? - Lexis There are dozens of lists of words which (are said to) differ between British and American English These lists are usually categorical, e.g. for the concept (cf. OALD) a long thin piece of potato fried in oil or fat UK: chips US: (French) fries a thin round slice of potato that is fried until hard then dried and eaten cold UK: crisp US: chip Although an obvious contender, this is hard to research because of the multiple meanings of chip(s), fries and crisp 15 3 Case studies The Americani[sz]ation of English? – Lexis Before running our analysis, we want to make sure we only include nouns, as aubergine/eggplant may also be used as an adjective, as in e.g. He wore an aubergine/eggplant jacket This, of course, requires a tagged corpus or manual analysis Using GloWbE, we could add _nn to only find nominal uses We should also extend our analysis to include plural forms In GloWbE (and the other corpora from english-corpora.org) we can write the search term in CAPITALS to find all word-forms of the lemma Looking at the structure of GloWbE (the US and GB components are more than ten times larger than the JM or TZ components and still four times larger than e.g. IN), we must make sure to look at the normalised frequencies 16 3 Case studies The Americani[sz]ation of English? – Lexis 17 3 Case studies The Americani[sz]ation of English? – Past tense 18  Keywords Americanisation Collocation Construction Frequency PMW Qualitative Quantitative Semantic shift Token Type Type/token ratio (TTR) 20  References Anchimbe, Eric A. 2006. World Englishes and the American tongue. English Today 22(4). 3–9. Dako, Kari. 2003. Ghanaianisms: a glossary. Accra: Ghana University Press. Gonçalves, Bruno, Lucía Loureiro-Porto, José J. Ramasco & David Sánchez. 2018. Mapping the Americanization of English in space and time. PloS one 13(5). Labov, William. 1966. The social stratification of English in New York City. Washington, D.C.: Center for Applied Linguistics. Labov, William. 2006. The social stratification of English in New York City, 2nd edn. Cambridge: Cambridge University Press. Lange, Claudia & Sven Leuckert. 2019. Corpus linguistics for World Englishes: A guide for research. London: Routledge. Schmid, Hans-Jörg. 2003. Collocation: hard to pin down, but bloody useful. Zeitschrift für Anglistik und Amerikanistik 51(3). 235–258. Sharbawi, Salbrina. 2022. The Americanisation of English in Brunei. World Englishes. Shoba, Jo A., Kari Dako & Elizabeth Orfson-Offei. 2013. ‘Locally acquired foreign accent’ (LAFA) in contemporary Ghana. World Englishes 32(2). 230–242. Trudgill, Peter. 1972. Sex, covert prestige and linguistic change in the urban British English of Norwich. Language in Society 1. 179– 195. Trudgill, Peter. 1988. Norwich revisited: Recent linguistic changes in an English urban dialect. English World-Wide 9(1). 33–49. 21

Use Quizgecko on...
Browser
Browser