Sequence 1: Modern Linguistics and Corpus Linguistics PDF

Summary

This document provides an introduction to modern linguistics and corpus linguistics. It discusses the concepts and components of modern linguistics and corpus linguistics, including various subfields and methodologies. The document contains questions about corpus linguistics.

Full Transcript

Séquence 1 : Modern linguistics and corpus linguistic 1.1. Introduction [This introduction to linguistics is based on two main concepts, those of Modern linguistics and corpus linguistics. It was not necessary to deal with traditional linguistics because of the numerous advances this science ha...

Séquence 1 : Modern linguistics and corpus linguistic 1.1. Introduction [This introduction to linguistics is based on two main concepts, those of Modern linguistics and corpus linguistics. It was not necessary to deal with traditional linguistics because of the numerous advances this science has witnessed. Besides, modern linguistics is closely linked to corpus linguistics, which can allow you to enter the register of technical English. Furthermore, this lesson will help you understand the characteristics of Legal English, which is the focus of your training in this semester.] 1.3 Modern Linguistics: [This part of the sequence is devoted to definitions and concepts around the key domains of Modern Linguistics and Corpus Linguistics. There is a need to define linguistics as a science of language; - However, Linguistics is also related to other sciences such as philosophy, psychology, sociology. The gist of the lesson is around linguistics and language subsystems such as phonetics, phonology, morphosyntax, semantics, pragmatics, which are often mentioned in the main domains already mentioned. There is a necessity in this part to get acquainted with the most common linguistic concepts]. 1.3 Corpus Linguistics: [This part of the lesson will deal with a number of concepts that are instrumental to the handling of a scientific discourse. Four components have been identified, namely: the definition of corpus (plural corpuses or corpora), corpus linguistics, corpus linguistics methodologies and CL & discourse analysis. Corpus Linguistics started to develop from the 1960s. Corpus linguistics has made it possible for researchers to undertake systematic research on linguistic material. The ultimate aim was to do a linguistic description of a language system like the one of English. Corpus linguistics is thus concerned with the description and explanation of the nature, structure, and uses of language and languages. Furthermore, it helps shed light on areas like language learning, variation, and change.] READ: “It [corpus linguistics] deals with sub-fields in sociolinguistics, linguistic situations (conveyance and vernacularization, for example the language of administration, education, justice..., but also typologization of situations (English as a Foreign Language, EFL, Second Language Acquisition, SLA, English as the Mother Tongue, EMT). Precisely, it identifies and analyzes the linguistic utterances produced. Besides, it is the set of statistical and geo-linguistic data. It is interested in the different modes of appropriation (acquisition vs. learning). All language productions constitute the corpus. The corpus is the collection of materials for a particular purpose, for example: copies of exams, textbooks, etc.” 1 “Corpus linguistics has been considered differently, first as a research approach in the study of written and spoken discourse. This is often done following certain types of methodologies in extracting frequency-based patterns, and other linguistic features of language in use. It also represents a certain kind of discourse, whether spoken or written. Other theoreticians believe that it allows the researcher to look at the way that people use language to communicate. Some go further by analyzing the ways people exercise power through their use of language. One of the most practical aspect of CL is the high-frequency words (HFW) lists, i.e., the words that appear most commonly in the English language. These words are important dealing both with meaning and context to make sentences more meaningful. The list thus established help our reading and writing.” Now Indicate whether the following statements are true or false 1. Corpus linguistics groups other sciences TRUE/FALSE 2. Learning and acquisition are two ways of knowing TRUE/FALSE 3. A corpus has several purposes TRUE/FALSE 4. Corpus linguistics deals with a unique discourse TRUE/FALSE 5. High frequency words lists are not useful TRUE/FALSE 1.3.1. what a corpus is: [It goes without saying that one needs to understand the concepts often used in the literature. Not least is the one of corpus (pl. corpora or corpuses). You have to read two definitions on corpus]. Consider the following definitions: do they say the same thing or are their definitions different? “Corpus usually refers to a large collection of naturally occurring language texts presented in machine-readable form accumulated scientifically to 2pecialized2 a particular variety or use of language (Sinclair1 1991: 172). It is methodically designed to contain many millions of word compiled from different texts across various linguistic domains to encompass the diversity a language usually exhibits through its multifaceted use. It may refer to any text in written or spoken form… collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings, which have been collected for linguistic study. More recently, the word has been reserved for collections of texts (or parts of texts) that are stored and accessed electronically.”(Dash2 NS 2020). [Both Sinclair and Dash agree nearly on every aspect of the definition of corpus. In fact, they can be considered as complementary. If a number of practitioners could accept these definitions, in the field of research this has generated a great number of findings depending on the domain in use and the language practised. This has given birth to other accepted definitions. A number of corpora have been found, among them Davies3 2009.] 1 Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. 2 Dash, N.S. 2005. Corpus Linguistics and Language Technology (With Reference to Indian Languages). New Delhi: Mittal Publications, in Definition of a language corpus’ https://medium.com/@ns_dash/definition-of-a- language-corpus-2f237c08025a. 3 The 385+ million word Corpus of Contemporary American English (1990–2008+) Design, architecture, and linguistic insights”. Davies, M. (2009: 159-160). (https://www.english- 2 - Language corpora: Linguistic Corpora are considered as collections of linguistic data, either written texts or a transcription of recorded speech. They can be used to do a linguistic description or as a way to check research, verify and confirm/infirm hypotheses about a language (corpus linguistics). Corpora are thus “databases that contain texts”, that can be useful in language research. They are helpful for both language teachers and learners. Furthermore, corpus linguistics concerns the compilation and analysis of collections of spoken and written texts and is employed for describing the nature, structure, and use of languages. There are types of corpora: monolingual, multilingual, comparable, diachronic, static corpora. The first modern, electronically readable, corpus was The Brown Corpus of Standard American English. The corpus consists of one million words of American English texts printed in 1961. The largest corpus in the world is the Oxford English Corpus (OEC), which is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University Press’ language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words. Other corpora are known: The American National Corpus (ANC) British National Corpus (BNC). To do research, corpora of texts were the initial linguistic material for investigations in corpus linguistics. In the same line of conduct, a number of corpora were designed. Arguably, the first electronically treated text corpora was created in 1963 at Brown University, in the United States. Francis, W. and Kucera, H. designed a corpus of texts of 1 million words, to which the name of the Brown Corpus was given. Below are several examples of corpora]: English British National Corpus The volume of the corpus is more than 100 million http://www.natcorp.ox.ac.uk/ words of contemporary spoken (10%) and written (90%) British English. The BNC was designed to be well- balanced, with a wide range of genres. The Bank of English The corpus includes different types of written texts and http://www.collins.co.uk/Corpus/ oral speech. It consists of several subcorpora, namely CorpusSearch.aspx British books, newspapers, journals etc. (36 million words); American books, newspapers etc. (10 million words); British oral speech (10 million words). The volume of corpus is 56 million words. American National Corpus It is planned to compile a corpus, which will include 100 (ANC) million words. At present the volume of the existing http://americannationalcorpus.org/ fragment of the corpus is 10 million words. Michigan Corpus of Academic The volume of the internet-version of the corpus is 1 848 Spoken English: MiCASE 364 words taken from 152 samples. http://quod.lib.umich.edu/m/micase/ Source: Modern English Legal Terminology: linguistic and cognitive aspects (Kucheruk, L. 2013). [In French, one known corpus is the ‘Corpus of spontaneous interviews (Un corpus d’entretiens spontanés4, around 90).] corpora.org/davies/articles/davies_36.pdf). International Journal of Corpus Linguistics 14:2 (2009), 159–190. doi 10.1075/ijcl.14.2.02dav issn 1384–6655 / e-issn 1569–9811 © John Benjamins Publishing Company. 4 http://www.llas.ac.uk/resources/mb/80. 3 “The corpora can differ in a number of ways according to the purpose for which they were compiled, their representativeness, organization, and format. In the scientific literature, several types of corpora are distinguished: 1. According to the form of storage: - in an audio record – written corpora – mixed corpora; 2. According to the language of texts: - monolingual – multilingual; 3. According to genres of texts included: - literary – dialect – informal – journalistic – mixed; 4. According to accessibility: - open-access – commercial – closed; 5. According to destination: - exploratory – illustrative; 6. According to dynamics: - dynamic (monitor) 109 – static; 5 7. According to additional information provided: - annotated – not annotated”. [Specialized corpora can also be counted in. This specialization (ESP/EST) responds to some definite criteria that specify the type of texts to be considered. These corpora contain texts specialized in terms of a particular subject (law, technology, sciences…).] 1.3.2 tools and techniques for corpus analysis [Because a corpus must reflect language use, it should show a variety of nuances, and be representative enough of a given language, dialect, in their written and spoken forms and whether they are received or produced. Two methods have been used: ‘concordancing’ programs (search of specific terms in a corpus) and computer ones. Both techniques use in corpus design the technique of sampling used. From there, one can consider word frequency as a follow-up technique]. - Word frequency (definition given in the first sentence). “Word frequency enables computers to count how many words occur in a piece of writing. Word frequency is the technique used to see how often a word, for example “SUN” appears in a text, let us say it occurs 10 times in a text of 100 words. This gives an absolute frequency of 10 and a relative frequency of 0.1 or 10%. This depends of course on the length of the text or corpus, and the genre or register. Word frequency is generally regarded as the most important factor in creating pedagogical word lists, but it is not the only factor to consider; see for example Nation’s “subjective criteria” (2016: 10) and Ishikawa’s “pedagogical adjustments” (2019: 2). Consideration of other criteria for inclusion in pedagogical word lists entails a shift away from the exclusively quantitative domain of ranked frequencies of corpus items, and necessitates more qualitative, context-specific decision-making. Other criteria might include: the “learnability” or “learning burden” of a vocabulary item – for example, the transparency of its orthography, the typicality of its grammatical patterning, etc. (West, 1953; Nation, 2001), relevance to specialized, or syllabus-defined topics, modality or register, necessity relating to context/environment (e.g. classroom language), inclusion or exclusion of proper nouns, numbers, etc., and consideration of the L1 of the learners (e.g. the JACET lists were constructed specifically for Japanese L1 learners of English (JACET, 2016; Ishikawa,2019).” [Besides, researchers start by drawing up word families. Any decision concerning choice 5 Modern English Legal Terminology: linguistic and cognitive aspects (Kucheruk, L.2013: 107-109. 4 should depend on what the list will be used for: as a simple inventory of vocabulary items, for better comprehension, for distinction between forms and senses with a close link with context. This will lead to wordlists that will generate more or less important lists of words sorted by their frequency of occurrence. These automatically generated wordlists are continuously revised because of change and neologisms]. ______________________ READ (what are the topic words in the following text?): (Why does word frequency matter6? “Word frequency matters because it can reveal a lot about the vocabulary and the content of a text or a corpus. For instance, it can help you determine the difficulty level of a text, based on how many high-frequency or low-frequency words it contains. High- frequency words are those that appear frequently in a language, such as "the", "and", or "is". Low-frequency words are those that appear rarely, such as "quark", "antimatter", or "paradigm". Generally, texts with more high-frequency words are easier to read and understand than texts with more low-frequency words. Word frequency can also help you identify the keywords or the topic words of a text or a corpus, which are those that are more frequent than expected by chance, compared to a reference corpus or a general language corpus. Keywords or topic words can indicate the main themes or the specific focus of a text or a corpus”. [Word frequency counts are often used in classroom situation to identify the words the teacher could include in his teaching. This procedure helps the researcher to establish word- lists. One of the most famous is Svartvik’s list. “The Alphabetical Listing of Words Occurring More Than Once” in Svartvik and Quirk (1980 Svartvik, J., & Quirk, R. A corpus of English conversation. Lund, Sweden: Gleerup]. - Word lists: “In English for Specific Academic Purposes (ESAP) or English for Specific Purposes (ESP) programs, where learners have highly specific needs, and plan to study similar subject areas (e.g., Mathematics, Physics, Chemistry) or even the same subject area (e.g., Mathematics), discipline-specific wordlists may better serve learners’ needs than general academic wordlists. Specialized vocabulary tends to occur more often in specialized texts (Chung & Nation, 2004; Nation, 2016). Compared with general academic wordlists, discipline-specific wordlists, are better at drawing their attention to the most frequent and wide ranging words in their specific areas and providing a shortcut to reduce the amount of learning (Nation, 2013). Moreover, learners might be motivated to learn items from these lists because they can clearly see the relationship between what they study in their English courses and their subject courses (Hyland, 2016). Additionally, similarities in the learners’ academic discipline may make it easier for teachers to focus on more specialized vocabulary in that discipline. General academic wordlists have a wider application (Hyland, 2016; Nation, 2013). 6 https://www.linkedin.com/advice/0/how-do-you-measure-word-frequency-your-corpus. 5 They can be useful in EGAP programs where learners (1) are more heterogeneous in terms of disciplines that they plan to study, (2) have not yet identified their target disciplines, (3) plan to study interdisciplinary subject areas, or (4) teachers lack background knowledge of learners’ specific disciplines. In such environments, it is usually challenging for EAP teachers to satisfy the specific needs of every learner in their programs and a general academic wordlist would therefore be more practical. The value of general academic wordlists is evident from Banister’s (2016) finding that Coxhead’s (2000) Academic Word List (AWL), a general academic wordlist, was widely used and perceived as a useful instrument for L2 learners from a wide range of subjects by EAP teachers7.” _____________________ 1.3.3 designing a corpus [One question you may ask is how one can design a corpus. Compiling a corpus on any language stretch of discourse is a difficult task. Transcription is not the least complex process. A number of linguists and sociolinguists have worked for long periods to achieve products worth analyzing: the process itself and the results obtained. Thus, Reppen 8 (2022 ed.) explains what to consider when put in front of such a task. ------------------------------ READ: In order to design a corpus, the researcher must take a series of decisions such as: 1. decide on the structural criteria that will be used to build the corpus, and apply them to create a framework for the main corpus components; 2. for each component, draw up a comprehensive inventory of text types that are found there, using external criteria only; 3. put the text types in a priority order, taking into account all the factors that might increase or decrease the importance of a text type — the kind of factors discussed above; 4. estimate a target size for each text type, relating together (i) the overall target size for the component, (ii) the number of text types, (iii) the importance of each, and (iv) the practicality of gathering quantities of it; 5. as the corpus takes shape, maintain a comparison between the actual dimensions of 7 Dang, T. N. Y., Coxhead, A., & Webb, S. (2017). The academic spoken word list. Language Learning, 67(4), 959に997. http://eprints.whiterose.ac.uk/135479/ 8 R. Reppen “Building a corpus: what are key considerations?” in The Routledge Handbook of Corpus Linguistics 2e, Edited by Anne O’Keeffe and Michael J. McCarthy. 6 the material and the original plan; 6. (most important of all) document these steps so that users can have a reference point if they get unexpected results, and so that improvements can be made on the basis of experience (J.Sinclair9 2004: 12). Now ANSWER the question: Which according to you is the most important decision to take while building a corpus? Take a couple of minutes before you read our own response. As far as we are concerned, it is the sixth element, because research is a non-stop activity that could induce changes in the process. 1 Sinclair, John (2004). “Corpus and Text — Basic Principles”. In: Martin Wynne (ed.). Developing Linguistic Corpora: a Guide to Good Practice. 1-16. Oxford: Oxbow Books. 7

Use Quizgecko on...
Browser
Browser