Podcast
Questions and Answers
Which of these corpora is the largest balanced corpus of the specified type?
Which of these corpora is the largest balanced corpus of the specified type?
- British National Corpus (BNC) (correct)
- News on the Web (NOW)
- Global Web-Based English (GloWbE)
- International Corpus of Learner English (ICLE)
Which corpus is considered the largest monitor corpus of English?
Which corpus is considered the largest monitor corpus of English?
- International Corpus of Learner English (ICLE)
- News on the Web (NOW) (correct)
- Global Web-Based English (GloWbE)
- British National Corpus (BNC)
From which time period does the International Corpus of Learner English (ICLE) data originate?
From which time period does the International Corpus of Learner English (ICLE) data originate?
- 2010-present
- 18th century
- 2012-2013
- 1980-1993 (correct)
What type of data does the International Corpus of Learner English (ICLE) primarily contain?
What type of data does the International Corpus of Learner English (ICLE) primarily contain?
Which corpus contains data from the largest number of countries?
Which corpus contains data from the largest number of countries?
What does the example of the "TB" file demonstrate about corpus files?
What does the example of the "TB" file demonstrate about corpus files?
Which of the following is NOT mentioned as a typical component of a corpus?
Which of the following is NOT mentioned as a typical component of a corpus?
What is the primary purpose of providing metadata in a corpus?
What is the primary purpose of providing metadata in a corpus?
Which of the following is NOT a corpus mentioned in the provided content?
Which of the following is NOT a corpus mentioned in the provided content?
Based on the text, what is the significance of the example of the "TB" file in the context of corpus design?
Based on the text, what is the significance of the example of the "TB" file in the context of corpus design?
What is the main focus of the ICE Corpus?
What is the main focus of the ICE Corpus?
How many varieties of English are currently available in the ICE Corpus?
How many varieties of English are currently available in the ICE Corpus?
Which of the following categories is NOT included in the ICE Corpus's written category?
Which of the following categories is NOT included in the ICE Corpus's written category?
In the ICE Corpus's 'Spoken' category, which subcategory is the largest?
In the ICE Corpus's 'Spoken' category, which subcategory is the largest?
Which of the following is NOT a subcategory of the ICE Corpus's 'Spoken' category?
Which of the following is NOT a subcategory of the ICE Corpus's 'Spoken' category?
In the ICE Corpus's 'Spoken' category, which subcategory belongs to 'Monologues' and 'Unscripted'?
In the ICE Corpus's 'Spoken' category, which subcategory belongs to 'Monologues' and 'Unscripted'?
In the ICE Corpus's 'Written' category, which subcategory is included under 'Non-printed' and 'Letters'?
In the ICE Corpus's 'Written' category, which subcategory is included under 'Non-printed' and 'Letters'?
Which of the following subcategories of the ICE Corpus's 'Written' category is NOT classified under 'Printed'?
Which of the following subcategories of the ICE Corpus's 'Written' category is NOT classified under 'Printed'?
What is the approximate size of the Corpus of Contemporary American English (COCA)?
What is the approximate size of the Corpus of Contemporary American English (COCA)?
What is the time period covered by the Corpus of Historical American English (COHA)?
What is the time period covered by the Corpus of Historical American English (COHA)?
What is the role of POS tagging in the analysis of large corpora?
What is the role of POS tagging in the analysis of large corpora?
Which of the following is NOT a benefit of using POS tagging for language analysis?
Which of the following is NOT a benefit of using POS tagging for language analysis?
According to the passage, what is the significance of the 'fast' word in the context of the 'NOW' corpus?
According to the passage, what is the significance of the 'fast' word in the context of the 'NOW' corpus?
What is the main difference between the 'ICE Canada' and 'NOW' corpora discussed in the text?
What is the main difference between the 'ICE Canada' and 'NOW' corpora discussed in the text?
What is the purpose of the provided numerical data on the 'NOW' corpus?
What is the purpose of the provided numerical data on the 'NOW' corpus?
What does the abbreviation 'POS' stand for in the given text?
What does the abbreviation 'POS' stand for in the given text?
What is the primary function of the 'CLAWS' system?
What is the primary function of the 'CLAWS' system?
What is the significance of the 'CLAWS7 tagset' mentioned in the passage?
What is the significance of the 'CLAWS7 tagset' mentioned in the passage?
What is the approximate size of the BROWN corpus?
What is the approximate size of the BROWN corpus?
What does the term 'diachronic' refer to in the context of corpora?
What does the term 'diachronic' refer to in the context of corpora?
How many samples are included in the BROWN corpus?
How many samples are included in the BROWN corpus?
According to the provided text, what are some examples of criteria used to categorize corpora?
According to the provided text, what are some examples of criteria used to categorize corpora?
What is a common characteristic of corpora following the BROWN design?
What is a common characteristic of corpora following the BROWN design?
What is a corpus?
What is a corpus?
How does reading a corpus differ from reading a single text?
How does reading a corpus differ from reading a single text?
Which of the following is NOT a common characteristic of a corpus?
Which of the following is NOT a common characteristic of a corpus?
What is the purpose of annotating a corpus?
What is the purpose of annotating a corpus?
How does the structure of a single text relate to the structure of a corpus?
How does the structure of a single text relate to the structure of a corpus?
How can parole help us understand langue?
How can parole help us understand langue?
Why is it important for a corpus to be representative of a language or domain?
Why is it important for a corpus to be representative of a language or domain?
What is the main characteristic of a corpus that makes it useful for research?
What is the main characteristic of a corpus that makes it useful for research?
Flashcards
Corpus
Corpus
A corpus is a structured collection of written or spoken texts, designed for analysis by a computer.
Metadata
Metadata
Data that provides information about other data, such as author, genre, and title in a corpus.
Corpus design
Corpus design
The process of creating a corpus, including selecting texts and defining its structure.
Raw data to corpus file
Raw data to corpus file
Signup and view all the flashcards
Corpus structure
Corpus structure
Signup and view all the flashcards
Difference between Text and Corpus
Difference between Text and Corpus
Signup and view all the flashcards
Text Reading Approach
Text Reading Approach
Signup and view all the flashcards
Events in Text and Corpus
Events in Text and Corpus
Signup and view all the flashcards
Social Practice
Social Practice
Signup and view all the flashcards
Corpus Annotation
Corpus Annotation
Signup and view all the flashcards
Types of Data in Corpus
Types of Data in Corpus
Signup and view all the flashcards
Machine-readable
Machine-readable
Signup and view all the flashcards
Corpora
Corpora
Signup and view all the flashcards
Mode of Corpora
Mode of Corpora
Signup and view all the flashcards
Time Depth
Time Depth
Signup and view all the flashcards
Specificness of Corpora
Specificness of Corpora
Signup and view all the flashcards
The BROWN Family of Corpora
The BROWN Family of Corpora
Signup and view all the flashcards
GloWbE
GloWbE
Signup and view all the flashcards
NOW
NOW
Signup and view all the flashcards
BNC
BNC
Signup and view all the flashcards
ICLE
ICLE
Signup and view all the flashcards
International Corpus of English (ICE)
International Corpus of English (ICE)
Signup and view all the flashcards
Spoken Dialogues
Spoken Dialogues
Signup and view all the flashcards
Legal cross-examinations
Legal cross-examinations
Signup and view all the flashcards
Scripted Monologues
Scripted Monologues
Signup and view all the flashcards
COCA
COCA
Signup and view all the flashcards
COHA
COHA
Signup and view all the flashcards
Non-printed writing
Non-printed writing
Signup and view all the flashcards
Academic writing
Academic writing
Signup and view all the flashcards
Creative writing
Creative writing
Signup and view all the flashcards
Broadcast news
Broadcast news
Signup and view all the flashcards
Part-of-speech (POS) tagging
Part-of-speech (POS) tagging
Signup and view all the flashcards
POS tag format
POS tag format
Signup and view all the flashcards
Word class ambiguity
Word class ambiguity
Signup and view all the flashcards
CLAWS tagger
CLAWS tagger
Signup and view all the flashcards
CLAWS7 tagset
CLAWS7 tagset
Signup and view all the flashcards
KWIC output
KWIC output
Signup and view all the flashcards
Existential there
Existential there
Signup and view all the flashcards
Usefulness of POS tagging
Usefulness of POS tagging
Signup and view all the flashcards
Study Notes
Corpus Linguistics (1)
- Multilingualism is a common phenomenon globally
- Diglossia involves using two languages in a speech community for different functions
- Code-switching signifies changing between languages based on the situation
- Code-mixing involves switching between languages within communication
- English usage varies globally, including in Anglosphere countries (UK, Ireland, USA, Canada, Australia, New Zealand) and former colonial territories in Africa, Asia, the Caribbean, and Oceania
- World Englishes are modeled using models like Kachru's Three Circles Model and Schneider's Dynamic Model
- Pidgins and creoles are typologically unrelated languages, evolving through language contact and exhibiting simplified structures
- These languages are prevalent in West Africa, the Caribbean, the US and Central America, and the Pacific.
Today's Lecture
- The lecture covers Introduction, Corpora, Concordancing, and Part-of-speech Tagging
Introduction to Corpus Linguistics
- Corpus linguistics is a scientific method of language analysis relying on data from language corpora to support statements about language.
- The BROWN Corpus (1 million words), released in 1967, was a significant early corpus
- Modern corpora encompass billions of words
- Corpus linguistics aids in language trend analysis and authentic language teaching materials
Why Corpus Linguistics is Useful
- Language trend analysis: Identifying trends in language use over time and social media
- Language teaching: Creating authentic teaching materials based on real-life language use and ESL textbooks
- Lexicography: Enhancing dictionary definitions with usage information
- AI & NLP: Improving computational models of language understanding and training AI models (Siri, Alexa)
- Discourse analysis: Uncovering patterns and structures in spoken/written text (e.g. political speeches)
- Linguistic hypotheses testing: Validating or refuting language usage theories
What is a Corpus?
- A corpus is a collection of electronic texts designed to represent a language or a language variety, and is built according to specific criteria for purpose of study.
- Corpora aim to represent language naturally
- Corpora can consist of written or spoken texts (including transcripts).
- Corpora are machine-readable and are sampled to be representative
- Corpora are large collections of texts which enable investigation of language phenomena
Difference between text and Corpus
- A text is a whole, coherent communicative event that is read horizontally and is an individual act of will.
- A corpus is fragmented and read vertically for formal patterning and repeated events; it represents language use as social practice.
Corpus Structure
- Corpus structure and level of annotation vary greatly
- A corpus frequently consists of annotated transcripts of written and spoken data (e.g. in file form).
- Corpora often include a manual, metadata (e.g., author, genre, title) and/or an online interface for exploration
- Metadata includes details like author, genre, and title
From Raw Data to Corpus File
- Raw data can be transformed into a corpus file using XML (Extensible Markup Language).
- Metadata includes aspects such as the date, location, author, and the nature of the written or spoken text
Corpora Overview
- Hundreds of corpora exist, covering various aspects of the English language
- Corpora can be categorized by mode (written, spoken, both), time depth (synchronic, diachronic, historical), and topic specificity
- Comprehensive online corpus lists exist
The BROWN Family
- A large collection of written English
- Comprises 8 major genres (press, religion, skills, hobbies, popular lore, belles-lettres, miscellaneous, humor)
- Provides data for diachronic and comparative analysis due to its development over time
The International Corpus of English (ICE)
- A 1 million-word collection
- Includes written and spoken data (student essays, news reports, conversations) for different varieties of English.
ICE Design: Spoken and Written
- Contains multiple varieties of English, with specific quantities for different genres for both written and spoken
COCA and COHA
- COCA (Corpus of Contemporary American English): 1 billion words; covers 1990-2019
- COHA (Corpus of Historical American English): 475 million words; covers 1820s-2010s
GloWbE and NOW
- GloWbE (Global Web-Based English) comprises 1.9 billion words from 20 countries across various linguistic domains; collected between 2012-2013
- NOW (News on the Web) is an up-to-the-minute corpus comprising 20.3 billion words, collected daily
BNC and ICLE
- BNC (British National Corpus): 100 million words; encompasses written and spoken British English from 1980-1993; 7 categories of textual data
- ICLE (International Corpus of Learner English): 3.7 million words, focusing on learner essays and diverse varieties
Concordancing Introduction
- Concordancing helps linguists investigate word occurrences and behavior in real-life contexts.
- Traditonal approaches rely on the native speaker's intuition
- Concordance offers a listing of word forms paired with their surrounding context; also known as co-text
AntConc
- AntConc is widely used for corpus analysis
- It facilitates diverse searches from individual letters to larger constructions, like sentences
- It's available across many platforms, is freely accessible, well-documented and user-friendly
Part-of-speech (POS) tagging Introduction
- POS tagging assigns grammatical categories (e.g., noun, verb, adjective) to words in a text
- This aids in understanding word classes and inflections.
- CLAWS is a commonly used automated tagging system
POS tagging Usefulness
Analysis of different tagged words and quantities shows how often words are used in different contexts.
- Provides insights into how frequently a word appears in different formats, like an adjective or adverb, helping to better understand usage
TagAnt
- TagAnt is a versatile free tagging tool for corpus analysis.
- Various tagsets(specific groupings of words) may be selected based on the project needs
Keywords
- A list of common terms used in corpus linguistics, including types of corpora, analysis techniques, and softwares/tools.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz covers key concepts in Corpus Linguistics, focusing on multilingualism, diglossia, code-switching, and language variation. Learn about the global usage of English and the emergence of pidgins and creoles, along with various models explaining World Englishes. Test your understanding with questions about language dynamics and their implications.