Corpus Linguistics 08
41 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of these corpora is the largest balanced corpus of the specified type?

  • British National Corpus (BNC) (correct)
  • News on the Web (NOW)
  • Global Web-Based English (GloWbE)
  • International Corpus of Learner English (ICLE)

Which corpus is considered the largest monitor corpus of English?

  • International Corpus of Learner English (ICLE)
  • News on the Web (NOW) (correct)
  • Global Web-Based English (GloWbE)
  • British National Corpus (BNC)

From which time period does the International Corpus of Learner English (ICLE) data originate?

  • 2010-present
  • 18th century
  • 2012-2013
  • 1980-1993 (correct)

What type of data does the International Corpus of Learner English (ICLE) primarily contain?

<p>Essay writing (D)</p> Signup and view all the answers

Which corpus contains data from the largest number of countries?

<p>Global Web-Based English (GloWbE) (A), News on the Web (NOW) (B)</p> Signup and view all the answers

What does the example of the "TB" file demonstrate about corpus files?

<p>Corpus files are designed to be processed and analyzed by computers. (A)</p> Signup and view all the answers

Which of the following is NOT mentioned as a typical component of a corpus?

<p>Examples of human-readable text (C)</p> Signup and view all the answers

What is the primary purpose of providing metadata in a corpus?

<p>To provide context and information about the source of the text. (C)</p> Signup and view all the answers

Which of the following is NOT a corpus mentioned in the provided content?

<p>The Gutenberg Project (D)</p> Signup and view all the answers

Based on the text, what is the significance of the example of the "TB" file in the context of corpus design?

<p>It demonstrates that corpus files are not designed for human readability. (A)</p> Signup and view all the answers

What is the main focus of the ICE Corpus?

<p>Modern English from diverse geographic regions (B)</p> Signup and view all the answers

How many varieties of English are currently available in the ICE Corpus?

<p>14 (C)</p> Signup and view all the answers

Which of the following categories is NOT included in the ICE Corpus's written category?

<p>Poetry (C)</p> Signup and view all the answers

In the ICE Corpus's 'Spoken' category, which subcategory is the largest?

<p>Dialogs (A)</p> Signup and view all the answers

Which of the following is NOT a subcategory of the ICE Corpus's 'Spoken' category?

<p>Newspaper Articles (C)</p> Signup and view all the answers

In the ICE Corpus's 'Spoken' category, which subcategory belongs to 'Monologues' and 'Unscripted'?

<p>Unscripted Speeches (C)</p> Signup and view all the answers

In the ICE Corpus's 'Written' category, which subcategory is included under 'Non-printed' and 'Letters'?

<p>Business Letters (A)</p> Signup and view all the answers

Which of the following subcategories of the ICE Corpus's 'Written' category is NOT classified under 'Printed'?

<p>Student Writing (A)</p> Signup and view all the answers

What is the approximate size of the Corpus of Contemporary American English (COCA)?

<p>1 billion words (A)</p> Signup and view all the answers

What is the time period covered by the Corpus of Historical American English (COHA)?

<p>1820s - 2010s (D)</p> Signup and view all the answers

What is the role of POS tagging in the analysis of large corpora?

<p>It separates words into different categories based on their grammatical functions. (D)</p> Signup and view all the answers

Which of the following is NOT a benefit of using POS tagging for language analysis?

<p>Compiling a comprehensive dictionary of all words in the corpus (C)</p> Signup and view all the answers

According to the passage, what is the significance of the 'fast' word in the context of the 'NOW' corpus?

<p>It exemplifies the need for a robust tagging system to handle complex language data. (A)</p> Signup and view all the answers

What is the main difference between the 'ICE Canada' and 'NOW' corpora discussed in the text?

<p>The size and scope of the corpora. (A)</p> Signup and view all the answers

What is the purpose of the provided numerical data on the 'NOW' corpus?

<p>To highlight the challenges associated with analyzing large text datasets. (D)</p> Signup and view all the answers

What does the abbreviation 'POS' stand for in the given text?

<p>Part of Speech (B)</p> Signup and view all the answers

What is the primary function of the 'CLAWS' system?

<p>To assign a word class to each word in the corpus. (D)</p> Signup and view all the answers

What is the significance of the 'CLAWS7 tagset' mentioned in the passage?

<p>It is a collection of standardized codes for tagging different grammatical features. (C)</p> Signup and view all the answers

What is the approximate size of the BROWN corpus?

<p>1 million words (D)</p> Signup and view all the answers

What does the term 'diachronic' refer to in the context of corpora?

<p>Analysis of language data across different time periods (A)</p> Signup and view all the answers

How many samples are included in the BROWN corpus?

<p>500 (A)</p> Signup and view all the answers

According to the provided text, what are some examples of criteria used to categorize corpora?

<p>Source of data (written, spoken, both) and its type (text, speech) (A)</p> Signup and view all the answers

What is a common characteristic of corpora following the BROWN design?

<p>They allow for diachronic and cross-variety analyses (D)</p> Signup and view all the answers

What is a corpus?

<p>A collection of texts that represent a language or domain. (D)</p> Signup and view all the answers

How does reading a corpus differ from reading a single text?

<p>A corpus is read vertically, focusing on formal patterning, while a single text is read horizontally, focusing on content. (A)</p> Signup and view all the answers

Which of the following is NOT a common characteristic of a corpus?

<p>It is always a collection of printed texts. (A)</p> Signup and view all the answers

What is the purpose of annotating a corpus?

<p>To make it easier to analyze text. (D)</p> Signup and view all the answers

How does the structure of a single text relate to the structure of a corpus?

<p>A corpus can be thought of as a collection of single texts, but the corpus may have its own structure and markings. (A)</p> Signup and view all the answers

How can parole help us understand langue?

<p>Parole is a single act of communication, while langue is a system of language, but the study of parole can reveal patterns within langue. (D)</p> Signup and view all the answers

Why is it important for a corpus to be representative of a language or domain?

<p>So that it can provide insights into the language as a whole. (C)</p> Signup and view all the answers

What is the main characteristic of a corpus that makes it useful for research?

<p>It is carefully selected to represent a specific language or domain. (C)</p> Signup and view all the answers

Flashcards

Corpus

A corpus is a structured collection of written or spoken texts, designed for analysis by a computer.

Metadata

Data that provides information about other data, such as author, genre, and title in a corpus.

Corpus design

The process of creating a corpus, including selecting texts and defining its structure.

Raw data to corpus file

The transformation of unprocessed data into a structured format suitable for analysis.

Signup and view all the flashcards

Corpus structure

The organization and arrangement of textual data within a corpus for computational use.

Signup and view all the flashcards

Difference between Text and Corpus

Text is read as a unique event, while a corpus provides a sample for investigation across multiple texts.

Signup and view all the flashcards

Text Reading Approach

Text is read horizontally for content, whereas a corpus is read vertically for patterns.

Signup and view all the flashcards

Events in Text and Corpus

Text is considered a unique act, while corpus focuses on repeated events for analysis.

Signup and view all the flashcards

Social Practice

Corpus treats data as a sample of social activity, not just as individual events like text does.

Signup and view all the flashcards

Corpus Annotation

Most corpora include detailed annotation for linguistic features like sentence IDs and grammatical tags.

Signup and view all the flashcards

Types of Data in Corpus

Corpora consist of written and spoken data transcripts, allowing diverse linguistic analysis.

Signup and view all the flashcards

Machine-readable

Corpus texts are designed to be processed by machines for linguistic investigations.

Signup and view all the flashcards

Corpora

Collections of written or spoken texts used for linguistic research.

Signup and view all the flashcards

Mode of Corpora

Classification criterion based on whether data is written, spoken, or both.

Signup and view all the flashcards

Time Depth

Refers to the timeframe of the language data: synchronic, diachronic, or historical.

Signup and view all the flashcards

Specificness of Corpora

How focussed the data is on genres, types, or specific varieties.

Signup and view all the flashcards

The BROWN Family of Corpora

A collection of written American English corpus designed in 1961 for linguistic analysis.

Signup and view all the flashcards

GloWbE

Global Web-Based English, a corpus with 1.9 billion words from 20 countries.

Signup and view all the flashcards

NOW

News on the Web, the largest monitor corpus of English with 20.3 billion words.

Signup and view all the flashcards

BNC

British National Corpus, contains 100 million words, both written and spoken.

Signup and view all the flashcards

ICLE

International Corpus of Learner English, focuses on essays from advanced learners.

Signup and view all the flashcards

International Corpus of English (ICE)

A collection of spoken and written English from various countries.

Signup and view all the flashcards

Spoken Dialogues

Interpersonal communication captured in conversation format.

Signup and view all the flashcards

Legal cross-examinations

Formal questioning in the court used for evidence gathering.

Signup and view all the flashcards

Scripted Monologues

Prepared speeches or talks delivered by an individual.

Signup and view all the flashcards

COCA

Corpus of Contemporary American English, a large English corpus.

Signup and view all the flashcards

COHA

Corpus of Historical American English, focusing on language from the 1820s-2010s.

Signup and view all the flashcards

Non-printed writing

Forms of writing that do not appear in physical books or articles.

Signup and view all the flashcards

Academic writing

Formal writing style used in scholarly articles and papers.

Signup and view all the flashcards

Creative writing

Writing that expresses imagination, often seen in fiction like stories and novels.

Signup and view all the flashcards

Broadcast news

News content delivered through radio or television.

Signup and view all the flashcards

Part-of-speech (POS) tagging

The process of assigning word classes to words in a text, like nouns and verbs.

Signup and view all the flashcards

POS tag format

Tags are often added to word forms in the format '_TAG', indicating their word class.

Signup and view all the flashcards

Word class ambiguity

Some words can belong to multiple classes, such as 'fast' being a noun, verb, or adverb.

Signup and view all the flashcards

CLAWS tagger

A widely used POS tagger that assigns word classes with 96-97% accuracy.

Signup and view all the flashcards

CLAWS7 tagset

A set of 137 tags used by CLAWS, including tags for determiners and verbs.

Signup and view all the flashcards

KWIC output

Key Word In Context output shows words in their textual surroundings for analysis.

Signup and view all the flashcards

Existential there

A form of 'there' used to indicate existence in sentences, e.g., 'There is a cat.'

Signup and view all the flashcards

Usefulness of POS tagging

POS tagging simplifies the process of finding and analyzing specific word usages in texts.

Signup and view all the flashcards

Study Notes

Corpus Linguistics (1)

  • Multilingualism is a common phenomenon globally
  • Diglossia involves using two languages in a speech community for different functions
  • Code-switching signifies changing between languages based on the situation
  • Code-mixing involves switching between languages within communication
  • English usage varies globally, including in Anglosphere countries (UK, Ireland, USA, Canada, Australia, New Zealand) and former colonial territories in Africa, Asia, the Caribbean, and Oceania
  • World Englishes are modeled using models like Kachru's Three Circles Model and Schneider's Dynamic Model
  • Pidgins and creoles are typologically unrelated languages, evolving through language contact and exhibiting simplified structures
  • These languages are prevalent in West Africa, the Caribbean, the US and Central America, and the Pacific.

Today's Lecture

  • The lecture covers Introduction, Corpora, Concordancing, and Part-of-speech Tagging

Introduction to Corpus Linguistics

  • Corpus linguistics is a scientific method of language analysis relying on data from language corpora to support statements about language.
  • The BROWN Corpus (1 million words), released in 1967, was a significant early corpus
  • Modern corpora encompass billions of words
  • Corpus linguistics aids in language trend analysis and authentic language teaching materials

Why Corpus Linguistics is Useful

  • Language trend analysis: Identifying trends in language use over time and social media
  • Language teaching: Creating authentic teaching materials based on real-life language use and ESL textbooks
  • Lexicography: Enhancing dictionary definitions with usage information
  • AI & NLP: Improving computational models of language understanding and training AI models (Siri, Alexa)
  • Discourse analysis: Uncovering patterns and structures in spoken/written text (e.g. political speeches)
  • Linguistic hypotheses testing: Validating or refuting language usage theories

What is a Corpus?

  • A corpus is a collection of electronic texts designed to represent a language or a language variety, and is built according to specific criteria for purpose of study.
  • Corpora aim to represent language naturally
  • Corpora can consist of written or spoken texts (including transcripts).
  • Corpora are machine-readable and are sampled to be representative
  • Corpora are large collections of texts which enable investigation of language phenomena

Difference between text and Corpus

  • A text is a whole, coherent communicative event that is read horizontally and is an individual act of will.
  • A corpus is fragmented and read vertically for formal patterning and repeated events; it represents language use as social practice.

Corpus Structure

  • Corpus structure and level of annotation vary greatly
  • A corpus frequently consists of annotated transcripts of written and spoken data (e.g. in file form).
  • Corpora often include a manual, metadata (e.g., author, genre, title) and/or an online interface for exploration
  • Metadata includes details like author, genre, and title

From Raw Data to Corpus File

  • Raw data can be transformed into a corpus file using XML (Extensible Markup Language).
  • Metadata includes aspects such as the date, location, author, and the nature of the written or spoken text

Corpora Overview

  • Hundreds of corpora exist, covering various aspects of the English language
  • Corpora can be categorized by mode (written, spoken, both), time depth (synchronic, diachronic, historical), and topic specificity
  • Comprehensive online corpus lists exist

The BROWN Family

  • A large collection of written English
  • Comprises 8 major genres (press, religion, skills, hobbies, popular lore, belles-lettres, miscellaneous, humor)
  • Provides data for diachronic and comparative analysis due to its development over time

The International Corpus of English (ICE)

  • A 1 million-word collection
  • Includes written and spoken data (student essays, news reports, conversations) for different varieties of English.

ICE Design: Spoken and Written

  • Contains multiple varieties of English, with specific quantities for different genres for both written and spoken

COCA and COHA

  • COCA (Corpus of Contemporary American English): 1 billion words; covers 1990-2019
  • COHA (Corpus of Historical American English): 475 million words; covers 1820s-2010s

GloWbE and NOW

  • GloWbE (Global Web-Based English) comprises 1.9 billion words from 20 countries across various linguistic domains; collected between 2012-2013
  • NOW (News on the Web) is an up-to-the-minute corpus comprising 20.3 billion words, collected daily

BNC and ICLE

  • BNC (British National Corpus): 100 million words; encompasses written and spoken British English from 1980-1993; 7 categories of textual data
  • ICLE (International Corpus of Learner English): 3.7 million words, focusing on learner essays and diverse varieties

Concordancing Introduction

  • Concordancing helps linguists investigate word occurrences and behavior in real-life contexts.
  • Traditonal approaches rely on the native speaker's intuition
  • Concordance offers a listing of word forms paired with their surrounding context; also known as co-text

AntConc

  • AntConc is widely used for corpus analysis
  • It facilitates diverse searches from individual letters to larger constructions, like sentences
  • It's available across many platforms, is freely accessible, well-documented and user-friendly

Part-of-speech (POS) tagging Introduction

  • POS tagging assigns grammatical categories (e.g., noun, verb, adjective) to words in a text
  • This aids in understanding word classes and inflections.
  • CLAWS is a commonly used automated tagging system

POS tagging Usefulness

Analysis of different tagged words and quantities shows how often words are used in different contexts.

  • Provides insights into how frequently a word appears in different formats, like an adjective or adverb, helping to better understand usage

TagAnt

  • TagAnt is a versatile free tagging tool for corpus analysis.
  • Various tagsets(specific groupings of words) may be selected based on the project needs

Keywords

  • A list of common terms used in corpus linguistics, including types of corpora, analysis techniques, and softwares/tools.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Corpus Linguistics (1) PDF

Description

This quiz covers key concepts in Corpus Linguistics, focusing on multilingualism, diglossia, code-switching, and language variation. Learn about the global usage of English and the emergence of pidgins and creoles, along with various models explaining World Englishes. Test your understanding with questions about language dynamics and their implications.

More Like This

Introduction to Corpus Linguistics Quiz
10 questions
Corpus Linguistics for Translators Quiz
16 questions
Modern Corpus Linguistics Overview
6 questions
Corpus Linguistics 09
45 questions
Use Quizgecko on...
Browser
Browser