Language Corpus Linguistics

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What was a limitation on early electronic corpus creation?

  • No need to enter material manually
  • Automated processing for language features
  • Absence of the Internet for sharing (correct)
  • Unlimited data storage capacity

What primarily drove renewed interest in corpus compilation in the 1980s?

  • Challenges related to language sharing
  • The need for manual data entry methods
  • The desire to drive language processing software (correct)
  • Limitations in computer speed and capacity

What is a characteristic of the 'golden era' of linguistic corpora?

  • Limited availability of multilingual parallel corpora
  • Decreased interest in language analysis
  • Focus on monolingual corpora only
  • Government-funded projects compiling corpora (correct)

What is a function that the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) provide?

<p>Serving as repositories and distributors of corpora (A)</p> Signup and view all the answers

What is the initial stage of corpus creation?

<p>Data capture (A)</p> Signup and view all the answers

What is the most common representation format for linguistic corpora nowadays?

<p>XML (B)</p> Signup and view all the answers

What does the notion of 'stand-off annotation' require?

<p>Annotations are encoded in documents separate from the primary data. (A)</p> Signup and view all the answers

What is a consideration when creating linguistic corpora?

<p>Gross structure of the texts. (A)</p> Signup and view all the answers

What do programs for sentence splitting and word tokenization need?

<p>Language-specific information. (D)</p> Signup and view all the answers

What is a common problem with freely available tools for identifying named entities?

<p>They may not perform well on humanist data like literary works. (B)</p> Signup and view all the answers

What is Part-of-speech (POS) tagging primarily used for?

<p>Adding Morpho-syntactic annotation. (A)</p> Signup and view all the answers

What do stochastic taggers rely on?

<p>Probabilities for n-grams. (B)</p> Signup and view all the answers

What is a characteristic of the Penn Treebank tagset?

<p>It eliminates information retrievable from the form of the lexical item. (C)</p> Signup and view all the answers

What ability does the presence of lemmas in an annotated text provide?

<p>To extract all orthographic forms associated with a given lemma. (C)</p> Signup and view all the answers

Which type of parallel alignment is generally easier and more accurate?

<p>Sentence alignment. (C)</p> Signup and view all the answers

What is a key function of syntactically annotated corpora?

<p>To drive syntactic parsers. (D)</p> Signup and view all the answers

What is a function that Semantic annotation provides?

<p>Information about the meaning of elements in a text. (C)</p> Signup and view all the answers

What is WordNet used for?

<p>Providing lists of senses and grouping words into synsets. (B)</p> Signup and view all the answers

What persists to be one of the more difficult problems in language processing?

<p>Automatic means to sense-tag data. (B)</p> Signup and view all the answers

What is the focus of co-reference annotation?

<p>Relating objects to prior elements in a discourse. (C)</p> Signup and view all the answers

What does discourse structure annotation identify?

<p>Multi-level hierarchies of discourse segments and the relations between them. (B)</p> Signup and view all the answers

What information can be included in the orthographic transcription of speech data?

<p>All are correct (B)</p> Signup and view all the answers

What is a difficulty associated with speech annotation?

<p>Subjectivity of tone or other movements/aspects. (D)</p> Signup and view all the answers

What is the view of the annotation process based on the MULTEXT project?

<p>A chain of smaller, individual processes that incrementally add annotations. (A)</p> Signup and view all the answers

What is the purpose of the ISO model for linguistic annotations?

<p>To provide a common pivot format for interchange. (D)</p> Signup and view all the answers

What are potential impacts regarding the Semantic Web for corpus annotation and analysis?

<p>The Semantic Web enables definition of the kinds of relationships one resource may have with another. (D)</p> Signup and view all the answers

What do ontologies provide?

<p>A priori information about relationships among categories of data. (A)</p> Signup and view all the answers

What led to the full-scale usage of data-driven methods to confront generalized problems in computational linguistics?

<p>The increased availability of large amounts of electronic text. (A)</p> Signup and view all the answers

What part of language processing had corpora first put to use in large-scale?

<p>Speech recognition (A)</p> Signup and view all the answers

What is the purpose of statistical methods in probabilistic parsing?

<p>To choose the most probable interpretation. (C)</p> Signup and view all the answers

What must a corpus do to be representative of any language as a whole?

<p>Include samples from a variety of texts. (C)</p> Signup and view all the answers

When applying unbalanced corpora to natural language processing, what can this lead to?

<p>Drastically skewed results. (D)</p> Signup and view all the answers

Why are large training corpora of speech needed when a state-of-the-art speech recognition research effort moves to a new domain?

<p>Speech recognition systems are notoriously dependent on the characteristics of their training corpora. (B)</p> Signup and view all the answers

What does the gathering of authentic language data from corpora enable regarding description of the the language?

<p>From the evidence rather than from imposing some theoretical model. (C)</p> Signup and view all the answers

What is a concordancer?

<p>A program that displays occurrences of a word in the middle of a line of context from the corpus. (D)</p> Signup and view all the answers

What does the Mutual Information (MI) score show?

<p>Shows how closely one word is associated with others based on the regularity with which they co-occur in context. (D)</p> Signup and view all the answers

What needs have lead to the increased collaboration with computational linguists?

<p>The need for more informative results to serve the needs of lexicography. (D)</p> Signup and view all the answers

What should the collaboration of Humanists and researchers in computational linguistics lead to?

<p>Vastly increased capabilities for both. (C)</p> Signup and view all the answers

What led to the creation of corpora in electronic form in the 1950s?

<p>The availability of computers. (C)</p> Signup and view all the answers

Which of these corpora were compiled in the 1960s using a representative sample of texts produced in the year 1961?

<p>The Brown Corpus and the LOB corpus. (A)</p> Signup and view all the answers

What significant advancement occurred in the 1980s that facilitated the creation of larger corpora?

<p>Increased computer speed and text production in computerized form. (D)</p> Signup and view all the answers

What is a parallel corpus?

<p>A corpus containing the same text in two or more languages. (A)</p> Signup and view all the answers

Where can numerous corpora, be located?

<p>The Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA). (D)</p> Signup and view all the answers

Flashcards

Linguistic Corpus

A collection of machine-readable texts used for linguistic analysis.

POS Tagging

Assigning part-of-speech tags to words in a corpus automatically.

Parallel Corpus

A corpus containing the same text in multiple languages.

Parallel Text Alignment

Associating words/sentences with their translations in a parallel corpus.

Signup and view all the flashcards

EAGLES XML Corpus Encoding Standard (XCES)

XML standard for encoding linguistic corpora and annotations.

Signup and view all the flashcards

Stand-off Annotation

Annotations encoded in separate documents linked to primary data.

Signup and view all the flashcards

Sentence Splitting

Splitting text into individual sentences.

Signup and view all the flashcards

Word Tokenization

Breaking sentences into individual words or tokens.

Signup and view all the flashcards

Named Entity Recognition

Identifying proper names, dates, and organizations in text.

Signup and view all the flashcards

Morpho-syntactic Annotation

Assigning syntactic labels (noun, verb, etc.) to words.

Signup and view all the flashcards

Stochastic Tagger

Algorithm using probabilities of tag sequences.

Signup and view all the flashcards

Lemma

Basic form of a word.

Signup and view all the flashcards

Word Sense Disambiguation

Assigning the correct meaning (sense) to a word in context.

Signup and view all the flashcards

Co-reference Resolution

Linking mentions of the same entity in a text.

Signup and view all the flashcards

Discourse Structure Annotation

Hierarchy of discourse segments and relations between them.

Signup and view all the flashcards

Speech Annotation

Time-aligned orthographic transcription of speech.

Signup and view all the flashcards

Concordancer

Displaying occurrences of a word with its surrounding context.

Signup and view all the flashcards

Mutual Information (MI)

Statistical measure of how closely words are associated.

Signup and view all the flashcards

Resource Description Framework (RDF)

Enables definition of relationships between resources.

Signup and view all the flashcards

Ontological Web Language (OWL)

Specifies and accesses ontological information.

Signup and view all the flashcards

Study Notes

Overview

  • A corpus is a crucial resource for language research.
  • Electronic corpora emerged in the 1950s alongside the advent of computers, facilitating automated searches, frequency calculations, distributional analyses, and descriptive statistics.
  • Early uses focused on literary analysis, authorship studies, and lexicography but were limited by computer storage and processing speeds.
  • Data sharing was limited prior to the Internet, meaning corpora were generally processed and stored in single locations.
  • Two exceptions were the Brown Corpus of American English and the London/Oslo/Bergen (LOB) corpus of British English, that each contained one million words and part-of-speech tags from texts in 1961.
  • Brown and LOB were the main computer-readable corpora of general language for many years.
  • Computer capabilities increased in the 1980s, leading to corpora containing millions of words.
  • Larger language samples allowed for meaningful statistics about language patterns, which renewed interest in corpus compilation especially for computational linguistics.
  • Parallel corpora containing texts in multiple languages also appeared, such as the Canadian Hansard corpus.
  • Corpus creation was still a lot of work, even with electronic texts, which often needed extensive processing of typesetting codes.
  • Since 1990, there has been a "golden era" of linguistic corpora with massive text and speech corpora being compiled, often via government-funded projects.
  • Automatic techniques developed in the 1990s enabled automated annotation of linguistic properties, such as part-of-speech tagging and parallel text alignment, with 95–98% accuracy.
  • Automatic identification of syntactic configurations, proper names, and dates was also developed.
  • The Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) were founded in the mid-1990s as repositories and distributors of corpora and lexicons.
  • Existing corpora vary in composition due to the cost of obtaining texts, with few being "balanced" in genre representation.
  • Exceptions include the British National Corpus (BNC), the American National Corpus (ANC), and PAROLE project corpora.
  • Most text corpora comprise readily available materials like newspaper data and web content.
  • Speech data is often more representative of specific dialects due to controlled acquisition.
  • Many corpora are available for research under license for a small fee, but some require a substantial payment.

Preparation of Linguistic Corpora

  • Initial data capture converts text to electronic form via manual entry, OCR, word processor output, or PDF files.
  • Manual entry is costly and unsuitable for large corpora.
  • OCR output can also be costly if it needs heavy editing.
  • Electronic data from other sources often contain formatting codes that must be removed or translated.

Representation Formats

  • XML is currently the most common representation format for linguistic corpora.
  • The EAGLES XML Corpus Encoding Standard (XCES) uses a Text Encoding Initiative (TEI) – compliant XML application designed for linguistic corpora annotations.
  • XCES uses stand-off annotation, where annotations are in separate documents linked to the primary data.
  • This avoids challenges like overlapping hierarchies and unwieldy documents and enables different annotation schemes for the same feature.
  • The use of stand-off annotation promotes removing annotations to get raw corpus or extracting annotations separately, as outlined in Leech (1993).
  • Stand-off annotation is now widely accepted, although many corpora still include annotations in the same document as the text because inter-document linking mechanisms are new to XML.
  • The XCES identifies two types of information that may be encoded in the primary data: gross structure and segmental structure.
  • Gross structure includes universal text elements and features of typography, while segmental structure includes language-dependent elements at the sub-paragraph level.
  • Annotations are linked to primary data through XML conventions (XLink, Xpointer).
  • Speech data is often treated as "read-only," with stand-off documents identifying start and end points of structures.
  • The ATLAS project uses byte offsets to link annotations to data in speech corpora which does not use XML tagging.

Identification of Segmental Structures

  • Markup identifying gross structures can be automatically generated from original formatting.
  • Original formatting is presentational rather than descriptive, making automatic transduction challenging.
  • Identifying sub-paragraph structures like sentences, words, names, and dates is almost always needed.
  • Many programs for sentence splitting and word tokenization are available, including those in GATE.
  • Sentence splitting and tokenization are language-dependent requiring specific information.
  • Languages without word-boundary markers require different segmentation approaches, such as dynamic programming.
  • There is software to identify named entities, dates, and time expressions, but they are often based on newspaper and government report corpora.
  • Available data skews the development of tools and algorithms, which limits their applicability to generalized corpora.

Corpus Annotation

Morpho-syntactic Annotation

  • Morpho-syntactic annotation (part-of-speech tagging) is the most common due to highly accurate automatic taggers.
  • Part-of-speech tagging involves disambiguation to determine the correct part of speech in context.
  • English has high rates of part of speech ambiguity for words such as "that", and cases of ambiguity between verb and noun, e.g., "hide", "dog", "brand", "felt", etc..
  • Taggers are either rule-based, stochastic, or hybrid (e.g., Brill tagger).
  • Stochastic taggers are trained on hand-validated data.
  • More accurately tagged data leads to higher probabilities and more tagged corpora.
  • Generated tags give more information than simple word class.
  • Tagsets with 50-100 tags are typically used in automatic tagging.
  • Common English tagsets are derived from the 87 tags used in the Brown corpus.
  • The 45-tag set of the Penn Treebank project is widely used.
  • The 61-tag C5 tagset of the CLAWS tagger is used to tag the British National Corpus.
  • Part-of-speech taggers use lexicons that provide possible part-of-speech assignments.
  • It is often hard to map one tagset to another, requiring re-creation of lexical information.
  • Lexicons often include lemmas.
  • Morpho-syntactic annotation can generate lemmas as well as part-of-speech tags.
  • Lemmas let you extract all orthographic forms for a given lemma.
  • Many corpora lack lemmas, but there are exceptions like the Journal of the Commission corpus.

Parallel Alignment

  • Algorithms for aligning parallel texts exist which can provide useful info for machine translation systems.
  • Parallel corpora are also used to generate bilingual dictionaries automatically and to achieve automatic sense tagging.
  • Sentence and word alignment are two common types.
  • Sentence alignment is easier and more accurate, with one-to-many mappings being a challenge.
  • Most parallel aligned corpora include only two languages, such as the English-French Hansard Corpus.
  • Multilingual parallel corpora are rare due to the lack of texts in multiple languages.
  • Multilingual aligned corpora include the United Nations Parallel Text Corpus and the Orwell 1984 Corpus.

Syntactic Annotation

  • Noun phrase (NP) bracketing/chunking and treebank creation are two main types of syntactic annotation.
  • Syntactically annotated corpora support statistics-based applications, such as probabilistic parsing.
  • They also provide data for theoretical linguistic studies.
  • The Penn Treebank for English is the most widely used treebank.
  • Ongoing projects exist to develop treebanks for other languages such as German and Czech.
  • Many schemes provide constituency-based representation of relations among syntactic components.
  • Dependency schemes specify grammatical relations among elements, without hierarchical analysis.
  • Hybrid systems combine constituency analysis and functional dependencies.
  • Syntactic annotation is often interspersed with the text, making it hard to add other annotations.
  • Stand-off annotations are encouraged, and a scheme developed by Ide and Romary (2001).

Semantic Annotation

  • Semantic annotation adds information about the meaning of text elements.
  • Some annotations, such as case role, are included in syntactic annotations.
  • Another type identifies words or phrases for a particular theme.
  • "Sense tagging" associates lexical items with a sense or definition from a dictionary or online lexicon like WordNet.
  • The key difficulty in sense annotation is choosing an appropriate set of senses.
  • Some attempts have been made to identify useful sense distinctions for automatic language processing.
  • WordNet is the most common source of sense tags used for semantic annotation by grouping words into "synsets" of synonymous words.
  • WordNet distinctions are not optimal, but it is the most widely used and freely available lexicon for English.
  • The Euro WordNet project has produced WordNets for most western European languages which are linked to WordNet 1.5.
  • Sense-tagging requires hand annotation since human annotators often disagree here.
  • Examples of corpora that use it are the Semantic Concordance Corpus (SemCor) and the DSO Corpus.
  • Automatic means to sense-tag data have been sought, and has come to be known as "word sense disambiguation".
  • This area remains difficult in language processing, but statistical approaches are most common.
  • Recent research uses information from parallel translations to make sense distinctions.

Discourse-Level Annotation

  • Topic identification, co-reference annotation, and discourse structure are three main discourse-level annotations.
  • Topic detection annotates texts with information about events or activities.
  • A subtask is detecting boundaries between texts.
  • Co-reference annotation links referring objects to prior elements, which must be done manually.
  • Discourse structure annotation identifies hierarchies of discourse segments and relations between them.
  • Common approaches are based on focus spaces or Rhetorical Structure Theory.
  • Discourse structure annotation is almost always done by hand, but discourse segmentation software has been developed.

Annotation of Speech and Spoken Data

  • The most common speech annotation is a time-aligned orthographic transcription.
  • Annotations demarcate speech "turns" and individual "utterances" that are included.
  • Examples of speech corpora are the CHILDES corpus and the TIMIT corpus of read speech.
  • Annotations can include part of speech, syntax, and co-reference.
  • Speech data may be annotated with phonetic segmentation, prosodic phrasing, disfluencies, and gesture.
  • These annotations are time-consuming since they must be done by hand.
  • Speech annotation is problematic since it relies on the assumption that speech sounds are clearly demarcated.
  • Prosodic annotation is subjective, where decisions about tone vary among annotators.
  • Notations for prosodic annotation vary widely, and are not typically rendered in a standard format such as XML.
  • The London-Lund Corpus of Spoken English is widely inconsistent in its format.

Corpus Annotation Tools

  • Projects have created tools to facilitate annotation of linguistic corpora.
  • Most are based on the MULTEXT project model, which views annotation as a chain of incremental processes.
  • The TIPSTER project developed a similar model.
  • LT XML and GATE are existing annotation tools for language data.
  • The Multilevel Annotation, Tools Engineering (MATE) project provides an annotation tool suite for spoken dialogue corpora.
  • ATLAS is a joint initiative to build a general-purpose annotation architecture and interchange format.
  • A subcommittee of the International Standards Organization (ISO) is developing a generalized model for linguistic annotations and processing tools.
  • The goal is to provide a common "pivot" format for data and annotations to enable seamless interchange.

The Future of Corpus Annotation

  • Recent developments in XML have focused attention on building a Semantic Web.
  • The Semantic Web enables defining relationships between resources.
  • Technologies like the Resource Description Framework (RDF) can affect how annotations are associated with language resources.
  • Another activity is developing technologies to support the specification of and access to ontological information.
  • Ontologies provide a priori information about relationships among data categories.
  • Ontologies enable inferencing processes to yield information not explicit in the data.
  • The W3C group is providing a standard representation format via the Ontological Web Language (OWL).
  • It is up to the computational linguistics community to use it to annotate and analyze language data.
  • Semantic Web technologies will enable development of common and universally accessible ontological information.

Analysis of Linguistic Corpora

Statistics Gathering

  • A corpus provides a bank of samples for developing numerical language models.
  • This goes together with empirical methods.
  • Corpora enabled data-driven methods to solve computational linguistics problems in the late 1980s.
  • Success with statistical methods led to their wider application in statistical language processing.
  • Key to this approach is the availability of large, annotated corpora for training.
  • Stochastic part-of-speech taggers rely on previously annotated corpora.
  • In word sense disambiguation, statistics are gathered reflecting the context of previously sense-tagged words.
  • These statistics are used to disambiguate occurrences in untagged corpora.
  • Speech recognition first used corpora at a large scale in language processing, starting with Hidden Markov Models (HMMs) in the 1970s.
  • Researchers trained a French-English correspondence model on Canadian Parliamentary records, and also trained a language model for English production.
  • Probabilistic parsing is a more recent application which requires previously annotated and validated data for statistics.
  • A probabilistic parser uses previously gathered statistics to choose the most probable interpretation.

Issues for Corpus-Based Statistics Gathering

  • A corpus must include samples from various texts that reflect the range of syntactic and semantic phenomena.
  • Data must be large enough to avoid the problem of data sparseness.
  • Many word senses need to be represented with the application of polysemous words
  • Existing corpora are over-representative of newspaper samples.
  • This can lead to skewed results.
  • Training data can lead to drastically skewed results.
  • The problem of balance is acute in speech recognition since speech recognition systems are notoriously dependent on the characteristics of their training corpora.
  • Training corpora are invariably composed of written rather than spoken texts.
  • Whenever a speech recognition research effort moves to a new domain, a new training corpus of speech must be collected.

Language Analysis

  • Authentic language data enables language description based on evidence rather than theoretical models.
  • Corpora of native-speaker texts provide samples of genuine language.
  • Corpora are used for dictionary-making, or lexicography, especially "learners' dictionaries".
  • The COBUILD corpus was used to create the Collins COBUILD English Dictionary, the first corpus-based dictionary.
  • Most British dictionary publishers use corpora as the primary data source now.
  • The basic lexicographic tool is a concordancer, which displays occurrences of a word in context.
  • The huge increase in available data has led lexicographers to use computational linguistics techniques to summarize concordance data.
  • The Mutual Information (MI) score is a statistical measure of how closely one word is associated with others.
  • Dictionary makers team with computational linguistics researchers to glean even more precise information from corpus data.
  • Supplemental grammatical information can provide more precise understanding of word usage, especially in context.
  • Gathering this information requires relatively sophisticated language-processing software.
  • The need for more results has increased collaboration with computational linguists.
  • Researchers in the humanities and computational linguistics have not collaborated often, but this should change with the World Wide Web.
  • Humanists have increased access to information, tools, and resources developed by the computational linguistics community.
  • Computational linguists are likely to face new language processing challenges.
  • Collaboration should lead to vastly increased capabilities.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Modern Corpus Linguistics Overview
6 questions
DTM powerpoint 2
20 questions

DTM powerpoint 2

FabulousDivergence3880 avatar
FabulousDivergence3880
Corpus Linguistics: History and Development
66 questions
Use Quizgecko on...
Browser
Browser