Podcast
Questions and Answers
What was a limitation on early electronic corpus creation?
What was a limitation on early electronic corpus creation?
- No need to enter material manually
- Automated processing for language features
- Absence of the Internet for sharing (correct)
- Unlimited data storage capacity
What primarily drove renewed interest in corpus compilation in the 1980s?
What primarily drove renewed interest in corpus compilation in the 1980s?
- Challenges related to language sharing
- The need for manual data entry methods
- The desire to drive language processing software (correct)
- Limitations in computer speed and capacity
What is a characteristic of the 'golden era' of linguistic corpora?
What is a characteristic of the 'golden era' of linguistic corpora?
- Limited availability of multilingual parallel corpora
- Decreased interest in language analysis
- Focus on monolingual corpora only
- Government-funded projects compiling corpora (correct)
What is a function that the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) provide?
What is a function that the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) provide?
What is the initial stage of corpus creation?
What is the initial stage of corpus creation?
What is the most common representation format for linguistic corpora nowadays?
What is the most common representation format for linguistic corpora nowadays?
What does the notion of 'stand-off annotation' require?
What does the notion of 'stand-off annotation' require?
What is a consideration when creating linguistic corpora?
What is a consideration when creating linguistic corpora?
What do programs for sentence splitting and word tokenization need?
What do programs for sentence splitting and word tokenization need?
What is a common problem with freely available tools for identifying named entities?
What is a common problem with freely available tools for identifying named entities?
What is Part-of-speech (POS) tagging primarily used for?
What is Part-of-speech (POS) tagging primarily used for?
What do stochastic taggers rely on?
What do stochastic taggers rely on?
What is a characteristic of the Penn Treebank tagset?
What is a characteristic of the Penn Treebank tagset?
What ability does the presence of lemmas in an annotated text provide?
What ability does the presence of lemmas in an annotated text provide?
Which type of parallel alignment is generally easier and more accurate?
Which type of parallel alignment is generally easier and more accurate?
What is a key function of syntactically annotated corpora?
What is a key function of syntactically annotated corpora?
What is a function that Semantic annotation provides?
What is a function that Semantic annotation provides?
What is WordNet used for?
What is WordNet used for?
What persists to be one of the more difficult problems in language processing?
What persists to be one of the more difficult problems in language processing?
What is the focus of co-reference annotation?
What is the focus of co-reference annotation?
What does discourse structure annotation identify?
What does discourse structure annotation identify?
What information can be included in the orthographic transcription of speech data?
What information can be included in the orthographic transcription of speech data?
What is a difficulty associated with speech annotation?
What is a difficulty associated with speech annotation?
What is the view of the annotation process based on the MULTEXT project?
What is the view of the annotation process based on the MULTEXT project?
What is the purpose of the ISO model for linguistic annotations?
What is the purpose of the ISO model for linguistic annotations?
What are potential impacts regarding the Semantic Web for corpus annotation and analysis?
What are potential impacts regarding the Semantic Web for corpus annotation and analysis?
What do ontologies provide?
What do ontologies provide?
What led to the full-scale usage of data-driven methods to confront generalized problems in computational linguistics?
What led to the full-scale usage of data-driven methods to confront generalized problems in computational linguistics?
What part of language processing had corpora first put to use in large-scale?
What part of language processing had corpora first put to use in large-scale?
What is the purpose of statistical methods in probabilistic parsing?
What is the purpose of statistical methods in probabilistic parsing?
What must a corpus do to be representative of any language as a whole?
What must a corpus do to be representative of any language as a whole?
When applying unbalanced corpora to natural language processing, what can this lead to?
When applying unbalanced corpora to natural language processing, what can this lead to?
Why are large training corpora of speech needed when a state-of-the-art speech recognition research effort moves to a new domain?
Why are large training corpora of speech needed when a state-of-the-art speech recognition research effort moves to a new domain?
What does the gathering of authentic language data from corpora enable regarding description of the the language?
What does the gathering of authentic language data from corpora enable regarding description of the the language?
What is a concordancer?
What is a concordancer?
What does the Mutual Information (MI) score show?
What does the Mutual Information (MI) score show?
What needs have lead to the increased collaboration with computational linguists?
What needs have lead to the increased collaboration with computational linguists?
What should the collaboration of Humanists and researchers in computational linguistics lead to?
What should the collaboration of Humanists and researchers in computational linguistics lead to?
What led to the creation of corpora in electronic form in the 1950s?
What led to the creation of corpora in electronic form in the 1950s?
Which of these corpora were compiled in the 1960s using a representative sample of texts produced in the year 1961?
Which of these corpora were compiled in the 1960s using a representative sample of texts produced in the year 1961?
What significant advancement occurred in the 1980s that facilitated the creation of larger corpora?
What significant advancement occurred in the 1980s that facilitated the creation of larger corpora?
What is a parallel corpus?
What is a parallel corpus?
Where can numerous corpora, be located?
Where can numerous corpora, be located?
Flashcards
Linguistic Corpus
Linguistic Corpus
A collection of machine-readable texts used for linguistic analysis.
POS Tagging
POS Tagging
Assigning part-of-speech tags to words in a corpus automatically.
Parallel Corpus
Parallel Corpus
A corpus containing the same text in multiple languages.
Parallel Text Alignment
Parallel Text Alignment
Signup and view all the flashcards
EAGLES XML Corpus Encoding Standard (XCES)
EAGLES XML Corpus Encoding Standard (XCES)
Signup and view all the flashcards
Stand-off Annotation
Stand-off Annotation
Signup and view all the flashcards
Sentence Splitting
Sentence Splitting
Signup and view all the flashcards
Word Tokenization
Word Tokenization
Signup and view all the flashcards
Named Entity Recognition
Named Entity Recognition
Signup and view all the flashcards
Morpho-syntactic Annotation
Morpho-syntactic Annotation
Signup and view all the flashcards
Stochastic Tagger
Stochastic Tagger
Signup and view all the flashcards
Lemma
Lemma
Signup and view all the flashcards
Word Sense Disambiguation
Word Sense Disambiguation
Signup and view all the flashcards
Co-reference Resolution
Co-reference Resolution
Signup and view all the flashcards
Discourse Structure Annotation
Discourse Structure Annotation
Signup and view all the flashcards
Speech Annotation
Speech Annotation
Signup and view all the flashcards
Concordancer
Concordancer
Signup and view all the flashcards
Mutual Information (MI)
Mutual Information (MI)
Signup and view all the flashcards
Resource Description Framework (RDF)
Resource Description Framework (RDF)
Signup and view all the flashcards
Ontological Web Language (OWL)
Ontological Web Language (OWL)
Signup and view all the flashcards
Study Notes
Overview
- A corpus is a crucial resource for language research.
- Electronic corpora emerged in the 1950s alongside the advent of computers, facilitating automated searches, frequency calculations, distributional analyses, and descriptive statistics.
- Early uses focused on literary analysis, authorship studies, and lexicography but were limited by computer storage and processing speeds.
- Data sharing was limited prior to the Internet, meaning corpora were generally processed and stored in single locations.
- Two exceptions were the Brown Corpus of American English and the London/Oslo/Bergen (LOB) corpus of British English, that each contained one million words and part-of-speech tags from texts in 1961.
- Brown and LOB were the main computer-readable corpora of general language for many years.
- Computer capabilities increased in the 1980s, leading to corpora containing millions of words.
- Larger language samples allowed for meaningful statistics about language patterns, which renewed interest in corpus compilation especially for computational linguistics.
- Parallel corpora containing texts in multiple languages also appeared, such as the Canadian Hansard corpus.
- Corpus creation was still a lot of work, even with electronic texts, which often needed extensive processing of typesetting codes.
- Since 1990, there has been a "golden era" of linguistic corpora with massive text and speech corpora being compiled, often via government-funded projects.
- Automatic techniques developed in the 1990s enabled automated annotation of linguistic properties, such as part-of-speech tagging and parallel text alignment, with 95–98% accuracy.
- Automatic identification of syntactic configurations, proper names, and dates was also developed.
- The Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) were founded in the mid-1990s as repositories and distributors of corpora and lexicons.
- Existing corpora vary in composition due to the cost of obtaining texts, with few being "balanced" in genre representation.
- Exceptions include the British National Corpus (BNC), the American National Corpus (ANC), and PAROLE project corpora.
- Most text corpora comprise readily available materials like newspaper data and web content.
- Speech data is often more representative of specific dialects due to controlled acquisition.
- Many corpora are available for research under license for a small fee, but some require a substantial payment.
Preparation of Linguistic Corpora
- Initial data capture converts text to electronic form via manual entry, OCR, word processor output, or PDF files.
- Manual entry is costly and unsuitable for large corpora.
- OCR output can also be costly if it needs heavy editing.
- Electronic data from other sources often contain formatting codes that must be removed or translated.
Representation Formats
- XML is currently the most common representation format for linguistic corpora.
- The EAGLES XML Corpus Encoding Standard (XCES) uses a Text Encoding Initiative (TEI) – compliant XML application designed for linguistic corpora annotations.
- XCES uses stand-off annotation, where annotations are in separate documents linked to the primary data.
- This avoids challenges like overlapping hierarchies and unwieldy documents and enables different annotation schemes for the same feature.
- The use of stand-off annotation promotes removing annotations to get raw corpus or extracting annotations separately, as outlined in Leech (1993).
- Stand-off annotation is now widely accepted, although many corpora still include annotations in the same document as the text because inter-document linking mechanisms are new to XML.
- The XCES identifies two types of information that may be encoded in the primary data: gross structure and segmental structure.
- Gross structure includes universal text elements and features of typography, while segmental structure includes language-dependent elements at the sub-paragraph level.
- Annotations are linked to primary data through XML conventions (XLink, Xpointer).
- Speech data is often treated as "read-only," with stand-off documents identifying start and end points of structures.
- The ATLAS project uses byte offsets to link annotations to data in speech corpora which does not use XML tagging.
Identification of Segmental Structures
- Markup identifying gross structures can be automatically generated from original formatting.
- Original formatting is presentational rather than descriptive, making automatic transduction challenging.
- Identifying sub-paragraph structures like sentences, words, names, and dates is almost always needed.
- Many programs for sentence splitting and word tokenization are available, including those in GATE.
- Sentence splitting and tokenization are language-dependent requiring specific information.
- Languages without word-boundary markers require different segmentation approaches, such as dynamic programming.
- There is software to identify named entities, dates, and time expressions, but they are often based on newspaper and government report corpora.
- Available data skews the development of tools and algorithms, which limits their applicability to generalized corpora.
Corpus Annotation
Morpho-syntactic Annotation
- Morpho-syntactic annotation (part-of-speech tagging) is the most common due to highly accurate automatic taggers.
- Part-of-speech tagging involves disambiguation to determine the correct part of speech in context.
- English has high rates of part of speech ambiguity for words such as "that", and cases of ambiguity between verb and noun, e.g., "hide", "dog", "brand", "felt", etc..
- Taggers are either rule-based, stochastic, or hybrid (e.g., Brill tagger).
- Stochastic taggers are trained on hand-validated data.
- More accurately tagged data leads to higher probabilities and more tagged corpora.
- Generated tags give more information than simple word class.
- Tagsets with 50-100 tags are typically used in automatic tagging.
- Common English tagsets are derived from the 87 tags used in the Brown corpus.
- The 45-tag set of the Penn Treebank project is widely used.
- The 61-tag C5 tagset of the CLAWS tagger is used to tag the British National Corpus.
- Part-of-speech taggers use lexicons that provide possible part-of-speech assignments.
- It is often hard to map one tagset to another, requiring re-creation of lexical information.
- Lexicons often include lemmas.
- Morpho-syntactic annotation can generate lemmas as well as part-of-speech tags.
- Lemmas let you extract all orthographic forms for a given lemma.
- Many corpora lack lemmas, but there are exceptions like the Journal of the Commission corpus.
Parallel Alignment
- Algorithms for aligning parallel texts exist which can provide useful info for machine translation systems.
- Parallel corpora are also used to generate bilingual dictionaries automatically and to achieve automatic sense tagging.
- Sentence and word alignment are two common types.
- Sentence alignment is easier and more accurate, with one-to-many mappings being a challenge.
- Most parallel aligned corpora include only two languages, such as the English-French Hansard Corpus.
- Multilingual parallel corpora are rare due to the lack of texts in multiple languages.
- Multilingual aligned corpora include the United Nations Parallel Text Corpus and the Orwell 1984 Corpus.
Syntactic Annotation
- Noun phrase (NP) bracketing/chunking and treebank creation are two main types of syntactic annotation.
- Syntactically annotated corpora support statistics-based applications, such as probabilistic parsing.
- They also provide data for theoretical linguistic studies.
- The Penn Treebank for English is the most widely used treebank.
- Ongoing projects exist to develop treebanks for other languages such as German and Czech.
- Many schemes provide constituency-based representation of relations among syntactic components.
- Dependency schemes specify grammatical relations among elements, without hierarchical analysis.
- Hybrid systems combine constituency analysis and functional dependencies.
- Syntactic annotation is often interspersed with the text, making it hard to add other annotations.
- Stand-off annotations are encouraged, and a scheme developed by Ide and Romary (2001).
Semantic Annotation
- Semantic annotation adds information about the meaning of text elements.
- Some annotations, such as case role, are included in syntactic annotations.
- Another type identifies words or phrases for a particular theme.
- "Sense tagging" associates lexical items with a sense or definition from a dictionary or online lexicon like WordNet.
- The key difficulty in sense annotation is choosing an appropriate set of senses.
- Some attempts have been made to identify useful sense distinctions for automatic language processing.
- WordNet is the most common source of sense tags used for semantic annotation by grouping words into "synsets" of synonymous words.
- WordNet distinctions are not optimal, but it is the most widely used and freely available lexicon for English.
- The Euro WordNet project has produced WordNets for most western European languages which are linked to WordNet 1.5.
- Sense-tagging requires hand annotation since human annotators often disagree here.
- Examples of corpora that use it are the Semantic Concordance Corpus (SemCor) and the DSO Corpus.
- Automatic means to sense-tag data have been sought, and has come to be known as "word sense disambiguation".
- This area remains difficult in language processing, but statistical approaches are most common.
- Recent research uses information from parallel translations to make sense distinctions.
Discourse-Level Annotation
- Topic identification, co-reference annotation, and discourse structure are three main discourse-level annotations.
- Topic detection annotates texts with information about events or activities.
- A subtask is detecting boundaries between texts.
- Co-reference annotation links referring objects to prior elements, which must be done manually.
- Discourse structure annotation identifies hierarchies of discourse segments and relations between them.
- Common approaches are based on focus spaces or Rhetorical Structure Theory.
- Discourse structure annotation is almost always done by hand, but discourse segmentation software has been developed.
Annotation of Speech and Spoken Data
- The most common speech annotation is a time-aligned orthographic transcription.
- Annotations demarcate speech "turns" and individual "utterances" that are included.
- Examples of speech corpora are the CHILDES corpus and the TIMIT corpus of read speech.
- Annotations can include part of speech, syntax, and co-reference.
- Speech data may be annotated with phonetic segmentation, prosodic phrasing, disfluencies, and gesture.
- These annotations are time-consuming since they must be done by hand.
- Speech annotation is problematic since it relies on the assumption that speech sounds are clearly demarcated.
- Prosodic annotation is subjective, where decisions about tone vary among annotators.
- Notations for prosodic annotation vary widely, and are not typically rendered in a standard format such as XML.
- The London-Lund Corpus of Spoken English is widely inconsistent in its format.
Corpus Annotation Tools
- Projects have created tools to facilitate annotation of linguistic corpora.
- Most are based on the MULTEXT project model, which views annotation as a chain of incremental processes.
- The TIPSTER project developed a similar model.
- LT XML and GATE are existing annotation tools for language data.
- The Multilevel Annotation, Tools Engineering (MATE) project provides an annotation tool suite for spoken dialogue corpora.
- ATLAS is a joint initiative to build a general-purpose annotation architecture and interchange format.
- A subcommittee of the International Standards Organization (ISO) is developing a generalized model for linguistic annotations and processing tools.
- The goal is to provide a common "pivot" format for data and annotations to enable seamless interchange.
The Future of Corpus Annotation
- Recent developments in XML have focused attention on building a Semantic Web.
- The Semantic Web enables defining relationships between resources.
- Technologies like the Resource Description Framework (RDF) can affect how annotations are associated with language resources.
- Another activity is developing technologies to support the specification of and access to ontological information.
- Ontologies provide a priori information about relationships among data categories.
- Ontologies enable inferencing processes to yield information not explicit in the data.
- The W3C group is providing a standard representation format via the Ontological Web Language (OWL).
- It is up to the computational linguistics community to use it to annotate and analyze language data.
- Semantic Web technologies will enable development of common and universally accessible ontological information.
Analysis of Linguistic Corpora
Statistics Gathering
- A corpus provides a bank of samples for developing numerical language models.
- This goes together with empirical methods.
- Corpora enabled data-driven methods to solve computational linguistics problems in the late 1980s.
- Success with statistical methods led to their wider application in statistical language processing.
- Key to this approach is the availability of large, annotated corpora for training.
- Stochastic part-of-speech taggers rely on previously annotated corpora.
- In word sense disambiguation, statistics are gathered reflecting the context of previously sense-tagged words.
- These statistics are used to disambiguate occurrences in untagged corpora.
- Speech recognition first used corpora at a large scale in language processing, starting with Hidden Markov Models (HMMs) in the 1970s.
- Researchers trained a French-English correspondence model on Canadian Parliamentary records, and also trained a language model for English production.
- Probabilistic parsing is a more recent application which requires previously annotated and validated data for statistics.
- A probabilistic parser uses previously gathered statistics to choose the most probable interpretation.
Issues for Corpus-Based Statistics Gathering
- A corpus must include samples from various texts that reflect the range of syntactic and semantic phenomena.
- Data must be large enough to avoid the problem of data sparseness.
- Many word senses need to be represented with the application of polysemous words
- Existing corpora are over-representative of newspaper samples.
- This can lead to skewed results.
- Training data can lead to drastically skewed results.
- The problem of balance is acute in speech recognition since speech recognition systems are notoriously dependent on the characteristics of their training corpora.
- Training corpora are invariably composed of written rather than spoken texts.
- Whenever a speech recognition research effort moves to a new domain, a new training corpus of speech must be collected.
Language Analysis
- Authentic language data enables language description based on evidence rather than theoretical models.
- Corpora of native-speaker texts provide samples of genuine language.
- Corpora are used for dictionary-making, or lexicography, especially "learners' dictionaries".
- The COBUILD corpus was used to create the Collins COBUILD English Dictionary, the first corpus-based dictionary.
- Most British dictionary publishers use corpora as the primary data source now.
- The basic lexicographic tool is a concordancer, which displays occurrences of a word in context.
- The huge increase in available data has led lexicographers to use computational linguistics techniques to summarize concordance data.
- The Mutual Information (MI) score is a statistical measure of how closely one word is associated with others.
- Dictionary makers team with computational linguistics researchers to glean even more precise information from corpus data.
- Supplemental grammatical information can provide more precise understanding of word usage, especially in context.
- Gathering this information requires relatively sophisticated language-processing software.
- The need for more results has increased collaboration with computational linguists.
- Researchers in the humanities and computational linguistics have not collaborated often, but this should change with the World Wide Web.
- Humanists have increased access to information, tools, and resources developed by the computational linguistics community.
- Computational linguists are likely to face new language processing challenges.
- Collaboration should lead to vastly increased capabilities.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.