Language Corpus Linguistics

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What was a limitation on early electronic corpus creation?

No need to enter material manually
Automated processing for language features
Absence of the Internet for sharing (correct)
Unlimited data storage capacity

What primarily drove renewed interest in corpus compilation in the 1980s?

Challenges related to language sharing
The need for manual data entry methods
The desire to drive language processing software (correct)
Limitations in computer speed and capacity

What is a characteristic of the 'golden era' of linguistic corpora?

Limited availability of multilingual parallel corpora
Decreased interest in language analysis
Focus on monolingual corpora only
Government-funded projects compiling corpora (correct)

What is a function that the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) provide?

Serving as repositories and distributors of corpora (A) Signup and view all the answers

What is the initial stage of corpus creation?

Data capture (A) Signup and view all the answers

What is the most common representation format for linguistic corpora nowadays?

XML (B) Signup and view all the answers

What does the notion of 'stand-off annotation' require?

Annotations are encoded in documents separate from the primary data. (A) Signup and view all the answers

What is a consideration when creating linguistic corpora?

Gross structure of the texts. (A) Signup and view all the answers

What do programs for sentence splitting and word tokenization need?

Language-specific information. (D) Signup and view all the answers

What is a common problem with freely available tools for identifying named entities?

They may not perform well on humanist data like literary works. (B) Signup and view all the answers

What is Part-of-speech (POS) tagging primarily used for?

Adding Morpho-syntactic annotation. (A) Signup and view all the answers

What do stochastic taggers rely on?

Probabilities for n-grams. (B) Signup and view all the answers

What is a characteristic of the Penn Treebank tagset?

It eliminates information retrievable from the form of the lexical item. (C) Signup and view all the answers

What ability does the presence of lemmas in an annotated text provide?

To extract all orthographic forms associated with a given lemma. (C) Signup and view all the answers

Which type of parallel alignment is generally easier and more accurate?

Sentence alignment. (C) Signup and view all the answers

What is a key function of syntactically annotated corpora?

To drive syntactic parsers. (D) Signup and view all the answers

What is a function that Semantic annotation provides?

Information about the meaning of elements in a text. (C) Signup and view all the answers

What is WordNet used for?

Providing lists of senses and grouping words into synsets. (B) Signup and view all the answers

What persists to be one of the more difficult problems in language processing?

Automatic means to sense-tag data. (B) Signup and view all the answers

What is the focus of co-reference annotation?

Relating objects to prior elements in a discourse. (C) Signup and view all the answers

What does discourse structure annotation identify?

Multi-level hierarchies of discourse segments and the relations between them. (B) Signup and view all the answers

What information can be included in the orthographic transcription of speech data?

All are correct (B) Signup and view all the answers

What is a difficulty associated with speech annotation?

Subjectivity of tone or other movements/aspects. (D) Signup and view all the answers

What is the view of the annotation process based on the MULTEXT project?

A chain of smaller, individual processes that incrementally add annotations. (A) Signup and view all the answers

What is the purpose of the ISO model for linguistic annotations?

To provide a common pivot format for interchange. (D) Signup and view all the answers

What are potential impacts regarding the Semantic Web for corpus annotation and analysis?

The Semantic Web enables definition of the kinds of relationships one resource may have with another. (D) Signup and view all the answers

What do ontologies provide?

A priori information about relationships among categories of data. (A) Signup and view all the answers

What led to the full-scale usage of data-driven methods to confront generalized problems in computational linguistics?

The increased availability of large amounts of electronic text. (A) Signup and view all the answers

What part of language processing had corpora first put to use in large-scale?

Speech recognition (A) Signup and view all the answers

What is the purpose of statistical methods in probabilistic parsing?

To choose the most probable interpretation. (C) Signup and view all the answers

What must a corpus do to be representative of any language as a whole?

Include samples from a variety of texts. (C) Signup and view all the answers

When applying unbalanced corpora to natural language processing, what can this lead to?

Drastically skewed results. (D) Signup and view all the answers

Why are large training corpora of speech needed when a state-of-the-art speech recognition research effort moves to a new domain?

Speech recognition systems are notoriously dependent on the characteristics of their training corpora. (B) Signup and view all the answers

What does the gathering of authentic language data from corpora enable regarding description of the the language?

From the evidence rather than from imposing some theoretical model. (C) Signup and view all the answers

What is a concordancer?

A program that displays occurrences of a word in the middle of a line of context from the corpus. (D) Signup and view all the answers

What does the Mutual Information (MI) score show?

Shows how closely one word is associated with others based on the regularity with which they co-occur in context. (D) Signup and view all the answers

What needs have lead to the increased collaboration with computational linguists?

The need for more informative results to serve the needs of lexicography. (D) Signup and view all the answers

What should the collaboration of Humanists and researchers in computational linguistics lead to?

Vastly increased capabilities for both. (C) Signup and view all the answers

What led to the creation of corpora in electronic form in the 1950s?

The availability of computers. (C) Signup and view all the answers

Which of these corpora were compiled in the 1960s using a representative sample of texts produced in the year 1961?

The Brown Corpus and the LOB corpus. (A) Signup and view all the answers

What significant advancement occurred in the 1980s that facilitated the creation of larger corpora?

Increased computer speed and text production in computerized form. (D) Signup and view all the answers

What is a parallel corpus?

A corpus containing the same text in two or more languages. (A) Signup and view all the answers

Where can numerous corpora, be located?

The Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA). (D) Signup and view all the answers

Flashcards

Linguistic Corpus

A collection of machine-readable texts used for linguistic analysis.

POS Tagging

Assigning part-of-speech tags to words in a corpus automatically.