New Document.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Transcript

TEXT PREPROCESSING George Mikros HBKU Sec. 2.1 PARSING A DOCUMENT • What format is it in? – pdf/word/excel/html? • What language is it in? • What character set is in use? – (CP1252, UTF‐8, …) These tasks are often done heuristically … Sec. 2.1 COMPLICATIONS: FORMAT/LANGUAGE • Documents being...

TEXT PREPROCESSING George Mikros HBKU Sec. 2.1 PARSING A DOCUMENT • What format is it in? – pdf/word/excel/html? • What language is it in? • What character set is in use? – (CP1252, UTF‐8, …) These tasks are often done heuristically … Sec. 2.1 COMPLICATIONS: FORMAT/LANGUAGE • Documents being indexed can include docs from many different languages – A single index may contain terms from many languages. • Sometimes a document or its components can contain multiple languages/formats – Chinese email with an English pdf attachment. – French email quote clauses from an English‐language contract • There are commercial and open source libraries that can handle a lot of this stuff Sec. 2.2.1 TOKENIZATION • Input: “Friends, Romans and Countrymen” • Output: Tokens – Friends – Romans – Countrymen • A token is an instance of a sequence of characters • Each such token is now a candidate for an index entry, after further processing – Described below • But what are valid tokens to emit? Sec. 2.2.1 TOKENIZATION • Issues in tokenization: – Finland’s capital → Finland AND s? Finlands? Finland’s? – Hewlett‐Packard → Hewlett and Packard as two tokens? • typical solution: break up hyphenated sequence. • co‐education • lowercase, lower‐case, lower case ? • It can be effective to get the user to put in possible hyphens – San Francisco: one token or two? • How do you decide it is one token? Sec. 2.2.1 NUMBERS • • • • • 3/20/91 55 B.C. B‐52 My PGP key is 324a3df234cb23e (800) 234‐2333 – Older IR systems do not index numbers – But often very useful: think about things like looking up error codes/stacktraces on the web – We often index “meta‐data” separately • Creation date, format, etc. Sec. 2.2.1 TOKENIZATION: LANGUAGE ISSUES • French – L'ensemble → one token or two? • L ? L’ ? Le ? • Want l’ensemble to match with un ensemble – Until at least 2003, it didn’t on Google » Internationalization! • German noun compounds are not segmented – Lebensversicherungsgesellschaftsangestellter – ‘life insurance company employee’ – German retrieval systems benefit greatly from a compound splitter module – Can give a 15% performance boost for German Sec. 2.2.1 TOKENIZATION: LANGUAGE ISSUES • Chinese and Japanese have no spaces between words: – 莎拉波娃现在居住在美国东南部的佛罗里达。 – Not always guaranteed a unique tokenization • Further complicated in Japanese, with multiple alphabets intermingled – Dates/amounts in multiple formats Sec. 2.2.1 TOKENIZATION: LANGUAGE ISSUES • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right • Words are separated, but letter forms within a word form complex ligatures ← → ←→ ← start ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ • With Unicode, the order of characters in files matches the conceptual order, and the reversal of displayed characters is handled by the rendering system. 9 Sec. 2.2.2 STOP WORDS • Common words which would appear to be of little value. – e.g. the, a, and, to, be • With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: – They have little semantic content – There are a lot of them: ~30% of postings for top 30 words • But the trend is away from doing this: – Good compression techniques means the space for including stop words in a system is very small – Good query optimization techniques mean you pay little at query time for including stop words. – You need them for: • Phrase queries: “King of Denmark” • Various song titles, etc.: “Let it be”, “To be or not to be” • “Relational” queries: “flights to London” 10 Sec. 2.2.3 NORMALIZATION TO TERMS • We may need to “normalize” words in indexed text as well as query words into the same form – We want to match U.S.A. and USA • Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary • We most commonly implicitly define equivalence classes of terms by, e.g., – deleting periods to form a term • U.S.A., USA – deleting hyphens to form a term • anti‐discriminatory, antidiscriminatory 11 Sec. 2.2.3 NORMALIZATION: OTHER LANGUAGES • Normalization of things like date forms – 7月30日 vs. 7/30 – Japanese use of kana vs. Chinese characters • Tokenization and normalization may depend on the language and so is intertwined with language detection • Crucial: Need to “normalize” indexed text as well as query terms identically 12 Sec. 2.2.3 CASE FOLDING • Reduce all letters to lower case – exception: upper case in mid‐sentence? • e.g., General Motors • Fed vs. fed • SAIL vs. sail – Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… • Longstanding Google example: – Query C.A.T. – #1 result is for “cats” (well, Lolcats) not Caterpillar Inc. 13 Sec. 2.2.3 NORMALIZATION TO TERMS • An alternative to equivalence classing is to do asymmetric expansion • An example of where this may be useful – Enter: window – Enter: windows – Enter: Windows Search: window, windows Search: Windows, windows, window Search: Windows • Potentially more powerful, but less efficient 14 Sec. 2.2.4 LEMMATIZATION • Reduce inflectional/variant forms to base form • E.g., – am, are, is → be – car, cars, car's, cars' → car • the boy's cars are different colors → the boy car be different color • Lemmatization implies doing “proper” reduction to dictionary headword form 15 Sec. 2.2.4 STEMMING • Reduce terms to their “roots” before indexing • “Stemming” suggests crude affix chopping – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 16 Sec. 2.2.4 PORTER’S ALGORITHM • Commonest algorithm for stemming English – Results suggest it’s at least as good as other stemming options • Conventions + 5 phases of reductions – phases applied sequentially – each phase consists of a set of commands – sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 17 Sec. 2.2.4 TYPICAL RULES IN PORTER • • • • sses → ss ies → i ational → ate tional → tion • Weight of word sensitive rules • (m>1) EMENT → • replacement → replac • cement → cement 18 Sec. 2.2.4 DOES STEMMING HELP? • English: very mixed results. Helps recall for some queries but harms precision on others – E.g., operative (dentistry) oper • Definitely useful for Spanish, German, Finnish, … – 30% performance gains for Finnish! 19 Quantitative reasoning in Legal Settings George Mikros HBKU Questions that can be answered using statistics • Descriptive statistics: • What is the number of rapes reported in the country? • How many drugs of a particular type are found in drug seizures? • Inference from observed data to a larger population: • Given the responses to the British Crime Survey, what is the estimated number of illegal drug users in the UK? • Inference from observed data to a scientific conclusion: • Did the exposure to the emissions from an incinerator raise the risk of birth defects? • Prediction: • Given a set of characteristics, what is the chance that the accused will reoffend? • Evaluation: The evaluation of scientific findings in court uses probability as a measure of uncertainty. This is based upon the findings, associated data, expert knowledge, case-specific propositions and conditioning information Use of statistical science and types of evidence • Statistical science can be called upon to support expert knowledge when dealing with a variety of types of evidence and proceedings. These include: • evaluation of DNA evidence • evaluation of trace evidence, eg, fibres, glass, paint, or firearms discharge residues • evaluation of pattern-matching evidence, eg, toolmarks, ballistics, and fingerprint • causation of illness or injury in a civil case, where it may be helpful to apply epidemiological research (the study of occurrence, aetiology, prognosis and treatment of illness in populations) to individual cases The process by which statistical science is used in legal proceedings Communication of the probative value when statistical science is used • When conclusions based on statistical science are drawn from data, it is crucial that the data and the reasoning supporting those conclusions are transparent. Under the term ‘intelligent transparency’ Baroness Onora O’Neill has argued that the data and reasoning must be: • accessible: i.e., easily available and not, for example, hidden behind a proprietary algorithm; • understandable: to everyone involved in the case, including a jury; • useable: they address current specific concerns • assessable: where the ‘working’ is open to scrutiny by legal and other professionals. Expected frequency Properties of probabilities • Probabilities of all possible events add to 1. The probability of not getting two heads is ¾ or 75%; • Probabilities are multiplied for sequences of events that are independent (e.g. the probability of two heads in a row is ½ × ½ = ¼); • Probabilities are added when considering probabilities of separate (mutually exclusive) sets of events (e.g., The probability of rolling either a 5 or a 6 on a single die roll is the sum of the probabilities of each separate event. Since a 5 and a 6 cannot be rolled simultaneously on a single die, they are mutually exclusive. So, 1/6 + 1/6 = 1/3) Assumptions in probabilities 1. Probability as a Measure of Knowledge: Probability is presented as a subjective measure, dependent on the observer's knowledge and assumptions. In the example of drawing an ace from a deck of cards, the probability of 1/13 is based on the assumptions that the deck is complete, well-shuffled, and unbiased in terms of physical characteristics and drawing method. 2. Assumptions Underlying Probabilities: The calculation of probability is based on certain assumptions: a full pack, equal chances of drawing any card, uniformity in card condition, and unbiased drawing. These assumptions are critical in defining the probability. 3. Dynamic Nature of Probability: Probability is dynamic and changes with new information. For example, if aces are drawn from a deck, the probability of drawing another ace changes. This highlights how probability is a reflection of the current state of knowledge and information. 4. Probability in Legal Contexts: In legal proceedings, probability is used to make informed judgments based on available data. Probabilities should be grounded in empirical data where available, such as demographic information in a population. 5. Importance of Relevant Data in Assigning Probability: The use of relevant data is crucial in assigning probabilities. Probabilities are not arbitrary but should be based on the best available evidence. For instance, the frequency of certain traits in a population can inform the probability of these traits in a random individual from that population. 6. Implications for Legal Proceedings: In legal contexts, understanding these nuances of probability is vital, especially when dealing with evidence and making judgments. The way probabilities are assigned and interpreted can significantly impact legal decisions. Personal Probabilities • We make personal assignments of probability every day: • What is the probability that I will miss the bus this morning if I have one more cup of coffee? • What is the probability of being caught if I break into this property? • In such circumstances, the probabilities that we assign, albeit not mathematically evaluated or even verbalised in this way, will depend on our knowledge and understanding of the factors and risks involved. Such probabilities are also known as personal ‘degrees of belief’. Expert-assigned probabilities • Experts assign personal probabilities based on their experience, knowledge, and understanding of their type of expert evidence. However, a challenge with such probabilities is the potential influence of cognitive effects. The reliability of expertassigned probabilities is determined by various factors, including: • the extent and relevance of the expert’s experience; • the ability of the expert to compile and store systematically those experiences in their memory; • the expert’s ability to recall accurately the relevant data; • the expert’s ability to avoid and mitigate against bias while inputting expert knowledge; and • calibration, i.e., measuring the extent to which those events assigned a probability of (say) 40% actually do have a relative frequency of occurrence close to 40%. Probative value expressed as a likelihood ratio (LR) • Technically, the LR is the probability of the evidence assuming that proposition A is true divided by the probability of the evidence assuming the proposition B is true: • LRs are typically attached to DNA evidence in which a ‘match’ of some degree is found between the suspect’s DNA profile and the DNA profile derived from a trace found at the scene of a crime. The two competing hypotheses are that the DNA profile in the recovered trace material originates from the suspect or it originates from someone else so that we can express the LR as: • The ‘DNA evidence’ is the suspect’s DNA together with the DNA trace from the crime scene. For the specific situation when the trace contains plenty of DNA and it is deemed to have come from one person, the LR above can be written, after some mathematical operations and given some assumptions, as: • The random match probability is the probability of finding an evidence match if selected randomly from within a particular population. For example, in the context of a DNA sample, it is the probability of observing a DNA profile of an unknown person that is the same as the DNA profile from a crime scene stain (and assuming a particular population genetic model). Typical LRs for DNA evidence are in the millions or billions, although the exact values may be contested, such as when there are complications due to the traces containing a mix of DNA from multiple people Interpretations of LRs Bayes theorem and the LR [1] • Bayes’ theorem provides a general rule for updating probabilities about a proposition in the light of new evidence. It says that: • the posterior (final) odds for a proposition = the LR x the prior [(initial) odds for the proposition] • For example, suppose a hypothetical screening test for doping in sports is claimed to be ‘95% accurate’, meaning that if an athlete is doping there is a 95% chance (probability 0.95; sensitivity) of obtaining a positive test result, and if the athlete is a non-doper there is a 95% chance (probability 0.95; specificity) of obtaining a negative test result. • Assuming that the odds of an athlete taking drugs prior to being subject to a screening test are 1 in 50 (1:50), then if an athlete tests positive what is the probability that they are truly doping? Bayes theorem and the LR [2] • The LR is the probability of a positive test given the proposition that the athlete is doping (95%) divided by the probability of a positive test given the proposition that the athlete is not doping (5%, ie, 1 specificity). This ratio is 19 (LR = 0.95/0.05 = 19). • Bayes’ theorem tells us that the posterior odds of the athlete having taken drugs can be computed by multiplying the prior odds of that proposition by the LR provided by the positive test. In this form, we have to work with odds not probability. Odds are related mathematically to probability and a very simple conversion can be used to give the value for probability where the odds of m:n correspond to the probability m/(m + n). Bayes theorem and the LR [3] • So, for the doping example, • the prior odds for the proposition ‘athlete is doping’ versus ‘athlete is not doping’ are 1:50, which corresponds to a probability of 1/(1 + 50) or a prior probability of approximately 0.02 (the actual value is 0.0196, which is equivalent to 1.96%); • the LR is 0.95/0.05 = 19; and • therefore, by Bayes’ theorem, the posterior odds that the athlete is doping are (1:50) x 19 = 19:50, giving a posterior probability of doping of 19/(19 + 50) ≈ 0.28 or 28%. Bayes theorem and the LR [4] • So, even though drug testing could be claimed to be ‘95% accurate’ (based on the sensitivity and specificity metric) this does not mean that, in the event of a positive result, there is a 95% chance that the athlete is doping. In this example, the probability that the athlete is doping, given a positive test result, is approximately 28%. The posterior odds that an athlete is doping crucially depends on the prior odds for the proposition ‘athlete is doping’ versus ‘athlete is not doping’ (in the example this was 1:50) prior to considering the result of the screening test (the LR result). • This means that if conclusions are drawn from test results in isolation there could be misinterpretations of what is meant by the accuracy of the test. This could cause conclusions such as athletes being incorrectly accused of doping because they failed a drug test. Test time… • Context: • Mammograms are used as a screening tool for breast cancer. However, a positive result on a mammogram does not necessarily mean the person has breast cancer. The accuracy of a mammogram test can be better understood using Bayes' Theorem. • Given Data: • Prevalence of Breast Cancer: Assume 1% of women have breast cancer. • Sensitivity of Mammogram: The probability that a mammogram correctly identifies breast cancer (True Positive) is 80%. • Specificity of Mammogram: The probability that a mammogram correctly identifies no breast cancer (True Negative) is 90%. • Task: • Calculate the probability that a woman actually has breast cancer, given that she has a positive mammogram result. Bayes Theorem in action • Bayes' Theorem Formula: 𝑃𝑃 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑃𝑃(𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃|𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶) × 𝑃𝑃(𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶) 𝑃𝑃(𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃) • P(Cancer∣Positive) is the probability of having cancer given a positive test result. • P(Positive∣Cancer) is the sensitivity (80%). • P(Cancer) is the prevalence of breast cancer (1%). • P(Positive) is the overall probability of a positive test result, calculated as: P(Positive) = P(Positive∣Cancer)×P(Cancer)+P(Positive∣No Cancer)×P(No Cancer) • Here, P(Positive∣No Cancer) is the false positive rate, which is 1 - Specificity (10%). Nature of Coincidences and Rare Events • Intuition's Limitation: Human intuition often struggles with assessing how 'surprising' an event truly is. • Perception vs. Reality: Just because an event is exceedingly rare for an individual (e.g., winning the lottery) doesn't mean it's surprising when considering a larger group (due to the vast number of tickets sold). Illustrating the Concept of Coincidences • Plane Crashes: In 2014, three major plane crashes occurred within an eight-day period. While it seemed beyond coincidence to many, there's approximately a 60% probability that such a 'cluster' will happen over a ten-year span. • Lottery Wins: Winning the lottery might have odds of 45 million to 1 against for an individual, but given the large number of tickets sold, someone winning isn't as surprising as it seems. • Birthday Paradox problem: https://www.youtube.com/watch?v=KtT_cgMzHx8 The Complexity of Evaluating Rare Events • Misleading Intuition: Events that seem improbable for an individual might not be that rare in a broader context. • Legal Interpretations: Terms like 'beyond reasonable doubt' and 'balance of probabilities' in legal contexts relate more to the strength of evidence than the actual probability of an event. Standards of Proof in Legal Contexts • Beyond Reasonable Doubt • Definition: Highest standard of proof in criminal law. • Meaning: The evidence must lead to a moral certainty that the accused is guilty and that no other logical explanation can be derived from the facts. • Use Case: Required in criminal trials to convict a defendant. • Example: A defendant is charged with burglary. Evidence includes DNA, fingerprints at the scene, stolen items found in their possession, and security footage of the break-in. • Balance of Probabilities • • • • Definition: Standard of proof in civil law. Meaning: The plaintiff must show that their assertion is more likely to be true than not true. Use Case: Used in civil trials, such as personal injury claims or contract disputes. Example: In a negligence claim, evidence shows the defendant failed to maintain safety standards, leading to the plaintiff's injury. • Key Distinctions • Criminal vs. Civil: Reflects the different stakes (liberty vs. liability). • Degree of Certainty: "Beyond reasonable doubt" requires near certainty, while "balance of probabilities" requires a likelihood of greater than 50%. CORPORA AND CORPUS LINGUISTICS George Mikros HBKU WHAT IS A CORPUS A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety A corpus is different from a random collection of texts or an archive SOME DEFINITIONS … “generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type” (Leech 1992: 116) “…selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996) “A well-organized collection of data” (McEnery 2003) “gathered according to explicit design criteria” (Tognini-Bonelili 2001: 2) “built according to explicit design criteria for a specific purpose” (Atkins et al 1992) texts selected and put together “in a principled way” (Johansson 1998: 3) WHAT IS CORPUS LINGUISTICS? A new theory of language?  No. In principle, any theory of language is compatible with corpusbased research. A separate branch of linguistics (in addition to syntax, semantics…)?  No. Most aspects of language can be studied using a corpus (in principle). A methodology to study language in all its aspects?  Yes! The most important principle is that aspects of language are studied empirically by analysing natural data using a corpus.  A corpus is an electronic, machine-readable collection of texts that represent “real life” language use. AN INITIAL EXAMPLE Suppose you’re a linguist interested in the syntax of verb phrases.  Some verbs are transitive, some intransitive  I ate the meat pie (transitive)  I swam (intransitive) What about:  quiver  quake Most traditional grammars characterise these as intransitive Are these really intransitive? ONE POSSIBLE METHODOLOGY… The standard method relies on the linguist’s intuition:  I never use quiver/quake with a direct object.  I am a native speaker of this language.  All native speakers have a common mental grammar or competence (Chomsky).  Therefore, my mental grammar is the same as everyone else’s.  Therefore, my intuition accurately reflects English speakers’ competence.  Therefore, quiver/quake are intransitive. NB: The above is a gross simplification! E.g. linguists often rely on judgements elicited from other native speakers. ANOTHER POSSIBLE METHODOLOGY… This one relies on data:  I may never use quiver/quake with a direct object, but…  …other people might  Therefore, I’ll get my hands on a large sample of written and/or spoken English and check. QUIVER/QUAKE: THE CORPUS LINGUIST’S ANSWER A study by Atkins and Levin (1995) found that quiver and quake do occur in transitive constructions:  the insect quivered its wings  it quaked his bowels (with fear) Used a corpus of 50 million words to find examples of the verbs. With sufficient data, you can find examples that your own intuition won’t give you… EXAMPLE II: LEXICAL SEMANTICS Quasi-synonymous lexical items exhibit subtle differences in context.  strong  powerful A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning. EXAMPLE II CONTINUED Some differences between strong and powerful (source: British National Corpus):  strong wind, feeling, accent, flavour  powerful tool, weapon, punch, engine The differences are subtle, but examining their collocates helps. SOME PRELIMINARY DEFINITIONS The second approach is typical of the corpus-based methodology: Corpus: A large, machine-readable collection of texts.  Often, in addition to the texts themselves, a corpus is annotated with relevant linguistic information. Corpus-based methodology: An approach to Natural Language analysis that relies on generalisations made from data. EXAMPLE (BRITISH NATIONAL CORPUS) British National Corpus (BNC): 100 million words of English  90% written, 10% spoken Designed to be representative and balanced. Texts from different genres (literature, news, academic writing…) Annotated: Every single word is accompanied by part-of-speech information. EXAMPLE (CONTINUED) A sentence in the BNC:  Explosives found on Hampstead Heath. <s> <w NN2>Explosives <w VVD>found <w PRP>on <w NP0>Hampstead <w NP0>Heath <PUN>. EXAMPLE (CONTINUED) <s> <w NN2>Explosives <w VVD>found new sentence plural noun past tense verb <w PRP>on preposition <w NP0>Hampstead proper noun <w NP0>Heath proper noun <PUN>. punctuation Explosives found on Hampstead Heath IMPORTANT TO NOTE This is not “raw” text.  Annotation means we can search for particular patterns.  E.g. for the quiver/quake study: “find all occurrences of quiver which are verbs, followed by a determiner and a noun” The collection is very large  Only in very large collections are we likely to find rare occurrences. Corpus search is done by computer. You can’t trawl through 100 million words manually! THE PRACTICAL OBJECTIONS… But we’re linguists not computer scientists! Do I have to write programs?  No, there are literally dozens of available tools to search in a corpus. Are all corpora good for all purposes?  No. Some are “general-purpose”, like the BNC. Others are designed to address specific issues. THE THEORETICAL OBJECTIONS… What guarantee do we have that the texts in our corpus are “good data”, quality texts, written by people we can trust? How do I know that what I find isn’t just a small, exceptional case. E.g. quiver in a transitive construction could be really a one-off! Just because there are a few examples of something, doesn’t mean that all native speakers use a certain construction! Do we throw intuition out of the window? PART 2 A brief history of corpus linguistics LANGUAGE AND THE COGNITIVE REVOLUTION Before the 1950’s, the linguist’s task was:  to collect data about a language;  to make generalisations from the data (e.g. “In Maltese, the verb always agrees in number and gender with the subject NP”)  The basic idea: language is “out there”, the sum total of things people say and write. After the 1950’s:  the so-called “cognitive revolution”  language treated as a mental phenomenon  no longer about collecting data, but explaining what mental capabilities speakers have THE 19TH & EARLY 20TH CENTURY Many early studies relied on corpora. Language acquisition research was based on collections of child data. Anthropologists collected samples of unknown languages. Comparative linguists used large samples from different languages. A lot of work done on frequencies:  frequency of words…  frequency of grammatical patterns…  frequency of different spellings… All of this was interrupted around 1955. CHOMSKY AND THE COGNITIVE TURN Chomsky (1957) was primarily responsible for the new, cognitive view of language. He distinguished (1965):  Descriptive adequacy: describing language, making generalisations such as “X occurs more often than Y”  Explanatory adequacy: explaining why some things are found in a language, but not others, by appealing to speakers’ competence, their mental grammar He made several criticisms of corpus-based approaches. CRITICISMS OF CORPORA (I) Competence vs. performance:  To explain language, we need to focus on competence of an idealised speaker-hearer.  Competence = internalised, tacit knowledge of language  Performance – the language we speak/write – is not a good mirror of our knowledge  it depends on situations  it can be degraded  it can be influenced by other cognitive factors beyond linguistic knowledge CRITICISMS OF CORPORA (II) Early work using corpora assumed that:  the number of sentences of a language is finite (so we can get to know everything about language if the sample is large enough) But actually, it is impossible to count the number of sentences in a language.  Syntactic rules make the possibilities literally infinite: the man in the house (NP -> NP + PP) the man in the house on the beach (PP -> PREP + NP) the man in the house on the beach by the lake … So what use is a corpus? We’re never going to have an infinite corpus. CRITICISMS OF CORPORA (III) A corpus is always skewed, i.e. biased in favour of certain things.  Certain obvious things are simply never said. E.g. We probably won’t find a dog is a dog in our corpus. A corpus is always partial: We will only find things in a corpus if they are frequent enough.  A corpus is necessarily only a sample.  Rare things are likely to be omitted from a sample. CRITICISMS OF CORPORA (IV) Why use a corpus if we already know things by introspection? How can a corpus tell us what is ungrammatical?  Corpora won’t contain “disallowed” structures, because these are by definition not part of the language.  So a corpus contains exclusively positive evidence: you only get the “allowed” things  But if X is not in the corpus, this doesn’t mean it’s not allowed.  It might just be rare, and your corpus isn’t big enough. (Skewness) REFUTATIONS Corpora can be better than introspective evidence because:  They are public; other people can verify and replicate your results (the essence of scientific method).  Some kinds of data are simply not available to introspection. E.g. people aren’t good at estimating the frequency of words or structures.  Skewness can itself be informative: If X occurs more frequently than Y in a corpus, that in itself is an interesting fact. REFUTATIONS (II) By the way, nobody’s saying “throw introspection out the window”…  There is no reason not to combine the corpus-based and the introspection-based method. Many other objections can be overcome by using large enough corpora.  Pre-1950, most corpus work was done manually, so it was error prone.  Machine-readable corpora means we have a great new tool to analyse language very efficiently! CORPORA IN THE LATE 20TH CENTURY Corpus linguistics enjoyed a revival with the advent of the digital personal computer.  Kucera and Francis: the Brown Corpus, one of the first  Svartvik: the London-Lund Corpus, which built on Brown These were rapidly followed by others… Today, corpora are firmly back on the linguistic landscape. PART 3 Corpus design WHAT IS REPRESENTATIVENESS? Representativeness is a defining feature of a corpus As language is infinite but a corpus has to be finite in size, we sample and proportionally include a wide range of text types to ensure maximum balance and representativeness “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety” (Leech 1991) Representativeness refers to the extent to which a sample includes the full range of variability in a population (Biber 1993) WHAT IS REPRESENTATIVENESS? Representativeness is a fluid concept closely related to your research questions  If you want a corpus which is representative of general English, a corpus representative of newspapers will not do  If you want a corpus representative of newspapers, a corpus representative of The Times will not do TWO TYPES OF REPRESENTATIVENESS The representativeness of general corpora and (domain- or genre specific) specialized corpora are achieved and measured in different ways  General corpora  Balance: The range of genres included in a corpus and their proportion  Sampling: How the text chunks for each genre are selected  Specialized corpora  Degree of closure/saturation: Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of language (e.g. computer manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point, i.e. the curve of lexical growth is flattening out WHY SHOULD WE CARE ABOUT REPRESENTATIVENESS? Reader of corpus-based studies (assessment)  To interpret the results of corpus research with caution, considering whether the corpus data and the method used in the study was appropriate Corpus user (assessment)  Important to “know your corpus”  To decide whether a given corpus is appropriate for their specific research question  To make appropriate claims on the basis of such a corpus Corpus creator (assessment?)  To make their corpus as representative as possible of a language (variety) claimed to represent  To document design criteria explicitly and make the documentation available to corpus users CRITERIA FOR TEXT SELECTION The criteria used to select texts for a corpus are principally external  The external vs. internal criteria corresponds to Biber’s (1993: 243) situational vs. linguistic perspectives  External criteria are defined situationally irrespective of the distribution of linguistic features  Internal criteria are defined linguistically, taking into account the distribution of such features It is circular to use internal criteria like the distribution of words or grammatical features as the primary parameters for the selection of corpus data  If the distribution of linguistic features is pre-determined when the corpus is designed, there is no point in analyzing such a corpus to discover naturally occurring linguistic feature distributions  The corpus is problematic as it is skewed by design CRITERIA FOR TEXT SELECTION Time?  If a corpus is not regularly updated, it rapidly becomes unrepresentative (Hunston 2002) The relevance of permanence in corpus design actually depends on how we view a corpus - a static or dynamic language model  Static model: sample corpora (nearly all existing corpora, BNC, LOB/FLOB)  Dynamic model: monitor corpora (e.g. Bank of English) CRITERIA FOR TEXT SELECTION Tips  “Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.” (Sinclair 2005) CORPUS BALANCE A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration The proportions of different kinds of text it contains should correspond with informed and intuitive judgements There is no scientific measure for balance – just best estimation The acceptable balance is determined by the intended use – your research questions THE BNC MODEL Generally accepted as being a balanced corpus Has been followed in the construction of a number of corpora 4,124 texts (including transcripts of recording) ca. 100 million words: 90% Written + 10% Spoken Three criteria for Written  Domain: the content type (i.e. subject field)  Time: the period of text production  Medium: the type of text publication (book, periodicals etc) Two criteria for Spoken  Demographic: informal conversations by speakers selected by age group, sex, social class and geographical region  Context-governed: formal encounters such as meetings, lectures and radio broadcasts recorded in 4 broad context categories WRITTEN BNC SPOKEN BNC BNC VS. BALANCE The design criteria of the BNC illustrates the notion of corpus balance/representativeness very well  “In selecting texts for inclusion in the corpus, account was taken of both production, by sampling a wide variety of distinct types of material, and reception, by selecting instances of those types which have a wide distribution. Thus, having chosen to sample such things as popular novels, or technical writing, best-seller lists and library circulation statistics were consulted to select particular examples of them.” (Aston and Burnard 1998: 28) PRAGMATICS IN CORPUS DESIGN “Most general corpora of today are badly balanced because they do not have nearly enough spoken language in them; estimates of the optimal proportion of spoken language range from 50% - the neutral option - to 90%, following a guess that most people experience many times as much speech as writing” (Sinclair 2005) The written BNC is nine times as large as the spoken BNC  Is speech less frequent or important than writing? PRAGMATICS IN CORPUS DESIGN Absolutely not! …but writing typically has a larger audience than speech …also collection of spoken data costs 10 times as much as for written data …it takes 10 hours to transcribe one hour of recording Pragmatic considerations also mean that balance is a more important issue for a static sample corpus than for a dynamic monitor corpus  As a monitor corpus is frequently updated, it is usually “impossible to maintain a corpus that also includes text of many different types, as some of them are just too expensive or time consuming to collect on a regular basis.” (Hunston 2002: 30-31) SAMPLING IN CORPUS CREATION Language is infinite, but a corpus is finite in size, so sampling is inescapable in corpus building  “Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not.” (Biber 1993) Population ( language/variety) vs. sample (corpus)  The aim of sampling “is to secure a sample which, subject to limitations of size, will reproduce the characteristics of the population, especially those of immediate interest, as closely as possible” (Yates 1965: 9)  A sample is a scaled-down version of a larger population  A sample is representative if what we find for the sample also holds for the general population Corpus representativeness and balance rely heavily on sampling  A corpus is a sample of a given population (language or language variety) SAMPLING IN CORPUS CREATION Sampling unit  For written text, it could be a book (chapter), periodical or newspaper (article) Sampling frame  A list of sampling units Population  Languages, language, or language variety under consideration  The assembly of all sampling units, which can be defined in terms of  Language production (demographic: speakers and writers)  Language reception (demographic: audience and readers)  Language as a product (registers and genres) EXAMPLES OF BROWN AND LOB Brown LOB  Population: Written English text published in the United States in 1961  Sampling frame: A list of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum  Sampling unit: each book/periodical within the sampling frame  Population: Written English text published in the UK around 1961  Sampling frame: The British National Bibliography Cumulated Subject Index 1960– 1964 (for books) and Willing’s Press Guide 1961 (for periodicals)  Sampling unit: each book/periodical within the sampling frame SAMPLING TECHNIQUES Simple random sampling  All sampling units within the sampling frame are numbered and the sample is chosen by use of a table of random numbers  Positively correlating with frequency in the population, so rare features may not be included Stratified random sampling  The population is divided in relatively homogeneous groups (i.e. strata), and then these latter are sampled at random  Never less representative than simple random sampling STRATIFIED RANDOM SAMPLING The whole population for the Brown/LOB corpus is divided into 15 text categories and then samples were drawn from each category at random In demographic sampling for collecting spoken data, individuals (sampling units) in the population are first divided into different groups on the basis of demographic variables such as speaker/writer age, sex and social class, and then samples are taken at random from each group SIZE OF SAMPLES Full texts or text segments?  “Samples of language for a corpus should wherever possible consist of entire documents or transcriptions of complete speech events” (Sinclair 2005)  Good for studying textual organization  A full-text corpus may be inappropriate or problematic  Peculiarity of an individual style or topic may occasionally show through  There are copyright issues in including full texts  Frequent linguistic features are quite stable in their distributions and hence short text chunks (e.g. 2,000 running words) are usually sufficient Text initial, middle or end chunks?  Text initial, middle, and end samples must be taken in a balanced way PROPORTION OF GENRES IN BROWN Constant sample size: ca. 2,000 words “RELATIVELY SPEAKING…” Any claim of corpus representativeness and balance must be interpreted in relative terms  There is no objective way to balance a corpus or to measure its representativeness  Any claim for representativeness is an act of faith rather than a statement of fact Corpus balance and representativeness are a fluid concept  The research question that one has in mind when building/choosing a corpus determines what an acceptable balance is for the corpus one should use and whether it is suitably representative Corpus balance is also influenced by practical considerations  How easily can data of different types be collected? CORPUS SIZE How large should a corpus be?  There is no easy answer to this question.  Krishnamurthy (2001): “Size matters.”  Leech (1991): “Size is not all-important.” The size of the corpus needed depends upon the purpose for which it is intended as well as a number of practical considerations  The kind of query that is anticipated from users  Are you studying common or rare linguistic features?  The methodology they use to study the data  How much work can be done by the machine and how much has to be done by hand?  For corpus creators, also the source of data  Are the data in electronic form readily available at a reasonable cost?  Can copyright permissions be granted easily if at all? CORPUS SIZE Corpus size increases with the development of technology  1960s-70s  Brown and LOB: one million words  1980s  The Birmingham/Cobuild corpora: 20 M words  1990s  The British National Corpus: 100 M words  Early 21st Century  The Bank of English: 645 M words CORPUS SIZE Is a large corpus really what you want?  The size of the corpus needed to explore a research question depends on the frequency and distribution of the linguistic features under consideration in that corpus – your research question  Corpora for lexical studies are usually much larger than those for grammatical studies  Specialized corpora serve a very different yet important purpose from large multi-million-word corpora  Corpora that need extensive manual annotation or analysis are necessarily small  Many corpus tools set a ceiling on the number of concordances that can be extracted The optimum size of a corpus is determined by the research question the corpus is intended to address as well as practical considerations TYPES OF CORPORA, DIFFERENT USES General/reference vs. specialized corpora Written vs. spoken corpora Synchronic vs. diachronic corpora Monolingual vs. multilingual corpora Comparable vs. parallel corpora Native vs. learner corpora Developmental vs. learner/interlanguage corpora Raw vs. annotated corpora Static/sample vs. dynamic/monitor corpora … MONITOR CORPORA Constantly updated and growing in size  Much larger corpus size  Often contain full text  Always up-to-date  Often only admit new material which has new features not already present in corpus  Used to track changes across different periods of time  Monitor corpora could be a series of static corpora Disadvantages  No attempt to balance the corpus  Text availability can become an issue (e.g. copyrights)  Confusing to indicate specific corpus version (token number)  Cannot easily compare results obtained from corpora of different sizes SOME WELL-KNOWN ENGLISH CORPORA The British National Corpus (BNC) The Bank of English (BoE) BYU American English corpus Corpora of the Brown family (Brown, LOB, FLOB, Frown) ICE corpora (GB, EA, HK, Singapore, Philippines, New Zealand etc) London-Lund corpus of spoken English SBCSAE The Helsinki Diachronic Corpus of English Texts (8th - 18th Century, ca. 5 million words) The International Corpus of Learner English (ICLE) MICASE THE BNC First and best-known national corpus (sample corpus) 100 M word balanced corpus of written (90%) and spoken (10%) British English in current use 1960 - earlier 1990s (1966-1974, 1974-1984, 1985-1993) Rich metadata encoded for language variation studies POS tagged Accessing the BNC  BYU-BNC: http://corpus.byu.edu/bnc/  BNC Online: http://www.natcorp.ox.ac.uk/getting/index.xml.ID=order_online  Lancaster BNCWeb CQP edition http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php  BNC Baby: http://www.natcorp.ox.ac.uk/corpus/baby/index.html  Sketch Engine: http://www.sketchengine.co.uk/  BNC PIE: http://pie.usna.edu/ THE BOE Best known monitor corpus 645 M words (counting and growing) of presentday English language 75% written and 25% spoken 70% BrE, 20% AmE and 10% other English varieties Particularly useful for lexical and lexicographic studies, e.g. tracking new words, new uses or meanings of old words, and words falling out of use CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA) 560+ M words of American English 20M per year for 1990-2017 Equally divided among spoken, fiction, popular magazines, newspapers, and academic texts Updated every 6-9 months Useful for studying variation across genres and over time Free online access  https://www.english-corpora.org/coca/ CORPORA OF THE BROWN FAMILY Brown: Written AmE in 1961 LOB: Written BrE in 1961 FLOB: Written BrE in 1991 Frown: Written AmE in 1991 Common corpus design  One M word each  500 samples (ca. 2000 words each)  Same proportions from the same 15 text categories Useful for synchronic and diachronic comparison of BrE and AmE Further information ICAME CD: http://khnt.hit.uib.no/icame/manuals/ Exended Brown family: http://cqpweb.lancs.ac.uk (access account to be applied) THE ICE CORPORA 20 one M word balanced corpora  E.g. Britain, Ireland, US, Canada, Hong Kong, Singapore, India, the Philippines, East Africa Common corpus design  500 samples (ca. 2000 words each)  60% spoken + 40% written  12 Genres  1990-1994 Designed for the synchronic study of “world Englishes” More information  http://www.ucl.ac.uk/english-usage/ice/ THE LONDON-LUND CORPUS First electronic corpus of spontaneous language A corpus of spoken British English recorded from 1953-1987 100 texts, each of 5,000 words, totaling half a million running words Both dialogue (e.g. face-to-face conversations, telephone conversations, and public discussion) and monologues (both spontaneous and prepared) Speaker information (gender, age, occupation) Annotated with prosodic information Further information  http://clu.uni.no/icame/manuals/ SBCSAE Based on hundreds of recordings of spontaneous speech from all over the United States Representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds Each of the 60 transcripts is time stamped and accompanied by a digital audio file Free download  https://www.linguistics.ucsb.edu/research/santabarbara-corpus HELSINKI CORPUS OF ENGLISH TEXTS Best-known historical corpus 1.5 million words of English in 400 text samples dating from the 8th to 18th centuries Divided into three periods (Old, Middle, and Early Modern English) and 11 sub-periods Socio-historical variation and a wide range of text types for each specific period Allows for researchers to go beyond simply dating and reporting language change by combining diachronic, sociolinguistic and genre studies Further information  Oxford Text Archive: http://ota.oucs.ox.ac.uk/headers/1477.xml THE ICLE CORPUS First and best-known learner English corpus Comprising argumentative essays written by advanced learners of English (i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study Over 2.5 million words in 3,640 texts ranging between 500-1,000 words in length 11 L1 backgrounds and still expanding with 8 additional L1s Useful in investigating the interlanguage of the foreign language learners Further information: https://uclouvain.be/fr/node/11962 MICASE ca. 1.8 M words in 152 transcripts of nearly 200 hours of recordings of 1,571 speakers Focusing on contemporary university speech within the domain of the University of Michigan Encoded with speaker information (age, academic role, language status) Free online search or transcript download  http://quod.lib.umich.edu/m/micase/ AN OPEN TOOL FOR EXPLORING CORPORA AntConc  Freely available from https://www.laurenceanthony.net/software/antconc/  AntConc is a program for analysing electronic texts (that is, corpus linguistics) in order to find and reveal patterns in language. It was created by Laurence Anthony of Waseda University  AntConc does not require installation. It is a stand-alone program that runs by simply clicking on the icon. When downloading it please make sure you noted where you have saved the program for ease of locating it afterwards. I recommend that you save it onto the desktop first, then move it into a subfolder in the main drive. Thank you! [email protected]

Tags

natural language processing information retrieval text preprocessing
Use Quizgecko on...
Browser
Browser