Extracting Information From Text PDF
Document Details
Uploaded by BrainiestWilliamsite3306
International Burch University - Stirling Education
Dželila MEHANOVIĆ
Tags
Summary
This document provides an introduction to information extraction within natural language processing (NLP). It explains the techniques and methods used for extracting structured information from unstructured text. The concepts are discussed in the context of identifying named entities and relations, with examples illustrating how to apply these methods.
Full Transcript
Introduction to Natural Language Processing Extracting Information From Text Assist. Prof. Dr. Dželila MEHANOVIĆ Extracting Information From Text The goal of this chapter is to answer the following questions: How can we build a system that extracts structured data from unstructured text?...
Introduction to Natural Language Processing Extracting Information From Text Assist. Prof. Dr. Dželila MEHANOVIĆ Extracting Information From Text The goal of this chapter is to answer the following questions: How can we build a system that extracts structured data from unstructured text? What are some robust methods for identifying the entities and relationships described in a text? Which corpora are appropriate for this work? Information Extraction Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships. For example, we might be interested in the relation between companies and locations. Given a particular company, we would like to be able to identify the locations where it does business; conversely, given a location, we would like to discover which companies do business in that location. Information Extraction If this location daIf our data is in tabular form, such as the example in Table, then answering these queries is straightforward. ta was stored in Python as a list of tuples (entity, relation, entity), then the question “Which organizations operate in Atlanta?ˮ could be translated as follows: Information Extraction Things are more tricky if we try to get similar information out of text. For example, consider the following snippet (from nltk.corpus.ieer, for fileid NYT19980315.0085. 1 The fourth Wells account moving to another agency is the packaged paperproducts division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta. If you read through 1, you will collect the information required to answer the example question. But how do we get a machine to understand enough about 1) to return the list ['BBDO South', 'Georgia-Pacific'] as an answer? This is obviously a much harder task. Unlike Table, 1) contains no structure that links organization names with location names. Information Extraction We will only look for very specific kinds of information in text, such as the relation between organizations and locations. Rather than trying to use text like 1) to answer the question directly, we first convert the unstructured data of natural language sentences into the structured data of Table. This method of getting meaning from text is called Information Extraction. Information Extraction has many applications, including business intelligence, media analysis, sentiment detection, patent search, and email scanning. Information Extraction Architecture The figure illustrates a simple information extraction system. It processes a document by segmenting it into sentences and tokenizing words. Each sentence is tagged with part-of-speech labels to assist in named entity recognition, identifying entities of interest. Finally, relation recognition identifies potential relationships between entities. Information Extraction Architecture To perform the first three tasks, we can define a function that simply connects together NLTKʼs default sentence segmenter, word tokenizer, and part-of-speech tagger : Next, in named entity recognition, we segment and label the entities that might participate in interesting relations with one another. Typically, these will be definite noun phrases such as proper names such as Monty Python. In relation extraction, we search for specific patterns between pairs of entities that occur near one another in the text, and use those patterns to build tuples recording the relationships between the entities. Chunking The basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences as illustrated in Figure. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text. Noun Phrase Chunking We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases (noun phrase is a linguistic unit consisting of a noun and any associated words that modify or complement it). For example, here is some Wall Street Journal text with NP-chunks marked using brackets: 2 [ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/ IN [ Digital/NNP ] [ ʼs/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB./. Noun Phrase Chunking One of the most useful sources of information for NP-chunking is part-of-speech tags. We demonstrate this approach using an example sentence that has been part-of-speech tagged. ○ In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. ○ In this case, we will define a simple grammar with a single regular expression rule. Noun Phrase Chunking This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner DT) followed by any number of adjectives JJ) and then a noun NN. Using this grammar, we create a chunk parser, and test it on our example sentence. The result is a tree, which we can either print, or display graphically. Tag Patterns The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g., ?. Chunking With Regular Expressions The RegexpParser starts with a flat sentence structure and applies chunking rules step by step to build the chunk structure. After all rules are applied, the final structure is returned. A simple grammar example includes two rules: one for determiners/possessives, adjectives, and a noun, and another for proper nouns. An example sentence is chunked using this grammar. Chunking With Regular Expressions If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked: Once we have created the chunk for money market, we have removed the context that would have permitted fund to be included in a chunk. This issue would have been avoided with a more permissive chunk rule, e.g., NP {+}. Exploring Text Corpora We saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows: Chinking Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink: [ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ] Chinking Chinking is the process of removing a sequence of tokens from a chunk. If the: ○ matching sequence of tokens spans an entire chunk, then the whole chunk is removed ○ sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before ○ sequence is at the periphery of the chunk, these tokens are removed, and a smaller chunk remains These three possibilities are illustrated in Table: Chinking Next, we put the entire sentence into a single chunk, then excise the chinks. Representing Chunks: Tags Versus Trees Chunk structures can be represented using either tags or trees. The most widespread file representation uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g., BNP, INP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O. Representing Chunks: Tags Versus Trees IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format. Here is how the information in Figure would appear in a file: In this representation there is one token per line, each with its part-of-speech tag and chunk tag. Representing Chunks: Tags Versus Trees As we saw earlier, chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent that can be manipulated directly. Reading IOB Format And The Conll-2000 Chunking Corpus Using the corpora module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP, and PP. As we have seen, each sentence is represented using multiple lines, as shown here: Reading IOB Format And The Conll-2000 Chunking Corpus A conversion function chunk.conllstr2tree() builds a tree representation from one of these multiline strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks: Reading IOB Format And The Conll-2000 Chunking Corpus We can use the NLTK corpus module to access a larger amount of chunked text. The CoNLL2000 Chunking Corpus contains 270k words of Wall Street Journal text, divided into “trainˮ and “testˮ portions, annotated with part-of-speech tags and chunk tags in the IOB format. We can access the data using nltk.corpus.conll2000. Here is an example that reads the 100th sentence of the “trainˮ portion of the corpus: Reading IOB Format And The Conll-2000 Chunking Corpus As you can see, the CoNLL2000 Chunking Corpus contains three chunk types: NP chunks, which we have already seen; VP chunks, such as has already delivered; and PP chunks, such as because of. Since we are only interested in the NP chunks right now, we can use the chunk_types argument to select them: Trees A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node. Hereʼs an example of a tree (note that they are standardly drawn upside-down): Trees We use a ‘familyʼ metaphor to talk about the relationships of nodes in a tree: for example, S is the parent of VP; conversely VP is a child of S. Also, since NP and VP are both children of S, they are also siblings. For convenience, there is also a text format for specifying trees: Trees In NLTK, we create a tree by giving a node label and a list of children: Trees We can incorporate these into successively larger trees as follows: Here are some of the methods available for tree objects: Trees The bracketed representation for complex trees can be difficult to read. In these cases, the draw method can be very useful. It opens a new window, containing a graphical representation of the tree. The tree display window allows you to zoom in and out, to collapse and expand subtrees, and to print the graphical representation to a postscript file (for inclusion in a document). Named Entity Recognition At the start of this chapter, we briefly introduced named entities NEs. Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. Table lists some of the more commonly used types of NEs. These should be self-explanatory, except for “FACILITYˮ: human-made artifacts in the domains of architecture and civil engineering; and “GPEˮ: geo-political entities such as city, state/province, and country. Named Entity Recognition The goal of a named entity recognition NER system is to identify all textual mentions of the named entities. This can be broken down into two subtasks: ○ identifying the boundaries of the NE, and ○ identifying its type. In Question Answering QA, we try to improve the precision of Information Retrieval by recovering not whole pages, but just those parts which contain an answer to the userʼs question. Named Entity Recognition Most QA systems take the documents returned by standard Information Retrieval, and then attempt to isolate the minimal text snippet in the document containing the answer. Now suppose the question was Who was the first President of the US?, and one of the documents that was retrieved contained the following passage: 3 The Washington Monument is the most prominent structure in Washington, D.C. and one of the cityʼs early attractions. It was built in honor of George Washington, who led the country to independence and then became its first President. Named Entity Recognition Analysis of the question leads us to expect that an answer should be of the form X was the first President of the US, where X is not only a noun phrase, but also refers to a named entity of type PER. This should allow us to ignore the first sentence in the passage. Although it contains two occurrences of Washington, named entity recognition should tell us that neither of them has the correct type. Named Entity Recognition How do we go about identifying named entities? One option would be to look up each word in an appropriate list of names. But there are problems, in the case of names for people or organizations. Any list of names will probably have poor coverage. New organizations come into existence every day, so it is unlikely that we will be able to recognize many of the entities. Another major source of difficulty is caused by the fact that many named entity terms are ambiguous. Thus May and North are likely to be parts of named entities for DATE and LOCATION, respectively, but could both be part of a PERSON; conversely Christian Dior looks like a PERSON but is more likely to be of type ORGANIZATION. Named Entity Recognition Further challenges are posed by multi word names like Stanford University, and by names that contain other names, such as Cecil H. Green Library and Escondido Village Conference Service Center. In named entity recognition, therefore, we need to be able to identify the beginning and end of multitoken sequences. Relation Extraction Once named entities have been identified in a text, we then want to extract the relations that exist between them. As indicated earlier, we will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form X, α, Y, where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. Relation Extraction The following example searches for strings that contain the word in. The special regular expression ?!\b.+ing\b) is a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerund. Exercises 1. Write a tag pattern to match noun phrases containing plural head nouns, e.g., many/JJ researchers/NNS, two/CD weeks/NNS, both/DT new/JJ positions/NNS. Try to do this by generalizing the tag pattern that handled singular noun phrases. 2. Pick one of the three chunk types in the CoNLL2000 Chunking Corpus. Inspect the data and try to observe any patterns in the POS tag sequences that make up this kind of chunk. Develop a simple chunker using the regular expression chunker nltk.RegexpParser. Discuss any tag sequences that are difficult to chunk reliably. 3. Write a tag pattern to cover noun phrases that contain gerunds, e.g., the/DT receiving/VBG end/NN, assistant/NN managing/VBG editor/NN. Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising. Thank you