Natural Language Understanding in a Semantic Web Context PDF
Document Details
Uploaded by MarvelousUniverse
2016
Caroline Barrière
Tags
Summary
This book serves as a starting point for students and researchers in the Semantic Web interested in Natural Language Processing (NLP). It focuses on how NLP can be utilized to transform unstructured text data (like reports and articles) into a shareable, normalized format for the Semantic Web. The book covers fundamental NLP concepts, emphasizing semantic processing, information extraction, and knowledge acquisition.
Full Transcript
Caroline Barrière Natural Language Understanding in a Semantic Web Context Natural Language Understanding in a Semantic Web Context Caroline Barrière Natural Language Understanding in a Semantic Web Context 123 Caroline Barrière Computer Research Institute of Montreal (CRIM) Montreal Canada I...
Caroline Barrière Natural Language Understanding in a Semantic Web Context Natural Language Understanding in a Semantic Web Context Caroline Barrière Natural Language Understanding in a Semantic Web Context 123 Caroline Barrière Computer Research Institute of Montreal (CRIM) Montreal Canada ISBN 978-3-319-41335-8 ISBN 978-3-319-41337-2 (eBook) DOI 10.1007/978-3-319-41337-2 Library of Congress Control Number: 2016943789 © Springer International Publishing Switzerland 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG Switzerland Preface I hope for this book to serve as a good starting point for students and researchers in Semantic Web (SW) interested in discovering what Natural Language Processing (NLP) has to offer. At a time when Open Data is becoming increasingly popular, there is a pressing demand for tools to help the SW community transform those data into a shareable, normalized format, making all these data accessible as Linked Open Data. But a large portion of the data held by organizations seeking to make their data openly accessible are not stored in tables, but in much less structured forms, that is, textual forms such as reports, notes, memos, and articles. Manually generating structured information from them seems like an insurmountable task. Certainly, NLP can help uncovering the information held in text, thus augmenting the real content of the Semantic Web in a significant and lasting way. My main goal is not just to foster interest in NLP in the readership, but awareness of how useful it can be to the Semantic Web community. I will not delve very deeply into linguistic principles, but instead focus on how NLP approaches different kinds of problems and provides solutions to them. My aim is also to show how, for the past 40 years, researchers in NLP have been interested in problems closely related to the ones faced by the Semantic Web community. Problems such as ambiguity and linking of knowledge are not specific to one field or the other, but central to both. This book covers the basics of Natural Language Processing (NLP), with a focus on Natural Language Understanding (NLU). Here, understanding refers to semantic processing, Information Extraction, and knowledge acquisition, which I see as the key links between the SW and NLP communities. Much emphasis will be placed on mining sentences in search of entities and relations. In our quest in NLU, we will encounter challenges for various text analysis tasks, including part-of-speech tag- ging, parsing, semantic disambiguation, Named Entity Recognition, and Relation Extraction. I will present the standard algorithms associated with these tasks so as to provide an understanding of the fundamental concepts. Furthermore, I chose to emphasize the importance of experimental design and result analysis, and for doing v vi Preface so, most chapters show small experiments on corpus data with quantitative and qualitative analysis of results. I assume that readers are familiar with the Semantic Web and are looking to learn about NLP in order to expand their horizons. That being said, the book provides enough information for a reader new to both fields to understand their underlying principles and the challenges they face. Also, the reader should be familiar with algorithms and simple programming principles, as this book will often use algorithms to describe problem-solving approaches. Since I chose to cover a small number of simple algorithms in details, I do include a Further Reading section in most chapters for links to relevant state-of-the-art research in which readers can find more complex algorithms. I believe that understanding the fundamentals within basic algorithms is the best preparation for understanding more complex algorithms. I hope that through this book, important challenges in NLP will become familiar to the reader and that the book will stimulate the reader’s interest in investigating them further. Montreal, Canada Caroline Barrière Acknowledgements I would like to acknowledge the contribution of the Centre de Recherche Informatique de Montréal (CRIM) for the writing of this book. During the time of writing, CRIM was very supportive and encouraging of this ambitious project. I specially acknowledge the editing work of Amanda Bon, good friend of mine, and such a talented person, who performed a first editorial review of the entire book. Amanda performed more than simple editorial changes, and she pointed out unclear passages, learning gaps, and confusing parts. This allowed me to perform a thorough review of the material in the hope of making it clearer. Without Amanda’s reading, suggestions, and corrections, this book would not have seen the light. I cannot thank her enough. Thank you as well to my parents, Fernand and Fernande, for their constant encouragement throughout the long (even longer than expected) writing process. And finally, thank you to Cyril, who patiently heard me talk about this book, day after day, from beginning to end. Thanks for the late-night dinners waiting for me. Thanks for being so supportive of everything I do. vii Contents 1 Introduction........................................ 1 1.1 Scope......................................... 1 1.2 Approach...................................... 2 1.3 Prerequisite Knowledge............................ 3 1.4 Structure....................................... 4 1.5 Learning Paths................................... 5 Part I Searching for Entities in Text 2 Entities, Labels, and Surface Forms....................... 9 2.1 A First Surface Form: Entity Label.................... 10 2.2 Experiment — Searching for Beethoven................. 11 2.3 Looking for Additional Surface Forms.................. 13 2.4 Categorizing various Surface Forms.................... 15 2.5 Expanding the Set of Surface Forms with a Generative Approach...................................... 17 2.6 Multilingual Surface Forms.......................... 18 2.7 The Most Ambiguous Surface Form: Pronouns............ 19 2.8 In Summary.................................... 20 2.9 Further Reading.................................. 21 2.10 Exercises...................................... 21 3 Searching for Named Entities............................ 23 3.1 Entity Types.................................... 24 3.2 Gazetteers...................................... 25 3.3 Capturing Regularities in Named Entities: Regular Expressions............................... 27 3.4 Experiment — Finding DATE Instances in a Corpus......... 29 3.4.1 Gold Standard and Inter-Annotator Agreement....... 29 3.4.2 Baseline Algorithm: Simple DATE Regular Expression............................... 32 3.4.3 Refining the DATE Expressions.................. 34 ix x Contents 3.5 Language Dependency of Named Entity Expressions........ 35 3.6 In Summary.................................... 36 3.7 Further Reading.................................. 36 3.8 Exercises...................................... 37 4 Comparing Surface Forms.............................. 39 4.1 Edit Distance — A Dynamic Programming Algorithm....... 40 4.2 From Distance to Binary Classification — Finding Thresholds..................................... 43 4.3 Soundex — A Phonetic Algorithm..................... 47 4.4 Experiment — A Comparative Setting for Misspelling Detection...................................... 49 4.4.1 Gold Standard............................. 50 4.4.2 Looking at Results.......................... 52 4.5 Soundex and Edit Distance in a Multilingual Setting........ 52 4.6 In Summary.................................... 54 4.7 Further Reading.................................. 54 4.8 Exercises...................................... 54 Part II Working with Corpora 5 Exploring Corpora.................................... 59 5.1 Defining Corpora................................. 60 5.2 Characterizing Documents........................... 61 5.3 Corpora Resources................................ 62 5.4 Surface Form Frequencies in Corpus................... 64 5.5 Information Content............................... 66 5.6 Experiment — Comparing Domain-Specific Corpora........ 67 5.6.1 Building Domain-Specific Corpora............... 68 5.6.2 Tokenizing the Corpora...................... 70 5.6.3 Calculating Frequencies and IC................. 71 5.6.4 Observing and Comparing Results............... 72 5.7 Mutual Information and Collocations................... 74 5.8 Viewing Candidate Collocations through a Concordancer..... 76 5.9 In Summary.................................... 79 5.10 Further Reading.................................. 80 5.11 Exercises...................................... 82 6 Words in Sequence................................... 85 6.1 Introduction to Language Modeling.................... 86 6.2 Estimating Sequence Probabilities..................... 87 6.3 Gathering N-Gram Probabilities from Corpora............. 91 6.3.1 Building a Domain-Independent Corpus........... 91 6.3.2 Calculating N-Gram Probabilities................ 92 6.3.3 Issues in N-Gram Estimation................... 95 Contents xi 6.4 Application: Spelling Correction...................... 98 6.5 In Summary.................................... 101 6.6 Further Reading.................................. 102 6.7 Exercises...................................... 102 7 Bilingual Corpora.................................... 105 7.1 Types of Bilingual Corpora.......................... 105 7.2 Exploring Noise in Bilingual Corpora................... 107 7.3 Language Identification............................. 109 7.3.1 Estimating Letter Trigram Probabilities............ 110 7.3.2 Testing a Letter Trigram Model for Language Identification.............................. 111 7.3.3 Language Identification as Preprocessing Step....... 112 7.4 Searching for Term Equivalents in Parallel Corpora......... 114 7.4.1 Obtaining and Indexing a Parallel Corpus.......... 116 7.4.2 Adapting PMI for Term Equivalent Search......... 117 7.5 Experiment — Term Equivalent Search................. 119 7.5.1 Dataset and a posteriori Evaluation............... 119 7.5.2 Inter-Annotator Agreement.................... 120 7.5.3 Result Analysis............................ 122 7.6 In Summary.................................... 123 7.7 Further Reading.................................. 123 7.8 Exercises...................................... 124 Part III Semantic Grounding and Relatedness 8 Linguistic Roles...................................... 129 8.1 Tokenization.................................... 130 8.2 Sentence Splitting................................ 133 8.3 Lemmatization................................... 134 8.4 POS Tagging.................................... 136 8.5 Constituency Parsing.............................. 138 8.6 Experiment — Classifying Groups of Entities............. 141 8.6.1 Goal of the Experiment....................... 142 8.6.2 Gold Standard and Evaluation Method............ 143 8.6.3 Our Method: Classification through POS Tagging.... 143 8.6.4 Performance Evaluation...................... 144 8.6.5 Result Analysis — Intrinsic versus Extrinsic Evaluation................................ 145 8.7 In Summary.................................... 147 8.8 Further Reading.................................. 148 8.9 Exercises...................................... 149 9 Definition-Based Grounding............................. 151 9.1 Word Sense Disambiguation......................... 152 9.2 Entity Linking................................... 154 xii Contents 9.3 Bag-of-Words Representation........................ 156 9.4 Bag-of-Words Content — Looking at Text Cohesion........ 159 9.5 Bag-of-Words Comparison.......................... 161 9.6 Grounding Algorithm: BOW-Match.................... 162 9.7 Experiment — Disambiguating Beethoven................ 163 9.7.1 Grounding Space, Gold Standard, and Evaluation Method.................................. 164 9.7.2 Testing our BOW-Match Algorithm.............. 165 9.7.3 Result Analysis............................ 168 9.8 In Summary.................................... 169 9.9 Further Reading.................................. 169 9.10 Exercises...................................... 170 10 Relatedness......................................... 173 10.1 Building a Co-occurrence Vector...................... 174 10.1.1 Corpus Size............................... 177 10.1.2 Word Filtering — Linguistic versus Statistic........ 178 10.1.3 Window Dependency — Positioning............. 180 10.1.4 Window Dependency — Size.................. 181 10.2 Measuring Relatedness............................. 183 10.2.1 First-Level Relatedness....................... 183 10.2.2 Second-Level Relatedness..................... 184 10.3 Word Embeddings................................ 185 10.4 Experiment 1 — Relatedness for Semantic Similarity........ 187 10.4.1 Dataset and Evaluation....................... 187 10.4.2 Testing Relatedness Measures and Analyzing Results.................................. 189 10.5 Experiment 2 — Relatedness for Entity Linking........... 193 10.5.1 Dataset and Evaluation....................... 193 10.5.2 Developing a BOW-Similarity Algorithm.......... 194 10.5.3 Result Analysis............................ 195 10.6 In Summary.................................... 197 10.7 Further Reading.................................. 198 10.8 Exercices...................................... 199 Part IV Knowledge Acquisition 11 Pattern-Based Relation Extraction........................ 205 11.1 Relation Types and Textual Resources.................. 206 11.2 Lexical Patterns.................................. 208 11.2.1 Regular Expressions for Lexical Patterns........... 209 11.3 Lexico-Syntactic Patterns........................... 211 11.3.1 Regular Expressions for Lexico-Syntactic Patterns.... 212 Contents xiii 11.4 Experiment — Pattern-Based Synonymy Relation Extraction................................ 213 11.4.1 Development Set........................... 213 11.4.2 Defining a Synonym Search Strategy Using Lexical and Lexico-Syntactic Patterns........ 215 11.4.3 Defining a Gold Standard and Evaluation Method.... 216 11.4.4 Testing Lexical and Lexico-Syntactic Patterns....... 217 11.4.5 Result Analysis............................ 218 11.5 Toward a Semi-automatic Process of Knowledge Acquisition..................................... 220 11.6 In Summary.................................... 225 11.7 Further Reading.................................. 226 11.8 Exercises...................................... 226 12 From Syntax to Semantics.............................. 231 12.1 Dependency Grammars............................. 232 12.2 Semantic Frames................................. 235 12.3 From Sentences to Semantic Frames.................... 237 12.3.1 Automatic Frame Identification................. 237 12.3.2 Semantic Role Labeling...................... 238 12.4 Semi-automatic Acquisition of Syntactic Realizations........ 240 12.4.1 Human Annotation Effort — Writing Prototype Sentences................................ 240 12.4.2 Algorithmic Effort — Processing Prototype Sentences into Syntactic Realizations............. 242 12.5 Experiment — Semantic Role Labeling of Cooking Sentences...................................... 244 12.5.1 Gold Standard and Evaluation.................. 244 12.5.2 Define our Semantic Role Labeling Strategy........ 246 12.5.3 Result Analysis and Discussion................. 247 12.6 Syntactic Realizations as Patterns for Relation Extraction..... 248 12.7 In Summary.................................... 251 12.8 Further Reading.................................. 251 12.9 Exercises...................................... 252 13 Semantic Types...................................... 255 13.1 Common Semantic Types — Exploring NER Systems....... 256 13.2 Specific Semantic Types — Exploring Lexico-Semantic Resources...................................... 257 13.2.1 Semantic Types in FrameNet................... 257 13.2.2 Semantic Types in WordNet................... 259 13.3 Building Gazetteers from Textual Resources.............. 260 13.3.1 Lexico-Syntactic Patterns..................... 261 13.3.2 Dependency Patterns........................ 262 13.3.3 Comparing Approaches....................... 263 xiv Contents 13.4 Experiment — Using Semantic Types to Filter Semantic Roles......................................... 264 13.4.1 Task Definition............................ 265 13.4.2 Defining a Gold Standard..................... 265 13.4.3 Combining Resources and Algorithms within Voting Strategies...................... 266 13.4.4 Evaluation of Voting Strategies................. 269 13.4.5 Result Analysis and Discussion................. 270 13.5 In Summary.................................... 271 13.6 Further Reading.................................. 272 13.7 Exercises...................................... 272 Appendix A: A Look into the Semantic Web.................... 275 Appendix B: NLP Tools, Platforms and Resources................ 281 Appendix C: Relation Lists................................. 283 Glossary............................................... 291 References............................................. 313 Acronyms AI Artificial intelligence BOW Bag of words IE Information Extraction IR Information retrieval KB Knowledge base KR Knowledge representation LOD Linked Open Data MT Machine Translation MWE Multi-word expression NED Named Entity Disambiguation NER Named Entity Recognition NLP Natural Language Processing NLU Natural Language Understanding OOV Out-of-vocabulary POS Part of speech RDF Resource Description Framework SW Semantic Web URI Uniform Resource Identifier WSD Word Sense Disambiguation xv URI Prefixes PREFIX foaf: http://xmlns.com/foaf/spec/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/rdf-schema/ PREFIX dbr: http://dbpedia.org/resource/ PREFIX dbpedia-fr: http://fr.dbpedia.org/resource/ PREFIX dbo: http://dbpedia.org/ontology/ PREFIX dbp: http://dbpedia.org/property/ PREFIX owl: http://www.w3.org/2002/07/owl PREFIX yago: http://dbpedia.org/class/yago/ PREFIX wikidata: http://www.wikidata.org/entity/ PREFIX schema: http://schema.org PREFIX wordnet: http://www.w3.org/2006/03/wn/wn20/instances/ PREFIX dct: http://purl.org/dc/terms/ PREFIX umbel-rc: http://umbel.org/reference-concept/ PREFIX nlusw: referring to made-up examples in the current book xvii Chapter 1 Introduction Natural Language Processing (NLP) is a large research field, and although it has evolved over recent decades, many of the fundamental difficulties being tackled today are similar to those people were grappling with fifty years ago. The introduction of electronic corpora in the early 1990s, combined with the rise of internet accessibility in the later part of the decade, caused NLP to take a sharp statistical turn toward corpus analysis, increasing its overlap with the fields of machine learning and information retrieval. These changes led to the surfacing of new challenges, such as question- answering using large corpora, as well as new data-oriented statistical views on linguistic problems, such as measuring distributional similarities between words. Although these changing tides have had positive effects on the field of NLP, some underlying problems related to the understanding of language remain to this day. In my view, language understanding implies the transformation of text into a deeper representation, one on which reasoning can occur. Admittedly, this view is more in line with the traditional view of artificial intelligence (AI) than with current trends in NLP, but this is not to say that the content of this book will be old-fashioned. On the contrary, it is intended to introduce current available resources and up-to-date algorithms, and to revisit the fundamental goal of Natural Language Understanding (NLU) from the point of view of the research field as it is today. 1.1 Scope Since this book is primarily concerned with exploring what NLP has to offer to the Semantic Web community (and other research communities interested in knowledge representation), there are many subfields of NLP that will not enter into the discus- sion. These include Machine Translation, text summarization, text categorization, and others. The subfield that will be central to this book is Information Extraction (IE). It is through the study of IE that this book can show the important contribu- tion that NLP can make to the Semantic Web community that of providing methods toward the transformation of unstructured data (e.g., textual information) into struc- © Springer International Publishing Switzerland 2016 1 C. Barrière, Natural Language Understanding in a Semantic Web Context, DOI 10.1007/978-3-319-41337-2_1 2 1 Introduction tured data (e.g., the Semantic Web datastore), being particularly helpful for ontology population. IE is a well-established area of study in NLP, with important research results dating back to the Message Understanding Conferences (MUC) of the 1990s. IE is also quite wide reaching, ranging from the acquisition of structured information from text involving company mergers to text presenting biographical facts about people. It even has modern-day commercial applications, such as the extraction of dates from e-mail messages for the purpose of automatically suggesting calendar events. Common to all research in IE is the need to extract two important types of infor- mation: entities and relations. Each of these is important and complex enough in itself to have an entire subfield of NLP research devoted to it. In the study of entities, the fields of Named Entity Recognition and Entity Linking are highly active today. The primary focus of Named Entity Recognition is searching for dates, locations, organizations, and other types of entities in text. Entity Linking involves linking particular mentions of entities in text back to their definitions. These definitions of entities are often taken from Wikipedia, an incredibly vast and openly accessible source of knowledge. As for the study of relations, the task of Relation Extraction is concerned with attempting to find occurrences of relations in text, in order to bet- ter link entities. For example, Relation Extraction could find information in text that would lead to the conclusion that a particular person is the director of a movie, not its leading actor. This book will focus on both entities and relations. Through the use of IE methods, we will aim to extract both from text, in order to express them within a structured representation. 1.2 Approach Although we have reduced the scope of this book to focus mainly on Information Extraction (IE), this subfield is itself quite large and diverse, and attempting to capture both its depth and its breadth in a single book would be ambitious to say the least. For this reason, we will adopt a breadth coverage approach in this book, meaning we will complete a general investigation of most areas of the field, but will not delve into the finer details of any one. The reader will be introduced to several problems facing IE and will learn a few simple algorithms for approaching each one. For each topic that is covered, rather than providing the details of the multiple algorithms in current research, I will focus on the intuitions that inform them. It is crucial to grasp the underlying principles, general trends, and baseline algorithms if we hope to gain a solid general understanding of the field of IE. This will also provide us with the knowledge necessary for informed consideration of current research. That said, for the reader who is interested in learning more about particular topics, each chapter will include a Further Reading section to gather pointers to research surveys and relevant research articles. These sections will also serve to consolidate 1.2 Approach 3 all references for the algorithms and elements of discussions presented within their corresponding chapter, making this information easier to find at a later date. An important feature of this book is a focus on experimentation and analysis of results. Much current research in NLP is experiment-driven, an approach that is pro- moted by the various international competitions that have emerged in recent years (these cover various NLP tasks including Word Sense Disambiguation, Question Answering, and Machine Translation). This book will follow this trend and empha- size the design of experiments, as well as the analysis of results, even when it comes to the simplest of approaches. During our analysis of experiments’ results, emphasis will be put on the importance of performing both qualitative analysis and quantitative analysis, so as to both understand the reasons for which algorithms succeed or fail, and provide measures of their performance. Another commonality among the various chapters is the use of Wikipedia as a source of textual information. In order to perform Information Extraction, we require text to analyze, and from which to extract information. Being free, accessible and of considerable size makes Wikipedia the ideal resource for the learning examples presented in this book. That being said, we will discuss throughout the book how Wikipedia only covers one genre of text (informative texts) which might not be appropriate for all extraction tasks. 1.3 Prerequisite Knowledge This book promotes a very hands-on approach to the study of Natural Language Processing, which rests on the assumption that readers have a certain amount of previous knowledge of algorithms and software programming. Most NLP strategies will be explored through the use of algorithms, and as such, the reader is assumed to be capable of reading and understanding individual steps of algorithms, and of programming them for analysis of results. No knowledge of NLP is assumed on the part of the reader. This book is intended to take the reader from an initial introduction of the basic concepts of NLP, gradually toward an eventual understanding of the more advanced concepts that appear in its later sections. A certain amount of knowledge about the Semantic Web will be assumed. As such, I steer readers less familiar with the Semantic Web to Appendix A, which provides a concise overview of the required material. Appendix A also introduces the query language SPARQL, which is useful for finding information within the Semantic Web. Being that the focus of this book is on language analysis as opposed to mathemat- ical models, it is accessible to a large audience and does not assume any preexisting knowledge of advanced mathematics. As a consequence of this, machine learning approaches are excluded (except for mention of word embeddings in Chap. 10), even though they have a large influence on current research methods. This is not to say that these machine learning approaches are unimportant. On the contrary, they are highly relevant in today’s world of NLP, and those readers who are interested are strongly 4 1 Introduction encouraged to complement the knowledge contained within this book with additional knowledge of machine learning. Here though, the only statistical models we will con- sider are those that are required for corpus analysis. Furthermore, when considering these models, details of the mathematical processes involved will be explicitly pro- vided, in order to accommodate the range of mathematical backgrounds among the readership. 1.4 Structure This book is divided into four parts, each containing three chapters. Part I, Searching for Entities in Text, is dedicated to the search for entities in textual data. It begins with Chap. 2, Entities, Labels, and Surface Forms, which discusses the fundamental many-to-many relationship between concepts and lexical units, as well as the multiple ways language has of referring to concepts. Then Chap. 3, Searching for Named Entities, expands the search for individual entities to the classes of entities. This chapter also introduces a common and essential tool of NLP, regular expressions. Finally, Chap. 4, Comparing Surface Forms, presents two algorithms (Edit Distance and Soundex) used for discovering surface forms in text that have been altered by typographic errors. Part II, Working with Corpora, is dedicated to an investigation of corpora as valuable resources for NLP work. Chapter 5, Exploring Corpora, discusses the various types of corpora and provides a sense of how words behave inside them. It takes us through both a quantitative exploration of corpora using word frequency measures and a qualitative exploration using a concordancer. Secondly, Chap. 6, Words in Sequence, provides a brief introduction to probabilistic modeling of word sequences, revealing the often predictable nature of words’ occurrences in text. The ideas presented in the first two chapters are transposed to a bilingual context in Chap. 7, Bilingual Corpora, toward applications of automatic language detection and term equivalent search. Part III, Semantic Grounding and Relatedness, is focused on the process of linking surface forms found in text to entities in resources. Its first chapter, Linguistic Roles (Chap. 8), introduces NLP processes of tokenizing, part-of-speech tagging, and parsing, in order to shed light on the linguistic nature of words. Determining the linguistic nature of a word can be thought of as the first step toward its disambiguation (e.g., to cook versus a cook). Next, Chap. 9, Definition-Based Grounding, tackles the problems of Word Sense Disambiguation and Entity Linking, which occur within a particular part of speech (e.g., mouse as a computer device versus a small animal, or Paris as the city in France or the name of a person). This chapter adopts a definition- based approach, since it considers the similarity between the context of occurrence of a surface form and the definitions of all the possible word senses or entities it can be linked to. In order to help disambiguation, additional contexts of occurrence in a text can be discovered through coreference analysis, which is a very difficult NLP task of which we only skim the surface. Finally, Chap. 10, Relatedness, takes us back into the 1.4 Structure 5 realm of statistical corpus analysis, in order to discover relatedness between words. Knowledge of words’ relatedness can be beneficial in many different NLP tasks, and we show one example of this by revisiting the definition-based grounding algorithm from the previous chapter and use word relatedness to increase its performances. Part IV, Knowledge Acquisition, delves into the world of relations and Relation Extraction. First, Chap. 11, Pattern-Based Relation Extraction, looks into pattern- based approaches to Relation Extraction in text. It revisits regular expressions as possible implementations of lexical and lexico-syntactic pattern searches. The under- lying purpose of the task will be ontology population, also called knowledge base expansion. Chapter 12, From Syntax to Semantics, first introduces dependency grammars for performing syntactic analysis, then moves on to an exploration of semantic frames as providing a structured representation for the semantic interpre- tations of sentences. The path from dependency graphs to semantic frames is a chal- lenging one and is the focus of the remainder of the chapter. In Chap. 13, Semantic Types, we consider the constraints that semantic relations or semantic frames impose on their participants. For example, when defining a relation between an employee and an employer, that relation requires that the employee be of type Person and the employer of type Organization. Named Entity Recognition will be introduced as one possibility for identifying standardized semantic types in text, such as Location, Person and Duration. As for less standardized semantic types (e.g., Food or Con- tainer), we will investigate their presence in various resources, as well as strategies for automatically discovering instances of these particular types through searches in text. This book also includes three appendices. Appendix A, A Look into the Semantic Web, provides a brief overview of the Semantic Web. It is intended to bring readers less familiar with the Semantic Web up to speed, so that they too can fully benefit from the material of this book. Appendix B, NLP Tools and Platforms, provides information about NLP platforms and tools. Appendix C, Relation Lists, gathers lists of relations under different categories, showing how relations can be varied and serve different purposes. Finally, this book provides an extensive Glossary of terms. The glossary consoli- dates the definitions of all the NLP-related terms used over the course of the book and is to be used as a reference whenever necessary. Throughout the text, the reader will find terms emphasized but not necessarily completely defined when first mentioned, perhaps because they refer to concepts explained in a later chapter. In such cases, the reader can minimally use the glossary to find short definitions to help his/her comprehension of the material. 1.5 Learning Paths This book is structured to comprise an incremental study of NLP. We will construct the more advanced knowledge found toward the end of the book out of the simpler pieces of knowledge we gather over the beginning sections. This means that later 6 1 Introduction chapters will often refer back to material from earlier chapters. However, so as not to force a linear reading of the book, many references to previously learned knowledge will be made explicit, allowing the reader to seek out the missing pieces of information when desired. I recognize that goals, aims, and reasons for approaching this book can vary among readers. While one person may be looking for a thorough, all encompassing understanding of Information Extraction, someone else may desire a shorter, more concise overview of only the key elements, perhaps to complement their studies in another area. For this reason, I have outlined various learning paths one could follow through this book. The shortest path covers only crucial chapters, while the longer ones include additional chapters that deal with particular interests or niche areas of NLP. The hope is that, in this way, each reader can get out of this book what he or she is seeking. For the reader seeking to work through the material of this book in a more con- centrated fashion, the short path (7 chapters) may be the best option. Begin with Chaps. 2 and 3, to grasp the fundamentals of searching for entities in text. Next, move on to Chap. 5, to learn about corpora. Corpora provide the source material (text) for IE to apply its algorithms on. Then, skip right to Chaps. 8 and 9, which together will provide the basics of linking surface forms found in corpora to entities described in resources. Lastly, Chaps. 11 and 12 will provide important knowledge about Relation Extraction and semantic interpretation of sentences. A slightly more thorough route would be to take the longer semantic exploration path (9 chapters). This path adds Chaps. 10 and 13, which provide additional insight about word relatedness and semantic types, both quite important for Entity Linking and semantic interpretation of sentences. Then, there are three additional chapters, corresponding to more specific interests. The reader interested in applying IE on noisy texts (e.g., e-mails, sms, and blogs) would do well to add Chap. 4 to their reading. This chapter provides algorithms for comparing words that are similar sounding, as well as those that contain orthographic errors. The reader wanting to learn basics in statistical models of language can addition- ally read Chap. 6, in order to gain an understanding of sequence models, commonly used in word prediction, error correction, and even Machine Translation. Finally, Chap. 7 contains pertinent information for those interested in bilingual corpora as it provides the statistical tools for finding information in a bilingual parallel corpus, such as term equivalents, which are terms in different languages linking to the same concept. Part I Searching for Entities in Text Our starting point is a knowledge base within the Semantic Web that contains a certain amount of information about select entities. Our objective will be to search in text for further information about those entities. Below are three entities, provided by three URIs (Uniform Resource Identifiers) taken from DBpedia.1 All three refer to specific instances of people or places. http://dbpedia.org/resource/Mount_Everest http://dbpedia.org/resource/Wolfgang_Amadeus_Mozart http://dbpedia.org/resource/Ireland We will also see the term entity used to refer to generic objects, such as those seen below. These do not refer to a specific mobile phone or one particular Chinese cabbage bought on a certain day, but rather to any mobile phone or Chinese cabbage. http://dbpedia.org/resource/Mobile_phone http://dbpedia.org/resource/Chinese_cabbage The information we seek regarding specific and generic entities can be quite varied, as demonstrated by the following questions: What is the height of Mount Everest? When was Wolfgang Amadeus Mozart born? What is the capital of Ireland? When were mobile phones invented? What is the typical color of a Chinese cabbage? Answers to these questions may very well be somewhere in the Semantic Web datastores, but they may not be. For example, answers to the first two questions are available within DBpedia, but answers to the last three are not.2 In cases where DBpedia (or any other datastore) does not provide the answer, we hope that tex- tual data will contain an answer which can be uncovered through the process of Information Extraction. 1 DBpedia is an important resource of the Semantic Web and can be accessed at http://dbpedia.org. 2 The inclusion of these predicates in DBpedia was validated in the summer of 2015. 8 Part I: Searching for Entities in Text In Information Extraction, the first step toward finding answers to questions like the ones above is to find mentions of the relevant entities in text. Given that entities often have multiple names; however, our quest into text should account for this by searching them in their various surface forms. For example, if we are searching for information about Mozart, we might also search his other names, such as Wolfgang Amadeus Mozart and W.A. Mozart. Likewise, a search for Chinese cabbage should include synonyms such as bok choy. Entities and their multiple surface forms will be the focus of Chap. 2, Entities, Labels, and Surface Forms. Certain information in text is better searched for based on the type of information it has. For example, a search for when Mozart was born would involve searching for textual information corresponding to the specific entity type Date. On the other hand, a search for the capital of Ireland would seek information corresponding to a Location, or more precisely a City. There are various entity types that are quite common, such as Person, Duration, and Organization, and other less common ones, such as Politician and Symphony. Furthermore, certain entity types such as dates have particular, predictable formats (i.e., May 2 2015 or 15 June 2014), which make them easier to find in text relative to other, less predictable ones. To search for entities with predictable formats, we can use regular expressions, a very powerful text search tool that is commonly used in Natural Language Processing. One goal of Chap. 3, Searching for Named Entities, will be to demystify regular expressions. In Chap. 4, Comparing Surface Forms, we will learn how to address instances of surface forms that contain typographic errors, for example, musician misspelled as misicien, by matching them to existing words in our vocabulary. This matching can be accomplished via an Edit Distance algorithm, looking at letter deletion, insertion, and substitutions. Another algorithm, called Soundex, relies instead on the sounds of words for resolving cases of typographic error. We will also explore Soundex in this chapter and compare it with Edit Distance. The knowledge we acquire over the course of the following three chapters will provide us with the necessary tools for Searching for Entities in Text. Chapter 2 Entities, Labels, and Surface Forms The starting point for Information Extraction (IE) is textual data in the form of a document, a set of documents, or even a set of individual sentences in which we will search for information. For now, let us refer to any such textual data as a corpus.1 Let us set our first IE goal as to find all sentences from a corpus in which a particular entity is mentioned. In this chapter, we will look into a first obstacle toward this seemingly sim- ple IE goal: the fact that entities do not have normalized names. Instead, entities can be referred to by many different surface forms. For any entity searched (e.g., dbr:Wolfgang_Amadeus_Mozart), there could be various associated surface forms present in the corpus (e.g., W.A. Mozart, Amadeus Mozart, Mozart), and knowledge of these surface forms is necessary to achieve our goal. We will introduce a measure called recall measure to evaluate the extent to which a set of known surface forms suffices for retrieving all of an entity’s textual mentions. The search in the corpus for a particular surface form is likely to output sentences unrelated to the entity we are looking for. That is due to polysemy. Polysemy is the word used to describe the fact that a surface form (e.g., Mozart) can be associated with many entities (e.g., Wolfgang Amadeus Mozart, Café Mozart, Mozart Street). We will introduce a second measure called precision measure, to evaluate the extent to which the sentences retrieved from the corpus actually lead to the desired entity. Which surface forms are available to the IE process largely influences its precision and recall. We will explore DBpedia as a resource in which surface forms can be found and further provided to the IE process. DBpedia contains surface forms explicitly stated through the standard naming predicates (e.g., rdfs:label, foaf:name). But we will shift our attention toward a non-standard naming predicate, the predicate dbo:wikiPageRedirects, as to explore a larger, but more eclectic set of candidate surface forms. Studying this eclectic set reveals semantic variations (e.g., quasi- 1 We will further discuss definitions of the word corpus in Chap. 5, Sect. 5.1. © Springer International Publishing Switzerland 2016 9 C. Barrière, Natural Language Understanding in a Semantic Web Context, DOI 10.1007/978-3-319-41337-2_2 10 2 Entities, Labels, and Surface Forms synonym, related word), lexical variations (e.g., plural, singular), and orthographic variations (e.g., capitalization of the first letter, use of hyphen) as surface forms to be expected to occur in text. We will suggest generative rules to construct some of an entity’s associated surface forms automatically, as to provide the IE process with an even larger set of surface forms. We will then briefly look at surface forms from a multilingual point of view, as provided by DBpedia. Finally, we will discuss the most problematic surface forms: pronouns. Mentions of entities in text through the use of pronouns such as it, them, or he make it very difficult for IE. 2.1 A First Surface Form: Entity Label I am interested in the composer Ludwig van Beethoven. I find an entity referring to him in DBpedia under the URI dbr:Ludwig_van_Beethoven. The Semantic Web provides a network of Universal Resource Identifiers (URIs) to uniquely identify con- cepts.2 Such normalization is necessary for the sharing of information and is ideal for communication among machines. In IE, on the other hand, the source of informa- tion to analyze textual data has been written by humans for human consumption. In such human-to-human communication channel, there is no notion of normalization, quite the opposite actually, as creativity is usually valued in such communication. Therefore, for an IE process to search in text, it requires knowledge of the surface forms likely to be used by the writers of text to refer to various concepts, surface forms that are also likely to be understood by readers of these texts. The Semantic Web contains a certain number of naming predicates that provide alternate surface forms for entities. Some commonly used naming predicates are foaf:name or rdfs:label as in the following examples3 providing two different names for the desired entity: (dbr:Ludwig_van_Beethoven, rdfs:label, “Ludwig van Beethoven”) (dbr:Ludwig_van_Beethoven, foaf:name, “Beethoven, Ludwig van”) These surface forms can be provided to the IE process to search for their occur- rences in the BeethovenCorpus, which contains 12 sentences, as shown in Table 2.1. The IE process finds the first surface form Ludwig van Beethoven in sentences {1,4,8}. We can then assume that this set of sentences contain information about our entity of interest. If we provide the second surface form Beethoven, Ludwig van to the IE process, it leaves us with an empty set. Although a valid surface form, it is not one commonly used in text and no sentences of the BeethovenCorpus contains it. So, how successful is the IE process so far? It might seem early in our quest to think of evaluating our method, but it is not. It is never too early. 2 An introduction to the Semantic Web is provided in Appendix A. 3 Prefixes associated with various Semantic Web data provider sites are listed at the beginning of this book. 2.2 Experiment — Searching for Beethoven 11 Table 2.1 The small Beethovencorpus No. Sentence 1 The Andante favori is a work for piano solo by Ludwig van Beethoven. 2 The other great passion of the young Mirabehn was the music of van Beethoven. 3 L.V. Beethoven spent the better part of his life in Vienna. 4 Charles Munch conducted the symphony no. 9 of Ludwig van Beethoven in 1962. 5 Among the few composers writing for the orphica was Ludvig von Beethoven. 6 Betthoven, too, used this key extensively in his second piano concerto. 7 Naue went to Vienna to study briefly with von Beethoven. 8 Bonn is the birthplace of Ludwig van Beethoven (born 1770). 9 Johann van Beethoven joined the court, primarily as a singer, in 1764. 10 Camper van Beethoven were inactive between late 1990 and 1999. 11 Beethoven, meanwhile, runs after a loose hot dog cart and ends up on a merry-go-round. 12 Beetehoven hit theaters in april 1992. 2.2 Experiment — Searching for Beethoven Evaluation is a major focus of NLP, and new methods presented by researchers are always regarded relative to other methods or established baselines. So far, we can consider that we have defined a baseline algorithm, meaning a first simple method. Our baseline algorithm for finding information about a particular entity consists of searching in text using the surface form provided by the predicates rdf:label and foaf:name. Once we have evaluated the performance of this baseline algorithm, we will move on to suggest new ideas and hope these will lead to improvements. The most common evaluation measures are based on a comparison between a gold standard and an algorithm (or system). The gold standard defines what is correct and incorrect, and our system tries to mimic it. A contingency table, as shown in Table 2.2, summarizes the system’s various possible results in relation to the gold standard. If the system turns up a correct answer, we have a true positive (TP). If the system thinks an answer is correct but it is not, then we have a false positive (FP). If the system thinks the answer is incorrect when in fact it is correct, we have a false negative (FN). Lastly, if the system rightly identifies an answer as incorrect, then we have a true negative (TN). Table 2.2 Contingency table Gold standard correct incorrect System correct true positive (TP) false positive (FP) incorrect false negative (FN) true negative (TN) 12 2 Entities, Labels, and Surface Forms Typical evaluation measures derived from the contingency table are as follows: TP Recall = TN T P + FN Reject = T N + FP TP Pr ecision = TP +TN T P + FP Accuracy = N Pr ecision ∗ Recall F1 = 2 ∗ N = T P + T N + FN + FP Pr ecision + Recall In Natural Language Processing, the first three measures, recall, precision, and F1, are the most commonly used. As you will see throughout this book, we constantly try to balance the precision and recall of algorithms as we search for information in text. We hope for high recall as an indication that our algorithm did not miss important information and we hope for high precision as an indication that the algorithm did not dilute the important information among many irrelevant candidates. Sometimes, an application dictates whether high recall or high precision is more important, but if that is unknown, then the F1 measure is a good compromise, since it combines both measures by calculating the harmonic mean between recall and precision. In our example, assuming we use the BeethovenCorpus to define our gold standard, the entity dbr:Ludwig_van_Beethoven is referred to in sentences 1 to 8, making these sentences positive examples. Sentences 9 to 12 are negative examples referring to three other entities with very similar names. We compare our baseline algorithm, searching using the surface form Ludwig van Beethoven, against this gold standard. Since our algorithm identifies only the three sentences {1,4,8} as correctly relating to the entity dbr:Ludwig_van_Beethoven, it provides 3 true positives. The full contingency table is shown in Table 2.3, and the derived precision/recall measures follow. Table 2.3 Contingency table for baseline algorithm of Ludwig van Beethoven search Truth (Gold standard) correct incorrect Baseline search correct 3 0 incorrect 5 4 3 Pr ecision = = 100% 3 3 Recall = = 37.5% 8 100 ∗ 37.5 F1 = 2 ∗ = 54.5% 100 + 37.5 These results show that our baseline algorithm provides a high degree of precision, but unfortunately it comes at the expense of recall, which is low. We missed many of 2.2 Experiment — Searching for Beethoven 13 the positive sentences from the gold standard in which the entity’s full surface form was not used. Let us see in the next section how recall can be improved, hopefully without loosing in precision. 2.3 Looking for Additional Surface Forms As in any search problem, the goal is to improve on our precision and recall. So far we have a good level of precision in our baseline algorithm, because our query is very specific. To obtain higher recall, we will have to relax the query somewhat, or add alternative names. The predicate dbo:wikiPageRedirects found in DBpedia is a good place to find alternative names, since it provides all the redirect links used in Wikipedia to access a desired page. It shows us the various ways humans attempt to find a particular Wikipedia page, and we can assume that the same ones would apply to text. Through the DBpedia SPARQL endpoint, we can find the redirect links using the following SPARQL query, which returns 53 variations4 : PREFIX dbr: PREFIX dbo: select ?X where { ?X dbo:wikiPageRedirects dbr:Ludwig_van_Beethoven. } Table 2.4 Examples of redirect links to dbr:Ludwig_van_Beethoven and matching sen- tences from BeethovenCorpus No. Redirects Sentences retrieved 1 Beethoven 1, 2, 3, 4, 5, 7, 8, 9, 10, 11 2 Beethovens 3 Baytoven 4 Beeethoven 5 Beetehoven 12 6 L.V. Beethoven 3 7 Ludvig von Beethoven 5 8 Ludwig van Beitoven 9 van Beethoven 1, 2, 3, 8, 9, 10 10 von Beethoven 5, 7 The second column of Table 2.4 shows a sample of these variations. As you can see, these surface forms vary in their nature. Some are abbreviations, some are short 4 Information about SPARQL can be found in Appendix A. The DBpedia SPARQL endpoint is at the address http://dbpedia.org/sparql. The number of variations returned by the query is likely to vary depending on the date you access the site, as DBpedia is expanding every day. 14 2 Entities, Labels, and Surface Forms forms, and others include orthographic errors. We will look more closely at these variations in the next section, but for now let us focus on the impact of using them in our search on the BeethovenCorpus. Assuming we include all variations in our search, how does this influence our results? The third column of Table 2.4 shows the sentences retrieved by each surface form. We can summarize the results in the contingency table shown in Table 2.5, and the equations that follow show the new precision, recall, and F1 results. Table 2.5 Evaluation of search using surface forms provided by the redirect links Truth (Gold standard) correct incorrect System correct 7 4 incorrect 1 0 7 Pr ecision = = 63.6% 11 7 Recall = = 87.5% 8 63.6 ∗ 87.5 F1 = 2 ∗ = 73.7% 63.6 + 87.5 Using all variations, our recall is now at 87.5%, but our precision has dropped to 63.6%. This is typical of a query expansion process in information retrieval, which is precisely the process we have engaged in. This trade-off between precision and recall is often the struggle in searches for information in text, whether they involve entities, as seen here, or relations, which we will investigate in later chapters. Our strategy will therefore depend on our ultimate purpose. If the goal is really to not miss anything, assuming a human reviewer will later filter the results, we should aim for high recall. On the other hand, if we are less concerned with capturing every possibility and more with presenting tailored, quality results to a user, then we should aim for high precision. In our example, the surface form van Beethoven leads us to include sentences about the rock band Camper van Beethoven, as well as one about Johann van Beethoven, Ludwig’s father. The surface form Beethoven included sentences about the movie Beethoven, which features a dog named Beethoven. This could be problematic if we intend to use these results as the basis for a later fact extraction algorithm. Including sentences about the rock band and/or the movie is bound to lead to errors or even ridiculous facts. We are now confronted with the significant problem of polysemy. Surface forms, especially those that are made up of single words, have multiple meanings. The longer form Ludwig van Beethoven ensured that we were finding relevant sentences. Relax- ing our query to Beethoven also returned relevant sentences, but with them many other, less relevant ones which we will have to somehow filter out. One important approach to dealing with polysemy is to develop disambiguation algorithms, some- 2.3 Looking for Additional Surface Forms 15 thing we will look at in Part III of this book, investigating Semantic Grounding and Relatedness. Before moving on, there are a couple of things we should take note of. First, notice that when we apply an all encompassing pattern such as Beethoven, all the more precise variations are subsumed by it and defining them becomes less useful. Also, although we had many variations to search for, sentence 6 is still missing from our results. That sentence contains the misspelled form Betthoven which was not included in the known variations. We will further explore misspelling errors in Chap. 4. For now, let us continue on our exploration of name variations. 2.4 Categorizing various Surface Forms As we search for information in text, we may be interested in specific entities (e.g., Ludwig van Beethoven) as we have seen thus far, or our interest could be geared more toward generic entities (e.g., garden, laptop computer). Similar to specific entities, the surface forms provided by naming predicates for generic entities are usually insufficient for searches in text. For example, if we want to know about mobile phones, the only label provided in DBpedia is Mobile phone. However, people tend to be very creative in how they refer to particular concepts. To demonstrate this point, Table 2.6 shows a subset of the results returned by the following SPARQL query: PREFIX dbr: PREFIX dbo: select ?X where { ?X dbo:wikiPageRedirects dbr:Mobile_phone. } Table 2.6 Examples of dbo:wikiPageRedirects for the URI of Mobile_phone in DBpedia Cell_phone Cellular_mobile Cellular_telephone Celluar_telephone Cellular_Radio Cellular_phones Mobile_cellular Handphone Mobile_Phones Cell_telephones Cell_phones Hand_phone Cell_phone_carrier Cellular_radio How_mobilephones_work? Mobile_telecom Cell_Phone Cellphones Cell-phone Cellular_telephones Mobile_phones Mobile_communications Mobile_Phone Cellular_phone Flip_phones Mobile_telephone Wireless_phone Mobile_telephones Cellular_Telephone Cell_Phones Cell_phone_dance Cellphone Cellphone_video 16 2 Entities, Labels, and Surface Forms To be sure, not all the variations in Table 2.6 are proper synonyms for mobile phone. So what are they? Should we include them all in our search, or not? Let us begin to answer these questions by characterizing the variations found in the table. Real synonyms Many of the variations are real synonyms of mobile phone, meaning the two surface forms are completely interchangeable in a sentence. Many variations also show that mobile phone has a compositional meaning, formed by the individual meanings of mobile and phone. Being that it is compositional, mobile phone allows for variants of both components of the expression. The first word, mobile, has the variants cell and cellular, and the second word, phone, has the variant telephone. Together, these variants lead to multiple possible combinations. Quasi-synonyms Flip phone, hand phone, and wireless phone are loose extensions of mobile phone and could probably be used interchangeably with mobile phone in certain contexts, but not always. Quasi-synonyms are very common in language, much more so than real synonyms. Uppercase variations Many variations simply come from using the uppercase for the first letter of either or both words, such as in Cell phone or cell Phone. Orthographic variations Different ways of combining the words lead to ortho- graphic variations. Examples here include the use of a hyphen in cell-phone, the use of a space in cell phone, and even the two words being combined into a single word in cellphone. Plural form Some variations come from the use of the plural, such as cell phones and mobile phones. Typographic error Some variations included in the redirect links are simply to ensure that the Wikipedia entries are reached even if the user makes a typographic error, such as in celluar telephone. Related topics Variations such as How mobile phones work?, Cell phone carrier, cellphone video, cellular radio, and Mobile telecom are certainly related to mobile phone, but they are neither synonyms nor quasi-synonyms. Given all of these variations, which one(s) should we use to expand our search in text? As we have seen, anything we add will increase recall but runs the risk of decreasing the precision of our results. It is safe to say that real synonyms should be used in a search. The more precise the synonym, the less likely it is to introduce noise and lower the overall precision of our algorithm (e.g., cellular phone). On the other hand, introducing a synonym like mobile in our search is very likely to create noise, given its polysemous nature. As for quasi-synonyms, they can be tricky. There are times when we should include all of them in a single search, especially if our goal is to search for information at a general level. Other times though, we will want to understand the difference between quasi-synonyms, in which case they should be searched for individually. Typographic errors are fairly important for searches in noisy corpora (e.g., Web, e-mails, etc.), but much less so for searches in books or Wikipedia. Chapter 4, Com- paring Surface Forms, looks at algorithms to automatically evaluate the distance 2.4 Categorizing various Surface Forms 17 between two strings, such as Celuar and Cellular. Such algorithms are very helpful in analyzing text, since predicting all possible typographic errors ahead of time is not possible. When it comes to uppercase and plural forms, as a rule we should probably include them in our search. The exception would be in searches for company names (e.g., Apple versus apple), where the differentiation lies in the first letter being capitalized. Contrarily to typographic errors, variations such as plural or uppercase forms are pre- dictable, which makes it possible to write algorithms for generating these variations ahead of time. That is the topic of the next section. 2.5 Expanding the Set of Surface Forms with a Generative Approach Despite being listed as wikiPageRedirects, certain surface form variations of entities can be generated automatically, and doing so will allow us to obtain a more complete list of variations. For example, if Mobile phone and cell phones are included on our list, why should not Cell phones, Cell phone, Mobile phones, and mobile phones also be there? It would be possible to generate all of these variations with only two rules, one for changing the first letter to uppercase and another for adding the plural marker ‘s’. Rather than listing all variations, we can write generative rules in order to derive all variations from a subset of elements. These generative rules are most valuable since they can be applied generally, to all Wikipedia entries (not just mobile phone). A rule for first letter capitalization is a good example of one that would be generally applicable. Other examples are rules for orthographic variations (hyphens, spaces, single word) as well as ones for plural variations, even though plural rules must be adapted to word endings (e.g., story/stories). Since rules can be applied systematically to generate all forms, using them is likely to lead to more reliable coverage of a larger number of variations. This is much more straightforward than attempting to compile an exhaustive list one form at a time, where we could easily forget to include one or several forms. Table 2.7 provides examples of generative rules for different types of variations. Some of these rules can be combined to generate further variations. Then, Table 2.8 takes us back to our example of Ludwig van Beethoven and suggests some generative rules for proper nouns. Table 2.7 Examples of generative rules applied to cell phone Variation type Rule Derived form plural Add ‘s’. cell phones capitalization Put first letter of first word in uppercase. Cell phone capitalization Put first letter of all words in uppercase. Cell Phone orthographic Put hyphen between words. cell-phone orthographic Remove space between words. cellphone 18 2 Entities, Labels, and Surface Forms Generative rules can be put in opposition to normalization rules, perhaps more commonly used in NLP. A typical example of a normalization rule is lemmatization which tries to find a word’s base form, called a lemma. For example, lemmatization would transform chosen and choosing into a single base form choose. This will be part of our exploration into linguistic processes in Chap. 8 looking at Linguistic Roles. Table 2.8 Examples of generative rules applied to Ludwig van Beethoven Variation type Rule Derived forms abbreviation Gather the first letter of each word L.V.B. separated by periods. spaced abbreviation Apply abbreviation rule and add space L. V. B. in between letters. first names initials Apply abbreviation rule except on the L.V. Beethoven last word. skip middle noun Remove the middle name. Ludwig Beethoven Generative rules derive multiple surface forms which are then stored in a resource as possible variations for an entity. At search time, all forms can be used to find relevant sentences in text. Normalization rules take the opposite approach of going through the text at search time and attempting to map all encountered forms onto a limited number of normalized forms held in the resource. Both of these processes will be useful at different times, depending on our application. 2.6 Multilingual Surface Forms Let us now make a small incursion into multilingual surface forms and look at labels and surface forms provided in the Semantic Web for languages other than English. As an example, we discuss the multilingual DBpedia, defining a conceptual system in each language, and then relating concepts using the owl:sameAs predicate. In theory, the owl:sameAs should be used to establish an equivalence between concepts (URIs). If two concepts are equivalent, then we could assume that their respective labels, as provided by the usual naming predicates, would correspond to translations of each other. The problem, though, is that the same variations seen in Sect. 2.1 exist in every language. There are even languages that would contain more variations based on grammatical case, for example, having a different form of a word corresponding to its accusative case (direct object of the verb) or dative case (indirect object of the verb). These variation types, similar to differentiation between plural forms, verb forms, or even abbreviations, cannot be captured with a rdfs:label naming predicate and would require an encoding format able to capture linguistic information about words. I encourage the reader to follow the pointers to the lemon model and 2.6 Multilingual Surface Forms 19 the initiative of Linguistic Linked Open Data in Sect. 2.9, for more information about encoding linguistic information in the Semantic Web. Besides variations within languages, there are problems related to the linking of information between languages. For example, things become quite complex with the spilling over of some English information into the naming information of other languages. If we follow the owl:sameAs link for dbr:United_Kingdom to the French DBpedia, we find different labels, as shown below. (dbr:United_Kingdom, owl:sameAs, dbpedia-fr:Royaume-Uni) (dbpedia-fr:Royaume-Uni, rdfs:label, “Royaume-Uni”) (dbpedia-fr:Royaume-Uni, rdfs:label, “United Kingdom”) (dbpedia-fr:Royaume-Uni, rdfs:label, “Regne Unit”) (dbpedia-fr:Royaume-Uni, rdfs:label, “Reino Unido”)... (dbpedia-fr:Royaume-Uni, foaf:name, “United Kingdom”) (dbpedia-fr:Royaume-Uni, dbo:frenchName, “Royaume Uni”) The same predicate, rdfs:label, leads to different surface forms, corresponding to different languages, but without explicit indication of which language. Then, how could an automatic system find the French name assigned to UK, given all this information? The dbo:frenchName predicate could help select the correct label, but this predicate is not consistently used to name entities. Unfortunately, the lack of coherence within this multilingual labeling will cause this information to be very difficult to use in NLP, if we wish to expand, for example, our entity search to texts written in different languages. Ensuring resource coherence is a challenge in itself, either for experienced lex- icographers, terminologists, or domain experts in the case of curated resources, or for community contributors in the case of collective resources such as Wikipedia. From an NLP point of view, it is important to be aware that resource coherence will have an impact on any automatic process that tries to use it. 2.7 The Most Ambiguous Surface Form: Pronouns In the last section of this chapter, I wish to introduce the most ambiguous surface forms: pronouns. Below is a short made-up text in which I highlighted the use of pronouns. Lora Bruntelli earns income from publication of her works and from public performances. She also depends on the generosity of a particular patron, Mr. Zelig, for income. Lora is lucky that his wife, Clara, especially loves her music. He often commissions her compositions to be played in private performances to which they invite friends. Pronouns are short surface forms most often used in text to refer to previously mentioned entities. Once the entity Lora Bruntelli is established in the first sentence, the following mentions of she and her can be associated with Lora Bruntelli. Then, once the entity Mr. Zelig is introduced in the second sentence, the pronoun his and 20 2 Entities, Labels, and Surface Forms he can be used to refer to Mr. Zelig. The use of pronouns allows for fluidity in text, avoiding the repeated use of names of people, location, or objects. But fluidity for human consumption of text adds an additional heavy burden on NLP, since pronouns quickly become very ambiguous in who or what they refer to. In our example, when the entity Clara is introduced in the third sentence, it also permits the use of her to refer to Clara. Now, having two entities Lora and Clara possibly referred to by her creates ambiguity. Seemingly easy for humans, disambiguating the reference of a pronoun is one of the most difficult tasks in NLP, called anaphora resolution. In anaphora resolution, we try to automatically find the entity previously mentioned in the text to which a pronoun corresponds. The use of coreferences is a related language phenomenon also included in the example above. The use of Lora in the third sentence links back to Lora Bruntelli in the first sentence. Automatically finding such links is called coreference resolution, perhaps a slightly more approachable task than anaphora resolution, but still in the realm of the research NLP world. We will come back to the task of coreference resolution in Chap. 9. Another related but much less frequent language phenomenon is called cataphora, as shown in the following example: Knowing he tends to be late, Brandon always sets his clock fifteen minutes early. Notice how the pronoun he is used before the name Brandon is mentioned. The associated NLP task is called cataphora resolution, but it is not studied as much as anaphora resolution given the much less frequent use of cataphora in text. Some NLP tasks that will be mentioned throughout this book are considered mature problems, meaning that they have implementations in software packages that are usable, out of the box, with reasonable performances. Anaphora and cataphora resolutions are not among those tasks, as they are still very much in the realm of NLP research. Most IE systems do not attempt to perform those tasks and simply ignore sentences in which pronouns occur (except for the very easy cases of non-ambiguity). Not dealing with sentences containing hidden forms (e.g., his, him, he) is most often compensated by using a larger corpus, in which we hope to find sufficient explicit mentions. 2.8 In Summary Both specific entities and generic entities are referred to by multiple surface forms in text. Precision and recall of an entity search process will be influenced by which surface forms were used in the search. Not all the variations found through the dbo:wikiPageRedirects are to be con- sidered as valid surface forms for entity search. Many variations of lexical units can be generated automatically (e.g., plural, cap- italization). 2.8 In Summary 21 Generative rules can also be developed for proper nouns (e.g., generating initials from a person’s name). In a multilingual setting, we should use with care the predicate rdf:label as indicative of surface forms for each language. Pronouns are the most ambiguous surface forms, making Information Extraction difficult on sentences containing them. 2.9 Further Reading Beyond labels: In this chapter, we have used foaf:name and rdfs:label as naming predicates, but the Semantic Web contains many other predicates used for labeling purposes (see Ell et al. (2011)). Still these various predicates do not provide an integrated and agreed-upon model in which entities are related to surface forms. But, in recent years, there has been various efforts toward a “linguistically aware” Semantic Web. One such major effort, called lemon—The Lexicon Model for Ontolo- gies, is worth paying special attention to. This suggested model for encoding lexical information is quite rich, allowing us not only to link word forms to word senses, but also to express variations in word forms (see http://lemon-model.net, McCrae et al. (2012)). I also encourage the reader to look into efforts in Linguistic Linked Open Data, as there is growing interest in formalizing linguistic information as part of the Linked Open Data cloud (http://linguistic-lod.org/). There is also effort in standardization of linguistic processes and resources, through the Natural Language Processing Interchange Format (NIF) described in Hellmann et al. (2013). DBpedia/Wikipedia: DBpedia will often be referred to in this book. The article by Lehmann et al. (2012) describes its extraction process from Wikipedia, another resource largely used in this book. In The People’s Web Meets NLP, Gurevych and Kim (2013) show various usages of the community resource Wikipedia for NLP, and vice versa. Anaphora resolution: It is interesting to compare an early survey on anaphora reso- lution (Mitkov 1999) to a recent one (Poesio et al. 2010). For a survey on coreference resolution, see Elango (2006). 2.10 Exercises Exercises 2.1 (Entity search evaluation). a. Among all the possible surface forms provided by the dbo:wikiPageRedirects predicate for Beethoven, ten are listed in Table 2.4. Can you find the remaining ones? If you use all of them, what is the impact on precision/recall of the entity search applied to the small BeethovenCorpus provided in Table 2.1. 22 2 Entities, Labels, and Surface Forms b. Reflect on why precision and recall are favoured in NLP for reporting on search experiments, rather than the other measures of reject and accuracy. All the equa- tions are given in Sect. 2.2. Exercises 2.2 (Variations in surface forms). a. Look up DBpedia for the generic entity Cat, at dbr:Cat. Go through the list of its dbo:wikiPageRedirects and classify the entries, provided the classification from Sect. 2.4. You will notice that some variations do not quite fit within our classification. Identify new categories of variations, give them a name, and provide examples for them. b. Look up DBpedia for the specific entity Paris, at dbr:Paris. Again, go through the list of its dbo:wikiPageRedirects and classify the entries according to the classification in Sect. 2.4 enriched with the new categories you defined in the previous exercise. Exercises 2.3 (Generative rules). a. Write a software program that can take a word as an entry and generate variations implementing the rules from Table 2.7. Which ones are easier to program? Why? b. Adapt your code from exercise (a) to work for noun compounds. A noun com- pound is composed of two or more words, such as kitchen appliance, Japanese rock garden, or laser printing. Does your program work for all of these examples? What are its limitations? c. Moving on to specific entities, implement the rules in Table 2.8 and add a few more rules of your own. Test your rules on the names of ten famous people, companies, or organizations of your choice. Are the rules generating likely-used surface forms? How would you test if a surface form is actually a likely one to exist in text? Exercises 2.4 (Multilingual surface forms). a. Go to the Spanish DBpedia, following the owl:sameAs predicate from the United Kingdom entry in the English DBpedia, found at dbr:United_Kingdom. In terms of use of naming predicates, does it resemble what we saw in the French DBpedia, in Sect. 2.6? Discuss. Try another language of your choice and observe again. Exercises 2.5 (Use of pronouns). a. Take a news article at random from a newspaper of your choice. Find the pronouns. Do you find them easy to solve to earlier mentions? How would you quantify their use: frequent or infrequent? For a single explicit mention, are there many further anaphoric mentions? Chapter 3 Searching for Named Entities In the previous chapter, we searched for the specific composer Ludwig van Beethoven. But what if we wanted to find sentences about any classical music composer, or even more generally, about any composer? So far our strategy has consisted of starting with a URI, finding possible alternative surface forms, and looking for sentences in which they occur. If we follow the same strategy in the case of composers, we can find the URI dbr:Composer in DBpedia and discover some of its surface forms (e.g., music composer, musical composer, and even author) through the predicate dbo:wikiPageRedirects. Would a search in a corpus for sentences containing these surface forms be likely to lead us to information about composers? Perhaps it would, but it is certain that the recall performance of such approach will be quite low. For example, if we attempt the above strategy on the BeethovenCorpus from the previous chapter (see Table 2.1 in Sect. 2.1), we find a single sentence (sentence no. 6) among the 10 sentences about composers, which explicitly uses the word composer. An alternative approach, presented in this chapter, is to consider Composer as an entity type, also referred to as entity class. An entity type, or entity class, represents a set of individuals, and we will develop text mining strategies for finding the individuals which belong to this class. The first strategy is using a list of the individuals in the class. For example, we can gather a list of composers, such as L.V. Beethoven, W.A. Mozart, and J.S. Bach, and search for them in text. In NLP, such list is often referred to as a gazetteer. The second strategy is to search for regularities in the way of expressing the individuals in the class. The type Composer is likely not the best candidate for this strategy, although being a subclass of Person, we can expect the same regularity in a composer’s name than in a person’s name. Such regularity could be a sequence of two capitalized words (e.g., [F]rank [Z]appa), although we can imagine such regularity leading to many other entity types than Person, such as City (e.g., [N]ew [Y]ork) or Country (e.g., [S]ri [L]anka). Other entity types, such as Date or Time, are better candidates for this regularity detection strategy. © Springer International Publishing Switzerland 2016 23 C. Barrière, Natural Language Understanding in a Semantic Web Context, DOI 10.1007/978-3-319-41337-2_3 24 3 Searching for Named Entities We will look at various entity types in this chapter, as well as the use of gazetteers and the detection of regularities through a very powerful tool, regular expressions. We will also continue with the experiment-based learning approach promoted in this book, by defining an experiment to test the precision and recall of regular expressions in search of Date in a corpus. 3.1 Entity Types Composer is simply an example of an entity class, also called an entity type, a semantic type, or a category. Symphony would be another example, as would be Author, Company, and City. Entity types are at the core of knowledge organization, since they provide a structure for our understanding of the world. And because entity types play a central role in our quest for information, we will often return to these notions, and even debate their definitions. For example, is cell phone an entity type? Not in the same way as Composer is, yet we can talk about cell phones in general, and individual phones do exist under the broader umbrella of CellPhone. Although in our quest for information, we would rarely be interested in the specific phone Mr. X owns, but we might be interested in a particular Samsung or Nokia phone, just being put on the market. My purpose here is not to enter in a philosophical debate about ontological commitment or separation of classes and individuals. I simply want to bring awareness to the impact the difference between searching for entity types versus individuals can have on searching strategies and search results. Compared to the previous chapter, in which we discussed generic and specific entities, notice how the current chapter introduces a different terminology, more commonly used in NLP and in the Semantic Web, of individuals and entity types. In the previous chapter, we used the term specific entity to refer to an individual and generic entity to refer to an entity type. Interest in individuals and entity types is pervasive in the Semantic Web and NLP communities. The term Named Entity Recognition, abbreviated NER, refers to an important field of research within NLP aiming at recognizing named entities in corpus. We yet introduce another term: named entity which sense is closest to what we had called specific entity. In a strict sense, a named entity is an instance of an entity class, uniquely identified via a name. In this same strict sense, named entities are unique individuals. People, organizations, locations, and dates are all examples of things that are unique in our world. But a NER search might be interested in detecting other important information in a text, such as amounts of money or quantities. This extends the definition of named entity toward a less strict sense including individuals as well as other precise and important information. As the reader, you might find this confusing to be introduced to many similar terms having partially overlapping meaning. It is confusing, but it is important to know about all these terms, since you are likely to encounter them in different books and research articles written over many years. The effort to define types of named 3.1 Entity Types 25 entities and recognize them in text goes back twenty years to The Sixth Message Understanding Conference during which an NER task was introduced. At that time, the named entities to be recognized were given one of three possible labels: ENAMEX: Person, Organization, Location TIMEX: Date, Time NUMEX: Money, Percentage, Quantity Later, efforts were made to define more fine-grained lists of entity types, some examples being: PERSON: actor, architect, doctor, politician ORGANIZATION: airline, sports_league, government_agency, news_agency LOCATION: city, country, road, park PRODUCT: car, camera, computer, game ART: film, play, newspaper, music EVENT: attack, election, sports_event, natural_disaster BUILDING: airport, hospital, hotel, restaurant OTHER: time, color, educational_degree, body_part, tv_channel, religion, language, currency Beyond these more generic entity type lists, efforts have also been made in indi- vidual domains to define entity types specific to them (e.g., gene and protein in biology). Once entity types are defined, how would we devise a text mining process to identify in a corpus sentences mentioning these types. A first strategy is to use lists of individuals belonging to these types, as we see next. 3.2 Gazetteers One approach to finding named entities in text is to have lists of the individuals, often referred to as gazetteers. On the Web (e.g., in Wikipedia), we can find lists of just about anything imaginable: varieties of rice, car brands, romantic symphonies, countries, and so on. Let us take art museums as an example. The following query submitted to DBpedia SPARQL endpoint would provide a long list of hundreds of museums: PREFIX dbr: PREFIX rdf: select ?X where { ?X rdf:type dbr:Art_museum. } 26 3 Searching for Named Entities This list could easily become a gazetteer for an ArtMuseum entity type to be used for searches in text. Table 3.1 provides some examples. Each of the entities has a variety of surface forms, all of which could be used in a search. Table 3.1 Gazetteer for ArtMuseum No. Museum label 1 Berkshire Museum 2 The Louvre 3 Museum of Modern Art, Antwerp 4 Hirshhorn Museum and Sculpture Garden 5 Museum of Fine Arts of Lyon 6 Kosova National Art Gallery 7 Art Gallery of Algoma 8 National Gallery of Canada 9 Museu Picasso Similarly, using the following query into DBpedia SPARQL endpoint, we can find many composers classified under the Viennese Composers category in Yago.1 prefix rdf: prefix yago: select ?X where { ?X rdf:type yago:VienneseComposers } Results of the query can be used to generate a gazetteer for the VienneseComposer entity type, as in Table 3.2. This list would include many contemporaries of Ludwig van Beethoven, the individual of our earlier investigation. Table 3.2 Gazetteer for VienneseComposer No. Composer label 1 Wolfgang Amadeus Mozart 2 Franz Schubert 3 Johann Strauss I 4 Franz Lehar 5 Ludwig van Beethoven 6 Johannes Brahms 7 Joseph Haydn 8 Anton Webern 9 Alban Berg 10 Arnold Schoenberg 1 Yago is a large Semantic Web resource, described at http://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/yago-naga/yago/. 3.2 Gazetteers 27 An unfortunate drawback to all lists is that they are likely to be incomplete and thus insufficient. Despite this, lists are important in the search for named entities, and are widely used, most often in combination with other strategies. In Chap. 13, as we revisit entity types as constraints for relation search, we will investigate voting strategies to combine gazetteers with other methods of entity type validation. For now, let us explore a second strategy for finding mentions of an entity type in a corpus, using regular expressions. 3.3 Capturing Regularities in Named Entities: Regular Expressions Certain entity types have fairly regular ways of expressing their individuals, which we can easily detect. The individuals for the entity type ArtMuseum, as shown in Table 3.1, do include a few words that appear repeatedly (museum, fine art, art gallery), but still, we would have to consider more data to better convince ourselves of regularity within that class. What about an entity type Symphony, to continue on our musical theme, would that contain regularities? Table 3.3 Examples of instances of the Symphony type No. Symphonies 1 Symphony no. 8 2 Beethoven’s symphony no. 6 3 Symphony no.4, op. 47 4 Symphony no 5 in b flat 5 Symphony no. 1 in d minor 6 Schubert’s Symphony no. 4 7 Symphony “pathetique” 8 Symphony no. 3 in b-flat minor 9 Abel’s Symphony no. 5, op. 7 Consider a short list of individuals, belonging to the Symphony type, as shown in Table 3.3. As you can see, there is still quite a lot of variation within these examples, but we can identify certain patterns emerging. In an attempt to concisely capture and represent the variations seen here, we turn to regular expressions. Regular expressions provide a concise way of expressing particular sequences of text. Although they can seem intimidating at first, it is essential to master them for NLP, since they provide a very flexible and powerful tool for text search. Most programming languages define text search libraries allowing the use of regular expressions. The best way to become familiar with writing these expressions is 28 3 Searching for Named Entities simply to practice. The process is comparable to learning a new language, practice is key.2 Below are examples of what regular expressions can capture in text: single character disjunction: [tT]able → table, Table range: [A-Z]x → Ax, Bx, Cx, Dx,... Zx explicit disjunction: [a-z|A-Z] → all letters negation: [ˆSs] → everything except uppercase and lowercase ‘s’ previous character optional: favou?r → favor, favour repetition of 0 or more times: argh* → arghhhhhhh, arghh, arg repetition of 1 or more times: x+ → x, xx, xxx wildcard on a single character: ax.s → axis, axes specification of number of characters: x{2,4} → xx, xxx, xxxx escape for special characters: \(a → (a As you can see, the language provided by regular expressions includes five impor- tant aspects: repetition, range, disjunction, optionality, and negation. As we can combine these aspects in various ways and apply them on either single characters or groups of characters, there is an infinite number of text segments which can be captured with regular expressions. That explains how regular expressions capture the representation of very long lists within a single expression. Granted, for the Symphony type, it may be possible to list all the existing sym- phonies, but what about entity types such as Date, PhoneNumber, or EmailAddress? Regular expressions allow for the concise representation of variations within these non-enumerable entity types. Table 3.4 shows examples of different regular expres- sions to recognize particular entities. 2 There are some online regular expression testers, such as Regexpal, available at http://regexpal. com/ which allow you to write regular expressions and use them to search in text. You can also use the regular expression matcher libraries within your favourite programming language to write and test the search capacity of regular expressions. 3.3 Capturing Regularities in Named Entities: Regular Expressions 29 Table 3.4 Examples of regular expressions for specific entity types RegEx Examples Abbreviated winter month, optional final period (Jan|Feb|March).* Jan. / Feb Any year between 1000 and 3000 [0–9]{3} 1977 / 2015 Postal codes in Canada [A-Z][0–9][A-Z][0–9][A-Z][0–9] H2X3W7 / Q1Z4W8 Avenue names (1rst|2nd|3rd|[4–20]th) (Avenue|Ave.|Ave) 3rd Avenue / 6th Ave. Any street name [A-Z][a-z|A-Z]* (Street|St.|St) Wrench St. / Lock Street North American phone numbers \([1–9]{3}\)