Natural Language Processing Basics PDF
Document Details
Uploaded by FlawlessFantasy4551
Tags
Summary
This document provides an introduction to Natural Language Processing (NLP), a field of artificial intelligence that enables computers to understand and generate human language. The document covers the basics of NLP, including its history, applications, and subfields. It uses examples to illustrate how NLP works in everyday life.
Full Transcript
Natural Language Processing Basics. What Is Natural Language Processing?. Natural language processing (NLP), is a field of artificial intelligence (AI) that combines computer science and linguistics to give computers the ability to understand, interpret, and generate human language in a way that’s m...
Natural Language Processing Basics. What Is Natural Language Processing?. Natural language processing (NLP), is a field of artificial intelligence (AI) that combines computer science and linguistics to give computers the ability to understand, interpret, and generate human language in a way that’s meaningful and useful to humans. NLP helps computers perform useful tasks like understanding the meaning of sentences, recognizing important details in text, translating languages, answering questions, summarizing text, and generating responses that resemble human responses. NLP is already so commonplace in our everyday lives that we usually don’t even think about it when we interact with it or when it does something for us. For example, maybe your email or document creation app automatically suggests a word or phrase you could use next. You may ask a virtual assistant, like Siri, to remind you to water your plants on Tuesdays. Or you might ask Alexa to tell you details about the last big earthquake in Chile for your daughter’s science project. The chatbots you engage with when you contact a company’s customer service use NLP, and so does the translation app you use to help you order a meal in a different country. Spam detection, your online news preferences, and so much more rely on NLP. A Very Brief History of NLP.. It’s worth mentioning that NLP is not new. In fact, its roots wind back to the 1950s when researchers began using computers to understand and generate human language. One of the first notable contributions to NLP was the Turing Test. Developed by Alan Turing, this test measures a machine’s ability to answer any question in a way that’s indistinguishable from a human. Shortly after that, the first machine translation systems were developed. These were sentence- and phrase-based language translation experiments that didn’t progress very far because they relied on very specific patterns of language, like predefined phrases or sentences. By the 1960s, researchers were experimenting with rule-based systems that allowed users to ask the computer to complete tasks or have conversations. The 1970s and 80s saw more sophisticated knowledge-based approaches using linguistic rules, rule-based reasoning, and domain knowledge for tasks like executing commands and diagnosing medical conditions. Statistical approaches (i.e., learning from data) to NLP were popular in the 1990s and early 2000s, leading to advances in speech recognition, machine translation, and machine algorithms. During this period, the introduction of the World Wide Web in 1993 made vast amounts of text-based data readily available for NLP research. Since about 2009, neural networks and deep learning have dominated NLP research and development. NLP areas of translation and natural language generation, including the recently introduced ChatGPT, have vastly improved and continue to evolve rapidly. Human Language Is “Natural” Language. What is natural language anyway? Natural language refers to the way humans communicate with each other using words and sentences. It’s the language we use in conversations, when we read, write, or listen. Natural language is the way we convey information, express ideas, ask questions, tell stories, and engage with each other. While NLP models are being developed for many different human languages, this module focuses on NLP in the English language. If you completed the Artificial Intelligence Fundamentals badge, you learned about unstructured data and structured data. These are important terms in NLP, too. Natural language–the way we actually speak–is unstructured data, meaning that while we humans can usually derive meaning from it, it doesn’t provide a computer with the right kind of detail to make sense of it. The following paragraph about an adoptable shelter dog is an example of unstructured data. Tala is a 5-year-old spayed, 65-pound female husky who loves to play in the park and take long hikes. She is very gentle with young children and is great with cats. This blue-eyed sweetheart has a long gray and white coat that will need regular brushing. You can schedule a time to meet Tala by calling the Troutdale shelter. For a computer to understand what we mean, this information needs to be well-defined and organized, similar to what you might find in a spreadsheet or a database. This is called structured data. The information included in structured data and how the data is formatted is ultimately determined by algorithms used by the desired end application. For example, data for a translation app is structured differently than data for a chatbot. Here’s how the data in the paragraph above might look as structured data for an app that can help match dogs with potential adopters. Name: Tala. Age: 5. Spayed or Neutered: Spayed. And so on. Natural Language Understanding and Natural Language Generation.. Today’s NLP matured with its two subfields, natural language understanding (NLU) and natural language generation (NLG). Data processed from unstructured to structured is called natural language understanding (NLU). NLU uses many techniques to interpret written or spoken language to understand the meaning and context behind it. Data processed the reverse way–from structured to unstructured–is called natural language generation (NLG). NLG is what enables computers to generate human-like language. NLG involves the development of algorithms and models that convert structured data or information into meaningful, contextually appropriate, natural-like text or speech. It also includes the generation of code in a programming language, such as generating a Python function for sorting strings. In the past, NLU and NLG tasks made use of explicit linguistic structured representations like parse trees. While NLU and NLG are still critical to NLP today, most of the apps, tools, and virtual assistants we communicate with have evolved to use deep learning or neural networks to perform tasks from end-to-end. For instance, a neural machine translation system may translate a sentence from, say, Chinese, directly into English without explicitly creating any kind of intermediate structure. Neural networks recognize patterns, words, and phrases to make language processing exponentially faster and more contextually accurate. In the next unit, you learn more about our natural language methods and techniques that enable computers to make sense of what we say and respond accordingly. Learn About Natural Language Parsing.. Basic Elements of Natural Language.. Understanding and processing natural language is a fundamental challenge for computers. That's because it involves not only recognizing individual words, but also comprehending their relationships, their context, and their meaning. Our natural language, in text and speech, is characterized by endless complexity, nuances, ambiguity, and mistakes. In our everyday communication, we encounter words with several meanings; words that sound the same but are spelled differently and have different meanings; misplaced modifiers; misspellings; and mispronunciations. We also encounter people who speak fast, mumble, or who take forever to get to the point; and people who use speech patterns in accents or dialects that are different from ours. Take this sentence for example: “We saw six bison on vacation in Yellowstone National Park.” You might giggle a little as you imagine six bison in hats and sunglasses posing for selfies in front of Old Faithful. But, most likely, you understand what actually happened–that is, that someone who was on vacation in Yellowstone National Park saw six bison. Or this: “They swam out to the buoy.” If you heard someone speak this sentence without any context, you may think the people involved swam out to a male child, when in fact, they swam out to a marker in the water. The pronunciation of “boy” and “buoy” is slightly different, but the enunciation is not always made clear. While humans are able to flex and adapt to language fairly easily, training a computer to consider these kinds of nuances is quite difficult. Elements of natural language in English include: Vocabulary: The words we use. Grammar: The rules governing sentence structure. Syntax: How words are combined to form sentences according to grammar. Semantics: The meaning of words, phrases, and sentences. Pragmatics: The context and intent behind cultural or geographic language use. Discourse and dialogue: Units larger than a single phrase or sentence, including documents and conversations. Phonetics and phonology: The sounds we make when we communicate. Morphology: How parts of words can be combined or uncombined to make new words Parsing Natural Language.. Teaching a computer to read and derive meaning from words is a bit like teaching a child to read–they both learn to recognize words, their sounds, meaning, and pronunciation. But when a child learns to read, they usually have the advantage of context from a story; visual cues from illustrations; and relationships to things they already know, like trees or animals. They also often get assistance and encouragement from experienced readers, who help explain what they’re learning. These cues help new readers identify and attach meaning to words and phrases that they can generalize to other things they read in the future. We know that computers are a different kind of smart, so while a computer needs to understand the elements of natural language described above, the approach needs to be much more scientific. NLP uses algorithms and methods like large language models (LLMs), statistical models, machine learning, deep learning, and rule-based systems to process and analyze text. These techniques, called parsing, involve breaking down text or speech into smaller parts to classify them for NLP. Parsing includes syntactic parsing, where elements of natural language are analyzed to identify the underlying grammatical structure, and semantic parsing which derives meaning. As mentioned in the last unit, natural language is parsed in different ways to match intended outcomes. For example, natural language that’s parsed for a translation app uses different algorithms or models and is parsed differently than natural language intended for a virtual assistant like Alexa. Syntactic parsing may include: Segmentation.: Larger texts are divided into smaller, meaningful chunks. Segmentation usually occurs at the end of sentences at punctuation marks to help organize text for further analysis. Tokenization.: Sentences are split into individual words, called tokens. In the English language, tokenization is a fairly straightforward task because words are usually broken up by spaces. In languages like Thai or Chinese, tokenization is much more complicated and relies heavily on an understanding of vocabulary and morphology to accurately tokenize language. Stemming.: Words are reduced to their root form, or stem. For example breaking, breaks, or unbreakable are all reduced to break. Stemming helps to reduce the variations of word forms, but, depending on context, it may not lead to the most accurate stem. Look at these two examples that use stemming: “I’m going outside to rake leaves.” Stem = leave “He always leaves the key in the lock.” Stem = leave Lemmatization.: Similar to stemming, lemmatization reduces words to their root, but takes the part of speech into account to arrive at a much more valid root word, or lemma. Here are the same two examples using lemmatization: “I’m going outside to rake leaves.” Lemma = leaf “He always leaves the key in the lock.” Lemma = leave Part of speech tagging.: Assigns grammatical labels or tags to each word based on its part of speech, such as a noun, adjective, verb, and so on. Part of speech tagging is an important function in NLP because it helps computers understand the syntax of a sentence. Named entity recognition (NER).: Uses algorithms to identify and classify named entities–like people, dates, places, organizations, and so on–in text to help with tasks like answering questions and information extraction. Semantic Analysis.. Parsing natural language using some or all of the steps we just described does a pretty good job of capturing the meaning of text or speech. But it lacks soft skill nuances that make human language, well, human. Semantic parsing involves analyzing the grammatical format of sentences and relationships between words and phrases to find the meaning representation. Extracting how people feel, why they are engaging, and details about circumstances surrounding an interaction all play a crucial role in accurately deciphering text or speech and forming an appropriate response. Here are several common analysis techniques that are used in NLP. Each of these techniques can be powered by a number of different algorithms to get the desired level of understanding depending on the specific task and the complexity of the analysis. Sentiment analysis.: Involves determining whether a piece of text (such as a sentence, a social media post, a review, or a tweet) expresses a positive, negative, or neutral sentiment. A sentiment is a feeling or an attitude toward something. For example, sentiment analysis can determine if this customer review of a service is positive or negative: "I had to wait a very long time for my haircut.” Sentiment helps identify and classify emotions or opinions in text to help businesses understand how people feel about their products, services, or experiences. Intent analysis.: Intent helps us understand what someone wants or means based on what they say or write. It’s like deciphering the purpose or intention behind their words. For example, if someone types, “I can’t log in to my account,” into a customer support chatbot, intent analysis would recognize that the person’s intent is to get help to access their account. The chatbot might reply with details about resetting a password or other means the user can try to access their account. Virtual assistants, customer support systems, or chatbots often use intent analysis to understand user requests and provide appropriate responses or actions. Context (discourse) analysis.: Natural language relies heavily on context. The interpretation of a statement might change based on the situation, the details provided, and any shared understanding that exists between the people communicating. Context analysis involves understanding this surrounding information to make sense of a piece of text. For example, if someone says, “They had a ball,” context analysis can determine if they are talking about a fancy dance party, a piece of sports equipment, or a whole lot of fun. It does this by considering the previous conversation or the topic being discussed. Context analysis helps NLP systems interpret words more accurately by taking into account the broader context, the relationships between words, and other relevant information. These three analysis techniques–sentiment analysis, intent analysis, and context analysis–play important roles in extracting valuable insights from text and speech data. They create a more sophisticated and accurate understanding and engagement with textual content in various applications of NLP. Summary.. In this module, you’ve learned about NLP at a very high level, and as it relates to the English language. To-date, the majority of NLP study is conducted using English, but you can also find a great deal of research done in Spanish, French, Farsi, Urdu, Chinese, and Arabic. NLP is a very rapidly evolving field of AI. And advancements in NLP are quickly leading to more sophisticated language understanding, cross-language capabilities, and integration with other AI fields.