Natural Language Processing Course Description (CS 463) 2024-2025

‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ ‫‪2023/2024‬‬ ‫‪2023/2024‬‬ ‫‪2023/2024‬‬ ‫‪Dr. Marian Wagdy‬‬ ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty o...

‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ ‫‪2023/2024‬‬ ‫‪2023/2024‬‬ ‫‪2023/2024‬‬ ‫‪Dr. Marian Wagdy‬‬ ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University ‫توصيف مقرر معالجةاللغات الطبيعية‬ Course Title Natural Language processing Course Code CS 463 Academic 2024-2025 Year Coordinator Instructor(s) Omnia gamal el barbary Semester First semester ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Pre- Machine Learning CS314 Requisite Course Delivery Parent Computer Science department Department Date of Approval 2023/2024 2023/2024 2023/2024 1. Course Aims The aims of this course are to: This course is intended as a theoretical and methodological introduction to a the most widely used and effective current techniques, strategies and toolkits for natural language processing, with a primary focus on those available in the Python programming language. We will also consider how harnessing large digital corpora and large-scale textual data sources has changed how scholars engage with and evaluate digital archives and textual sources, and what opportunities textual repositories offer for computational approaches to the study of literature, history and a variety of other fields, including law, medicine, business and the social sciences. In addition to evaluating new digital methodologies in the light of traditional approaches to philological analysis, students will gain extensive experience in using Python to conduct textual and linguistic analyses, and by the end of the course, will have developed their own individual projects, thereby gaining a practical understanding of natural language processing workflows along with specific tools and methods for evaluating the results achieved through NLP-based exploratory and analytical strategies. Throughout ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University this course, the sources, methodologies and tools we will focus on will be in part decided by student interests and goals, so as we progress, please take note of and send to me any specific types of toolkits or approaches you think might be useful or relevant for your work and analyses. Intended Learning outcomes (ILOs) A. Knowledge and understanding: ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Upon successful completion of an undergraduate computer science program, the graduates will be able to: a1. Understand programming concepts for various branches of natural language processing techniques a3. Identify and consider the basics of tools for natural language processing techniques a4. Describe and model Mathematical problems, and Statistical 2023/2024 2023/2024 2023/2024 methods. a5. Understand moral, ethical and professional issues involved in the exploitation of computer technology a9. Know how to solve problems by programming and using simulation models. a13. Recognize Machin Learning techniques and Big Data Analysis. B. Intellectual skills: Upon successful completion of an undergraduate computer science program, the graduates will be able to: b1. Construct and solve abstract and mathematical models of computer and communication systems. b2. Gather, integrate, and evaluate data/information for problem solving b4. Apply appropriate programming techniques to the development of software solutions. b6. Develop computer algorithms to solve different problems. ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University C. Professional and practical skills: Upon successful completion of an undergraduate computer science program, the graduates will be able to: c1. Prepare Technical Reports and present Seminars effectively. c2. Choose the appropriate Programming Language or Operating system ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ c3. Deploy Communication Skills in team working or leading. c5. Investigate and use Computer science skills. c6. Design and Develop computer-based systems. c7. Evaluate systems in terms of quality attributes c13. Investigate different techniques of information retrieval. D. General and transferable skills: 2023/2024 2023/2024 2023/2024 Upon successful completion of an undergraduate computer science program, the graduates will be able to: d1. Practice Communication and Management skills. d2. Practice Independent Learning techniques. d3. Develop the act of getting people together. d4. Follow Analytical and Creative Thinking. d5. Practice designing skills and Engineering skills for projects. d6. Work effectively, independently or as a part of a team. d7. Specify the applied human rights. d8. Practice designing skills and Engineering skills for projects. d9. Use Modelling capability in software projects. d10. Follow ethics in research and work. ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University 3. Course Contents Week Topics 1 Introduction to NLP (Definition, scope of natural language processing and application) 2 General steps of NLP 3 Text Preprocessing 4 Regular Expressions ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 5 Quiz 1 What is a lexical Analysis? Stemming Lemmatization 6 Stop-word removal Morphological Parsing 7 Midterm 2023/2024 8 Stop-word removal 2023/2024 2023/2024 Morphological Parsing 9 Neural Network Models Rule-Based Methods Hidden Markov Models (HMMs) 10 Syntactic Analysis Part-of-speech (POS) tagging 11 Semantic Analysis Named Entity Recognition (NER) 12 Quiz 2 Word-Sense Disambiguation (WSD) 13 Discourse Analysis 14 Project dissusion ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University 4. Teaching and Learning Methods 1. Accommodation (Lecture rooms, laboratories, etc.) Lecture room Lab with office tools installed 2. Computing resources Office suite installed 5. Student Assessment Assessment Assessment Method Schedule Proportion ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Length Written Final 2 Hour The 16th 60% Examination Examination Week Oral Assessment 15 minutes Term Final 10% The 15 th 10% Practical Examination 15 minutes week Mid-Term Exam 2 Hour The 7th 20% 2023/2024 Examination 2023/2024 Week 2023/2024 6. List of references  Dan Jurafsky and James H. Martin, "Speech and Language Processing, 2nd Edition", Prentice Hall, 2009.  Third edition draft is available at web.stanford.edu/~jurafsky/slp3/.  Jacob Eisenstein, "Introduction to Natural Language Processing", The MIT Press, 2019  Chris Manning and Hinrich Schutze, "Foundations of Statistical Natural Language Processing", MIT Press, 1999 7. Facilities required for teaching and learning: - Board and dustless chalk - Overhead Projector - Data show - Different experiments related to the course. ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University Course Coordinator Program coordinator: Name dr. Marian Wagdy Prof. Nancy El- hefnawy Name ‫ مريان وجدى‬.‫د‬ ‫نانسي الحفناوي‬. ‫ د‬.‫أ‬ (Arabic) ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Signature Date 01/2024 01/2024 2023/2024 2023/2024 2023/2024 ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University 5.5 Course contents – Course ILOs Matrix Course Code / Course Title: CS 463 / Natural Language processing Course Contents 2023/2024 2023/2024 Knowledge and Intellect Practic Transfer 2023/2024 Understanding ual al able A A A A A B B B C C C D D 1 2 3 4 5 1 2 3 1 2 3 1 2 Introduction to NLP √ √ √ Text Preprocessing √ √ Lexical and Morphological Analysis √ √ √ √ √ √ Syntactic Analysis (Parsing) √ √ √ √ √ √ √ Semantic Analysis √ √ √ √ √ √ √ √ Pragmatic Analysis √ √ √ √ √ ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University Learning Method Course outcomes ILOs General and Learning Knowledge and Professional and 2023/2024 2023/2024 Intellectuel Skills 2023/2024 Transferable Method Understanding Practical Skills Skills A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 Lecture √ √ √ √ √ √ Discussion (Brain √ √ √ √ Storming) Self-Learning √ √ √ √ √ (Essay) Practice √ √ √ √ √ √ √ ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ ‫كلية الحاسبات والمعلومات – جامعة طنطا‬ Faculty of Computers and Informatics – Tanta University Assessment Methods Course outcomes ILOs General and 2023/2024 Assessment Knowledge and 2023/2024 Intellectual 2023/2024 Professional and Transferable Methods Understanding Skills Practical Skills Skills A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 Final √ √ √ √ √ √ √ √ Mid-Term √ √ √ √ √ Examination oral √ √ √ √ √ √ √ √ Semester work and practical √ √ √ √ √ √ √ √ √ √ exam √ Course coordinator: Dr Marian Wagdy Head of Department: Prof. Nancy Elhefnawy Table of content Chapter 1 introduction 1 1.1Application 3 1.2 Phases of Natural Language Processing (NLP) 20 Chapter 2 Text preprocessing 31 2.1 Text Preprocessing Technique in NLP 32 ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2.1.1 Document cleanup and binarization 32 2.1.2 Lowercasing 36 2.1.3 Regular Expressions 37 Chapter 3 Lexical and Morphological Analysis 57 3.2 What is a lexical Analysis? 58 3.3 Tokenization 2023/2024 2023/2024 2023/2024 60 3.4 Key Techniques used in Morphological Analysis for NLP 68 Tasks Chapter 4 Syntactic Analysis (Parsing) 83 4.1 Part-of-speech (POS) tagging 83 4.2 Role of Parsing 85 4.3 Types of Parsing 88 Chapter 5 Semantic Analysis 94 5.1 Named Entity Recognition (NER) 95 5.2 Word-Sense Disambiguation (WSD) 101 Chapter 6 Discourse Analysis 108 6.1 Discourse analysis 109 6.2 Discourse structure 110 6.3 Algorithms for Discourse Segmentation 111 6.4 Text Coherence 112 6.7 Reference Resolution 115 ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 Chapter 1 Introduction In recent years, everyone has his own handheld digital devices such as PDAs and camera phones which use them in capturing any ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ documents such as posters, magazine and books. This is the simplest way to disseminate and collect information. Scanners are used to transfer the documents into digital format in the past few decades. Efficient storage, management and retrieval of the digitized document images are extremely important in many office automation and digital library applications. 2023/2024 2023/2024 2023/2024 Another scenario when deal with valuable historical manuscripts, the flatbed scanner cannot be used because it is difficult to move 1 the historical document from its place to another one. The historical document by its nature suffers from different distortions, and it has gone through many years, so it became damaged. Also, such these documents can be affected by the heat of the device and physical pressure. In such cases, the standard digital camera is the best solution. Examples of images of digitization process are shown in the following figure. Natural language processing (NLP) is an interdisciplinary subfield ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning 2023/2024 and deep learning. 2023/2024 2023/2024 2 Major tasks in natural language processing are speech recognition, text classification, natural-language understanding, and natural- language generation. Natural language processing has its roots in the 1940s. Already in 1940, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ intelligence. The proposed test includes a task that involves the automated interpretation and generation of natural language. 1.1. Application The following is a list of some of the most commonly application 2023/2024 2023/2024 2023/2024 in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks. Though natural language processing tasks are closely intertwined, they can be subdivided into categories for convenience. A coarse division is given below. 1.1.1.Text analysis and speech processing Human beings have long been motivated to create machines that can talk. Early attempts at understanding speech production consisted of building mechanical models to mimic the human vocal apparatus. Two such examples date back to the 13th century, when the German philosopher Albertus Magnus and the English 3 scientist Roger Bacon are reputed to have constructed metal talking heads. However, no documentation of these devices is known to exist. The first documented attempts at making speaking machines came some five hundred years later. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 In 1769 Kratzenstein constructed resonant cavities which, when he excited them by a vibrating reed, produced the sounds of the five vowels a, e, i, o, and u. Around the same time, and independently of this work, Wolfgang von Kempelen constructed a mechanical speech synthesizer that could generate recognizable consonants, vowels, and some connected utterances. His book on his research, published in 1791, may be regarded as marking the beginnings of speech processing. Some 40 years later, Charles Wheatstone constructed a machine based essentially on von Kempelenʼs specifications. 4 Document image analysis is classified into two categories, textual processing and graphical processing as shown in the following figure. The first category deals with the textual processing such as skew detection and correction, text recognition tasks by optical character recognition engines (OCR), and finding words, text lines, and paragraphs. The second category deals with graphical processing, such as non-textual and symbol components, diagrams, figures and company logos. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 1.1.1.1 Optical Character Recognition (OCR) Document image digitization process is field of document image analysis and recognition. Document image analysis and recognition is the process that implements the overall interpretation of document images. It is2023/2024 2023/2024 responsible for recognizing 2023/2024 the components of the document images, such as text and figure, and extracting the information as a human would. Optical Character Recognition (OCR) engines are the most common applications that convert the digitized document images into machine- encoded/computer-readable text. ABBYY Fine Reader is a common example of OCR program as shown in the following figure. These engines are widely used to facilitate the data entry from the document (bank statement, passport documents, business card, invoices, mail, and receipts), to be electronically edited, searched and displayed. OCR is a field of research in image processing, pattern recognition, natural language processing, artificial intelligent and database systems. 5 ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 How does OCR work? The OCR engine or OCR software works by using the following steps:  Image acquisition A scanner reads documents and converts them to binary data. The OCR software analyzes the scanned image and classifies the light areas as background and the dark areas as text.  Preprocessing The OCR software first cleans the image and removes errors to prepare it for reading. These are some of its cleaning techniques: 6  Deskewing or tilting the scanned document slightly to fix alignment issues during the scan.  Despeckling or removing any digital image spots or smoothing the edges of text images.  Cleaning up boxes and lines in the image.  Script recognition for multi-language OCR technology ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬  Text recognition The two main types of OCR algorithms or software processes that an OCR software uses for text recognition are called pattern matching and feature extraction. 2023/2024 2023/2024 2023/2024  Pattern matching Pattern matching works by isolating a character image, called a glyph, and comparing it with a similarly stored glyph. Pattern recognition works only if the stored glyph has a similar font and scale to the input glyph. This method works well with scanned images of documents that have been typed in a known font.  Feature extraction Feature extraction breaks down or decomposes the glyphs into features such as lines, closed loops, line direction, and line intersections. It then uses these features to find the best 7 match or the nearest neighbor among its various stored glyphs.  Postprocessing After analysis, the system converts the extracted text data into a computerized file. Some OCR systems can create annotated PDF files that include both the before and after ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ versions of the scanned document. 1.1.1.2 Speech recognition Given a sound clip of a person or people speaking, determine the 2023/2024 2023/2024 2023/2024 textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete". In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed articulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent. 8 ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 1.1.1.3 Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing. Speech segmentation is a subfield of general speech perception and an important sub problem of the technologically focused field of speech recognition, and cannot be adequately solved in isolation. As in most natural language processing problems, one must take into account context, grammar, and semantics, and even so the result is often a probabilistic division (statistically based on likelihood) rather than a categorical one. Though it seems that articulation—a phenomenon which may happen between adjacent words just as easily as within a single word—presents the 9 main challenge in speech segmentation across languages, some other problems and strategies employed in solving those problems can be seen in the following sections. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese, compared to writing systems which indicate speech segmentation between words by a word divider, such as the space. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of Chinese characters for word stems in Japanese). 10 Example, given a sound clip of a person or people speaking, separates it into words. A subtask of speech recognition and typically grouped with it. 1.1.1.4 Text-to-speech Text-to-speech, also known as TTS, is a form of speech synthesis that converts digital text into spoken voice output. This technology ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ uses algorithms and neural networks to generate synthetic speech that closely mimics human speech. At its core, TTS technology involves several key processes: analyzing the text, converting it into phonemes (the smallest units of sound in a language), and using a dataset to generate speech. Advanced TTS systems, powered by artificial intelligence and deep learning, produce natural-sounding and human-like 2023/2024 voices. 2023/2024 2023/2024 11 Applications and Use Cases: TTS in Action  Accessibility for All TTS plays a crucial role in making digital content accessible to individuals with visual impairments, dyslexia, and other learning disabilities. Apps like Amazon's Alexa and Apple's Siri use TTS to read aloud web pages and other digital text, aiding those who struggle with traditional reading. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬  Educational and Assistive Tools For students with dyslexia or other learning disabilities, TTS tools like Microsoft's Immersive Reader can significantly improve comprehension and learning experiences. 2023/2024 2023/2024 2023/2024  The World of Entertainment From audiobooks to podcasts, TTS technology has transformed the entertainment industry. Services like Amazon Audible use high-quality TTS voices for narrating books, offering a rich listening experience.  In the Realm of Business TTS is widely used for voiceovers in advertisements, customer service chatbots, and virtual assistants. This technology saves time and resources while providing consistent and professional voice output. 12 1.1.1.5 Word segmentation (Tokenization) Tokenization is a process used in text analysis that divides text into individual words or word fragments. This technique results in two key components: a word index and tokenized text. The word index is a list that maps unique words to specific numerical identifiers, and the tokenized text replaces each word with its corresponding numerical token. These numerical tokens are then used in various deep learning methods. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining. 13 1.1.2.Higher-level NLP applications 1.1.2.1. Automatic summarization (text summarization) Produce a readable summary of a chunk of text. Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 1.1.2.2. Grammatical error correction Grammatical error detection and correction involves a great band- width of problems on all levels of linguistic analysis (phonology/orthography, morphology, syntax, semantics, pragmatics). Grammatical error correction is impactful since it affects hundreds of millions of people that use or acquire English as a second language. It has thus been subject to a number of shared tasks since 2011. As far as orthography, morphology, syntax and certain aspects of semantics are concerned, and due to 14 the development of powerful neural language models such as GPT- 2, this can now (2019) be considered a largely solved problem and is being marketed in various commercial applications. 1.1.2.3. Logic translation Translate a text from a natural language into formal logic. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 1.1.2.4. Machine translation (MT) Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess 2023/2024 2023/2024 2023/2024 (grammar, semantics, facts about the real world, etc.) to solve properly. 1.1.2.5. Natural-language understanding (NLU) Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs 15 to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language meta model and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. 2023/2024 2023/2024 2023/2024 1.1.2.6. Natural-language generation (NLG): Convert information from computer databases or semantic intents into readable human language. 16 ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 1.1.2.7. Book generation Not an NLP task proper but an extension of natural language generation and other NLP tasks is the creation of full-fledged books. The first machine-generated book was created by a rule- based system in 1984 (Racter, The policeman's beard is half- constructed). The first published work by a neural network was published in 2018, 1 the Road, marketed as a novel, contains sixty million words. Both these systems are basically elaborate but non- sensical (semantics-free) language models. The first machine- generated science book was published in 2019 (Beta Writer, Lithium-Ion Batteries, Springer, Cham). Unlike Racter and 1 the Road, this is grounded on factual knowledge and based on text summarization. 17 1.1.2.8. Document AI A Document AI platform sits on top of the NLP technology enabling users with no prior experience of artificial intelligence, machine learning or NLP to quickly train a computer to extract the specific data they need from different document types. NLP- powered Document AI enables non-technical teams to quickly access information hidden in documents, for example, lawyers, business analysts and accountants. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 1.1.2.9. Dialogue management Computer systems intended to converse with a human. 2023/2024 2023/2024 2023/2024 1.1.2.10. Question answering Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). 1.1.2.11. Text-to-image generation Given a description of an image, generate an image that matches the description. 1.1.2.12. Text-to-scene generation Given a description of a scene, generate a 3D model of the scene. 18 1.1.2.13. Text-to-video Given a description of a video, generate a video that matches the description. 1.1.2.14. Cognition Most higher-level NLP applications involve aspects that emulate intelligent behavior and apparent comprehension of natural ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ language. More broadly speaking, the technical operationalization of increasingly advanced aspects of cognitive behavior represents one of the developmental trajectories of NLP (see trends among CoNLL shared tasks above). 2023/2024 2023/2024 2023/2024 Cognition refers to "the mental action or process of acquiring knowledge and understanding through thought, experience, and the senses." Cognitive science is the interdisciplinary, scientific study of the mind and its processes. Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from both psychology and linguistics. Especially during the age of symbolic NLP, the area of computational linguistics maintained strong ties with cognitive studies. 19 1.2 Phases of Natural Language Processing (NLP) Natural Language Processing (NLP) is a field within artificial intelligence that allows computers to comprehend, analyze, and interact with human language effectively. The process of NLP can be divided into five distinct phases: Lexical Analysis, Syntactic Analysis, Semantic Analysis, Discourse Integration, and Pragmatic Analysis. Each phase plays a crucial role in the overall understanding and processing of natural language. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 20 1.2.1 First Phase of NLP: Lexical and Morphological Analysis 1) Tokenization The lexical phase in Natural Language Processing (NLP) involves scanning text and breaking it down into smaller units such as paragraphs, sentences, and words. This process, known as tokenization, converts raw text into manageable units called tokens or lexemes. Tokenization is essential for ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ understanding and processing text at the word level. In addition to tokenization, various data cleaning and feature extraction techniques are applied, including:  Lemmatization: Reducing words to their base or root form. 2023/2024  Stopwords Removal: 2023/2024 Eliminating common words 2023/2024 that do not carry significant meaning, such as “and,” “the,” and “is.”  Correcting Misspelled Words: Ensuring the text is free of spelling errors to maintain accuracy. These steps enhance the comprehensibility of the text, making it easier to analyze and process. 2) Morphological Analysis Morphological analysis is another critical phase in NLP, focusing on identifying morphemes, the smallest units of a word that carry meaning and cannot be further divided. Understanding morphemes is vital for grasping the structure of words and their relationships. 21 Types of Morphemes 1. Free Morphemes: Text elements that carry meaning independently and make sense on their own. For example, “bat” is a free morpheme. 2. Bound Morphemes: Elements that must be attached to free morphemes to convey meaning, as they cannot stand alone. For instance, the suffix “-ing” is a bound ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ morpheme, needing to be attached to a free morpheme like “run” to form “running.” Importance of Morphological Analysis Morphological analysis is crucial in NLP for several reasons: 2023/2024 2023/2024 2023/2024  Understanding Word Structure: It helps in deciphering the composition of complex words.  Predicting Word Forms: It aids in anticipating different forms of a word based on its morphemes.  Improving Accuracy: It enhances the accuracy of tasks such as part-of-speech tagging, syntactic parsing, and machine translation. By identifying and analyzing morphemes, the system can interpret text correctly at the most fundamental level, laying the groundwork for more advanced NLP applications. 22 1.2.2 Second Phase of NLP: Syntactic Analysis (Parsing) Syntactic analysis, also known as parsing, is the second phase of Natural Language Processing (NLP). This phase is essential for understanding the structure of a sentence and assessing its grammatical correctness. It involves analyzing the relationships between words and ensuring their logical consistency by comparing their arrangement against standard grammatical rules. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Role of Parsing Parsing examines the grammatical structure and relationships within a given text. It assigns Parts-Of-Speech (POS) tags to each word, categorizing them as nouns, verbs, adverbs, etc. This tagging is crucial for understanding how words relate to each 2023/2024 2023/2024 2023/2024 other syntactically and helps in avoiding ambiguity. Ambiguity arises when a text can be interpreted in multiple ways due to words having various meanings. For example, the word “book” can be a noun (a physical book) or a verb (the action of booking something), depending on the sentence context. Examples of Syntax Consider the following sentences:  Correct Syntax: “John eats an apple.”  Incorrect Syntax: “Apple eats John an.” Despite using the same words, only the first sentence is grammatically correct and makes sense. The correct arrangement 23 of words according to grammatical rules is what makes the sentence meaningful. Assigning POS Tags During parsing, each word in the sentence is assigned a POS tag to indicate its grammatical category. Here’s an example breakdown:  Sentence: “John eats an apple.” ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬  POS Tags: o John: Proper Noun (NNP) o eats: Verb (VBZ) o an: Determiner (DT) 2023/2024 o apple: Noun (NN) 2023/2024 2023/2024 Assigning POS tags correctly is crucial for understanding the sentence structure and ensuring accurate interpretation of the text. Importance of Syntactic Analysis By analyzing and ensuring proper syntax, NLP systems can better understand and generate human language. This analysis helps in various applications, such as machine translation, sentiment analysis, and information retrieval, by providing a clear structure and reducing ambiguity. 24 1.2.3 Third Phase of NLP: Semantic Analysis Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on extracting the meaning from text. Unlike syntactic analysis, which deals with grammatical structure, semantic analysis is concerned with the literal and contextual meaning of words, phrases, and sentences. Semantic analysis aims to understand the dictionary definitions of ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ words and their usage in context. It determines whether the arrangement of words in a sentence makes logical sense. This phase helps in finding context and logic by ensuring the semantic coherence of sentences. Key Tasks in Semantic Analysis 2023/2024 2023/2024 2023/2024 1. Named Entity Recognition (NER): NER identifies and classifies entities within the text, such as names of people, places, and organizations. These entities belong to predefined categories and are crucial for understanding the text’s content. 2. Word Sense Disambiguation (WSD): WSD determines the correct meaning of ambiguous words based on context. For example, the word “bank” can refer to a financial institution or the side of a river. WSD uses contextual clues to assign the appropriate meaning. Examples of Semantic Analysis Consider the following examples: 25  Syntactically Correct but Semantically Incorrect: “Apple eats a John.” o This sentence is grammatically correct but does not make sense semantically. An apple cannot eat a person, highlighting the importance of semantic analysis in ensuring logical coherence.  Literal Interpretation: “What time is it?” ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ o This phrase is interpreted literally as someone asking for the current time, demonstrating how semantic analysis helps in understanding the intended meaning. o Importance of Semantic Analysis 2023/2024 2023/2024 2023/2024 Semantic analysis is essential for various NLP applications, including machine translation, information retrieval, and question answering. By ensuring that sentences are not only grammatically correct but also meaningful, semantic analysis enhances the accuracy and relevance of NLP systems. 1.2.4 Fourth Phase of NLP: Discourse Integration Discourse Integration is the fourth phase of Natural Language Processing (NLP). This phase deals with comprehending the relationship between the current sentence and earlier sentences or the larger context. Discourse integration is crucial for contextualizing text and understanding the overall message conveyed. 26 Role of Discourse Integration Discourse integration examines how words, phrases, and sentences relate to each other within a larger context. It assesses the impact a word or sentence has on the structure of a text and how the combination of sentences affects the overall meaning. This phase helps in understanding implicit references and the flow of information across sentences. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Importance of Contextualization In conversations and texts, words and sentences often depend on preceding or following sentences for their meaning. Understanding the context behind these words and sentences is essential to accurately interpret their meaning. 2023/2024 2023/2024 2023/2024 Example of Discourse Integration Consider the following examples:  Contextual Reference: “This is unfair!” o To understand what “this” refers to, we need to examine the preceding or following sentences. Without context, the statement’s meaning remains unclear.  Anaphora Resolution: “Taylor went to the store to buy some groceries. She realized she forgot her wallet.” o In this example, the pronoun “she” refers back to “Taylor” in the first sentence. Understanding that 27 “Taylor” is the antecedent of “she” is crucial for grasping the sentence’s meaning. Application of Discourse Integration Discourse integration is vital for various NLP applications, such as machine translation, sentiment analysis, and conversational agents. By understanding the relationships and context within ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ texts, NLP systems can provide more accurate and coherent responses. 1.2.5 Fifth Phase of NLP: Pragmatic Analysis Pragmatic Analysis is the fifth and final phase of Natural 2023/2024 2023/2024 2023/2024 Language Processing (NLP), focusing on interpreting the inferred meaning of a text beyond its literal content. Human language is often complex and layered with underlying assumptions, implications, and intentions that go beyond straightforward interpretation. This phase aims to grasp these deeper meanings in communication. Role of Pragmatic Analysis Pragmatic analysis goes beyond the literal meanings examined in semantic analysis, aiming to understand what the writer or speaker truly intends to convey. In natural language, words and phrases can carry different meanings depending on context, tone, and the situation in which they are used. 28 Importance of Understanding Intentions In human communication, people often do not say exactly what they mean. For instance, the word “Hello” can have various interpretations depending on the tone and context in which it is spoken. It could be a simple greeting, an expression of surprise, or even a signal of anger. Thus, understanding the intended meaning behind words and sentences is crucial. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Examples of Pragmatic Analysis Consider the following examples:  Contextual Greeting: “Hello! What time is it?” o “Hello!” is more than just a greeting; it serves to 2023/2024 2023/2024 2023/2024 establish contact. o “What time is it?” might be a straightforward request for the current time, but it could also imply concern about being late.  Figurative Expression: “I’m falling for you.” o The word “falling” literally means collapsing, but in this context, it means the speaker is expressing love for someone. Application of Pragmatic Analysis Pragmatic analysis is essential for applications like sentiment analysis, conversational AI, and advanced dialogue systems. By interpreting the deeper, inferred meanings of texts, NLP systems 29 can understand human emotions, intentions, and subtleties in communication, leading to more accurate and human-like interactions. Conclusion The phases of NLP—Lexical Analysis, Syntactic Analysis, Semantic Analysis, Discourse Integration, and Pragmatic ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Analysis—each play a critical role in enabling computers to process and understand human language. By breaking down the text into manageable parts and analyzing them in different ways, NLP systems can perform complex tasks such as machine translation, sentiment analysis, and information retrieval, making significant advancements in human-computer interaction. 2023/2024 2023/2024 2023/2024 30 Chapter 2 Text preprocessing Natural Language Processing (NLP) has seen tremendous growth and development, becoming an integral part of various applications, from chatbots to sentiment analysis. One of the ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ foundational steps in NLP is text preprocessing, which involves cleaning and preparing raw text data for further analysis or model training. Proper text preprocessing can significantly impact the performance and accuracy of NLP models. This article will delve into the essential steps involved in text preprocessing for NLP tasks. 2023/2024 2023/2024 2023/2024 Text Preprocessing is the first step in the pipeline of Natural Language Processing (NLP), with potential impact in its final process. Text Preprocessing is the process of bringing the text into a form that is predictable and analyzable for a specific task. A task is the combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a task. The main objective of text preprocessing is to break the text into a form that machine learning algorithms can digest. In this report, we will perform the task of text preprocessing on a corpus of toxic comments and categorize the comments based on different types of toxicity. 31 Why Text Preprocessing is Important? Raw text data is often noisy and unstructured, containing various inconsistencies such as typos, slang, abbreviations, and irrelevant information. Preprocessing helps in:  Improving Data Quality: Removing noise and irrelevant information ensures that the data fed into the model is clean and consistent. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬  Enhancing Model Performance: Well-preprocessed text can lead to better feature extraction, improving the performance of NLP models.  Reducing Complexity: Simplifying the text data can reduce the computational complexity and make the models more efficient. 2023/2024 2023/2024 2023/2024 2.1 Text Preprocessing Technique in NLP There are different ways to preprocess your text. Here are some of the techniques listed below which help in preprocessing the input text. 1. Enhancement (noise removal) and binarization 2. Lowercasing 3. Regular Expressions 2.1.1 Document cleanup and binarization Traditionally, the scanning of the document images is done from pseudo binary hard copy paper manuscripts with a sheet-fed, flatbed, or mounted imaging device. In the past few years, digital 32 cameras are used as capturing device for documents instead of the flatbed scanners; this is due to lower prices of these imaging devices. Most imaging devices such as digital cameras, PDAs and mobile phones are now relatively accessible to everyone and very cheap. These devices are very powerful to capturing the documents. However, the digitized document images often suffer from many common problems such as, lightness, degradation, skewness and border noise. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ One of the most common problems that appear on the captured document image is the uncontrolled light condition. Unlike the flatbed, scanner can scan the document with the material contacted tightly on a bed with well controlled lights. Digitized document images are almost suffered from degradation in the course of photocopying, faxing, printing, and scanning. Degradation problems seems negligible to2023/2024 2023/2024 human eyes but can be responsible 2023/2024 for an abrupt decline in accuracy by the current generation of optical character recognition (OCR) systems. It have many inherent limitations such uneven illumination, color shift and text blur as shown in the following figure. Historical document images are subject to degradation due to poor storage and damage over time, it is often hard for a human to decipher. Historical document images often contain handwriting texts that were written with ink pen. Due to bad environments and long term storage, the pen strokes may disappear or fade and the features of the document paper may be damaged. Also due to bacterial growth, some spots may appear. As a result, techniques for automatic document image analysis are highly demanded. There are several challenges that need to be addressed when dealing with captured documents images with poor quality: 33 1) Variable background intensity as a result of unfit storage and non-uniform illumination. 2) Very low local contrast because of smudge or smear and shadows that happen in the capturing process of the document image. 3) Poor printing or writing quality. 4) Gray-scale changes in color areas and highlight. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 Binarization is the first and important step in a document image analysis and recognition processes. Binarization aims to solve the degradation problems by converting a color/gray image to a bi- level one. In binarization techniques, each pixel of the document image is classified into either white for background or black for text. 34 Image Binarization, also known as Image Thresholding, is a technique to create a binary image from a grayscale or RGB image that can be used to separate the image's foreground from its background. Image thresholding is one of the most fundamental ways to extract useful information from a given image or segment a region of interest. The simplest image thresholding can be generally expressed as follows. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ If P(x, y) < Threshold: P(x, y) = 0 else: P(x, y) = 255 where, P(x, 2023/2024 y) = Pixel Value 2023/2024 2023/2024 It is evident from the above algorithm that, In Thresholding/Binarization, the output image consists of only 0 and 255, representing binary-0 and binary-1, respectively. 35 Thresholding/Binarization techniques can be classified into two groups, 1) Global Thresholding 2) Local Thresholding In Global thresholding, a single threshold value is determined by considering the entire image and its global characteristics. Where in Local Thresholding, the image is divided into regions/segments ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ to determine independent threshold values for each region/segment. There are many applications where image thresholding is used. The thresholding techniques utilized in these applications vary depending on the nature of the images used in them. Text extraction from documents is one such application where image thresholding plays a significant2023/2024 2023/2024 role. 2023/2024 2.1.2 Lowercasing This is the simplest technique of text preprocessing which consists of lowercasing every single token of the input text. It helps in dealing with sparsity issues in the dataset. For example, a text is having mixed-case occurrences of the token ‘Canada’, i.e., at some places token ‘canada’ is and in other ‘Canada’ is used. To eliminate this variation, so that it does not cause further problems, we use the lowercasing technique to eliminate the sparsity issue and reduce the vocabulary size. Despite its excellence in reducing sparsity issues and vocabulary size, it sometimes impacts the system’s performance by increasing ambiguity. For example, ‘Apple is the best company for 36 smartphones ‘. Here when we perform lowercasing, Apple is transformed into apple and this creates ambiguity as the model is unaware whether apple is a company or a fruit and there are higher chances that it may interpret apple as a fruit. In the given dataset, we perform the task of lowercasing after tokenization and lowercase all the tokens. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2.1.3 Regular Expressions 2023/2024 2023/2024 2023/2024 What are Regular Expressions?  Regular expressions (regex) are a powerful tool in text preprocessing for Natural Language Processing (NLP). They allow for efficient and flexible pattern matching and text manipulation.  Regular expressions or RegEx is defined as a sequence of characters that are mainly used to find or replace patterns present in the text. In simple words, we can say that a regular expression is a set of characters or a pattern that is used to find substrings in a given string.  A regular expression (RE) is a language for specifying text search strings. It helps us to match or extract other strings or sets of strings, with the help of a specialized syntax present in a pattern. 37 For Example, extracting all hashtags from a tweet, getting email iD or phone numbers, etc from large unstructured text content. Sometimes, we want to identify the different components of an email address. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ 2023/2024 2023/2024 2023/2024 Simply put, a regular expression is defined as an ”instruction” that is given to a function on what and how to match, search or replace a set of strings. Regular Expressions are used in various tasks such as,  Data pre-processing,  Rule-based information Mining systems,  Pattern Matching,  Text feature Engineering, 38  Web scraping,  Data Extraction, etc. How do Regular Expressions Work? Let’s understand the working of Regular expressions with the help of an example: ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ Consider the following list of some students of a School, Names: Sunil, Shyam, Ankit, Surjeet, Sumit, Subhi, Surbhi, Siddharth, Sujan And our goal is to select only those names from the above list which match a certain pattern such as something like this S u _ _ _ 2023/2024 The names having the first two2023/2024 2023/2024 letters as S and u, followed by only three positions that can be taken up by any of the letters present in the dictionary. What do you think, which names from the above list fit this criterion? Let’s go one by one, the name Sunil, Sumit, and Sujan fit this criterion as they have S and u in the beginning and three more letters after that. While the rest of the three names are not following the given criteria. So, the new list extracted is given by, Extracted Names: Sunil, Sumit, Sujan What exactly we have done here is that we have a pattern and a list of student names and we have to find the name that matches the given pattern. That’s exactly how regular expressions work. In RegEx, we have different types of patterns to recognize different strings of characters. 39 Properties of Regular Expressions Some of the important properties of Regular Expressions are as follows:  The Regular Expression language is formalized by an American Mathematician named Stephen Cole Kleene.  Regular Expression(RE) is a formula in a special language, which can be used for specifying simple classes of strings, a ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ sequence of symbols. In simple words, we can say that Regular Expression is an algebraic notation for characterizing a set of strings.  Regular expression requires two things, one is the pattern that we want to search and the other is a corpus of text or a string from which we need to search the pattern. 2023/2024 2023/2024 2023/2024 Mathematically, we can define the concept of Regular Expression in the following manner: 1. ε is a Regular Expression, which indicates that the language is having an empty string. 2. 2. φ is a Regular Expression which denotes that it is an empty language. 3. If X and Y are Regular Expressions, then the following expressions are also regular.  X, Y  X.Y(Concatenation of XY)  X+Y (Union of X and Y)  X*, Y* (Kleen Closure of X and Y) 40 4. If a string is derived from the above rules then that would also be a regular expression. How can Regular Expressions be used in NLP? In NLP, we can use Regular expressions at many places such as, 1. To Validate data fields. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ For Example, dates, email address, URLs, abbreviations, etc. 2. To Filter a particular text from the whole corpus. For Example, spam, disallowed websites, etc. 3. To Identify particular strings in a text. For Example, token boundaries 2023/2024 2023/2024 2023/2024 4. To convert the output of one processing component into the format required for a second component. The concept of Raw String in Regular Expressions Now let’s discuss a crucial concept which you must know while studying Regular Expressions. On the basis of your prior knowledge of Python Programming language, you might know that Python raw string treats backslash() as a literal character. To understand this thing, let’s look at the following example: In the following example, we have a couple of backslashes present in the string. But in string python treats n as “move to a new line”. 41 After seeing the output, you can observe that n has moved the text after it to a new line. Here “nayan” has become “ayan” and n disappeared from the path. This is not what we want. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ So, to resolve this issue, now we use the “r” expression to create a raw string: 2023/2024 2023/2024 2023/2024 Again after seeing the output, we have observed that now the entire path has printed out here by simply using “r” in front of the path. Therefore, It is always recommended to use raw string instead of normal string while dealing with Regular expressions. Common Regex Functions used in NLP To work with Regular Expressions, Python has a built-in module known as “re”. Some common functions from this module are as follows:  re.search()  re.match() 42  re.sub()  re.compile()  re.findall() Let us look at each function with the help of an example: re. search( ) ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ This function helps us to detect whether the given regular expression pattern is present in the given input string. It matches the first occurrence of a pattern in the entire string and not just at the beginning. It returns a Regex Object if the pattern is found in the string, else it returns a None object. Syntax: re.search(patterns, string) 2023/2024 2023/2024 2023/2024 Within search functions, we can use different flags to perform specific operations. For Example, ‘re.I’ — to ignore the case (either uppercase or lowercase) of the text 're.M' — enables to search the string in multiple lines re.search(pattern, string, flags=re.I | re.M) In the following example, we search for the pattern “founder” in a given string 43 re.match( ) This function will only match the string if the pattern is present at ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ the very start of the string. Syntax: re.match(patterns, string) In the following example, we match a word at the beginning of the string: 2023/2024 2023/2024 2023/2024 Here Pattern = ‘Analytics’ and String = ‘Analytics Vidhya is the largest data Scientists community’. Since the required pattern is present at the beginning of the string we got the matching Object as an output. And we know that the output of the re.match is an object, so to get the matched expression, we will use the group() function of the match object. As we can observe that we have got our required output using the group() function. Now let us have a look at the other example which presents the second case: 44 Here as you can notice, our pattern(Scientists) is not present at the beginning of the string, hence we got None as our output. ‫ﻣﻴﺎﺭ ﻣﺤﻤﺪ ﻃﻪ ﺍﻟﺸﺮﻗﺎﻭﻯ‬ re.sub( ) This function is used to substitute a substring with another substring. Syntax: re.sub(patterns, Substitute, Input text) In the following example, we 2023/2024 2023/2024 replace xxx and yyy with

Natural Language Processing Course Description (CS 463) 2024-2025

Document Details

Tags

Related

Summary

Full Transcript