Text Analysis, Unstructured Data, Sentiment Analysis PDF

Module -2&3 Analyzing Social media data Introduction to unstructured data Text Analysis, Text cleaning and processing Sentiment Analysis Introduction to unstructured data What is Unstructured Data? Unstructured data is the data which does not conforms to a data model and has no easily identifiable structure such that it can not be used by a computer program easily. Unstructured data is not organized in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database. Characteristics of Unstructured Data: Data neither conforms to a data model nor has any structure. Data can not be stored in the form of rows and columns as in Databases Data does not follows any semantic or rules Data lacks any particular format or sequence Data has no easily identifiable structure Due to lack of identifiable structure, it can not used by computer programs easily Sources of Unstructured Data: Web pages Images (JPEG, GIF, PNG, etc.) Videos Memos Reports Word documents and PowerPoint presentations Surveys Advantages of Unstructured Data: Its supports the data which lacks a proper format or sequence The data is not constrained by a fixed schema Very Flexible due to absence of schema. Data is portable It is very scalable It can deal easily with the heterogeneity of sources. These type of data have a variety of business intelligence and analytics applications. Disadvantages Of Unstructured data: It is difficult to store and manage unstructured data due to lack of schema and structure Indexing the data is difficult and error prone due to unclear structure and not having pre-defined attributes. Due to which search results are not very accurate. Ensuring security to data is difficult task. Problems faced in storing unstructured data: It requires a lot of storage space to store unstructured data. It is difficult to store videos, images, audios, etc. Due to unclear structure, operations like update, delete and search is very difficult. Storage cost is high as compared to structured data Indexing the unstructured data is difficult Text Analysis If you receive huge amounts of unstructured data in the form of text (emails, social media conversations, chats), you’re probably aware of the challenges that come with analyzing this data. Manually processing and organizing text data takes time, it’s tedious, inaccurate, and it can be expensive if you need to hire extra staff to sort through text. Text analysis (TA) Text analysis (TA) is a machine learning technique used to automatically extract valuable insights from unstructured text data. Companies use text analysis tools to quickly digest online data and documents, and transform them into actionable insights. You can use text analysis to extract specific information, like keywords, names, or company information from thousands of emails, or categorize survey responses by sentiment and topic. The Text Analysis vs. Text Mining vs. Text Analytics Text mining and text analysis are two different processes. The terms are often used interchangeably to explain the same process of obtaining data through statistical pattern learning. To avoid any confusion here, let's stick to text analysis. Text analysis delivers qualitative results and text analytics delivers quantitative results. If a machine performs text analysis, it identifies important information within the text itself, but if it performs text analytics, it reveals patterns across thousands of texts, resulting in graphs, reports, tables etc. Different Types of Text Analysis Text Classification Text Extraction Word Frequency Collocation Concordance Word Sense Disambiguation Clustering Text Classification Text classification is the process of assigning predefined tags or categories to unstructured text. It's considered one of the most useful natural language processing techniques because it's so versatile and can organize, structure, and categorize pretty much any form of text to deliver meaningful data and solve problems. Natural language processing (NLP) is a machine learning technique that allows computers to break down and understand text much as a human would. Sentiment Analysis Topic Analysis Intent Detection Sentiment Analysis Customers freely leave their opinions about businesses and products in customer service interactions, on surveys, and all over the internet. Sentiment analysis uses powerful machine learning algorithms to automatically read and classify for opinion polarity (positive, negative, neutral) and beyond, into the feelings and emotions of the writer, even context and sarcasm. Topic Analysis Another typical example of text classification is topic analysis (or topic modelling) which automatically organizes text by subject or theme. For example: “The app is really simple and easy to use” If we use topic categories, like Pricing, Customer Support, and Ease of Use, this product feedback would be classified under Ease of Use. Intent Detection Text classifiers can also be used to detect the intent of a text. Intent detection or intent classification is often used to automatically understand the reason behind customer feedback. Is it a complaint? Or is a customer writing with the intent to purchase a product? Machine learning can read chatbot conversations or emails and automatically route them to the proper department or employee. Text Extraction Text extraction is another widely used text analysis technique that extracts pieces of data that already exist within any given text. You can extract things like keywords, prices, company names, and product specifications from news reports, product reviews, and more. You can automatically populate spreadsheets with this data or perform extraction in concert with other text analysis techniques to categorize and extract data at the same time. Keyword Extraction Keywords are the most used and most relevant terms within a text, words and phrases that summarize the contents of text. Entity Recognition A named entity recognition (NER) extractor finds entities, which can be people, companies, or locations and exist within text data. Word Frequency Word frequency is a text analysis technique that measures the most frequently occurring words or concepts in a given text using the numerical statistic You might apply this technique to analyze customers’ words or expressions most frequently in support conversations. For example, if the word 'delivery' appears most often in a set of negative support tickets, this might suggest customers are unhappy with your delivery service. Collocation Collocation helps identify words that commonly co-occur. For example, in customer reviews on a hotel booking website, the words 'air' and 'conditioning' are more likely to co-occur rather than appear individually. Bigrams (two adjacent words e.g. 'air conditioning' or 'customer support') and trigrams (three adjacent words e.g. 'out of office' or 'to be continued') are the most common types of collocation you'll need to look out for. Collocation can be helpful to identify hidden semantic structures and improve the granularity of the insights by counting bigrams and trigrams as one word. Concordance Concordance helps identify the context and instances of words or a set of words. For example, the following is the concordance of the word “simple” in a set of app reviews: In this case, the concordance of the word “simple” can give us a quick grasp of how reviewers are using this word. It can also be used to decode the ambiguity of the human language to a certain extent, by looking at how words are used in different contexts, as well as being able to analyze more complex phrases. Word Sense Disambiguation It's very common for a word to have more than one meaning, which is why word sense disambiguation is a major challenge of natural language processing. Take the word 'light' for example. Is the text referring to weight, color, or an electrical appliance? Smart text analysis with word sense disambiguation can differentiate words that have more than one meaning, but only after training models to do so. Clustering Text clusters are able to understand and group vast quantities of unstructured data. Although less accurate than classification algorithms, clustering algorithms are faster to implement, because you don't need to tag examples to train models. That means these smart algorithms mine information and make predictions without the use of training data, otherwise known as unsupervised machine learning. Clustering Example Google is a great example of how clustering works. When you search for a term on Google, have you ever wondered how it takes just seconds to pull up relevant results? Google's algorithm breaks down unstructured data from web pages and groups pages into clusters around a set of similar words or n- grams (all possible combinations of adjacent words or letters in a text). So, the pages from the cluster that contain a higher count of words or n-grams relevant to the search query will appear first within the results. How Does Text Analysis Work? Text analysis can stretch it's AI wings across a range of texts depending on the results you desire. It can be applied to: Whole documents: obtains information from a complete document or paragraph: e.g., the overall sentiment of a customer review. Single sentences: obtains information from specific sentences: e.g., more detailed sentiments of every sentence of a customer review. Sub-sentences: obtains information from sub-expressions within a sentence: e.g., the underlying sentiments of every opinion unit of a customer review. Steps in Text Analysis Data Scraping Data Preparation - Preprocessing Analyze Your Text Data 1. Data Gathering You can gather data about your brand, product or service from both internal and external sources: Internal Data This is the data you generate every day, from emails and chats, to surveys, customer queries, and customer support tickets. You just need to export it from your software or platform as a CSV or Excel file, or connect an API to retrieve it directly. Some examples of internal data: Customer Service Software: the software you use to communicate with customers, manage user queries and deal with customer support issues: Zendesk, Freshdesk, and Help Scout are a few examples. CRM: software that keeps track of all the interactions with clients or potential clients. It can involve different areas, from customer support to sales and marketing. Hubspot, Salesforce, and Pipedrive are examples of CRMs. Chat: apps that communicate with the members of your team or your customers, like Slack, Hipchat, Intercom, and Drift. Email: the king of business communication, emails are still the most popular tool to manage conversations with customers and team members. Some examples of internal data: Surveys: generally used to gather customer service feedback, product feedback, or to conduct market research, like Type form, Google Forms, and SurveyMonkey. NPS (Net Promoter Score): one of the most popular metrics for customer experience in the world. Many companies use NPS tracking software to collect and analyze feedback from their customers. A few examples are Delighted, Promoter.io and Satismeter. Databases: a database is a collection of information. By using a database management system, a company can store, manage and analyze all sorts of data. Examples of databases include Postgres, MongoDB, and MySQL. Product Analytics: the feedback and information about interactions of a customer with your product or service. It's useful to understand the customer's journey and make data-driven decisions. ProductBoard and UserVoice are two tools you can use to process product analytics. External Data This is text data about your brand or products from all over the web. You can use web scraping tools, APIs, and open datasets to collect external data from social media, news reports, online reviews, forums, and more, and analyze it with machine learning models. Web Scraping Tools: Visual Web Scraping Tools: you can build your own web scraper even with no coding experience Web Scraping Frameworks: seasoned coders can benefit from tools, like Scrapy in Python and Wombat in Ruby, to create custom scrapers. APIs Facebook, Twitter, and Instagram, for example, have their own APIs and allow you to extract data from their platforms. Major media outlets like the New York Times or The Guardian also have their own APIs and you can use them to search their archive or gather users' comments, among other things. Integrations SaaS tools, offer integrations with the tools you already use. You can connect directly to Twitter, Google Sheets, Gmail, Zendesk, SurveyMonkey, Rapidminer, and more and perform text analysis on Excel data by uploading a file. 2. Data Preparation In order to automatically analyze text with machine learning, you’ll need to organize your data. Text Analysis Operations using NLTK Tokenization Stop words Lexicon Normalization such as Stemming and Lemmatization POS Tagging Sentiment Analysis Text Classification Performing Sentiment Analysis using Text Classification Tokenization Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph. Remove Stop words Stop words are considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc. In NLTK for removing stop words, you need to create a list of stop words and filter out your list of tokens from these words. Lexicon normalization Lexicon normalization considers another type of noise in the text. For example, connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word. Stemming Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes. For example, connection, connected, connecting word reduce to a common word "connect". Stemming and lemmatization Stemming and lemmatization both refer to the process of removing all of the affixes (i.e. suffixes, prefixes, etc.) attached to a word in order to keep its lexical base, also known as root or stem or its dictionary form or lemma. The main difference between these two processes is that stemming is usually based on rules that trim word beginnings and endings (and sometimes lead to somewhat weird results), whereas lemmatization makes use of dictionaries and a much more complex morphological analysis. Difference between Stemming and lemmatization The table below shows the output of NLTK's Snowball Stemmer and Spacy's lemmatizer for the tokens in the sentence 'Analyzing text is not that hard’. The differences in the output have been boldfaced: Stop word Removal To provide a more accurate automated analysis of the text, we need to remove the words that provide very little semantic information or no meaning at all. These words are also known as stopwords: a, and, or, the, etc. There are many different lists of stopwords for every language. However, it's important to understand that you might need to add words to or remove words from those lists depending on the texts you want to analyze and the analyses you would like to perform. You might want to do some kind of lexical analysis of the domain your texts come from in order to determine the words that should be added to the stopwords list. Analyze Your Text Data WordCloud Frequency Plot etc.

Text Analysis, Unstructured Data, Sentiment Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript