Big Data Analytics PDF
Document Details
Uploaded by PrincipledStar
Tags
Summary
This document provides an overview of big data analytics, focusing on the comparison between machine learning and data mining. It describes various machine learning applications and types, including supervised, unsupervised, and reinforcement learning, with examples such as spam detection, fraud detection, and self-driving cars. The document also highlights when to use machine learning for a specific business problem.
Full Transcript
Big data analytics Machine Learning vs. Data Mining There is no common agreement Data Mining Machine learning focuses on designing algorithms that can learn from historical data and make predictions Machine Learning Data mining is a...
Big data analytics Machine Learning vs. Data Mining There is no common agreement Data Mining Machine learning focuses on designing algorithms that can learn from historical data and make predictions Machine Learning Data mining is a cross-disciplinary field that aims at discovering properties (useful Computer Science information) of data sets Machine learning can be used for data mining Mathematics Machine learning example applications Self-Driving Spam Detection Fraud Detection Voice Recognition Face Recognition Anomaly Detection Sales Forecast Robotics What is Machine Learning? Machine learning is the science of getting computers to act without being explicitly programmed Machine learning is a technique of data science that helps computers learn from existing data in order to forecast future behaviors, outcomes, and trends How machine learning works https://www.digitalpulse.pwc.com.au An example of machine learning task Let’s say that you are in the car rental business. How can you accurately predict demands for different types of cars at different times? https://docs.microsoft.com/en-us/azure/machine-learning/studio/basics-infographic-with-algorithm-examples Difference between traditional approach and machine learning approach Rule-based approach Machine learning approach Explicitly programmed to solve problems Trained (i.e., learned) from examples Decision rules are clearly defined by humans Decision rules are complex and fuzzy Rules are not defined by humans but learned by machines from data Summary Machine learning uses historical data to make predictions It is similar to data mining, but whereas data mining is the science of discovering unknown patterns and relationships in data; machine learning applies previously inferred knowledge to new data to make decisions in real-life applications Computers approximate complex functions from historical data Rules are not explicitly programmed but learned from data https://bit.ly/2NS7v9J 9 How to Start? Understand the business Tasks include Identify your business goals Assess your situation Define your data mining goals Produce your project plan https://mineracaodedados.files.wordpress.com/2012/04/the-crisp-dm-model-the-new-blueprint-for-data-mining-shearer-colin.pdf How to start? Ask a question you can answer with data Sharp questions can be answered with a name or a number What will my stock’s sale price be next week? Which car in my fleet is going to fail first? Vague questions cannot be answered with a name or a number How can I increase my profits? What can my data tell me about my business? How you ask a question is a clue to which algorithm can give you an answer The 5 questions data science can answer Surprise? but there are only five questions DS can answer Is this A or B? Is this weird? How much or how many? How is this organized? What should I do now? https://azure.microsoft.com/en-us/resources/videos/data-science-for-beginners-series-the-5-questions-data-science-answers/ Q1: Is this A or B? Use Classification algorithms Will this tire fail in next 1000 miles? Yes or No? Which brings in more customers? A $5 coupon or a 25% discount? https://azure.microsoft.com/en-us/resources/videos/data-science-for-beginners-series-the-5-questions-data-science-answers/ Q2: Is this weird? Use Anomaly Detection algorithms Your credit company analyzes your purchase pattern, so that they can alert you to possible fraud Charges that are “weird” might be a purchase at a store where you do not normally shop or buying an unusually pricey item https://azure.microsoft.com/en-us/resources/videos/data-science-for-beginners-series-the-5-questions-data-science-answers/ Q3: How much? or How many? Use Regression algorithms Regression algorithms make numerical predictions, such as What will the temperature be next Tuesday? What will my fourth-quarter sales be? They help answer any question that asks for a number https://azure.microsoft.com/en-us/resources/videos/data-science-for-beginners-series-the-5-questions-data-science-answers/ Q4: How is this organized? Use Clustering algorithms Common examples of clustering questions are: Which viewers like the same types of movies? Which printer models fail the same way? Sometimes you want to understand the structure of a data set https://azure.microsoft.com/en-us/resources/videos/data-science-for-beginners-series-the-5-questions-data-science-answers/ Q5: What should I do now? Use Reinforcement Learning algorithms Questions it answers are always about what action should be taken – usually by a machine or robot, e.g., For a self-driving car: at a yellow light, brake or accelerate? For a robot vacuum: keep vacuuming, or go back to the charging station https://azure.microsoft.com/en-us/resources/videos/data-science-for-beginners-series-the-5-questions-data-science-answers/ So, what do you want to find out? When to use machine learning? From business problem to machine learning problem: a recipe 1. Do you need machine learning? 2. Can you formulate your problem clearly? 3. Do you have sufficient examples? 4. Does your problem have a regular pattern? 5. Can you find meaningful representations of your data? 6. How do you define success? https://open.sap.com/ When to use machine learning? From business problem to machine learning problem: a recipe 1 Do you need machine learning? Do you need to automate the task? High-volume tasks with complex rules and unstructured data are good candidates Example: sentiment analysis High volume of reviews on the Web Unstructured text Human language is complex and ambiguous https://open.sap.com/ When to use machine learning? From business problem to machine learning problem: a recipe 2 Can you formulate your problem clearly? What do you want to predict given which input? Pattern: “given X, predict Y” What is the input? What is the output? Example: sentiment analysis Given a customer review, predict its sentiment Input: customer review text Output: positive, negative, neutral https://open.sap.com/ When to use machine learning? From business problem to machine learning problem: a recipe 3 Do you have sufficient examples? Machine learning always requires data Generally, the more data, the better Each example must contain two parts (supervised learning) Features: attributes of the example Label: the answer you want to predict Example: sentiment analysis Thousands of customer reviews and ratings from the Web https://open.sap.com/ When to use machine learning? From business problem to machine learning problem: a recipe 4 Does you problem have a regular pattern? Machine learning learns regularities and patterns Hard to learn patterns that are rare or irregular Example: sentiment analysis Positive words like good, awesome, or love it appear more often in highly-rated reviews Negative words like bad, lousy, or disappointed appear more often in poorly-rated reviews https://open.sap.com/ When to use machine learning? From business problem to machine learning problem: a recipe 5 Can you find meaningful representations of your data? Machine learning algorithms ultimately operate on numbers Generally, examples are represented as feature vectors Good features often determine the success of machine learning Example: sentiment analysis Represent customer review as vector of word frequencies Label is positive (4-5 stars), negative (1-2 stars), neutral (3 stars) https://open.sap.com/ When to use machine learning? From business problem to machine learning problem: a recipe 6 How do you define success? Machine learning optimizes a training criteria The evaluation function has to support the business goals Example: sentiment analysis Accuracy: percentage of correctly predicted labels https://open.sap.com/ Summary Consider using machine learning when you have a complex task or problem involving a large amount of data and lots of variables, but no existing formula or equation. For example, machine learning is a good option if you need to handle situations like these: https://www.mathworks.com/discovery/machine-learning.html Machine Learning Types Supervised learning – Classi cation – Regression/Forecasting (prediction) Unsupervised learning – Clustering Semi-supervised learning Reinforcement learning fi Growth of Machine Learning Machine learning is preferred approach to – Speech recognition, Natural language processing – Computer vision – Medical outcomes analysis – Robot control – Computational biology This trend is accelerating – Improved machine learning algorithms – Improved data capture, networking, faster computers – Software too complex to write by hand – New sensors / IO devices – Demand for self-customization to user, environment Supervised learning As its name suggests, Supervised machine learning is based on supervision. It means in the supervised learning technique, we train the machines using the "labelled" dataset, and based on the training, the machine predicts the output. Here, the labelled data speci es that some of the inputs are already mapped to the output. More preciously, we can say; rst, we train the machine with the input and corresponding output, and then we ask the machine to predict the output using the test dataset. Let's understand supervised learning with an example. fi fi Supervised learning Suppose we have an input dataset of cats and dog images. So, rst, we will provide the training to the machine to understand the images, such as the shape & size of the tail of cat and dog, Shape of eyes, color, height (dogs are taller, cats are smaller), etc. After completion of training, we input the picture of a cat and ask the machine to identify the object and predict the output. Now, the machine is well trained, so it will check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and nd that it's a cat. So, it will put it in the Cat category. This is the process of how the machine identi es the objects in Supervised Learning. fi fi fi Supervised learning The main goal of the supervised learning technique is to map the input variable(x) with the output variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam ltering, etc. fi Supervised learning Classi cation: In classi cation tasks, the machine learning program must draw a conclusion from observed values and determine to what category new observations belong. For example, when ltering emails as ‘spam’ or ‘not spam’, the program must look at existing observational data and lter the emails accordingly. Regression: In regression tasks, the machine learning program must estimate – and understand – the relationships among variables. Regression analysis focuses on one dependent variable and a series of other changing variables – making it particularly useful for prediction and forecasting. Forecasting: Forecasting is the process of making predictions about the future based on the past and present data, and is commonly used to analyze trends. fi fi fi fi Unsupervised Learning As its name suggests, there is no need for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset, and the machine predicts the output without any supervision. The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according to the similarities, patterns, and di erences. Machines are instructed to nd the hidden patterns from the input dataset. Let's take an example to understand it more preciously fi ff Unsupervised Learning Suppose there is a basket of fruit images, and we input it into the machine learning model. The images are totally unknown to the model, and the task of the machine is to nd the patterns and categories of the objects. So, now the machine will discover its patterns and di erences, such as colour di erence, shape di erence, and predict the output when it is tested with the test dataset. Clustering algorithm ff fi ff ff Semi-Supervised Learning Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled training data) and Unsupervised learning (with no labelled training data) algorithms and uses the combination of labelled and unlabeled datasets during the training period. Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data. As labels are costly, but for corporate purposes, they may have few labels. It is completely di erent from supervised and unsupervised learning as they are based on the presence & absence of labels. ff Semi-Supervised Learning The main aim of semi-supervised learning is to e ectively use all the available data, rather than only labelled data like in supervised learning. Typically, this will contain a very small amount of labeled data and a very large amount of unlabelled data. Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the unlabeled data into labelled data. It is because labelled data is a comparatively more expensive acquisition than unlabeled data. We can imagine these algorithms with an example. Supervised learning is where a student is under the supervision of an instructor at home and college. Further, if that student is self-analyzing the same concept without any help from the instructor, it comes under unsupervised learning. Under semi-supervised learning, the student has to revise himself after analyzing the same concept under the guidance of an instructor at college. ff Reinforcement learning Reinforcement learning works on a feedback-based process, in which an AI agent (A software component) automatically explore its surrounding by hitting & trail, taking action, learning from experiences, and improving its performance. Agent gets rewarded for each good action and get punished for each bad action; hence the goal of reinforcement learning agent is to maximize the rewards. In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their experiences only. The reinforcement learning process is similar to a human being; for example, a child learns various things by experiences in his day-to-day life. An example of reinforcement learning is to play a game, where the Game is the environment, moves of an agent at each step de ne states, and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment and rewards. Due to its way of working, reinforcement learning is employed in di erent elds such as Game theory, Operation Research, Information theory, multi-agent systems. ff fi fi Hadoop-related Apache Projects AmbariTM: A web based tool for provisioning, managing, and monitoring Hadoop clusters. It also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig, and Hive applications visually. AvroTM: A data serialization system. CassandraTM: A scalable multi-master database with no single points of failure. ChukwaTM: A data collection system for managing large distributed systems. HBaseTM: A scalable, distributed database that supports structured data storage for large tables. HiveTM: A data warehouse infrastructure that provides data summarization and ad hoc querying MahoutTM: A scalable machine learning and data mining library. PigTM: A high-level data-flow language and execution framework for parallel computation SparkTM: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. TezTM: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. ZooKeeperTM: A high-performance coordination service for distributed applications. Key Components of Mahout Mahout reference book Mahout Overview Algorithms Examples Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another Similarity (distance) measures: Euclidean distance if attributes are continuous Other problem-specific measures Tan, Steinbach, and Kumar. Introduction to Data Mining Clustering a collection involves three things An algorithm -- This is the method used to group the objects together. A notion of both similarity and dissimilarity -- This determines which objects belong to an existing group (cluster) and which should start a new one. A stopping condition -- This might be the point beyond which objects cannot be grouped (clustered) anymore, or when the objects are already quite dissimilar. Steps on Clustering Generate vectors from input data Write vectors to input directory Run Read clusters from clustering job output directory Write initial cluster centers Clustering - Example Euclidean distance based clustering in 3D space Intra-cluster distances Inter-cluster distances are minimized are maximized Tan, Steinbach, and Kumar. Introduction to Data Mining Clustering on a 2D Feature Plane k-means clustering K? Means? Clustering? Before Clustering is just the process of dividing a dataset into groups such After that the members of each group are as similar (close) as possible to one another, and di erent groups are as dissimilar (far) as possible from one another. k-means clustering aims to partition a data set (n observations) into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid) After Before K