Chapter 2.pptx
Document Details
Uploaded by Deleted User
Full Transcript
Chapter 2. Le arning from Big Data Dr. Feras Al-Obeidat Overview In this chapter, we are going to look at some of the basic concepts of machine learning and take a deep dive into some of the algorithms. We will use Spark's machine learning libraries. Spark is one...
Chapter 2. Le arning from Big Data Dr. Feras Al-Obeidat Overview In this chapter, we are going to look at some of the basic concepts of machine learning and take a deep dive into some of the algorithms. We will use Spark's machine learning libraries. Spark is one of the most popular computer frameworks for the implementation of algorithms and as a generic computation engine on big data. Spark fits into the big data ecosystem well, with a simple programming interface, and very effectively leverages the power of distributed and resilient computing frameworks. Cover the categories of machine learning in supervised and unsupervised learning: Regression analysis Data clustering K-means Data dimensionality reduction Supervised and Unsupervised Machine Learning Machine learning at a broad level is categorized into two types: supervised and unsupervised learning. As the name indicates, this categorization is based on the availability of the historical data or the lack thereof. In simple terms, a supervised machine learning algorithm depends on the trending data, or version of truth. Generalizing the model to make predictions on the new data points. Supervised and Unsupervised ML Supervised and Unsupervised Machine Learning The value of the y variable is dependent on the value of x. Based on a change in the value of x, there is a proportionate change in the value of y (increase or decrease in the value of one factor proportionally changes the other). Based on the data presented in the preceding table, it is clear that the value of y increases with an increase in the value of x. That means there is a direct relationship between x and y. In this case, x is an independent, or input, variable and y is called a dependent, or target, variable. Difference between traditional computer programming and machine learning fundamental difference between traditional computer programming and machine learning when it comes to predicting the value of the y variable for a specific value of x. The following diagram shows the traditional programming process: Traditional Computer Program The traditional computer program has a predefined function that is applied on the input data to produce the output. In this example, a traditional computer program calculates the value of the (y) output variable as 562. Machine Learning (ML) In the case of supervised machine learning, the input and output data (training data) are used to create the program or the function. This is also termed the predictor function. A predictor function is used to predict the outcome of the dependent variable. In its simplest form, the process of defining the predictor function is called model training. Once function is defined, we can predict the value of the target variable (y) corresponding to an input value (x). The goal of supervised machine learning is to develop a predictor function, h(x), called hypothesis. Hypothesis the target function that we want to model. Supervised Learning (Linear Regression) Once we plot the data points from the training data, we can visualize the correlation between the data points. To predict the value of y when x = 220, we can draw a straight line that tries to model, the truth (training data). The straight line represents the predictor function, which is also termed as a hypothesis. Based on the hypothesis, in this case our model predicts that the value of y when x = 220 will be ~430. While this hypothesis predicts the value of y for a certain value of x, the line that defines the predictor function does not cover all the values of the input variable. For example, based on the training data, the value of y = 380 at x = 150. However, as per the hypothesis, the value comes out to be ~325. This differential is called prediction error (~55 units in this case). Prediction Error Any input variable (x) value that does not fall on the predictor function has some prediction error based on the derived hypothesis. The sum of errors for across all the training data is a good measure of the model's accuracy. The primary goal of any supervised learning algorithm is to minimize the error while defining a hypothesis based on the training data. A straight-line hypothesis function is as good as an illustration. However, when have multiple input variables (multidimensional) that control the output variable. When we predict the value of an output variable at a certain value of the input variable it is called regression. Classification In certain cases, the historical data, or version of truth, is also used to separate data points into discrete sets (class, type, category). Classification does the separation For example, an email classification based on training data. In the case of classification, the classes are known and predefined. Classification Classification is a supervised learning technique that defines the Decision Boundary of the output variables. Regression and classification, require historical data to make predictions about the new data points. These represent supervised learning techniques. Classification The generic process of supervised machine learning can be represented as follows: The labeled data is split into training and validation sets with random sampling. Typically, an 80-20 , hold-out, cross- validation, etc rule is followed with the split percentage of the training and validation sets. The training set is used for training the model (curve fitting) to reduce overall error of the prediction. The model is checked for accuracy with the validation set. The model is further tuned for the accuracy threshold and then utilized for the prediction of the dependent variables for the new data. Supervised Learning Process The Spark programmi ng model Spark Spark is a distributed in-memory processing engine and framework that provides you with abstract APIs to process big volumes of data using distributed collection of objects called Resilient Distributed Datasets. Flexible Rich set of libraries, components, and tools, which let you write distributed code in an efficient manner. Spark applications are Java Virtual Machine (JVM)- contains three processes: driver, executor, and cluster manager. The driver program runs as a separate process on a logically- or physically- segregated node and is responsible for launching the Spark application, maintaining all relevant information and configurations about launched Spark applications, executing application DAG as per user code and schedules, and distributing tasks across different available executors Spark executor processes are responsible for running the tasks assigned to it by the driver processes, storing data in in-memory data structures called RDDs, and reporting its code- execution state back to the driver processes. Cluster managers are responsible for physical machines and resource allocation to any Spark application. Regression Analysis Dr. Feras Al-Obeidat Regression analysis Regression analysis is a modeling technique that is used for predicting or forecasting the occurrence of an event or the value of a continuous variable (dependent variable), based on one or many independent variables. For example, drive from one place to another, there are numerous factors that affect the amount of time it will take to reach the destination, for example, the start time, distance, real-time traffic conditions, construction activities on the road, and weather conditions. All these factors impact the actual time it will take to reach the destination. As you can imagine, some factors have more impact than the others on the value of the dependent variable. In regression analysis, we mathematically sort out which variables impact the outcome, leading us to understand which factors matter most, which ones do not impact the outcome in a meaningful way, how these factors relate to each other, and mathematically, the quantified impact of variable factors on the outcome. Various regression techniques Linear regression In linear regression, we model the relationship between the dependent variable, y, and independent variable, x. If there is one independent variable, it is called simple linear regression. If there are multiple independent variables, the regression is called multiple linear regression. The predictor function in the case of linear regression is a straight line The regression line defines the relationship between x and y. When the value of y increases when x increases, there is a positive relationship between x and y. If x and y are inversely proportional, there is a negative relationship between x and y. The line should be plotted on x and y dimensions Prediction error minimizes the difference between the predicted value and the actual value. Linear Regression This is the equation of a straight line: y is the value of dependent variable, a is the y intercept (the value of y where the regression line meets the y axis), and b is the slope of the line. Let's consider the least square method in which we can derive the regression line with minimum prediction error. Least square method Let's consider the same training data we referred to earlier in this chapter. We have values for the independent variable, x, and corresponding values for the dependent variable, y. These values are plotted on a two-dimensional scatter plot. The goal is to draw a regression line through the training data so as to minimize the error of our predictions. The linear regression line with minimum error always passes the mean intercept for x and y values. Example Linear Regression The formula for calculating the slope: The least square method calculates the y intercept and the slope of the line with the following steps: 1.mean of all the x values (119.33). 2.mean of all the y values (303.20). 3.Calculate difference from the mean for all the x and y values. 4.Calculate the square of mean difference for all the x values. 5.Multiply the mean difference of x by the mean difference of y for all the combinations of x and y. 6.the sum squares of all the mean differences of the x values (56743.33). 7.the sum of mean difference products of the x and y values (90452.00). 8.The slope of the regression line is obtained by dividing the sum of the mean difference products of x and y by the 9. sum of the squares of all the mean differences of the x values (90452.00 / 56743.33 = 1.594). the slope is positive. This is the value for b in our equation. b = 1.594 9.We need to calculate the value of the y intercept (a) by solving the following equation, y = a + 1.594 * x. The formula for calculating the slope: We need to calculate the value of the y intercept (a) by solving the following equation, y = a + 1.594 * x. predicted 10. Therefore, 303.2 = a + (1.594 * 119.33). x Y original y 11. Solving this, we get a = 112.98 as the y intercept for the 50 line. regression 192.68 180 75 232.53 200 100 272.38 320 125 312.23 340 150 352.08 380 y = 112.98 + 1.594*x R-squared So far, we have created our regression line with which we can predict the value of the dependent variable, y, for a value of x. Let’s check how close our regression line mathematically is to the actual data points. Most popular statistical techniques, R-squared, for this purpose. It is also called the coefficient of determination. R-squared calculates the % of response variable variation for the linear regression model we have developed. R-squared values will always be between 0% and 100%. A higher value of R-squared indicates that the model fits the training data well; generally termed the goodness of fit. R-squared R-squared Let's use our training data to calculate R-squared based on the formula in the preceding image. Please refer to the diagram we just saw, in this case, R-squared = 144175.50 / 156350.40 = 0.9221. = 92% The model is fitting the training data very well Standard Error There is another parameter we can derive, called standard error, from the estimate. This is calculated as: In this formula, n is the sample size or the Standard error matters number of observations. because it helps you With our dataset, the estimate how well your standard error of the sample data represents the estimate comes out to whole population. be 30.59. A high standard error A low standard error shows that sample shows that sample means are widely spread means are closely around the population distributed around the mean—your sample may population mean—your not closely represent sample is representative of Standard Error Generalized linear model In real world, we deal with multiple variables that affect the output variable, termed multiple regression. linear equation takes the following form: y = a0 + b1x1 + b2x2 +...+ bkxk a0 is the y intercept, x1, x2,...xk are the independent variables or factors, and b1, b2,.., bk are the weights of the variables. They define how much the effect of a variable on outcome. With multiple regression, we can create a model for predicting a single dependent variable. This limitation is overcome by the generalized linear model. It deals with multiple dependent/response variables, along with the correlation within the predictor variables. Logistic Regression - Classification Techni que Logistic Regression - Classification Technique Logistic regression is a method in which we analyze the input variables that result in the binary classification of the output variables. A popular method to solve classification problems, for example, to detect whether an email is spam or not, or whether a transaction is a fraudulent or not. The goal of logistic regression is to find a best-fitting model that defines the class of the output variable as 0 (negative class) or 1 (positive class). As a specialized case of linear regression, logistic regression generates the coefficients of a formula to predict probability of occurrence of the dependent variable. The probability is bound between 0 and 1. However, the linear regression model cannot guarantee the probability range of 0 to 1. Difference between linear and logistic Regression models Quadratic function A quadratic function is one of the form f(x) = ax2 + bx + c, where a, b, and c are numbers with a not equal to zero. In mathematics, a quadratic is a type of problem that deals with a variable multiplied by itself — an operation known as squaring. This language derives from the area of a square being its side length multiplied by itself. In algebra, a quadratic function, a quadratic polynomial, a polynomial of degree 2, or simply a quadratic, is a polynomial function with one or more variables in which the highest-degree term is of the second degree A quadratic function is a polynomial where the highest degree of any variable is 2 The most common base throughout the sciences is the irrational number e=2.718281828459045… Calculate the Probability There are two conditions we need to meet with regards to the probability to generate binary outcome of the independent variable: It should be positive(p >= 0): We can use an exponential function in order to ensure positivity: It should be less than 1(p