Prediction Techniques PDF
Document Details
Uploaded by DeadCheapHonor3289
Tags
Summary
This document provides an overview of prediction techniques, explaining how to estimate future outcomes based on past data. It discusses mathematical, statistical, and data mining methods, and covers important steps like data preprocessing and model building. The document demonstrates steps to take for successful prediction.
Full Transcript
Unit 2 39 Pattern Recognition 2.3 Prediction Techniques Prediction is an estimation or forecast of future outcomes based on knowledge of the past (Wheeler, 2016). To forecast the future, we need to ident...
Unit 2 39 Pattern Recognition 2.3 Prediction Techniques Prediction is an estimation or forecast of future outcomes based on knowledge of the past (Wheeler, 2016). To forecast the future, we need to identify factors that have occur- red in the past and appear to have influenced the outcome we are trying to predict. Assuming that the underlying patterns are stable, learning from the past allows us to predict the future. There are several mathematical, statistical, and data mining techni- ques that can be used for prediction. The prediction process follows some general phases which are adopted in many appli- cations. This process makes sure that some standard considerations are respected so that the resulting models are to a great extent reliable. However, in spite of all consid- erations, no model can make predictions without error. The goal of developing a predic- tive model should be minimizing errors rather than eliminating them. Therefore, the model development process should consider the impact of each characteristic in the data and the correlation between the characteristics, all while minimizing the rate of 40 Unit 2 misclassifications. Hence, the model should learn from the sample data—a process Training referred to as training. The result of training using sample data is a model that This is the process of describes a specific class of entities. This model is a parametric model, such as g(t,w), determining the where t is the list of selected features or characteristics, and w is their corresponding parameters of a weights. Each class of entities is then modeled using a set of parameters and their model using sample weights. This model is used for determining the type (class) of an unknown entity. data. Therefore, the classifying model can be viewed as an explanatory model which pro- vides information about an unknown entity (describes the entity). This procedure is Weights shown in the figure below. These are the coeffi- cients that show the relative importance of each parameter in a model. Explanatory model This is the model used to determine identity/type of an unknown entity. Project Aim and Scope Definition A predictive analysis should aim to predict the behavior of an entity. As the first step in developing such a system, we should clearly determine what is going to be predicted and what input data will be used for this prediction. In addition, the types of input to be included and the behaviors of interest must be evaluated. In this way, we will be able to describe what kind of data should be expected and how it should be processed. For instance, if the predictive analysis is going to determine which commercial adver- tisements are more likely to be clicked on by users, the types of advertisements to be considered (size, graphics used, animation, sound effects, etc.); the type of products they advertise; and the specifications of the users under consideration (in terms of age, education, language, country, etc.), should be considered. Collecting data Any prediction should be based on a model, which imitates the behavior of its user(s). This model is built using sample data from previous experiments. In general, the data is not acquired from a single source. So the potential data sources should be identified and the relevant data gathered for developing and training the model. Unit 2 41 Pattern Recognition Preprocessing data The data gathered from different sources may have different formats, represented in different scales or units, and include invalid or outlier items. The data should be pre- processed before being used for training the model. The preprocessing phase may include cleansing data, transforming the format, eliminating outliers and erroneous data, and unifying the scales and units. It is a time-consuming process, and extensive knowledge is needed in order to judge whether the data is trustworthy. Modeling Modeling involves statistical analysis of data to extract important features, which are most relevant from an information perspective. These statistics are utilized to develop a model that can classify similar behavioral reactions based on statistical similarities while differentiating dissimilar behaviors. Many different machine learning methods, such as neural networks, support vector machines, and logistic regression, are used for building prediction models. A simple and widely used statistical modeling technique in machine learning which is also used in predictive analysis is the Naïve Bayesian method. The Naïve Bayesian method is a statistical method relying on the probability distribution of input values and conditional probabilities. As a simple example, let us assume that a jar contains marbles of two different shapes and colors. To model the predicted behavior, we assume the shape of the marble is what is observed and its color the predicted behavior. (We may assume we can feel the shape of the marble when we want to withdraw a marble from the jar, but we do not see its color before we take it out of the jar.) Our a-priori information indicates that the number of marbles of each shape/color is A-priori information as shown in the table below. Note that this information is the core part of the Bayesian This is knowledge or analysis. This prior information incorporates our background, or domain, knowledge. In information availa- machine learning, the data forms the a-prior knowledge. ble prior to experi- ence. 42 Unit 2 Number and Probability of Observing Each Marble Shape\Color Black Red Marginal prob- ability* Round 25 20 0.45 Triangular 40 15 0.55 Marginal probability* 0.65 0.35 1.00 *The marginal probability of an event is the uncondi- tional probability of the event occurring. Without considering shape, the a-priori probability of selecting a black marble is 0.65. If we know the shape of the selected marble (round, for example) the probability of observing a black marble becomes 0.25 (conditional probability). Assuming the proba- bility of observing a black marble as output and the shape of the selected marble as input, we can predict the probability (score) of observing a black marble as: Probability black and round Probability color = black given shape = round = Probability round A simpler notation for the above formula is shown below: P black & round P color = blackshape = round = P round In our example, the probability of observing a black and round marble is 0.25, and the probability of observing a round marble is 0.45. Therefore, the probability of observing a black marble given that its shape is round becomes 0.25/0.45 = 0.56. It should be noted that the model developed here is based on a jar of marbles. In prac- tice, the data used for developing a model is a subset of the total set of observable values. For example, if the life expectancy of different species of fish were examined, the data would be a subset of the fish of each species. A more realistic example concerns the prediction of Web users’ behavior when they see a specific commercial advertisement. Let us assume that based on past data, we have the age distribution of female and male users who reacted to the advertisement. If we have some information about the gender and/or the age of the user, we can predict whether they will react to a specific advertisement. For instance, if we assume the reac- tion probability distributions of male and female users as shown in the next figure, we can build the prediction model in a similar way. Unit 2 43 Pattern Recognition P Age = xClicked P Clicked P ClickedAge = x = P Age=x where P ClickedAge = x is the probability of clicking on the advertisement given that the age is x, P Age = xClicked is the probability of observing Age = x if the advertise- ment is clicked, P(Clicked) is the probability that the advertisement is clicked. This probability is typically measured from past data, but we can also assign it a prescribed value. P(Age = x) is the probability of observing Age = x among users. In the figure, P(male) and P(female) represent the probability that the user is a male or a female, respectively. Deploying the Model After the model is built and validated on the test data, it should then be brought into “production.” The model may be deployed either via a computer center or on a cloud system with all the monitoring, versioning, and handling that it entails. The model pro- vides a numeric score indicating the likelihood of observing a given behavior as the reaction of the entity or individual to each specific input. The actual accuracy of the model can be verified only by deploying it in real environments. The deployment involves exposing the model to real environmental conditions. Many models display a high performance in test environments but fail to act properly under real conditions. Monitoring and Evolving the Model Building a model for predicting the behavior of entities is based on a limited subset of the data from experiences in the past. Even if the data are selected to be a fair repre- sentation of different cases, it may not cover all aspects of the behavioral model. In addition, predictive models are constructed by observing a reaction to some character- istics of input data. While designing the model, a group of characteristics of the input data is selected and the model trained accordingly. If the selected characteristics are not comprehensive enough to describe the relationship between the data and the 44 Unit 2 Lurking variable observed behaviors, implying the existence of hidden parameters or lurking variables, This is a variable the model may not correctly predict behavior. Moreover, the characteristics of the entity that is not consid- under investigation may change from the initial model overtime. This requires monitor- ered to be an ing the performance of the model and applying necessary improvements in accordance explanatory or with evolving conditions. observable variable but which may affect It should perhaps be emphasized that despite all the best considerations, there is the result. always a degree of uncertainty in real-world predictive models. The general consensus is to assume that the patterns identified through predictive models are nevertheless stable enough to predict future events. In what follows, we discuss regression and clas- sification techniques, including decision trees and neural networks. Regression Regression analysis Regression analysis is used for predicting and forecasting. The focus of this statistical This is used to esti- method is on determining whether a relationship exists between a dependent variable mate relationships (called a target, or responsive, variable) and one or more independent variables (or between variables predictors). In other words, regression helps us understand how the value of a using a set of statis- response variable changes when one of its predictors is varied while the other predic- tical processes. tors remain fixed. Dependent variable Regression analysis is a useful technique for businesses to predict future events. For The value of this var- example, insurance companies can predict the loss claims some insurance products iable is dependent using regression analysis. In marketing, companies can predict sales, customer satisfac- on one or more tion, product buying frequency, and the likelihood of a customer returning to the store other variables. after having received a recommendation for a new product or service. In order to do this, a whole range of variables are studied that could have an effect on the responsive Independent varia- variables in question. These variables are considered predictor variables. The most ble common predictors include the following: The value of this var- iable does not Demographic variables (e.g., age, sex, average income) depend on another Geographical variables (e.g., where the customer lives, their country, state, city) variable. Domain-specific variables containing information about past business processes, such as sales, order management, claims, and fraud. Examples of domain-specific variables include: a customer’s degree of satisfaction with the food in restaurants, with airline cabin crew, in-flight entertainment, or ticket prices in the airline indus- try. They can also include data showing how long a user spent on a website, how many products they explored, or the number of clicks customers made while in an online shop. For a simple example of linear regression analysis, consider the relation between an advertising budget (as a predictor) and product sales (as a response variable) in a cer- tain company. The relationship between these two variables is generally nonlinear, but for the sake of simplicity, we assume a linear relationship. This allows us to find the best linear fit (red line) to the blue curve. The equation of the red line is given by: Sales amount = 6.5068 · Advertising Budget − 154.13 Unit 2 45 Pattern Recognition Roughly speaking, the more a company invests in advertising, the higher the product sales. This simple linear equation explains the behavior of product sales versus adver- tising budget, but it does not give an accurate description of the real curve (which occurs around 100 on the x-axis in the figure). For a more precise model, we would need to make use of more complex equations. Decision Trees A single decision tree is a simple classification method made up of decision nodes and Decision tree branches. Beginning with the root node (conventionally, the top node in the diagram), A decision tree uses an attribute (or combination of attributes) is evaluated at each following decision a branching model node. Possible outcomes determine the different paths from that node. A sequence of to show each possi- one or more decision nodes leads to a terminating leaf node, shown in green in the ble outcome of a sample decision tree figure. This simplified example concerns predicting bank custom- decision. ers’ credit ratings, which are divided into three classes: bad, medium, and good. In this example, “credit rating” is the variable we seek to predict, i.e., the target variable. Decision nodes These are variables included in the deci- sion tree that are controlled by the decision-maker. 46 Unit 2 Branches These are the repre- sentations of possi- ble outcomes or decisions in a deci- sion tree. The top decision node (income) is the most important attribute, as it provides the best classification of the target variable. The different possible values of the predictor varia- ble lead to different sub-trees. Here, there are three possible values (or outcomes) for income: low, medium, and high. Other predictors (attributes) used in the tree are the number of credit cards and age. After the tree is created, it can be used to predict the credit rating of new customers. Suppose we have a new customer (or data instance) with the following attributes: Income = Medium, Number of Credit Cards = 6, Age = Senior Starting at the root node, we first test income. Since the income is medium, we follow the middle path which leads to the node with the number of credit cards. Since the customer has six cards, we follow the left branch to reach the age node. Finally, because our customer is a senior (the left branch), we predict that this customer has a medium credit rating (medium risk). It should be noted that a single tree is often not very useful: it only captures simple linear relationships and ignores the correlation of the variables, i.e., it is dependent on the order in which the variables are taken into account when building the tree. The concept of the decision tree is therefore generalized to a “forest” that consists of a collection of decision trees which differ slightly from one another. Using a “forest” allows us to obtain a more robust result, but the ease of interpretation is gone. Typi- cally, forest algorithms, such as random forest, perform more accurately and efficiently than single decision trees. Unit 2 47 Pattern Recognition There are several well-known decision tree algorithms. Here, we briefly introduce some of the most important ones. Classification and regression trees (CART) The CART algorithm was introduced by Breiman, Friedman, Stone, and Olshen in 1984. This method constructs binary decision trees where there are exactly two branches (i.e., outcomes) at each decision node. At each level (or decision node) of the tree, CART searches all variables to find the best splitting variable. The best attribute is selected by using the Gini index, which is used to measure the impurity of data. Gini index The Gini index meas- C4.5 algorithm ures the disparity Quinlan (1986) first introduced the ID3 algorithm. He later extended ID3 and introduced among the values of the C4.5 algorithm. Similar to CART, C4.5 tests every variable at each level of the tree a probability distri- and selects the best splitter (Han, Kamber, & Pei, 2011). This is done by using informa- bution. A Gini index tion gain or entropy. Unlike CART, the C4.5 algorithm does not necessarily produce of zero corresponds binary trees. to a perfect match between all values Decision rule of the distribution. One advantage of the decision tree method is that the trees provide an easy way to interpret the results by creating a rule-based prediction system. Indeed, by following each path in the tree, we can build a new decision rule. For example, following the mid- dle path from income down the tree in the above figure yields the following rule: IF income is medium AND the number of credit cards is more than five AND the customer is middle-aged or senior, THEN the credit rating is medium. Neural Networks Artificial neural networks (ANN) are inspired by the biological neural systems that exist in animal brains. Although a single neuron may be rather simple in structure and limi- ted in functionality, dense networks with large numbers of interconnected neurons can be used to solve complex learning problems, such as computer vision, speech recogni- tion, machine translation, social network analysis, and medical diagnostics. For example, an ANN can be trained to identify cars on a map using a collection of pic- tures of cars in different positions and environments. This picture collection becomes a training set that is fed into a learning algorithm to train the network. Then, the network itself discovers the most important features that allow it to distinguish and identify a car in an image. It should be emphasized that these features are obtained throughout the learning process and are not provided manually by a human expert. 48 Unit 2 Biological neural networks consist of large numbers of simple elements called neurons. A simple neuron is illustrated in the figure above. The neuron takes input signals from its environment (other neurons) via dendrites. These signals are modified (strength- ened or weakened) and summed up inside the cell body. The neuron then processes a net input signal, producing an output signal that is sent through the axon to other neu- rons downstream. A single biological neuron may be artificially modeled (see figure below). Unit 2 49 Pattern Recognition Each link between a neuron and its inputs has a weight. The value of the inputs gets multiplied by the corresponding weights, which are summed together to form the net input to the neuron. More precisely, the net input is given by net = ∑ wixi = w1x1 + ⋯ + wnxn After computing the net, an activation function is applied to determine the output, i.e., y = f net Activation functions are used to simulate the behavior of biological neurons. If the input signal in a simple biological neuron is enough (i.e., greater than a certain thresh- old), then the neuron will fire. If the input signal is not enough, the neuron will remain inactive (i.e., zero output). Such an activation function is called a step function (see fig- ure below). Note that the net input of the neuron is a linear combination of its inputs. If all activa- tion functions in a neural net are linear, then the outputs will be linearly related to the inputs because the composition of linear functions is again linear. This will greatly limit the power of the net to learn and predict nonlinear mappings between input and out- put samples. Therefore, in general, nonlinearity is an important property of the activa- tion function. There are several activation functions. The most common are the rectified linear unit (ReLU) and the logistic sigmoid function. The logistic sigmoid is given by 1 f x = 1 + e−x 50 Unit 2 where e ≈ 2.7182 is the Euler number. The graph of f is shown below. A logistic sigmoid is a nonlinear function. It can be considered a continuous version of the step function. The ReLU function is defined as follows: x. x > 0 ReLU x = max x. 0 = 0. x ≤ 0 Unit 2 51 Pattern Recognition Simple neurons may be arranged and combined to form more complex neural net- works. A typical neural net consists of several input neurons (nodes) arranged in the input layer. Input neurons are then connected to hidden units in the hidden layer(s), which are in turn connected to the neuron in the output layer. The net in the next fig- ure is an example of a fully-connected network where each neuron in a layer is connec- ted to all neurons in the next layer. This network is also a feedforward network. In such networks, the signals flow in a forward direction, and there is no closed loop path inside the net. In other words, in a feedforward net, signals pass from neurons in one layer to those in the next layer and not in the reverse direction. The arrangement of the neurons in different layers and the pattern of connections between nodes are called the net architecture. In practice, the data analyst should configure the net architecture according to the specifics of the problem at hand. Notice that the set of weights which can solve a particular problem are determined during the learning phase. The values of these weights are the most important knowl- edge to be extracted during and from the “memory” of the network. In what follows, we try to briefly explain how learning algorithms work in general. By learning, we mean the learning of weights. ANNs are trained based on a set of train- ing samples consisting of a number of data inputs along with their target values (i.e., desired outputs). The weights are either initialized simply with small random values or with a special procedure, e.g., the Glorot initialization (Glorot & Bengio, 2010). Then the training data are fed into the network one at a time. The network calculates the output value for each instance of data. The discrepancies between the calculated output and the (predetermined) target value constitute the error for that training sample. There are several techniques to measure this discrepancy such as sum of squares and cross- entropy error function. A sum of squares error (SSE) is given by 2 SSE = ∑ ∑ actual value−predicted value instances output nodes 52 Unit 2 The value of this error depends on the weights of the net, where the problem is to find a set of weights which minimizes SSE. This minimization problem can be complex and time-consuming. We, therefore, need an efficient method to solve this optimization problem. A well-known mathematical technique, the gradient descent method, is of great help here. The gradient descent method specifies the direction in which the weights should be adjusted to reduce SSE. Neural networks are often trained with the backpropagation algorithm (Larose & Lar- ose, 2014; Rumelhart, Hinton, & Williams, 1986). The general idea is as follows: First, we feed the network with a training sample and compute the error. The error then moves in a backward direction to determine how much each connection weight needs to be adjusted so as to reduce the error. This is done by the gradient descent algorithm. The same process is repeated for each training sample until the error (i.e., the difference between the net’s output and the desired output) for all training samples reaches a minimum. More precisely, suppose W = [w0.w1. ….wm] is the vector containing all weights of the network. The gradient of SSE is defined as the vector of partial derivatives of SSE with respect to the weights W = [w0.w1. ….wm], i.e., ∂SSE ∂SSE ∂SSE ∇SSE W = · ·⋯· ∂w0 ∂w1 ∂wm According to the gradient descent algorithm, to reduce the error function SSE, the weight vector W is updated in the opposite direction of the gradient vector. More pre- cisely, τ+1 τ W =W − α ∇SSE W where 0 < α ≤ 1 is the learning rate and τ is the iteration index. The following figure illustrates how the gradient descent method works in practice. Without loss of general- ity, we assume only one weight here. Choosing a random initial weight, the gradient (i.e., the derivative at that point) is calculated, and then the weight is adjusted in the opposite direction of the gradient to get closer to the minimum. We can see that after several iterations, the weight will get very close to the minimum point. Unit 2 53 Pattern Recognition Neural networks are flexible and powerful tools that can be trained to learn sophistica- ted patterns and solve very complex problems. In more advanced applications, the number of hidden layers may need to be increased. Such networks are called deep net- works. Due to the higher number of free parameters (i.e., weights) in a deep network, it is much more powerful than small, shallow networks with only one layer of hidden units. On the other hand, the training of deep networks is much slower, and other sophisticated techniques should be planted to speed up learning. Deep learning has been very successful in applications like image recognition, speech recognition, and natural language processing (NLP). Understanding the Limitations of Prediction Using machine learning or data mining techniques in a data prediction application is not without pitfalls and obstacles. Wong, Sen, and Chiang (2012) collected 1.7 million tweets about 34 films during the 2012 Oscar Awards. They classified the tweets as posi- tive or negative. They also collected reviews from a few celebrated movie sites like IMDb and Rotten Tomatoes. Their research showed that Twitter’s ability to predict Oscar winners is limited. Moreover, they found that reviews on Twitter do not typically reflect the reviews that appear on other websites. Most importantly, they concluded that Twit- ter data is not a reliable data source for predicting box office revenue. Such studies cast doubt on the reliability of predictions based on big data. We should be aware of the limitations of using data mining or machine learning techniques in predicting future events. According to Strong (2015), some of these limitations include dependency on historical data. Prediction can be performed only when there are historical data that can be analyzed. In cases where no historical data exist (such as the acceptance of a new product), prediction models may exhibit poor performance. dependency on the quality of data. The accuracy of a prediction model depends directly on the quality of the data used to build it. If the input data is defective or faulty, the prediction model will be unreliable. 54 Unit 2 bias. Even if historical data are available and of good quality, the trained model may still give a mediocre performance. The reason for this is that the data might not be representative of underlying patterns. The movie awards study by Wong et al. (2012) is an example of the damage bias can cause to a model. overfitting. Overfitting can happen when there are too many free parameters (i.e., those parameters that can take arbitrary real values) in the predictive model com- pared to the size of the data set. In such cases, the model becomes too flexible and may fit itself to the noises and irregularities of the data samples rather than to their underlying patterns and regularities. In other words, the model tries to memorize the data samples “by heart” rather than understanding the underlying patterns and regularities. selectivity. We, as humans, often tend to ignore inconvenient data samples because of our cognitive biases. This selection tendency may lead to models that cannot predict everything properly.