COS10022_Lecture 02_Linear Regression.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
Linear Regression COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Learning Outcomes This lecture supports the achievement of the following learning outcomes: 3. Describe the processes within the Data Analy...
Linear Regression COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Learning Outcomes This lecture supports the achievement of the following learning outcomes: 3. Describe the processes within the Data Analytics Lifecycle. 4. Analyse business and organisational problems and formulate them into data science tasks. 5. Evaluate suitable techniques and tools for specific data science tasks. 6. Develop analytics plan for a given business case study. COS10022 Data Science Principles Data Analytics Lifecycle COS10022 Data Science Principles Phase 1 - Discovery Data science team learns the business domain, and assesses the resources available to support the project in terms of people, technology, time, and data. Important activities include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data. Person Loan Approval Prediction Model: Conduct stakeholder interviews to gather comprehensive requirements and objectives for the personal loan approval model. Understand what factors influence loan approval decisions and what business goals (e.g., reducing default rates, increasing customer satisfaction) the model should aim to support. COS10022 Data Science Principles Phase 2 – Data Preparation Phase 2 requires the presence of an analytics sandbox, in which the data science team work with data and perform analytics for the duration of the project. The team performs ETLT to get the data into the sandbox, and familiarize themselves with the data thoroughly. ETL + ELT = ETLT (Extraction, Transform and Load) Person Loan Approval Prediction Model: Collect historical loan application data and perform data cleaning. Handle missing values, remove outliers, and ensuring the data is in a suitable format for analysis. Standardize the format of applicant income, employment history, credit score, and other relevant features. COS10022 Data Science Principles Phase 3 – Model Planning The data science team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team also explores the data to learn about the relationships between variables, and subsequently selects key variables and the most suitable models. Person Loan Approval Prediction Model: Selecting a set of machine learning algorithms to test for predicting loan approval outcomes. This might include logistic regression for its interpretability in binary outcomes like approval/rejection, decision trees for their ability to handle nonlinear relationships, or more complex models like random forests or gradient boosting machines if the problem requires capturing complex patterns in the data. COS10022 Data Science Principles Phase 4 – Model Building The data science team develops datasets for testing, training, and production purposes. The team builds and executes models based on the work done in the model planning phase. The team also considers the sufficiency of the existing tools to run the models, or whether a more robust environment for executing the models is needed (e.g. fast hardware, parallel processing, etc.). Person Loan Approval Prediction Model: Feature engineering to create new variables that might better capture the risk associated with a loan application, such as debt-to-income ratio. Split the data into training, validation, and test sets, then train different models on the training set while tuning hyperparameters and selecting the best model based on performance on the validation set. COS10022 Data Science Principles Phase 5 – Communicate Results The data science team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. Person Loan Approval Prediction Model: Develop a dashboard that visualizes the model's predictions, performance metrics (like accuracy, precision, recall), and the importance of different features in the decision-making process. This helps non-technical stakeholders understand how the model makes decisions. COS10022 Data Science Principles Phase 6 – Operationalize The data science team delivers final reports, briefings, code, and technical documents. The team may run a pilot project to implement the models in a production environment. Person Loan Approval Prediction Model: Prepare a presentation or report detailing the model's performance metrics, how it was developed, its expected impact on the loan approval process, and guidelines for its deployment and use in the operational environment. Setting up a process for continuous monitoring and updating of the model as new data becomes available would be critical to ensure its ongoing accuracy and relevance. COS10022 Data Science Principles Lecture: Key Questions What are the distinctions between the model planning and model building phase? What are some key considerations in model building? What software tools (commercial, open source) are typically used at this phase? What is Linear Regression model and in what situation is it appropriate? How does the Linear Regression model work for predictive modelling tasks? How do we prepare our data prior to applying the Linear Regression model? COS10022 Data Science Principles Phase 4 – Model Building Key activities: Develop analytical model, fit it on the training data, and evaluate its performance on the test data. The data science team can move to the next phase if the model is sufficiently robust to solve the problem, or if the team has failed. COS10022 Data Science Principles Phase 4 – Model Building In this phase, an analytical model is developed and fit on the training dataset, and subsequently evaluated against the test dataset. Both the model planning (Phase 3) and model building phases can overlap, where a data science team iterate back and forth between these two phases before settling on the final model. By ‘developed’, we do not always mean coding an entirely new analytics model from scratch. Rather, this usually involves selecting and experimenting with various models, and where applicable, fine tuning their parameters. Although some modelling techniques can be quite complex, the actual duration of model building can be short in comparison with the time spent for preparing data and planning the model. COS10022 Data Science Principles Phase 4 – Model Building Documentation is important at this stage. Examples of documentation: Data Transformation: Note decisions on data transformation, missing values, feature scaling, and new feature creation. When immersed in the details of building models and Algorithm Selection: Justify the choice of machine learning algorithms, transforming data, many small decisions are often citing supporting research or tests. Model Configuration: Document algorithm settings, hyperparameters, made about the data and the approach for modeling. and tuning methods used. These details can be easily forgotten once the project Training Process: Outline model training steps, data splitting, and overfitting prevention techniques. is completed. Evaluation Metrics: Detail metrics for model performance assessment and their selection rationale. Results Comparison: Summarize model performance results and visualizations for comparisons. It is vital to record the results and logic of the model Final Model Criteria: Explain the final model choice based on performance, interpretability, and practicality. during this phase. One must also take care to record Error Analysis: Detail any error analysis performed, including common any operating assumptions made concerning the data types of errors the model makes, insights from analyzing the errors, and potential strategies to address them.. or the context during the modeling process. Version Control: Maintain version history for models, data, and code in a control system. Dependencies: List necessary external libraries and their versions. COS10022 Data Science Principles Phase 4 – Model Building (reproduced from Lecture 03) Commercial tools used in this phase: SAS Enterprise Miner – allows users to run predictive and descriptive models based on large volumes of data from across the enterprise. IBM SPSS Modeler – offers methods to explore and analyze data through GUI. Matlab – provides a high-level language for performing a variety of data analytics, algorithms, and data exploration. Chorus 6 – provides a GUI front end for users to develop analytic workflows and interact with Big Data tools and platforms on the back end. … and many other well-regarded data mining tools e.g. STATA, STATISTICA , Mathematica COS10022 Data Science Principles Phase 4 – Model Building (reproduced from Lecture 03) Open source tools: R and PL/R – PL/R is a procedural language for PostgreSQL with R which allows R commands to be executed in database. Octave – programming language for computational modeling, with some of the functionalities of Matlab. WEKA – data mining package with an analytic workbench and rich Java API. Python – offers rich machine learning and data visualization packages: scikit-learn, NumPy, SciPy, pandas and matplotlib. MADlib – provides machine learning library of algorithms than can be executed in- database, for PostgreSQL or Greenplum. COS10022 Data Science Principles Predictive Models Predicting Class Label Predictive models are data analytics models/algorithms/techniques used for predicting certain attributes of a given object. Examples: 1. A predictive model can be used for guessing whether a customer will “subscribe” or “not subscribe” to a certain product or service. Group 2. Alternatively, a predictive model may be used to predict whether a similar data points patient would “survive” or “not survive” a specific disease. together The goal of a predictive model greatly differs from the goal of unsupervised models (e.g. K-Means Clustering) which are limited to finding specific patterns or structures within the data (e.g. clusters or segments). COS10022 Data Science Principles Predictive Models Predicting an attribute of objects are usually solved as classification problems. In a classification problem, a model is presented with a set of data examples that are already labeled (training dataset). After learning from these examples, the model then attempts to label new, previously unseen set of data (test dataset). output input variables / class variable Given the utilization of training set, most Training set Class labels classification (‘yes’, ‘no’) models are categorized as supervised models. Test set Class labels to be predicted COS10022 Data Science Principles Training and Test sets Training dataset: the portion of data used to discover a predictive relationship. Test dataset: the portion of data used to assess the strength and utility of a predictive relationship. The training and test datasets are usually independent from each other (non-overlapping). In addition to splitting a dataset into training and test sets, it is also common to set aside a certain portion of the dataset as a validation set to improve the performance of a model. Validation dataset is the portion of data that is used to minimize the possible overfitting of a model and select the optimal model parameters (more of these in the next lecture). There is no general rule for how COS10022 Data Science Principles you should partition the data! Linear Regression Model Linear Regression is considered as one of the oldest supervised/predictive models (more than 200 years old). Its goal is to understand the relationship between input and output variables. The model assumes that the output variable (i.e. predicted variable) is numerical and that a linear relationship exists between the input variables and the single output variable. The value of the output variable is calculated from a linear combination of the input variables. Advantages: (a) simplicity; (b) gives optimal results when the relationships between the input and output variables are linear. Disadvantages: (a) limited to predicting numerical values; (b) will not work for modeling non-linear relationships. COS10022 Data Science Principles Linear Regression Model output (y) → Y = aX + b ← input (x) Sample of a linear relationship between weight and height data. Source: http://machinelearningmastery.com/linear-regression-for-machine-learning/ COS10022 Data Science Principles Linear Regression Model Linear Regression belongs to what is called as parametric learning or parameter modeling approach. Following this approach, building a predictive model starts with specifying the structure of the model with certain numeric parameters left unspecified. The objective of the model building is to estimate the best values for these parameters from the training data. Y = aX + b Linear Regression’s model involves a set of parameterized numerical attributes. These attributes can be chosen based on domain knowledge regarding which attributes are likely to be informative in predicting the target variable, or based on more objective methods such as attribute selection techniques. COS10022 Data Science Principles Linear Regression Model The Linear Regression model: weight height age y = w0 + w1 x1 + w2 x2 + where: y is the predicted output variable; w0 is the bias coefficient / intercept (the value of y when all input variables are zero); w1 , w2 , are the parameters / weights / coefficients of the input values that need to be estimated from the training data; and x1 , x2 , are values of the input variables. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Given the following data: Task. Build a simple Linear Regression model that predicts the value of Y when the value of X is known. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Building a simple Linear Observe. Regression is similar to finding the best-fitting straight line The regression line represents our (called regression line) Y best estimation of the actual values of through the existing data Y’ Y’ = 1.00 Y (the coloured data points) and does points. Y’ Y’ Y Y’ = 1.635 not need to cross exactly over all of Y Y’ Y’=2.060 the actual points on the scatterplot. Y In the diagram on the right, Y’ Y Y’=2.485 Otherwise, we might end up with an the straight black line is the Y’=2.910 overfitting problem. resulting regression line. Points along this line Note that the line passes quite closely represents the predicted to the red data point; in contrast, it is values of Y given a value of X. situated quite far from the yellow data point. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Observe. Since there are many possibilities for drawing a regression line through the Table 2 shows the predicted Y values coloured data points, there (Y’) based on the previous regression must be a way to decide on the line, given each value of X. best-fitting regression line. Y-Y’ is the absolute error value. Linear Regression solves this by finding the regression line that (Y-Y’)2 is the squared error value. minimizes the prediction error Adding up these values for the five (hence, an optimization SSE =2.791 data points gives the sum of the problem). A common measure squared errors. SSE indicates how much of the variation in the dependent of such error is the sum of the variable (Y) is not explained by the model. squared errors (SSE). R2 indicates how well the model fits the data. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html The previous regression line is modelled using the following equation: y = 0.785 + 0.425 x For example: for x = 1, y = 0.785 + 0.425 (1) = 1.21 for x = 2, y = 0.785 + 0.425 (2) = 1.64 COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html How did we calculate the previous Linear Regression equation in the first place? N Five statistics are required: ∑x i mean of X: µx µx = i =1 N , where mean of Y: µy xi : an input value standard deviation of X: s x N : the total number of values in a given input variable x standard deviation of Y: s y Pearson’s correlation coefficient: rxy COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Standard deviation. Standard deviation measures how far a set of random numbers are spread out from their average value (mean). N ∑ i x ( x − µ ) 2 sx = i =1 N −1 N ∑(y this part of the i − µy ) 2 equation is called the sy = i =1 ‘sample N −1 variance’. Source: https://www.biologyforlife.com/standard-deviation.html COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Pearson’s correlation coefficient. N ∑ (x − µ i x )( yi − µ y ) rxy = i =1 , where N ∑ (x − µ i =1 i x ) 2 ⋅ ∑ ( yi − µ y ) 2 N : the total number of data points Person’s correlation coefficient measures the strength of association between two variables. Source: https://www.spss-tutorials.com/pearson-correlation-coefficient/ COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html The resulting statistics: Linear Regression formula. µ x = 3.00 y = 0.785 + 0.425 x µ y = 2.06 s x = 1.581 sy wx = rxy ⋅ = 0.425 s y = 1.072 sx rxy = 0.627 w0 = µ y − wx µ x = 2.06 − (0.425)(3) COS10022 Data Science Principles Ordinary Least Squares Regression The previous example illustrates a simple linear regression where we only have a single input variable. When there are more than one input variables, the ordinary least squares regression is used to estimate the parameter value (i.e. the coefficients / weights) of each input variable. Similar to finding the best-fitting regression line, the goal here is to finetune these parameters such that they minimize the sum of the squared error of each data point. As this is an optimization problem, in practice you hardly need to do this manually. Most data science software packages include linear regression functionality to solve this optimization task easily. COS10022 Data Science Principles Preparing Data for Linear Regression Source: http://machinelearningmastery.com/linear-regression-for-machine-learning/ Linear assumption. Linear Regression assumes that the relationships between the input and output variables are linear. For non-linear data, e.g. exponential relationship, data transformation technique such as the log transform is needed. Remove noise and outliers. Linear Regression assumes that the data is clean. Apply appropriate data cleaning techniques to remove possible noise and outliers. Examples of Noisy Data: Incorrect attribute values due to faulty data collection instruments, data entry problems, inconsistency in naming convention, etc. Remove collinearity. Collinearity is caused by having too many variables trying to do the same job. The Occam’s Razor principle states that among several possible explanations for an event, the simplest explanation is the best. Consequently, the simpler our model is, the better. Consider calculating pairwise correlations for your input data and remove the most correlated ones. COS10022 Data Science Principles Non-Linear Transformation: Linear curve has straight line relationship. Using non-linear transformation, non-linear problem can be solved as a linear (straight-line) problem. Source: https://people.revoledu.com/k ardi/tutorial/Regression/nonlin ear/NonLinearTransformation. htm COS10022 Data Science Principles Outliers: A data point that differs significantly from other data points. Anscombe’s Quartet Four (4) datasets with nearly identical descriptive statistics (mean, variance) but strikingly different shapes when graphed. Source: https://en.wikipedia.org/wiki/Anscombe's_quartet COS10022 Data Science Principles Collinearity: Two or more predictors are closely related. - Problematic in regression because it is difficult to check X1 X3 how much each predictor influences the output separately. Source: https://yetanotheriteration.net lify.com/2018/01/high- collinearity-effect-in- regressions/ X2 X2 COS10022 Data Science Principles Preparing Data for Linear Regression Source: http://machinelearningmastery.com/linear-regression-for-machine-learning/ Gaussian (normal) distributions. Linear Regression will produce more reliable predictions if your input and output variables have a Gaussian distribution. Certain data transformation techniques can be used to create a distribution that is more Gaussian looking. Gaussian Distribution Source: http://www.itl.nist.gov/di v898/handbook/pmc/sect ion5/gifs/normal.gif Rescale input. Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization. COS10022 Data Science Principles Normalization: To change the observations so that they can be described as a normal distribution (also known as the bell curve. - A Specific statistical distribution where a roughly equal observations falls above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. Standardization: Also called z-score normalization, which transforms data so that the resulting distribution has a mean of 0 and a standard Source: https://kharshit.github.io/blog/2018/03/23/scaling-vs-normalization deviation of 1. (Standard deviation) COS10022 Data Science Principles Texts and Resources Unless stated otherwise, the materials presented in this lecture are taken from: Dietrich, D. ed., 2015. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. EMC Education Services. COS10022 Data Science Principles