DataScW1-W4.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
Welcome COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Yakub Sebastian ([email protected]) About This Unit COS10022 Introduction to Data Science / Data Science Principles Unit Learning Outcomes...
Welcome COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Yakub Sebastian ([email protected]) About This Unit COS10022 Introduction to Data Science / Data Science Principles Unit Learning Outcomes Students who successfully complete this Unit should be able to: 1. Appreciate the roles of Data Science and Big Data analytics in organisational contexts. 2. Compare and analyse the key concepts, techniques and tools for discovering, analysing, visualising, and presenting data. 3. Describe the processes within the Data Analytics Lifecycle. 4. Analyse organisational problems and formulate them into data science tasks. 5. Evaluate suitable techniques and tools for specific data science tasks. 6. Develop and execute an analytics plan for a given case study Graduate Attributes This unit may contribute to the development of the following Swinburne Graduate Attributes: Communication 1 - Verbal communication Communication 2 - Communicating using different media Teamwork 1 - Collaboration and negotiation Teamwork 2 - Teamwork roles and processes Digital literacies 1 – Information literacy Digital Literacies 2 – Technical literacy Texts and Resources Unless stated otherwise, the materials presented in this lecture are taken from: Dietrich, D. ed., 2015. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. EMC Education Services. Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier. COS10022 Data Science Principles Minimum requirements to pass this Unit Achieve an overall mark for the unit of 50% or more, Submit ALL assignments, and Participate in ALL online tests. Assessment Overview Special Consideration If you encounter some difficulties such as medical issues, which affects your progress, during the study, you’ll need to launch a special consideration. The special consideration will only extend the deadline in a reasonable range for submitting the assignments. It will not be used to twist the evaluation criteria. Teaching Team Contact Hours: Monday – Friday (8.30 am to 5.30 pm) Please note that any emails sent after office hours will experience a delay in reply Overview of Data Science and The Big Data Ecosystem COS10022 Data Science Principles Key Questions 1. What is Data Science? 2. Who are Data Scientists? What do they do? 3. How does Data Science differ from the traditional analytics frameworks? 4. What is Big Data? 5. What drives Big Data? 6. What are the components in the Big Data Ecosystem? COS10022 Data Science Principles Learning Outcomes This lecture supports the achievement of the following learning outcomes: 1. Appreciate the roles of data science and Big Data analytics in business and organisational contexts. 2. Appreciate the key concepts, techniques and tools for discovering, analysing, visualising and presenting data. COS10022 Data Science Principles What is DATA SCIENCE? COS10022 Data Science Principles Job landscape Reference Link COS10022 Data Science Principles Data science is still highly sought-after and well-compensated. Demand for data scientists has increased significantly, with continued job growth projected. The field has become well-established in various industries, including finance, healthcare, and government. Educational offerings in data science have expanded substantially. Related roles, such as machine learning engineers and data engineers, have proliferated, emphasizing collaboration. Technology trends, including automation, have shifted the focus toward predictive modelling and translating business needs. Reference Link COS10022 Data Science Principles ‘Datafication’ A process of “taking all aspects of life and turning them into data”. To predict To monitor To predict objects, faces, & analyse health gestures, political conditions & A…………. anomalies, F……......... trends & Google’s improve V……….... etc. sentiments D……M……. medical P…… diagnoses using patient DATA data Once we `datafy’ things, we can transform their purpose and turn the information into new forms of value. COS10022 Data Science Principles Data Science: What is Mike Driscoll, CEO @ Metamarket View online *Why bother with definitions? I hate definitions! View Online COS10022 Data Science Principles Data Science: What is Drew Conway’s Data Science Venn diagram (2013) Link: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram COS10022 Data Science Principles Data Science: What is Michael Malak’s fourth bubble (2014) Link: http://datascienceassn.org/content/fourth-bubble-data-science-venn-diagram-social-sciences Hiring a social-sciences-oriented data scientist who lacks knowledge in your particular business domain Trying to analyse website clickstream data without knowing any behavioural science or even any CHI/UX knowledge (computer-human interface and user By Drew Conway By Micheal Malak experience). COS10022 Data Science Principles Who are Data Scientists? What do they do? COS10022 Data Science Principles Data Scientist: What is Defined by usage In Academia: “a scientist, trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real problem.” (O’Neil & Schutt 2014) COS10022 Data Science Principles Data Scientist: What is In Industry: “a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human.” (O’Neil & Schutt 2014) COS10022 Data Science Principles COS10022 Data Science Principles Data Scientist: What is Essential sets of skills and behavioral characteristics: 1. Quantitative skill, e.g. mathematics, statistics. 2. Technical aptitude, e.g. software engineering, machine learning, programming skills. 3. Skeptical mind-set and critical thinking, i.e. capable of examining their work critically rather than in a one-sided way. 4. Curious and creative, i.e. passionate about data and finding creative ways to solve problems and portray information. 5. Communicate and collaborative, e.g. able to articulate business values, a good team player. COS10022 Data Science Principles How does Data Science differ from the traditional analytics frameworks? COS10022 Data Science Principles Data Science vs Enterprise Data Warehouse Data Warehouse (DW) is a relational database that is designed for query and analysis rather than for transaction processing. Contains selective, cleaned, and transformed historical data. ETL (Extraction , Transformation, and Loading) OLAP (On-Line Analytical Processing) Supports enterprise decision making Image: https://upload.wikimedia.org/wikipedia/commons/4/46/Data_warehouse_overview.JPG Source: https://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm COS10022 Data Science Principles Data Science vs Enterprise Data Warehouse Limitations of the Enterprise DW analytics : 1. High-value data is hard to reach and leverage. Low priority for data science projects. 2. Data usually moves in batches from DW to local analytics tools (e.g. R, SAS, Excel). In-memory analytics; dataset size constraints. Getting a small subset of data from the DW and perform analysis on own desktops 3. Data science projects remain isolated and ad Source: Dietrich, D. ed., 2015, page 14 hoc, rather than centrally managed. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Data science initiatives are usually exist as nonstandard initiatives and are not aligned with corporate strategic business goals. COS10022 Data Science Principles Big Data vs Enterprise DW / Business Intel. The four V’s of Big Data will not work well with the traditional Enterprise Data Warehouse. Centralized, purpose-built space (lack of agility) Supports Business Intelligence and reporting (restrict robust analyses) Analysts must depend on IT group and DBAs for data access (lack of control) Analysts must spend significant time to aggregate and dis-aggregate data from multiples sources (reduces timeliness) To succeed, Big Data analytics require different approaches. COS10022 Data Science Principles Analytic Sandbox (Workspaces) Resolve the conflict between the needs of analysts and the traditional EDW or other formally managed corporate data. Data assets gathered from multiple sources and technologies for analysis. Enables flexible, high performance analysis in nonproduction environment. Reduces costs and risks associated with data replication into `shadow’ file systems. `Analyst owned’ rather than `DBA owned’. COS10022 Data Science Principles Data Science vs Business Intelligence Source: Dietrich, D. ed., 2015 Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. COS10022 Data Science Principles What is BIG DATA? COS10022 Data Science Principles Big Data: What is Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Scale Distribution Diversity Timeliness (James, M., Michael, C., Brad, B., Jacques, B., Richard, D., Charles, R. and Angela, H., 2011. Big data: the next frontier for innovation, competition, and productivity. The McKinsey Global Institute) COS10022 Data Science Principles Big Data: What is Four V’s of Big Data 1. Volume (scale) 2. Variety (diversity, distribution) 3. Velocity (timeliness) 4. Veracity (i.e. pertaining to the accuracy of data) COS10022 Data Science Principles Big Data: Data Structures Source: http://www.tsmtutorials.com/2016/06/data-and-information-basics.html COS10022 Data Science Principles Big Data: Data Structures Source: http://www.tsmtutorials.com/2016/06/data-and-information-basics.html COS10022 Data Science Principles Big Data: Data Structures Source: http://www.tsmtutorials.com/2016/06/data-and-information-basics.html COS10022 Data Science Principles What drives BIG DATA? COS10022 Data Science Principles Drivers of Big Data Source: Dietrich, D. ed., 2015 Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. COS10022 Data Science Principles What are the components in the Big Data Ecosystem? COS10022 Data Science Principles COS10022 Data Science Principles Data Devices Gather data from multiple locations. Continuously generate new data about this data. For each Gigabyte created for this data, an additional Petabyte of data is created about that data. Consider Data generated from someone playing an online video game through a PC, game console (PlayStation, Xbox, Nintendo Wii), or smartphone. COS10022 Data Science Principles Data Collectors Entities that collect data from the device and users. Consider Cable TV provider tracks: the shows a person watches which TV channels someone will and will not pay for to watch on demand the prices someone is willing to pay for premium TV content COS10022 Data Science Principles Data Aggregators Entities that compile and make sense of the data collected by data collectors. Transform and package the data as products to sell. Website: Falcon YouTube: Falcon Social Listening COS10022 Data Science Principles Data Users and Buyers a business that collects personal Direct benefactors of the data collected and information about consumers and aggregated by others within the data value chain. sells that information to other organizations. Examples Corporate customers Information brokers Provide information creditors Analytic services Credit bureaus and lenders use to help them make important lending Media archives Catalog co-ops decisions. Advertising Checkout Falcon’s customers: https://www.falcon.io/our-customers/ COS10022 Data Science Principles Big Data: Key roles Advanced training in quantitative disciplines such as mathematics, statistics and machine learning. Savvy but less technical than Group 1 such as financial analysts, market research analysts, life scientists. Those who support data analytic process such as DB admins, programmers, etc. COS10022 Data Science Principles End of Lecture Linear Regression COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Learning Outcomes This lecture supports the achievement of the following learning outcomes: 3. Describe the processes within the Data Analytics Lifecycle. 4. Analyse business and organisational problems and formulate them into data science tasks. 5. Evaluate suitable techniques and tools for specific data science tasks. 6. Develop analytics plan for a given business case study. COS10022 Data Science Principles Data Analytics Lifecycle COS10022 Data Science Principles Phase 1 - Discovery Data science team learns the business domain, and assesses the resources available to support the project in terms of people, technology, time, and data. Important activities include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data. Person Loan Approval Prediction Model: Conduct stakeholder interviews to gather comprehensive requirements and objectives for the personal loan approval model. Understand what factors influence loan approval decisions and what business goals (e.g., reducing default rates, increasing customer satisfaction) the model should aim to support. COS10022 Data Science Principles Phase 2 – Data Preparation Phase 2 requires the presence of an analytics sandbox, in which the data science team work with data and perform analytics for the duration of the project. The team performs ETLT to get the data into the sandbox, and familiarize themselves with the data thoroughly. ETL + ELT = ETLT (Extraction, Transform and Load) Person Loan Approval Prediction Model: Collect historical loan application data and perform data cleaning. Handle missing values, remove outliers, and ensuring the data is in a suitable format for analysis. Standardize the format of applicant income, employment history, credit score, and other relevant features. COS10022 Data Science Principles Phase 3 – Model Planning The data science team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team also explores the data to learn about the relationships between variables, and subsequently selects key variables and the most suitable models. Person Loan Approval Prediction Model: Selecting a set of machine learning algorithms to test for predicting loan approval outcomes. This might include logistic regression for its interpretability in binary outcomes like approval/rejection, decision trees for their ability to handle nonlinear relationships, or more complex models like random forests or gradient boosting machines if the problem requires capturing complex patterns in the data. COS10022 Data Science Principles Phase 4 – Model Building The data science team develops datasets for testing, training, and production purposes. The team builds and executes models based on the work done in the model planning phase. The team also considers the sufficiency of the existing tools to run the models, or whether a more robust environment for executing the models is needed (e.g. fast hardware, parallel processing, etc.). Person Loan Approval Prediction Model: Feature engineering to create new variables that might better capture the risk associated with a loan application, such as debt-to-income ratio. Split the data into training, validation, and test sets, then train different models on the training set while tuning hyperparameters and selecting the best model based on performance on the validation set. COS10022 Data Science Principles Phase 5 – Communicate Results The data science team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. Person Loan Approval Prediction Model: Develop a dashboard that visualizes the model's predictions, performance metrics (like accuracy, precision, recall), and the importance of different features in the decision-making process. This helps non-technical stakeholders understand how the model makes decisions. COS10022 Data Science Principles Phase 6 – Operationalize The data science team delivers final reports, briefings, code, and technical documents. The team may run a pilot project to implement the models in a production environment. Person Loan Approval Prediction Model: Prepare a presentation or report detailing the model's performance metrics, how it was developed, its expected impact on the loan approval process, and guidelines for its deployment and use in the operational environment. Setting up a process for continuous monitoring and updating of the model as new data becomes available would be critical to ensure its ongoing accuracy and relevance. COS10022 Data Science Principles Lecture: Key Questions What are the distinctions between the model planning and model building phase? What are some key considerations in model building? What software tools (commercial, open source) are typically used at this phase? What is Linear Regression model and in what situation is it appropriate? How does the Linear Regression model work for predictive modelling tasks? How do we prepare our data prior to applying the Linear Regression model? COS10022 Data Science Principles Phase 4 – Model Building Key activities: Develop analytical model, fit it on the training data, and evaluate its performance on the test data. The data science team can move to the next phase if the model is sufficiently robust to solve the problem, or if the team has failed. COS10022 Data Science Principles Phase 4 – Model Building In this phase, an analytical model is developed and fit on the training dataset, and subsequently evaluated against the test dataset. Both the model planning (Phase 3) and model building phases can overlap, where a data science team iterate back and forth between these two phases before settling on the final model. By ‘developed’, we do not always mean coding an entirely new analytics model from scratch. Rather, this usually involves selecting and experimenting with various models, and where applicable, fine tuning their parameters. Although some modelling techniques can be quite complex, the actual duration of model building can be short in comparison with the time spent for preparing data and planning the model. COS10022 Data Science Principles Phase 4 – Model Building Documentation is important at this stage. Examples of documentation: Data Transformation: Note decisions on data transformation, missing values, feature scaling, and new feature creation. When immersed in the details of building models and Algorithm Selection: Justify the choice of machine learning algorithms, transforming data, many small decisions are often citing supporting research or tests. Model Configuration: Document algorithm settings, hyperparameters, made about the data and the approach for modeling. and tuning methods used. These details can be easily forgotten once the project Training Process: Outline model training steps, data splitting, and overfitting prevention techniques. is completed. Evaluation Metrics: Detail metrics for model performance assessment and their selection rationale. Results Comparison: Summarize model performance results and visualizations for comparisons. It is vital to record the results and logic of the model Final Model Criteria: Explain the final model choice based on performance, interpretability, and practicality. during this phase. One must also take care to record Error Analysis: Detail any error analysis performed, including common any operating assumptions made concerning the data types of errors the model makes, insights from analyzing the errors, and potential strategies to address them.. or the context during the modeling process. Version Control: Maintain version history for models, data, and code in a control system. Dependencies: List necessary external libraries and their versions. COS10022 Data Science Principles Phase 4 – Model Building (reproduced from Lecture 03) Commercial tools used in this phase: SAS Enterprise Miner – allows users to run predictive and descriptive models based on large volumes of data from across the enterprise. IBM SPSS Modeler – offers methods to explore and analyze data through GUI. Matlab – provides a high-level language for performing a variety of data analytics, algorithms, and data exploration. Chorus 6 – provides a GUI front end for users to develop analytic workflows and interact with Big Data tools and platforms on the back end. … and many other well-regarded data mining tools e.g. STATA, STATISTICA , Mathematica COS10022 Data Science Principles Phase 4 – Model Building (reproduced from Lecture 03) Open source tools: R and PL/R – PL/R is a procedural language for PostgreSQL with R which allows R commands to be executed in database. Octave – programming language for computational modeling, with some of the functionalities of Matlab. WEKA – data mining package with an analytic workbench and rich Java API. Python – offers rich machine learning and data visualization packages: scikit-learn, NumPy, SciPy, pandas and matplotlib. MADlib – provides machine learning library of algorithms than can be executed in- database, for PostgreSQL or Greenplum. COS10022 Data Science Principles Predictive Models Predicting Class Label Predictive models are data analytics models/algorithms/techniques used for predicting certain attributes of a given object. Examples: 1. A predictive model can be used for guessing whether a customer will “subscribe” or “not subscribe” to a certain product or service. Group 2. Alternatively, a predictive model may be used to predict whether a similar data points patient would “survive” or “not survive” a specific disease. together The goal of a predictive model greatly differs from the goal of unsupervised models (e.g. K-Means Clustering) which are limited to finding specific patterns or structures within the data (e.g. clusters or segments). COS10022 Data Science Principles Predictive Models Predicting an attribute of objects are usually solved as classification problems. In a classification problem, a model is presented with a set of data examples that are already labeled (training dataset). After learning from these examples, the model then attempts to label new, previously unseen set of data (test dataset). output input variables / class variable Given the utilization of training set, most Training set Class labels classification (‘yes’, ‘no’) models are categorized as supervised models. Test set Class labels to be predicted COS10022 Data Science Principles Training and Test sets Training dataset: the portion of data used to discover a predictive relationship. Test dataset: the portion of data used to assess the strength and utility of a predictive relationship. The training and test datasets are usually independent from each other (non-overlapping). In addition to splitting a dataset into training and test sets, it is also common to set aside a certain portion of the dataset as a validation set to improve the performance of a model. Validation dataset is the portion of data that is used to minimize the possible overfitting of a model and select the optimal model parameters (more of these in the next lecture). There is no general rule for how COS10022 Data Science Principles you should partition the data! Linear Regression Model Linear Regression is considered as one of the oldest supervised/predictive models (more than 200 years old). Its goal is to understand the relationship between input and output variables. The model assumes that the output variable (i.e. predicted variable) is numerical and that a linear relationship exists between the input variables and the single output variable. The value of the output variable is calculated from a linear combination of the input variables. Advantages: (a) simplicity; (b) gives optimal results when the relationships between the input and output variables are linear. Disadvantages: (a) limited to predicting numerical values; (b) will not work for modeling non-linear relationships. COS10022 Data Science Principles Linear Regression Model output (y) → Y = aX + b ← input (x) Sample of a linear relationship between weight and height data. Source: http://machinelearningmastery.com/linear-regression-for-machine-learning/ COS10022 Data Science Principles Linear Regression Model Linear Regression belongs to what is called as parametric learning or parameter modeling approach. Following this approach, building a predictive model starts with specifying the structure of the model with certain numeric parameters left unspecified. The objective of the model building is to estimate the best values for these parameters from the training data. Y = aX + b Linear Regression’s model involves a set of parameterized numerical attributes. These attributes can be chosen based on domain knowledge regarding which attributes are likely to be informative in predicting the target variable, or based on more objective methods such as attribute selection techniques. COS10022 Data Science Principles Linear Regression Model The Linear Regression model: weight height age y = w0 + w1 x1 + w2 x2 + where: y is the predicted output variable; w0 is the bias coefficient / intercept (the value of y when all input variables are zero); w1 , w2 , are the parameters / weights / coefficients of the input values that need to be estimated from the training data; and x1 , x2 , are values of the input variables. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Given the following data: Task. Build a simple Linear Regression model that predicts the value of Y when the value of X is known. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Building a simple Linear Observe. Regression is similar to finding the best-fitting straight line The regression line represents our (called regression line) Y best estimation of the actual values of through the existing data Y’ Y’ = 1.00 Y (the coloured data points) and does points. Y’ Y’ Y Y’ = 1.635 not need to cross exactly over all of Y Y’ Y’=2.060 the actual points on the scatterplot. Y In the diagram on the right, Y’ Y Y’=2.485 Otherwise, we might end up with an the straight black line is the Y’=2.910 overfitting problem. resulting regression line. Points along this line Note that the line passes quite closely represents the predicted to the red data point; in contrast, it is values of Y given a value of X. situated quite far from the yellow data point. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Observe. Since there are many possibilities for drawing a regression line through the Table 2 shows the predicted Y values coloured data points, there (Y’) based on the previous regression must be a way to decide on the line, given each value of X. best-fitting regression line. Y-Y’ is the absolute error value. Linear Regression solves this by finding the regression line that (Y-Y’)2 is the squared error value. minimizes the prediction error Adding up these values for the five (hence, an optimization SSE =2.791 data points gives the sum of the problem). A common measure squared errors. SSE indicates how much of the variation in the dependent of such error is the sum of the variable (Y) is not explained by the model. squared errors (SSE). R2 indicates how well the model fits the data. COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html The previous regression line is modelled using the following equation: y = 0.785 + 0.425 x For example: for x = 1, y = 0.785 + 0.425 (1) = 1.21 for x = 2, y = 0.785 + 0.425 (2) = 1.64 COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html How did we calculate the previous Linear Regression equation in the first place? N Five statistics are required: ∑x i mean of X: µx µx = i =1 N , where mean of Y: µy xi : an input value standard deviation of X: s x N : the total number of values in a given input variable x standard deviation of Y: s y Pearson’s correlation coefficient: rxy COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Standard deviation. Standard deviation measures how far a set of random numbers are spread out from their average value (mean). N ∑ i x ( x − µ ) 2 sx = i =1 N −1 N ∑(y this part of the i − µy ) 2 equation is called the sy = i =1 ‘sample N −1 variance’. Source: https://www.biologyforlife.com/standard-deviation.html COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html Pearson’s correlation coefficient. N ∑ (x − µ i x )( yi − µ y ) rxy = i =1 , where N ∑ (x − µ i =1 i x ) 2 ⋅ ∑ ( yi − µ y ) 2 N : the total number of data points Person’s correlation coefficient measures the strength of association between two variables. Source: https://www.spss-tutorials.com/pearson-correlation-coefficient/ COS10022 Data Science Principles Linear Regression Model – Example Source: http://onlinestatbook.com/2/regression/intro.html The resulting statistics: Linear Regression formula. µ x = 3.00 y = 0.785 + 0.425 x µ y = 2.06 s x = 1.581 sy wx = rxy ⋅ = 0.425 s y = 1.072 sx rxy = 0.627 w0 = µ y − wx µ x = 2.06 − (0.425)(3) COS10022 Data Science Principles Ordinary Least Squares Regression The previous example illustrates a simple linear regression where we only have a single input variable. When there are more than one input variables, the ordinary least squares regression is used to estimate the parameter value (i.e. the coefficients / weights) of each input variable. Similar to finding the best-fitting regression line, the goal here is to finetune these parameters such that they minimize the sum of the squared error of each data point. As this is an optimization problem, in practice you hardly need to do this manually. Most data science software packages include linear regression functionality to solve this optimization task easily. COS10022 Data Science Principles Preparing Data for Linear Regression Source: http://machinelearningmastery.com/linear-regression-for-machine-learning/ Linear assumption. Linear Regression assumes that the relationships between the input and output variables are linear. For non-linear data, e.g. exponential relationship, data transformation technique such as the log transform is needed. Remove noise and outliers. Linear Regression assumes that the data is clean. Apply appropriate data cleaning techniques to remove possible noise and outliers. Examples of Noisy Data: Incorrect attribute values due to faulty data collection instruments, data entry problems, inconsistency in naming convention, etc. Remove collinearity. Collinearity is caused by having too many variables trying to do the same job. The Occam’s Razor principle states that among several possible explanations for an event, the simplest explanation is the best. Consequently, the simpler our model is, the better. Consider calculating pairwise correlations for your input data and remove the most correlated ones. COS10022 Data Science Principles Non-Linear Transformation: Linear curve has straight line relationship. Using non-linear transformation, non-linear problem can be solved as a linear (straight-line) problem. Source: https://people.revoledu.com/k ardi/tutorial/Regression/nonlin ear/NonLinearTransformation. htm COS10022 Data Science Principles Outliers: A data point that differs significantly from other data points. Anscombe’s Quartet Four (4) datasets with nearly identical descriptive statistics (mean, variance) but strikingly different shapes when graphed. Source: https://en.wikipedia.org/wiki/Anscombe's_quartet COS10022 Data Science Principles Collinearity: Two or more predictors are closely related. - Problematic in regression because it is difficult to check X1 X3 how much each predictor influences the output separately. Source: https://yetanotheriteration.net lify.com/2018/01/high- collinearity-effect-in- regressions/ X2 X2 COS10022 Data Science Principles Preparing Data for Linear Regression Source: http://machinelearningmastery.com/linear-regression-for-machine-learning/ Gaussian (normal) distributions. Linear Regression will produce more reliable predictions if your input and output variables have a Gaussian distribution. Certain data transformation techniques can be used to create a distribution that is more Gaussian looking. Gaussian Distribution Source: http://www.itl.nist.gov/di v898/handbook/pmc/sect ion5/gifs/normal.gif Rescale input. Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization. COS10022 Data Science Principles Normalization: To change the observations so that they can be described as a normal distribution (also known as the bell curve. - A Specific statistical distribution where a roughly equal observations falls above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. Standardization: Also called z-score normalization, which transforms data so that the resulting distribution has a mean of 0 and a standard Source: https://kharshit.github.io/blog/2018/03/23/scaling-vs-normalization deviation of 1. (Standard deviation) COS10022 Data Science Principles Texts and Resources Unless stated otherwise, the materials presented in this lecture are taken from: Dietrich, D. ed., 2015. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. EMC Education Services. COS10022 Data Science Principles Model Selection, K-Means Clustering DBSCAN COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Model Selection COS10022 Data Science Principles Data Analytics Lifecycle Phase 3: Model Planning The data science team determines (plans) the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team also explores the data in order to learn about the relationships between variables, and subsequently selects key variables and the most suitable models. The data science team can move to the next phase once it has a good idea about the type of model to try and the team has gained enough knowledge to refine the analytics plan. 5Data Science – Lecture 06 Model Selection This activity aims at choosing an analytical technique (a model), or a short list of candidate techniques, based on the end goal of the project. What is a “model”? A model is an abstraction from reality; an attempt to understand the reality. The connections among plants, insects and Whether moving objects are less likely to be animals in the food chains in an ecosystem struck by lightning than stationary objects 6Data Science – Lecture 06 Model Selection Modeling involves observing certain events happening in a real-world situation and attempting to construct models that emulate this behavior with a set of rules and conditions. Law of Demand: The purple line shows that as the price increases, the quantity supplied increases too. The green line shows that the higher the price of a good, the lower the quantity demanded. The equilibrium point is when the quantity demanded is equal to the quantity supplied. This is the Association rules are used to identify point where the seller would need to the best possible combination of decide whether to increase or decrease products or services which are or adjust the price. frequently bought together by customers. Model Selection Often in a model, all extraneous detail has been removed or abstracted. Consequently, we must pay attention to these abstracted details after a model has been analyzed to see what might have been overlooked. In statistical modeling (which we’ll be most concerned with in this unit), the terms ‘model’, ‘algorithm’, ‘analytical technique’, and ‘analytical method’ are often used interchangeably. 8 Model Selection When thinking about a model, it is useful to think about the process underlying the model: What comes first? What influences what? What causes what? What influences what? What causes what? The curve shows that the quantity demanded is inversely related What is a test of what? to the price. When price increases, the quantity demanded will fall, while the quantity demanded increases as the price falls. Think about the previously shown What is a test of what? By applying the supply demand Supply-Demand curve. model to a population, “How do changes in consumer wealth and population affect the market demand of a product or service?” Model Selection There are many ways to express models. Some prefer to express a model in terms of math. Others like to use pictures, graphs, and diagrams which show how things affect other things or what happens over time. In statistical modeling, however, most models would end up with some mathematical equations. In mathematical expressions, the convention is to use Greek letters for parameters and Latin letters for data/variables. Example: if you have two columns of data, 𝒙𝒙 and 𝒚𝒚, and you think there is a linear relationship, you can mathematically express a model that describes that relationship as follows: data y = β 0 + β1 x data parameters 10 Model Selection Key considerations when selecting models in Phase 3 of the Data Analytics Lifecycle: 1. The structure of the data (structured, semi-structured, quasi-structured, unstructured) Structured: Regression models for housing price predictions from a CSV dataset with house features. Semi-Structured: Decision trees for sentiment analysis from JSON or XML files. Quasi-Structured: Text mining for structuring and analyzing inconsistent web logs. Unstructured Data: CNNs for image recognition tasks. 2. The extent to which the model would be suitable for proving / disproving the hypotheses Example: If the hypothesis is that customer satisfaction impacts churn rate, you might use logistic regression to determine the probability of churn based on customer satisfaction scores. 3. Whether the analytical needs warrant a series of models to be used. Example: In a scenario where you are trying to understand customer behaviour, you might start with clustering algorithms to segment the customer base, followed by decision trees to understand the factors driving different customer behaviours within each segment, and conclude with predictive models like random forests to forecast future purchase patterns of each segment. Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Model Selection The Kdnuggets’ latest poll asked data scientists which models/algorithms they used in the past 12 months for an actual data science-related application. Data Science – Lecture 06 13 Model Selection E.g. k-Means Clustering, E.g. Linear Regression, Naïve DBSCAN, Association Bayes Classifier, Decision Tree, Rules Mining Random Forest Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Supervised vs Unsupervised vs Semi-supervised models Supervised models take a set of input variables (X) and an output variable (Y) and try to learn the mapping function from the inputs to the output: Training set: Inputs + Outputs (i.e. the correct answers) Test set: Inputs + … (the model needs to guess the correct answers) Two main categories of supervised models: 1. Classification (categorical output, e.g. “Fail”, “Pass”) 2. Regression (real values/numerical output, e.g. 0.05, $250) 15 Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Feature extraction: Converting input data into a set of Model Training: The learning numerical values that capture the essential information from algorithm finds patterns in the the raw data). training data that map the Image Processing Identify edges, corners, colours, input to the output. etc. Text Analysis Frequency of words, encoding the meaning of words into vectors Classification Categorical output, Class X = {x1, x2, x3,...} label Correct answer, y Regression Another X Correct answer, y Predicted y Numerical Output 16 Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Supervised vs Unsupervised vs Semi-supervised models Unsupervised models work with a set of input variables (X) but no output variable. The goal is to find the underlying and ‘interesting’ structure or distribution in the data in order to learn more about the data. No training and test set. No correct answer and no “teacher”. Two main categories of unsupervised models: 1. Clustering 2. Association Data Science – Lecture 06 17 Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Clustering X = {x1, x2, x3,...} Dimensional Reduction Another X 18 Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Supervised vs Unsupervised vs Semi-supervised models Semi-supervised models address situations where there are a set of input variables (X) with partially available values of the output variable (Y). An example is a model that must automatically classify images from a photo archive where only some of the images are labelled (e.g. dog, cat, person) while the majority are not labelled. A possible strategy: Apply a supervised model to train the labelled data, and then Make the best prediction on the unlabeled output using the same supervised model. Unlabelled Unlabelled Once trained, apply the supervised model to make Labelled prediction on the test data. 19 Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Semi-Supervised Learning Workflow Principle of semi-supervised learning: 1. A model (e.g. a classifier) is first trained on the few available labelled training data. 2. This model is then used to classify and thus label the many unlabelled data available. 3. The newly labelled data are combined with the originally available labelled ones to retrain the model with many more data, and thus hopefully to obtain a better model. Exploratory Model: k-Means Clustering COS10022 Data Science Principles Data Science – Lecture 06 Exploratory model: K-Means Clustering Exploratory = Unsupervised Clustering methods aim at discovering natural grouping of objects of interests (i.e., customers, images, documents, etc.). Generally, this objective is achieved through: 1. Finding the similarities between the objects based on their attributes/properties/variables. 2. Group similar objects into clusters. Data Science – Lecture 06 Exploratory model: K-Means Clustering Popular methods: k-means clustering Use cases: Customer segmentation Grouping customers according to the similarity of their behaviors or spending patterns Automatic identification of abnormal activities in CCTV videos Process video data to learn what typical activity patterns look like and then identify anomalies or unusual patterns which might indicate suspicious behaviour Summarize news articles Group news articles into clusters based on word frequency, allowing the identification of central themes and the extraction of key phrases that represent the main content of each cluster. and many more …. Data Science – Lecture 06 n=3 K-Means Clustering: the algorithm K-Means clustering model assumes the following: A collection of M number of objects or data points with M = 10 n number of attributes/properties/variables. Pre-determined k number of clusters to be found (i.e. K- Means requires that you decide the number of clusters you need) Given a dataset with two attributes (n=2), an object or Cluster 1 M = 10, k =2 data point corresponds to a point (xi, yi) in a Cartesian m1, m2, plane, where x and y denote the two attributes and m3, i = 1, 2, …, M. m4, m5 m6, m7, Cluster 2 For a given cluster containing m data points m8, m9, m10 (m ≤ M), a centroid is the point in the Cartesian plane which corresponds to mean value of that cluster. Data Science – Lecture 06 K-Means Clustering: the algorithm The basic algorithm of the K-Means clustering consists of 4 main steps: Step 1: Choose the value of k and the k initial guesses for the centroids (also known as the ‘center of mass’). Step 2: Compute the distance from each data point (xi, yi) to each centroid. Assign each point to the closest centroid. This association defines the first k clusters. Step 3: Compute a new centroid for each cluster defined in Step 2. Step 4: Repeat Step 2 and 3 until the algorithm converges to an answer. The initialization process in Step 1 can be achieved using two different methods: Forgy: set the positions of the k centroids to k randomly chosen data points. Random partition: assign a cluster randomly to each observation and compute the initial centroids in a manner similar to Step 3. Data Science – Lecture 06 K-Means Clustering: the algorithm Step 1: Choose the value of k and the k initial guesses for the centroids (also known as the ‘center of mass’). Example: Assume k = 3 Data Science – Lecture 06 K-Means Clustering: the algorithm Step 2: Compute the distance from each data point (xi, yi) to each centroid. Assign each point to the closest centroid. This association defines the first k clusters. d = ( xi − xC ) 2 + ( yi − yC ) 2 Distance d between a data point (xi, yi) to a centroid (xC, yC ) is calculated using the Euclidean distance measure: Centroid Data Point Data Science – Lecture 06 K-Means Clustering: the algorithm Step 3: Compute a new centroid for each cluster defined in Step 2. For each cluster, the coordinate of its new centroid is calculated as an ordered pair of the arithmetic means of the coordinates of the m data points in that cluster, as follows: m m ∑ xi ∑ yi ( xnew _ C , ynew _ C ) = i =1 , i =1 new centroids m m Data Science – Lecture 06 K-Means Clustering: the algorithm Step 4: Repeat Step 2 and 3 until the algorithm converges to an answer. * Convergence is reached when the computed centroids no longer change. Observe how the coordinates of the centroids no longer change between Iteration 9 and the Converged! steps. K-Means Clustering: the algorithm Just another example. Source: https://en.wikipedia.org/wiki/K-means_clustering Data Science – Lecture 06 Initializing k-means Clustering Play with this online simulation. Initializing k-means Clustering Forgy Method: Set the positions of the k centroids to k randomly chosen data points. Initializing k-means Clustering Random Partition Method: Randomly assign each observation to a cluster. Bad Examples The k-means algorithm works reasonably well when the data fits the cluster model: The clusters are spherical: the data points in a cluster are centred around that cluster. The spread/variance of the clusters is similar: Each data point belongs to the closest cluster Top left figure: Although the clusters have the same scatter (in fact, the same shape), they are not spherical. Bottom left figure: Although the clusters have spherical shapes, they have different scatters. K-Means Clustering: Distance Calculation K-Means Clustering: Distance Calculation K-Means Clustering: Distance Calculation K-Means Clustering: Distance Calculation K-Means Clustering: generalizing the algorithm to a higher dimensional space The previous example illustrates K-Means clustering algorithm on 2-dimensional data points (n=2). The algorithm needs to be slightly modified such that it generalizes to higher dimensional data points (n>2). Assume that we have M data points, where each data point has n attributes or properties (p1, p2, … , pn). As such, data point i is described by (pi1, pi2, … , pin) for i=1,2, …, M. The previous d equation is then generalized as follows: n d ( pi , q ) = ∑(p j =1 ij − q j )2 where: pi is a data point at (pi1, pi2, … , pin) q is a centroid located at (q1, q2, … , qn) K-Means Clustering: generalizing the algorithm to a higher dimensional space Likewise, to compute the new coordinate of a centroid q of a cluster containing m data points with attributes (pi1, pi2, … , pin) (see Step 3 again), the following formula applies: m m m ∑ pi1 ∑ pi 2 ∑ pin (q1 , q2 , , qn ) = i =1 , i =1 , , i =1 m m m where: (q1, q2, …, qn) represents the new coordinate of centroid q. Data Science – Lecture 06 K-Means Clustering: choosing the value of k As mentioned before, the K-Means clustering model assumes that you already know the ‘right’ number of k WSS: Total sum of the squared distances between each data clusters to be found before executing the clustering point in a cluster and the centroid of that cluster. model. In practice, the optimal value of k can be determined by either: a ‘reasonable’ guess; predefined requirements, e.g. a company wishes to segment its customers to exactly 5 clusters given its traditional way of grouping customers; using the Within Sum of Squares (WSS) metric as a heuristic. The basic idea is that if having k+1 clusters does not greatly reduce the value of WSS compared to having k clusters, then there is little benefit to adding another cluster. http://www.learnbymarketing.com/methods/k-means-clustering/ K-Means Clustering: choosing the value of k The following questions should be considered Are the clusters well separated from each other? Do any of the clusters have only a few points Do any of the centroids appear to be too close to each other? Within Sum of Squares (WSS) WSS metric is the sum of the squares of the distances between each data point and the closest centroid. The process of identifying the appropriate value of k is referred to as finding the “elbow” of the WSS curve. K-Means Clustering: choosing the value of k Within Sum of Squares (WSS) The term 𝑞𝑞 (𝑖𝑖) indicates the closet centroid that is associated with the 𝑖𝑖th point. If the points are relatively close to their respective centroids, the WSS is relatively small. If 𝑘𝑘 + 1 clusters do not greatly reduce the The elbow of the curve appears to occur at k = 3. value of WSS from the case with only 𝑘𝑘 clusters, there may be little benefit to adding another cluster. K-Means Clustering: choosing the value of k Silhouette analysis can be used to determine the degree of separation between resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters. For each sample: Compute the average distance from all data points in the same cluster (ai). Compute the average distance from all data points in the closest cluster (bi). Compute the coefficient: The coefficient can take values in the interval [-1, 1]. If it is 0 –> the sample is very close to the neighbouring clusters. It it is 1 –> the sample is far away from the neighbouring clusters. It it is -1 –> the sample is assigned to the wrong clusters. A good cluster has a coefficient close to 1. K-Means Clustering: choosing the value of k K-Means Clustering: choosing the value of k Reasons to Choose and Cautions Decisions the practitioner must make: What object attributes should be included in the analysis? What unit of measure should be used for each attribute? Do the attributes need to be rescaled? What other considerations might apply? Reasons to Choose and Cautions Object Attributes Important to understand what attributes will be known at the time a new object is assigned to a cluster E.g., customer satisfaction may be available for modeling but not available for potential customers Best to reduce number of attributes when possible Too many attributes minimize the impact of key variables Identify highly correlated attributes for reduction Combine several attributes into one: e.g., debt/asset ratio Reasons to Choose and Cautions Highly correlated Object Attributes When there are too many attributes, one useful approach is to identify any highly correlated attributes and use only one or two of the correlated attributes in the clustering analysis. Scatterplot matrix for seven attributes Reasons to Choose and Cautions Units of Measure K-means algorithm will identify different clusters depending on the units of measure. Reasons to Choose and Cautions Additional Considerations Could explore distance metrics other than Euclidean E.g. Cosine, Manhattan, etc. K-means is easily applied to numeric data and does not work well with nominal attributes E.g. Color use “k-Modes” Exploratory Model: DBCAN COS10022 Data Science Principles A Density Based Notion of Clusters We can easily detect clusters and noise. The reason is that the density of points is higher within cluster than outside. source: Martin Ester, Hans‐Peter Kriegel, Jiirg Sander, Xiaowei Xu (1996) Density Based Notion of Clusters We can formalize this intuitive notion of clusters and noise. The key idea is that for each point of a cluster the neighbourhood of a given radius has to contain at least a minimum number of points. – the density in the neighbourhood has to exceed some threshold. The shape of a neighbourhood is determined by the choice of a distance function for two points p and q, denoted by dist(p, q). DBSCAN Density Based Spatial Clustering of Applications with Noise Basic idea – Clusters are dense regions in the data space, separated by regions of lower object density – A cluster is defined as a maximal set of density‐ connected points – Discovers clusters of arbitrary shape Method – DBSCAN Density Definition ‐Neighbourhood – Objects within a radius of from an object. E “High density” ‐ ‐Neighbourhood of an object contains at least MinPts of objects. MinPts=4 Density of p is “high” Density of q is “low” Core, Border & Outlier Given ε and MinPts, categorize the objects into three exclusive groups. A point is a core point if it has more than a specified number of points (MinPts) within Eps—These are points that are at the interior of a cluster. ε = 1unit, MinPts = 5 A border point has fewer than A noise point (outlier) is any MinPts within Eps, but is in the point that is not a core point neighbourhood of a core point. nor a border point Example Original Points Point types: core, border and outliers ε = 10, MinPts = 4 Density‐reachability Directly density‐reachable – An object q is directly density‐ reachable from object p if p is a core object and q is in p’s ε ‐ neighbourhood. – q is directly density‐reachable from p MinPts = 4 – p is not directly density‐ reachable from q – Density‐reachability is Asymmetry means that even if point A is density-reachable asymmetric from point B, the reverse may not be true. B may not be density-reachable from A, particularly if A does not have enough points in its epsilon neighbourhood to meet the minPts criterion while B does. Density‐reachability Density‐Reachable (directly and indirectly): – A point p is directly density‐reachable from p2 – p2 is directly density‐reachable from p1 – p1 is directly density‐reachable from q – p→p2→p1→q form a chain – p is (indirectly) density‐reachable from q – q is not density‐reachable from p DBSCAN Algorithm for each o D do if o is not yet classified then if o is a core‐object then collect all objects density‐ reachable from o and assign them to a new cluster. else assign o to NOISE DBSCAN Algorithm Pick any point in the dataset. If the point has more than minPoints points within a distance of epsilon from that point, (including the original point itself), all those points are part of the "cluster". We check all these new points and include those that have more than minPoints points within a distance of epsilon. Repeat. Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process. Visualization visualizing‐dbscan‐clustering DBSCAN Resistant to Noise Can handle clusters of different shapes and sizes Cannot handle varying densities Sensitive to parameters—hard to determine the correct set of parameters, require to determine ε and MinPts Summary Clustering analysis groups similar objects based on the objects’ attributes. To use k-means properly, it is important to: Properly scale the attribute values to avoid domination. Assure the concept of distance between the assigned values of an attribute is meaningful. Carefully choose the number of clusters, k. Once the clusters are identified, it is often useful to label them in a descriptive way. Data Science – Lecture 06 Texts and Resources Dietrich, D. ed., 2015. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. EMC Education Services. Schutt, R. and O'Neil, C., 2013. Doing data science: Straight talk from the frontline. O'Reilly Media, Inc. Hajaj, Y, 2020. Introduction to Supervised, Semi-supervised, Unsupervised and Reinforcement Learning, available at: https://www.baeldung.com/cs/machine-learning-intro Dabbura, I, 2018. K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks, available at: https://towardsdatascience.com/k-means-clustering-algorithm- applications-evaluation-methods-and-drawbacks-aa03e644b48a Introduction to Statistical Learning (Chapter 10) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani Ester, M., Kriegel, H.P., Sander, J. and Xu, X., 1996, August. A density‐based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226‐ 231). Naïve Bayes Classifier & Model Evaluation (P1) COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Learning Outcomes This lecture supports the achievement of the following learning outcomes: 3. Describe the processes within the Data Analytics Lifecycle. 4. Analyse business and organisational problems and formulate them into data science tasks. 5. Evaluate suitable techniques and tools for specific data science tasks. 6. Develop analytics plan for a given business case study. 3 Using Proper Techniques to Solve the Problems The Problem to Solve The Category of Techniques I want to group items by similarity. Clustering I want to find structure (commonalities) in the data I want to discover relationships between actions or items Association Rules I want to determine the relationship between the outcome and the input Regression variables I want to assign (known) labels to objects Classification I want to find the structure in a temporal process Time Series Analysis I want to forecast the behavior of a temporal process I want to analyze my text data Text Analysis 4 Key Questions How does the Naïve Bayes algorithm work for predictive modeling? How do we evaluate the performance of various analytics models? What are some popular evaluation metrics? How do we perform cross-validations? What are some practical considerations when evaluating a model? 5 Phase 4 – Model Building In the 4th phase of the Data Analytics Lifecycle, the data science team develops datasets for testing, training, and production purposes. The team builds and executes models based on the work done in the model planning phase. The team also considers the sufficiency of the existing tools to run the models, or whether a more robust environment for executing the models is needed (e.g. fast hardware, parallel processing, etc.). 6 Phase 4 – Model Building Key activities: Develop analytical model, fit it on the training data, and evaluate its performance on the test data. The data science team can move to the next phase if the model is sufficiently robust to solve the problem, or if the team has failed. 7 Naïve Bayes Model Naïve Bayes is a probabilistic classification model m P(a1 , a2 , , am | ci ) = P(a1 | ci ) ⋅ P(a2 | ci ) P(am | ci ) = ∏ P(a j | ci ) developed from the Bayes’ Theorem or Bayes’ Law. It is j =1 ‘naïve’ as it assumes that the influence of a particular attribute value on the class assignment of an object is P(A|C=“apple”) = P(color=“red”|C=“apple”) x P(size=“small”|C=“apple”) P(A|C=“grape”) = P(color=“red”|C=“grape”) x P(size=“small”|C=“grape”) independent of the values of other attributes. This assumption simplifies the computation of the model. IF P(A|C=“apple”) > P(A|C=“grape”) THEN Class Label =“apple” Suppose you have a bag of fruits and want to classify each fruit as either "apple" or “grape" based on its color and size. Advantages: Naïve Bayes assumes that the color and size of a fruit are independent of (a) easy to implement; each other. (b) execute efficiently without prior knowledge about the data. For instance, it treats the probability of a fruit being an apple based on its color (e.g., red) separately from the probability based on its size (e.g., small). Disadvantages: Even though in reality, the color and size of a fruit might be related (e.g., (a) its strong assumption about the independence of attributes small apples are often red), Naïve Bayes simplifies the calculation by treating often give bad results (i.e., bad prediction accuracy); them as independent. (b) discretizing numerical values may result in loss of useful This simplification helps in efficiently calculating the probabilities and making information. classifications. 8 Naïve Bayes Model Input variables: categorical variables; numerical or continuous variables need to be discretized, e.g.: income < $10,000 → low income income ≥ $1,000,000 → working class Output: a class label, plus its corresponding probability score. Note that the class label is a categorical variable (non-numeric). 9 Bayes’ Theorem Bayes’ Theorem gives the relationship between the probabilities of two events and their conditional probabilities. The theorem is derived from conditional probability. The conditional probability of event C occurring, given that event A has already occurred, is denoted as P(C|A), for which the following formula applies: P( A ∩ C ) P(C | A) = P( A) 10 Bayes’ Theorem The Bayes’ Theorem is algebraically derived from the previous conditional probability formula to obtain: P( A | C ) ⋅ P(C ) P 𝐶𝐶 𝐴𝐴 = 𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶) … … … ….. (1) P(C | A) = 𝑃𝑃(𝐴𝐴) P( A) 𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶) P 𝐴𝐴 𝐶𝐶 = … … … ….. (2) 𝑃𝑃(𝐶𝐶) where: P A ∩ C = 𝑃𝑃 𝐴𝐴 𝐶𝐶 𝑃𝑃(𝐶𝐶) …. (3) C is the class label C ∈ {c1 , c2 , , cn } Equation 1 is the conditional probability of C given A has occurred. A is the observed attributes, A ∈ {a1 , a2 , , am } Equation 2 is the conditional probability of A given C has occurred. From Equation 2, we can get the probability of A intersect C, 𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶). Equation 3 is substituted into Equation 1 to derived the equation of Bayes theorem. 11 Bayes’ Theorem: Example 1 John flies frequently and likes to upgrade his seat to first class. He has determined that if he checks in for his flight at least two hours early, the probability that he will get an upgrade is 0.75; otherwise, the probability that he will get an upgrade is 0.35. With his busy schedule, he checks in at least two hours before his flight only 40% of the time. Suppose John did not receive an upgrade on his most recent attempt, what is the probability that he did not arrive two hours early? P 𝐴𝐴 𝐶𝐶 = 0.75 Let C={John arrived at least two hours early} P 𝐴𝐴 ¬𝐶𝐶 = 0.35 A={John received an upgrade} P C = 0.40 such that, ¬C={John did not arrive two hours early} 𝐏𝐏 ¬𝐂𝐂 ¬𝐀𝐀 = ? ¬A={John did not receive an upgrade} 12 Bayes’ Theorem: Example 1 The question above requires that we compute the probability P(¬C|¬A). By directly applying Bayes’ Theorem, we can mathematically formulate the question as: P( A | C ) ⋅ P(C ) P ( ¬A | ¬C ) ⋅ P ( ¬C ) P(C | A) = P(¬C | ¬A) = P( A) P(¬A) The rest of the problem is simply figuring out the probability scores of the terms on the right-hand side. 13 Bayes’ Theorem: Example 1 Start by figuring out the simplest terms based on the available information: Since John checks in at least two hours early 40% of the time, we know that P(C) = 0.4 This means that the probability of not checking in at least two hours early is P(¬C) = 1 – P(C) = 0.6 The story also tells us that the probability that John received an upgrade given that he checked in early is 0.75, such that P(A|C) = 0.75 Next, we were told that the probability that John received an upgrade given that he did not checked in early is 0.35, i.e. P(A|¬C) = 0.35, which allows us to compute the probability that he did not receive an upgrade given the same circumstance as P(¬ A|¬C) = 1 - P(A|¬C) = 0.65 P ( ¬A | ¬C ) ⋅ P ( ¬C ) P(¬C | ¬A) = P(¬A) 14 Bayes’ Theorem: Example 1 We were not told of the probability of John receiving an upgrade, or P(A). Fortunately, using all the terms figured out earlier, this probability can be calculated as follows: P( A) = P( A ∩ C ) + P( A ∩ ¬C ) = P(C ) ⋅ P( A | C ) + P(¬C ) ⋅ P( A | ¬C ) = 0.4 × 0.75 + 0.6 × 0.35 = 0.51 Since P(A) = 0.51, then P(¬A) = 1 - P(A) = 0.49 15 Bayes’ Theorem: Example 1 Finally, using the Bayes’ Theorem, we can compute the probability P(¬C|¬A) as follows: P(¬A | ¬C ) ⋅ P(¬C ) 0.65 × 0.6 P(¬C | ¬A) = = ≈ 0.796 P(¬A) 0.49 Answer: the probability that John did not arrive two hours early given that he did not receive an upgrade is 0.796 16 Bayes’ Theorem: Example 2 Assume that a patient named Mary took a lab test for a certain disease and the result came back positive. The test returns a positive result in 95% of the cases in which the disease is actually present, and it returns a positive result in 6% of the cases in which the disease is not present. Furthermore, 1% of the entire population has this disease. What is the probability that Mary actually has the disease, given that the test is positive? Let P 𝐴𝐴 𝐶𝐶 = 0.95 C={having the disease} P 𝐴𝐴 ¬𝐶𝐶 = 0.06 A={testing positive} such that, P C = 0.01 ¬C={not having the disease} 𝐏𝐏 𝐂𝐂 𝐀𝐀 = ? ¬A={testing negative} 17 Bayes’ Theorem: Example 2 Slightly different from Example 1, the current problem requires that we compute the probability P(C|A). By directly applying Bayes’ Theorem, we translate the question as: P( A | C ) ⋅ P(C ) P(C | A) = P( A) 18 Bayes’ Theorem: Example 2 What we know: 1% of the population has the disease, hence P(C) = 0.01 Conversely, the probability of not having the disease is P(¬C) = 1 – P(C) = 0.99 The probability that the test is positive given the presence of the disease is 95%, i.e. P(A|C) = 0.95 The probability that the test is positive given the absence of the disease is 6%, i.e. P(A|¬C) = 0.06 To compute P(A): √ √ P( A) = P( A ∩ C ) + P( A ∩ ¬C ) P( A | C ) ⋅ P(C ) P(C | A) = = P(C ) ⋅ P( A | C ) + P(¬C ) ⋅ P( A | ¬C ) P( A) ? = 0.01× 0.95 + 0.99 × 0.06 = 0.0689 19 Bayes’ Theorem: Example 2 Finally, we can compute the probability P(C|A) as follows: P( A | C ) ⋅ P(C ) 0.95 × 0.01 P(C | A) =