FIT1043 PASS - Week 1 & 2 PDF
Document Details
Tags
Summary
This document provides an overview of data science concepts, including data science Venn diagram, big data characteristics, machine learning, and the data science process. It also details the steps involved in the data science process, from pitching ideas to wrangling and operationalizing results.
Full Transcript
Data Science ============ - Extraction of knowledge and insights from data **Data Science Venn Diagram** - Represents the combination of different skill sets - Extract the data with hacking skills and get insights of the data with maths and statistics knowledge - Hacking skills + su...
Data Science ============ - Extraction of knowledge and insights from data **Data Science Venn Diagram** - Represents the combination of different skill sets - Extract the data with hacking skills and get insights of the data with maths and statistics knowledge - Hacking skills + substantive expertise allows people to extract and structure the data, but they may **Big Data** - Extensive datasets with characteristics of volume, variety and velocity (3Vs) **Machine Learning** - To develop algorithms and techniques that allows computer to learn - Reasons for machine learning: - Automation, to deal with large amounts of data and when human expertise are expensive - Human expertise are not available (incapable) - Situation changes over time and solutions need to be adapted automatically **Data Science Process** +-----------------------------------+-----------------------------------+ | 1. Pitching Ideas | Pitching ideas for data science | | | projects to investors/ managers | +-----------------------------------+-----------------------------------+ | 2. Collecting Data | Collect data from different | | | sources and may need a long time | +-----------------------------------+-----------------------------------+ | 3. Integration | Data can come from many different | | | sources, therefore we need to | | | integrate all the data into a | | | repository | +-----------------------------------+-----------------------------------+ | 4. Interpretation | Data can be described using a | | | database schema | +-----------------------------------+-----------------------------------+ | 5. Governance | Care for the data, manage data | | | standards and formats, e.g. store | | | data safely to protect the data | | | and prevent data breach | +-----------------------------------+-----------------------------------+ | 6. Engineering | Data engineers make the back-end | | | work | +-----------------------------------+-----------------------------------+ | 7. Wrangling | Inspect and clean the data, | | | extract the data that we need | +-----------------------------------+-----------------------------------+ | 8. Modelling | Analyst propose a mathematical or | | | functional model | | | | | | to perform analysis, statistical | | | or machine learning work on the | | | data | +-----------------------------------+-----------------------------------+ | 9. Visualization | To interpret and present the | | | outcome | +-----------------------------------+-----------------------------------+ | 10. Operationalize | Putting the results to work | +-----------------------------------+-----------------------------------+ Data Science Process with Standard Value Chain ============================================== ![](media/image2.png) +-----------------------------------+-----------------------------------+ | | **Description** | +-----------------------------------+-----------------------------------+ | 1. Collection | Collecting the data from | | | different sources, instruments or | | | providers | +-----------------------------------+-----------------------------------+ | 2. Engineering | Processing and storing the data, | | | manage the databases across the | | | full lifecycle | +-----------------------------------+-----------------------------------+ | 3. Governance | Data management including | | | security, metadata across the | | | full lifecycle | +-----------------------------------+-----------------------------------+ | 4. Wrangling | Data pre-processing and cleaning | +-----------------------------------+-----------------------------------+ | 5. Analysis | Analyse and get insights from our | | | data through learning or | | | visualisation | +-----------------------------------+-----------------------------------+ | 6. Visualisation | Visualisation and summarisation | | | to argue that the results are | | | significant and useful | +-----------------------------------+-----------------------------------+ | 7. Operationalisation | Putting the results of analysis | | | to work to obtain value | +-----------------------------------+-----------------------------------+ FIT1043 PASS - Week 3 Basic Types of Data =================== +-----------------------------------+-----------------------------------+ | Numeric-Discrete | - Numeric, but the values are | | | countable and enumerable | | | | | | - Number of people, age (whole | | | years) | +-----------------------------------+-----------------------------------+ | Numeric-Continuous | - Numeric, usually | | | measurements, not enumerable | | | | | | - Described as intervals on the | | | real number line | | | | | | - E.g. weight, height, distance | +-----------------------------------+-----------------------------------+ | Categorical-Nominal | - Discrete number of values | | | with no ordering | | | | | | - E.g. country, gender, movie | | | genre | +-----------------------------------+-----------------------------------+ | Categorical-Ordinal | - Discrete number of states | | | with ordering | | | | | | - E.g. education status, state | | | of disease progression | +-----------------------------------+-----------------------------------+ Data Visualization ================== - Visual representation of data to understand the trend and patterns in data - **Numerical data**: histograms, box plots, motion charts - **Categorical data**: Frequency table, bar charts, pie charts Visualization for Numerical Data ================================ +-----------------------------------+-----------------------------------+ | Histogram | - Group numerical data into | | | bins | +-----------------------------------+-----------------------------------+ | Box Plot | ![](media/image4.png) | | | | | | - Shows the distribution of | | | data and outliers | +-----------------------------------+-----------------------------------+ | Motion Chart | - Visualizes data in | | | multi-dimensions e.g. x-axis, | | | y-axis, size of bubble, | | | colour | | | | | | - Advantages: Allows deeper | | | insights and good for | | | exploratory work | | | | | | - Disadvantage: Not suitable | | | for static media and display | | | can be overwhelming | +-----------------------------------+-----------------------------------+ Visualization for Categorical Data ================================== +-----------------------------------+-----------------------------------+ | Frequency Table | - Summarizes values and their | | | corresponding frequencies | +-----------------------------------+-----------------------------------+ | Bar Chart | ![](media/image6.png) | | | | | | - Compares data between | | | different categories or over | | | a time frame | +-----------------------------------+-----------------------------------+ | Pie Chart | - Shows part-whole | | | relationship/ numerical | | | proportion of the data | +-----------------------------------+-----------------------------------+ Descriptive Statistics ====================== - Summarises or describes the aspects of a dataset. Would lose detailed information but can gain easy comprehension of the dataset. - Uses measures e.g. **measure of centrality**, **measure of spread** - Contrasts with *inferential statistics* (use samples to predict a bigger population) Measure of Centrality ===================== +-----------------------------------+-----------------------------------+ | Mean | - The average value of the | | | dataset | | | | | | - Uses all values in the sample | | | to calculate | | | | | | - Changes to any value in the | | | sample changes the mean | +-----------------------------------+-----------------------------------+ | Median | - The middle value of the | | | dataset after the dataset is | | | sorted from least to the | | | greatest value | | | | | | - Uses at most 2 values of the | | | sample | | | | | | - Changes to values other than | | | the middle does not changes | | | the median | +-----------------------------------+-----------------------------------+ | Mode | - The most frequent occurring | | | value in the sample | +-----------------------------------+-----------------------------------+ Skewness - Mean vs Median ========================= +-----------------------+-----------------------+-----------------------+ | ![](media/image8.png) | | ![](media/image10.png | | | | ) | +-----------------------+-----------------------+-----------------------+ | **Symmetric | **Positively Skewed** | **Negatively Skewed** | | Distribution** | | | | | - Mean \> Median | - Mean \< Median | | - Mean ≈ Median | | | +-----------------------+-----------------------+-----------------------+ Percentile ========== - The *p*-th percentile is the value, *Q(y, p)* such that *p*% of the values of the sample are lower than *Q(y, p)* Measure of Spread (Dispersion) ============================== - Describes the variability of values in a dataset, shows how scattered the values are +-----------------------------------+-----------------------------------+ | Standard Deviation | ![](media/image12.png) | | | | | | - Average of squared deviations | | | from the sample mean | | | | | | - Measured in same units as the | | | data and provides a more | | | intuitive understanding of | | | the spread | +-----------------------------------+-----------------------------------+ | Variance | - Squared standard deviation | | | | | | - Mathematically easier to work | | | with | +-----------------------------------+-----------------------------------+ Association between Continuous Variables ======================================== +-----------------------------------+-----------------------------------+ | Pearson Correlation | ![](media/image14.png) | | | | | | - Measures the strength of | | | linear relationship between | | | two variables, *x* and *y* | | | | | | - Correlation coefficient *R* | | | is between -1 to 1 | | | | | | - -1: Completely negatively | | | correlated | | | | | | - 1: Completely positively | | | correlated | | | | | | - 0: No linear association | | | (does not imply no | | | non-linear association) | | | | | | - Correlation does not imply | | | causation, i.e. two events | | | occurring together does not | | | mean that there is a | | | cause-and-effect | +-----------------------------------+-----------------------------------+ | Scatter Plots | - Visualises the relationship | | | and association between two | | | numerical variables | +-----------------------------------+-----------------------------------+ Association between Categorical Variables ========================================= - Use side-by-side bar graphs to identify if there is a difference between two categories - If there is no difference, there might be no association and vice versa ![](media/image16.png) Association between Categorical and Numeric Variables ===================================================== - If variable *x* is categorical and variable *y* is numerical, use a side-by-side box plot **FIT1043 PASS - Week 4** Data Sources and Data Wrangling Open Data ========= - Data that is openly accessible, exploitable, editable and shared by anyone for any purpose - Involves significant cost for public resources and human resource effort - There are significant potential benefits but needs to be refined to realise their potential - Usually come in CSV or LOD format - CSV - Comma Separated Value files, separator can be semicolon, colon, comma or tab - LOD - Linked Open Data, comes in triples (subject, verb, object). Enables data from different sources to be connected and queried. **Benefits of Open Data** - Transparency - Oversight by government, helps to reduce wastage and potential corruption - E.g. citizens can use open data to track public budget expenditures and relate them to the impacts - Public Service Improvement - Citizens have data needed to contribute to the improvement of public services and to compare best practices against other countries - E.g. citizens can use open data to contribute to road safety planning - Innovation and Economic Value - Allows collaboration between government and citizens - E.g. business can use open data to understand potential markets and develop new products - Efficiency - Easier for government ministers to access their own data and data from other ministries, which reduce acquisition costs, redundancy and overhead **API - Application Programming Interface** - A user interface designed for computers to access the functionality of other software (Interactions across devices, API consumers vs API providers) - Mechanism for 2 applications to communicate with each other by a set of rules - API is documented with a list of url and queries parameters on how to make a request ![](media/image18.png) - Stakeholders of API - API Providers: build, expose and operates APIs, need to know how APIs are designed, built and operated - API Customers: make decisions what APIs to use and pay for the commercial use of APIs - API Consumer: develop software applications or websites that uses API - End-Users: Don't use API directly, they use apps or websites that uses APIs in the background - Example of API: - Twitter Developer API - Comprehensive, have library interfaces for Java, C++, Python and Javascript - Allows other applications to manage Twitter data for users but have extensive developer policy - Filter real-time tweets, get only the tweets you need by using advanced filtering tools - Embed Tweets, Timelines and more within your website - Rest API - Communicate via HTTP request to perform standard database functions like creating, reading, updating and deleting (CRUD) within a resource - Uses HTTP request such as GET, PUT, POST and DELETE data **Common types of data at our disposal** +-----------------------+-----------------------+-----------------------+ | No | Data Sources | Description | +-----------------------+-----------------------+-----------------------+ | 1 | Databases | - Relational | | | | databases e.g. | | | | Oracle, MySQL, | | | | MariaDB | | | | | | | | - Database for | | | | banks contains | | | | data for | | | | | | | | - Customer | | | | relationship | | | | management | | | | | | | | - Mortgage, | | | | hire | | | | purchase, | | | | business loan | | | | system | | | | | | | | - Salesforce | | | | automation | | | | | | | | - Credit card | | | | system, ATM | | | | transactions, | | | | retail | | | | banking | | | | | | | | - Human | | | | resource and | | | | payroll | +-----------------------+-----------------------+-----------------------+ | 2 | Files | - Documents can be | | | | stored as | | | | | | | | - System log | | | | files (a semi | | | | structured | | | | data) | | | | | | | | - Spreadsheets | | | | | | | | - PDF | | | | | | | | - Image | | | | | | | | - Raw text | | | | | | | | - Formatted | | | | text | +-----------------------+-----------------------+-----------------------+ | 3 | Web and Crowd Sources | - Information from | | | | websites e.g. | | | | news, blogs, | | | | corporate | | | | websites | | | | introducing their | | | | products and | | | | services | | | | | | | | - Gather data with | | | | APIs or | | | | crowdsourcing | | | | websites | | | | | | | | - Example: GDELT (a | | | | realtime database | | | | of global human | | | | society for open | | | | research) | +-----------------------+-----------------------+-----------------------+ | 4 | Social Media | - Contains a large | | | | amount of | | | | user-generated | | | | content | | | | | | | | - E.g. Facebook, | | | | Twitter, | | | | LinkedIn, Youtube | +-----------------------+-----------------------+-----------------------+ | 5 | Internet of Things | - Connection | | | (IoT) | between computers | | | | and machines | | | | through network, | | | | which is a | | | | process that | | | | occurs through | | | | the exchange of | | | | data | | | | | | | | - E.g. | | | | | | | | - Utilities | | | | (Power, | | | | Water, | | | | Traffic | | | | Lights) | | | | | | | | - Vehicle to | | | | Vehicle | | | | | | | | - Autonomous | | | | Driving | | | | | | | | - Mobile Phone | | | | Data | | | | (location, | | | | browsing | | | | history, | | | | usage | | | | history, | | | | personal | | | | information) | +-----------------------+-----------------------+-----------------------+ **Data Wrangling** - The process of cleaning, transforming and preparing raw data into usable data for analysis - Reasons for data wrangling - Data comes in different formats, shapes and sizes - Raw data might contain errors, missing values, inconsistencies and outliers - (need techniques to cleanse and prepare the data) - Steps for data wrangling - Data preprocessing: - Data preparation: - Data cleansing: - Data transformation: **Source of Data Quality Issues** +-----------------------+-----------------------+-----------------------+ | 1 | Interpretability | - Problem with | | | Issues | interpreting and | | | | understanding the | | | | data | | | | | | | | - There is no | | | | proper | | | | documentation | | | | (i.e. data | | | | dictionary) to | | | | explain each of | | | | the columns | +-----------------------+-----------------------+-----------------------+ | 2 | Data Format Issues | - Data are | | | | generated from | | | | different | | | | processes and | | | | often have | | | | different data | | | | formats | | | | | | | | - Difficult to | | | | integrate and | | | | manipulate data | | | | in different | | | | formats | | | | | | | | - E.g. data in JSON | | | | or XML | +-----------------------+-----------------------+-----------------------+ | 3 | Inconsistent and | - There might be | | | Faulty Data | mistyped data, | | | | inconsistent | | | | entry or | | | | irrelevant data | | | | | | | | - Usually the | | | | solution is to | | | | remove the | | | | irrelevant entry | | | | | | | | ![](media/image20.png | | | | ) | +-----------------------+-----------------------+-----------------------+ | 4 | Missing values | - Data values are | | | | absent in a | | | | dataset | +-----------------------+-----------------------+-----------------------+ | 5 | Outliers | - An observation | | | | that is abnormal | | | | in comparison | | | | with majority of | | | | other | | | | observations in | | | | the dataset | +-----------------------+-----------------------+-----------------------+ | 6 | Duplicates | - Multiple data | | | | entries | | | | corresponding to | | | | the same | | | | information | +-----------------------+-----------------------+-----------------------+ **Data Auditing** - The process of assessing the quality and utility of data for a specific purpose - What to do for data auditing? - Check dimensions and shape of data - Whether there are null values - Basic statistics - Correlation among variables **Problems in a Dataset and Solutions** +-----------------------+-----------------------+-----------------------+ | No | Problems | Descriptions | +-----------------------+-----------------------+-----------------------+ | 1 | Misspelling and | Common cases | | | Inconsistency | | | | | - Misspelling | | | | | | | | - Inconsistent | | | | casing (uppercase | | | | and lowercase) | | | | | | | | - Inconsistency in | | | | domain value | | | | representation | | | | | | | | Detecting | | | | | | | | - Investigate | | | | unique domain | | | | values | | | | | | | | - Calculate domain | | | | value frequencies | | | | | | | | Fixing | | | | | | | | - Make matches for | | | | infrequent values | | | | and replace with | | | | best match | +-----------------------+-----------------------+-----------------------+ | 2 | Irregularities | Common cases | | | | | | | | - Invalid dates | | | | | | | | - Invalid value for | | | | a specific domain | | | | e.g. negative | | | | value for people | | | | | | | | Detecting | | | | | | | | - Investigate | | | | unique domain | | | | values | | | | | | | | - Investigate range | | | | of value for | | | | specific columns | | | | | | | | Fixing | | | | | | | | - Refer to data | | | | documentation to | | | | check for meaning | | | | for values | | | | | | | | - Replace or remove | +-----------------------+-----------------------+-----------------------+ | 3 | Integrity Constraint | Common cases | | | Violation | | | | | - Dependent on | | | | context, e.g. | | | | start date is | | | | later than end | | | | date, a field is | | | | sum of another | | | | two | | | | | | | | Detecting | | | | | | | | - Highly dependent | | | | on the domain and | | | | problems | | | | | | | | - Check value range | | | | | | | | Fixing | | | | | | | | - Swap | | | | | | | | - Remove | +-----------------------+-----------------------+-----------------------+ | 4 | Duplicates | Common cases | | | | | | | | - Complete | | | | duplications | | | | (having | | | | completely same | | | | data) | | | | | | | | - Duplicate due | | | | missing field, | | | | i.e. the entries | | | | appear to be | | | | duplicates | | | | because there is | | | | a null field | | | | | | | | Detecting | | | | | | | | - Identify keys to | | | | check duplicates | | | | | | | | Fixing | | | | | | | | - Combine | | | | information | | | | | | | | - Remove duplicates | +-----------------------+-----------------------+-----------------------+ | 5 | Missing Values | Common cases | | | | | | | | - Missing values in | | | | a certain columns | | | | | | | | Detecting | | | | | | | | - Investigate | | | | unique domain | | | | values | | | | | | | | - Investigate value | | | | range | | | | | | | | - Domain analysis | | | | | | | | - | | | | | | | | Fixing | | | | | | | | - Substitute with | | | | mean/mode/dummy | | | | values | | | | | | | | - Remove | | | | | | | | - This depends on | | | | the situation and | | | | requires | | | | justification | +-----------------------+-----------------------+-----------------------+ | 6 | Outliers | Common cases | | | | | | | | - Outliers in | | | | numerical field | | | | | | | | Detecting | | | | | | | | - Outliers are | | | | challenging to | | | | detect | | | | | | | | - Using graphical | | | | tools e.g. box | | | | plot | | | | | | | | - Do comparisons | | | | between results | | | | found by | | | | different | | | | identifiers | | | | | | | | Fixing | | | | | | | | - Substitute with | | | | mean/mode/dummy | | | | values | | | | | | | | - Remove | | | | | | | | - This depends on | | | | the situation and | | | | requires | | | | justification | +-----------------------+-----------------------+-----------------------+ **FIT1043 PASS - Week 5** Predictive Models and Machine Learning Model ===== - A representation or construct that allows us to better understand our data. - Models are useful to - To help us understand how something works - To help us make predictions on the unknown Predictive Model ================ - A model that makes predictions based on a set of features describing an object - Analyzes historical data and current data to generate a model - The model uses equations/rules to map input features to output values - Predictions could be - Binary outcomes - Categorical - Real values - Vector of real values e.g. probability - Two common types of models: - Classifier: The predicted value is binary/categorical - Regression: The predicted value is a real value ![](media/image22.png) Features and Labels =================== - Features - Input values, also known as independent variables or predictors - Measurable characteristics or attributes of the data that help us make predictions - E.g. in a spam email classification task, the features can be the frequency of certain words, the length of the email - Labels - Output values, also known as dependent variables or targets - The desired outcomes or predictions we want to make - E.g. the email is a spam or not a spam Train and Test Datasets ======================= - Usually data is splitted into two subsets (training and testing) based on a ratio - Model is built with training data, and evaluated on performance with test data - Why don't we use the train data to assess the model performance? - The model has seen the data before, and might give high accuracy Training a Model ================ - Predictive models learn from training data, and make predictions on new and unseen data - The model assess and identify patterns within training data - **Classification** learning algorithms work by dividing the feature spaces into regions of the same type - In practice: - Data can be overlapping and hard to separate between classes - There can be many feature dimensions, with some features being more useful than others - Types of models Testing a Model =============== - Evaluate predictive models based on their performance in predicting labels for test data - Generally, we are able to get better test performance when: - There is more training data - There is more features to learn from (but there is a limit for number of features) Machine Learning ================ - Algorithms or statistical models which computer systems use to perform specific tasks without explicit instruction - Allowing computers to learn patterns and inferences based on data Types of Machine Learning ========================= - Supervised Machine Learning - The model is trained with labelled data to make predictions for new and unseen data - Requires human intervention for accurate training datasets - Types of supervised machine learning: - Classification - Allocates data into predefined categories - E.g. classifying colour, shapes, species - Regression - Understand the relationship between dependent and independent variables and make predictions of continuous data - E.g. household income, weight - Example of algorithms: - Linear Regression - Random Forest (Classification and Regression) - Support Vector Machine (Classification) - Unsupervised Machine Learning - Data is unlabelled and model learns the underlying patterns or distribution from the input data - Does not require human intervention - Types of unsupervised learning: - Clustering - Discover the inherent groupings in data - E.g. grouping customers by purchasing behaviour - Association - Discover rules that describes large portion of your data - E.g. people that buys X also tends to buy Y - Example of algorithms: - K-Means (Clustering) - Apriori Algorithm (Association) Learning Theory =============== - A subfield of AI that studies the design and analysis of machine learning, i.e. understanding how machine learning algorithms work Truth ===== - Also known as the ground truth, measured by collecting infinite data - We want to determine whether the outcome of the algorithm is correct by measuring the truth - We want machine learning models to be able to make predictions that are accurate and close to the truth Quality ======= - To evaluate the quality of results derived from learning, measured through values - Can be measured on a positive or negative scale - Loss: positive when things are bad, negative or zero when things are good - Gain: positive when things are good, negative when they are not good Error ===== - Measures the distance between the prediction and actual value - 0 means no error and the prediction was exactly right - Error is not a measure of quality, but we can convert error to a measure of quality with a loss function: - *Absolute-error(x) = \|x\|* - *Square-error(x) = x\*x* - *Hinge-error(x) = \|x\| if \|x\| \ - Prediction for y at point x using the model parameters [(*a*~0~, *a*~1~)]{.math.inline}, i.e. the intercept and slope/gradient Polynomial Regression ===================== - Measures the relationship between dependent and independent variables modelled in the nth degree polynomial - Fit the polynomial equation on data where the dependent and independent variable have a curvilinear relationship ![](media/image26.jpg) - Can overcome underfitting but can be prone to overfitting especially with higher degree polynomials Mean Square Error (MSE) - Loss Function ======================================= ![](media/image28.png) - We fit the regression model by finding the vector that minimises the loss function - Loss functions is used for parameter estimation - For all machine learning algorithm, our goal is to minimize the error defined by the loss function Underfitting ============ - Model is too simple and not able to capture the underlying structure of the data - Poor performance on training data and poor generalization on other data - Underfitting model have high bias and low variance - Possible reasons for underfitting: - Training data is not sufficient - Input features are not adequate to represent the underlying factors that influence the target variable - Possible methods to reduce underfitting - Increase model complexity - Increase the number of features or perform feature engineering Overfitting =========== - Model is trained with too much data, and learned about the noise and inaccurate data entries - Model does not categorize the data accurately because of the details and noise learned - Good performance on training data and poor generalization on other data - Overfitting models have low bias and high variance - Possible reasons for overfitting: - Training data is not suitable - Model is too complex - Possible methods to reduce overfitting - Improve quality of training data by focusing on meaningful patterns - Reduce model complexity ![](media/image30.png) Bias ==== - Measures how much the predictions of a model differ from the true values or desired regression function - The inability for the method to capture the true relationship between variables Variance ======== - Measures how much the predictions for individual datasets vary around their average (how much the predicted values are spread out) - High variance means that the predicted results would vary widely Bias and Variance Trade-Off =========================== - We want a model that accurately captures patterns in training data and generalizes well with unseen data - There is a trade-off between bias: - Increase the model's complexity can reduce bias but increase variance - Decreasing the model's complexity can reduce variance but increase bias - Therefore, we want to find the optimal balance between bias and variance No Free Lunch Theorem ===================== - No single machine learning algorithm is universally the best-performing algorithm for all problems - All optimization algorithms perform equally well when their performance is averaged over all optimization tasks Ensembles ========= - A collection of reasonable/possible models that allows us to understand the variability and range of predictions that is realistic - We can average the predictions over the models in an ensemble to improve performance **FIT1043 PASS - Week 7** Analyse (EXTRA) =============== - **Which model? (unsupervised/classification/regression/decision tree)** - **Which features? (All? determine need of a particular feature, can omit for efficiency)** - **Any pre-processing needed?** - **How do we measure Model "goodness"?** Classification ============== - A supervised machine learning method where the model tries to predict the correct label of a given input data - Two types of classification: binary-class and multi-class classification - Classification methods example: logistics regression Confusion Matrix and Classification Metrics =========================================== - To measure the performance for classification models - Confusion matrices can be used to calculate performance metrics for classification models - Common performance metrics are such as accuracy, precision, recall and F1 score ![](media/image32.png) ---------------------- ------------------------------------------------------------------------------------------------------ **Confusion Matrix** True Positive (TP) Correctly predicted the positive classes True Negative (TN) Correctly predicted the negative class False Positive (FP) Incorrectly predicted the positive class, i.e. model predicted positive, but it is actually negative False Negative (FN) Incorrectly predicted the negative class, i.e. model predicted negative, but it is actually positive ---------------------- ------------------------------------------------------------------------------------------------------ +-----------------------------------+-----------------------------------+ | **Classification Metrics** | | +-----------------------------------+-----------------------------------+ | Accuracy | Measures how often the model | | | correctly predicts | +-----------------------------------+-----------------------------------+ | Precision | Measure of true positives over | | | the number of total positives | | | predicted, | | | | | | i.e., when a positive value is | | | predicted, how often is the | | | prediction correct? | +-----------------------------------+-----------------------------------+ | Sensitivity (Recall) | Measure of the true positive over | | | the actual positive outcomes, | | | | | | i.e. when the actual value is | | | positive, how often is the | | | prediction correct? | +-----------------------------------+-----------------------------------+ | Specificity | Measure of the true negatives | | | over actual negative outcomes, | | | | | | i.e. when the actual value is | | | negative, how often is the | | | prediction correct? | +-----------------------------------+-----------------------------------+ | False Positive Rate (FPR) | Measure of the proportion of | | | positive cases that were | | | incorrectly identified | +-----------------------------------+-----------------------------------+ Which performance metrics should we use? ======================================== - Depending on the problem type we are trying to solve - In certain situations, we want to optimise precision or specificity (spam filter)/ optimise sensitivity (fraudulent transaction detector) - Examples: - Precision when we want to minimize false positives - Sensitivity (Recall) when we want to minimize false negatives - Specificity when we are concern about the accuracy of the negative rate - False Positive Rate (FPR) when we are concern about the accuracy of negative rate in binary classification Decision Trees and Regression Trees =================================== - Decision Trees - Predicts binary or multi-class (categorical) outcome - A hierarchical structure that classifies data by asking questions at each node - Prediction is the most common values in each region - Regression Trees - A decision tree used for regression tasks - Predicts continuous valued outputs instead of categorical output - Prediction is usually the average value in each region How to Build Regression and Decision Trees ========================================== - Perform recursive partition to divide the feature space into regions - At each iteration, we divide the data to group similar instances together - Decision trees would decide on which attributes to split/stop splitting with criterias such as entropy and information gain ![](media/image36.png) Ensemble learning ================= - **Random forest** is an ensemble learning method that operates by constructing a number of decision trees - Random forest can be used to reduce the danger of overfitting in the decision trees. - Uses bagging and feature randomness when building each individual tree - The prediction of the random forest is more accurate and any of the individual tree - The output is the result of majority of the tree Clustering ========== - Group a set of data points into different subgroups (clusters) based on their similarity - Aims at gaining insights from unlabelled data points where we don't have a target variable K-means Clustering (Centroid based, others: density based, distribution based, hierarchical based) ================================================================================================== - Assigns data points to one of the K clusters depending on their distance from the centre of the clusters - The goal of k-means clustering is to partition (n) observation into k clusters - K-means clustering can be defined as the method of quantization - [Cluster centroid] is the average of the location of all data points in a cluster - Algorithm:\* - Select k points at random as centroids/cluster centres - Perform cluster assignment and move the cluster centroids to the mean value of the cluster - Repeat iteratively until there are no changes in the clusters - How to select the K (number of clusters)? - A priori knowledge about the application domain e.g. knowledge of the number of T-shirt sizes - How to search for a good K? - Try different values of k and evaluate the results - Run hierarchical clustering on a subset of data - When we select K random data points from the dataset, the selected centroids may not be well positioned throughout the entire data space [\*Algorithm] k-means initial setup 1. Define k 2. Initialize centroids k-means Two Main (Iterative) Steps 1. Cluster assignment (Assigns the data points to the centroid) 2. Move centroid (Update) Week 8: Introduction to R Programming FIT1043 PASS - Semester 1, 2024 =============================== This R script provides a compilation of frequently used functions in R programming It is provided solely as a reference to assist you, please conduct your own research and ensure accuracy. 1. Basic Syntax in R ==================== 1.1 Variable Assignment ----------------------- Mainly uses the leftward assignment operator: \ - **Map**: The input data is divided into key-value pairs and processed in parallel across multiple nodes to produce intermediate key-value pairs - **Reduce**: The intermediate key-value pairs are combined and reduced to produce the final output. - Requires simple data parallelism followed by a merge (\"reduce\") process, making it suitable for distributed processing. - MapReduce is typically used for batch processing of large datasets, such as data mining, log analysis, and web indexing. - Apache Hadoop - An open-source java implementation of Map-Reduce - To efficiently store and process large data sets (e.g. predictive analysis, data mining, machine learning) - Enables big data analytics processing tasks to be split into smaller tasks, and the smaller tasks are performed in parallel by using an algorithm, and are distributed across a Hadoop cluster - Consists of four main modules - HDFS -- Hadoop Distributed File System - Store files in Hadoop format and parallelize them across clusters - YARN -- Yet Another Resource Negotiator - Cluster resource manager that schedules tasks and allocates resources - MapReduce - Splits big data processing tasks into smaller ones, distributes the small tasks across different nodes, then runs each task. - Hadoop Common - Set of common libraries and utilities that the other three modules depend on. - Apache Spark - Builds on Hadoop Distributed File System (HDFS) - Provides interfaces in Java, Scala, R and Python - Introduces the concept of Resilient Distributed Datasets (RDDs) and utilizes in-memory caching to process data more efficiently. - Optimized query execution for fast analytic queries against data of any size Hadoop vs Spark =============== +-----------------------------------+-----------------------------------+ | Hadoop | Spark | +-----------------------------------+-----------------------------------+ | - Provides an inexpensive and | - Runs at a higher cost because | | open-source platform for | it relies on in-memory | | parallelising processing, | computations **for real-time | | relies on disk storage (have | data processing**, which | | delay) for data processing | requires it to use high | | | quantities of RAM to spin up | | - Based on Map-Reduce | nodes. Ex: Netflix | | | | | - Ideal for **batch | - Includes Map-Reduce | | processing** and **linear | capabilities | | data processing** | | | | - Provides r**eal-time, | | - Not suitable for streaming | in-memory processing** | | (real time processing), more | | | suitable for offline | - Tends to perform faster than | | processing | Hadoop and it uses r**andom | | | access memory (RAM)** to | | | cache and process data | | | instead of a file system | +-----------------------------------+-----------------------------------+ Deep Learning ============= - A subset of machine learning that is effective at learning patterns - Deep learning algorithm attempts to learn multiple level of representation using a hierarchy of multiple levels - Learn pattern through various layers of neural network, additional layers in deep neural network helps to refine and optimize those outcomes for greater accuracy - Automates feature extraction and removes dependency on human intervention - E.g. digital assistants, voice-enabled TV remotes, fraud detection, automatic facial recognition Reinforce Learning ================== - A subfield of machine learning and deep learning - It mimics the trial-and-error learning process that humans use to achieve their goals - A machine learning training method based on rewarding desired behaviours and punishing undesired ones - Differs from supervised learning in not needing labelled data to tell the machine the correct answer