M3_1.pdf

Smart City and IoT Data Analytics Instructor: Punit Rathore 1. About this module, Data types and Pre-processing 1 What is data science/mining? Data Sci...

Smart City and IoT Data Analytics Instructor: Punit Rathore 1. About this module, Data types and Pre-processing 1 What is data science/mining? Data Science is the discipline of extraction of knowledge from data. It relies on computer science, statistics, and domain knowledge. Data structure Descriptive Ask right questions Programming Exploratory Interpret results Algorithms Explanatory Evaluate solutions Visualization Big data computing 2 Why City matters? The rise of megacities 3 Why City matters? 90% increase are concentrated in developing countries (Asia and Africa) India, China, and Nigeria together account for 37% of the projected growth Delhi is projected to the world’s second largest city with population rise to 36 million So what? Contribution to the increase in urban population by country, 2014 to 2050 4 Why City matters? Crisis, Conflicts, and Risks 5 What is Smart ?... Connected device (e.g., connected vehicles) everywhere 6 IoT Infrastructure for Smart City Application Video Structural health Environment Health Intelligent Layer Monitoring Monitoring Monitoring Monitoring Transportation Data Visualization Communication Stack Transport Cloud Computing Layer Data Storage Data Flow Storage Computation Analytics Network Layer Addressing and Data Quality of Service Processing MAC Layer Sensing and Connectivity Physical Data Layer Collection 7 Convergence of IoT Data Big Data 8 Necessity for data science/mining Policy Making Better Insights, Customer Raw Data Knowledge Service Data Mining (meaning, patterns, Challenges: trends, correlations) Unstructured Decision Noise, Making Incomplete Heterogeneity Labelling Volume Business Velocity, Benefits Dimensionality Right data? Importance? 9 Data, Information, Knowledge, and Wisdom Prescriptive Action What to do prevent/facilitate the physical event Predictive Understanding Why the physical event happen Diagnostic Models How did the physical event generate that information Context Descriptive Who did what, where and when Raw observations Captures data from the physical event 10 Types of Analysis 11 Smart City- Fundamental Concepts Noise Mapping City Particulate Matter Environment Transport Citizen Infrastructure Engagement System Structural Health Participatory Water Mgmt. Sensing Energy Mgmt. Community Health Ecosystem Environment Emergency & Management Disaster Mobile Health Management Urban Forest Monitoring Crowd Monitoring 12 Smart City: Example Problems Household poverty level prediction Location: Costa Rica Challenge: Difficult to ensure if the right people are given enough aid Data: Household characteristics Goal: To improve the accuracy of household income qualification 13 Smart City: Example Problems Disaster Recovery Support Location: Houston, TX Challenge: Fair and effective allocation funding for post-hurricane recovery Data: Property records, population census, household income, flood simulation data Goal: Predict who were hit by hurricane and estimate the damage 14 Smart City: Example Problems Bushfire monitoring Location: Australia Challenge: Monitor progress in fires over different spatio-temporal scales Data: Satellite imagery, real-time sensor data Goal: Better monitor and assess the risk of fires Can you give an example from India? 15 City Open Data Portal https://smartcities.data.gov.in/ , https://data.gov.in 16 About this module Regression Classification Clustering Data (Types Real-world and Applications Quality issues) Anomaly Detection Hands-on experience with practical workshops 17 Topics to be covered Week 1, Saturday: About the course, Data Types and Quality Issues Week 1, Sunday: Linear Regression, Regularization, Model Complexity Week 2, Saturday: Classification - Logistic Regression, Naïve Bayes, kNN Week 2, Sunday: Classification Models: Decision Trees, Ensemble Classification Week 3, Saturday: Ensemble Models (contd..), SVM (Optional) Week 3, Sunday: Neural Networks (Perceptron), Anom Week 4, Saturday: Clustering (k-means, hierarchical) and Dimension Reduction (PCA, MDS) Week 4, Sunday: Anomaly Detection and Examples 18 About Instructor Punit Rathore RBCCPS and CiSTUP at IISc Bangalore PhD-2019 University of Melbourne, Australia Postdoc (2019-2021) MIT Cambridge, USA (Senseable City Lab), 2 yrs. Expertise: Machine Learning, Spatio-temporal data mining, Internet of Things, Intelligent Transportation, Urban Intelligence 19 Assumed Knowledge and Assessment Maths q Familiarity with formal notations q UG level knowledge for linear algebra and statistics Programming (Optional) q Ideal: Some Exposure to Python or R Assessment component Final Quiz 20 Course Outcomes Upon successful completion of this module, students will be able to v Classify the different types of data generated by smart cities/IoT Platforms v Describe and use a range of basic and advanced statistical tools and data mining algorithms v Code, apply, and solve the real-world problems using Python v Design, implement, evaluate, and interpret data mining based (including some statistical ML) systems to solve real-world problems (not limited to smart cities) 21 Python: Jupyter Notebook Install Anaconda Individual Edition or Google Colab (cloud) 1. Download worksheet01.ipynb from Class Team 2. Move the downloaded file to a working directory %WORKDIR% 3. Start à Anaconda3 (64-bit) à Anaconda Prompt 4. Type the following command at the prompt: jupyter notebook --notebook-dir=%WORKDIR% 5. The Jupyter UI should open in a web browser. 6. Click on worksheet1.ipynb to get started. Practice: numpy-basics.ipynb 22 Course References There is no dedicated textbook for this subject. Relevant references will be mentioned during the lectures. Following are some of useful references: Machine learning: A Probabilistic Perspective, Kevin Murphy, MIT Press, 2012. Probabilistic Graphical Models, Principles and Techniques, Daphne Koller, Cambridge University Press, 2009 Pattern Recognition and Machine Learning, Christopher Bishop, New York, Springer, 2006 23 Break 24 Data types, Quality Issue, and Pre-processing Data and Measurements Dataset types Data Quality Data pre-processing 25 What is Data Collection or records and their attributes Important terminologies: Instance measurements/observation/records about individual entities/objects, e.g., a loan application Attribute: component/feature (also called dimensions) of the instances the applicant’s salary, number of dependents, etc. Label: target, an outcome that is categorical, numeric, etc. forfeit vs. paid off Examples: instance coupled with label Models: discovered relationship between attributes, label (Supervised, Unsupervised, Semi-supervised, Self-supervised) 26 Supervised v/s Unsupervised Learning Training data: used to construct models Training data Model used for Supervised Predict labels on new Labelled learning instances Cluster related Unsupervised Unlabelled instances; Understand learning attribute relationships 27 Learning Architecture Examples Learner Train data Model Instances Predicted Labels Test data Evaluation Labels 28 Data Types Data can be numbers, names, or other labels. Two main data types are – Categorical (Qualitative): fall into distinct categories e.g., gender (male, female), employment status (full-time, part-time, casual, unemployed) – Numerical (Quantitative): number values, with units, in either of two forms: – Discrete – between two sequential values there may be no permissible values, e.g. number of members in a family, shoe size – Continuous – an infinite number of values with a defined range are possible, e.g. measurements of weight, time etc. Not all data represented by numbers are numerical data (e.g., 1=male, 2=female; this is a categorical data). Data are of less use without their context, or unit. 29 Penguin example ID gender Age (months) Height (cm) Weight Fat 1 Fat 2 (kg) (mm) (mm) 342 M 12 67 25.3 15.1 16.2 12 F 23 73 26.4 14.1 13.9 432 M 5 24 0.5 7.1 8.5 14 M 62 79 28.9 12.1 11.5 98 F 28 76 31.0 13.5 14.9 987 F 31 81 30.7 14.6 14.7 This dataset contains the penguins’ profile – Gender is a categorical variable with two categories ‘M’ and ‘F’ – Age, height, weight, Fat1 and Fat2 are numerical variables. 30 Types of Categorical attributes Nominal: no natural order between categories e.g., ID number, gender, colors Ordinal: an ordering exists e.g., ranking, height, grades, fuzziness Interval: time duration, measurement errors (precision interval) 31 Common types of data.. Tabular data Transactions Data Temporal Data Spatial Data Spatio-temporal Data Graph Data Unstructured Data 32 Tabular Data Each record consists of a fixed sets of attributes Such data are often represented by a data matrix (𝑛 𝑥 𝑝) ; 𝑛 - number of records, 𝑝 – number of features/attributed/dimensions Distance matrix - pairwise dissimilarity using a distance metric (e.g., Euclidean distance) Data Matrix Distance Matrix 33 Transactions Data Each record (transaction) has a set of items 34 Temporal Data Sequential Data – record transaction data with timestamp Timestamp Customer ID Items bought 09:45, 05/08/2022 1 Milk, Cereal 16:05, 05/08/2022 2 Muffins 12:35, 06/08/2022 1 Eggs, Bread 18:15, 07/08/2022 3 Biscuits 19:05, 08/08/2022 3 Chips 20:45, 08/08/2022 2 Milk, Eggs 13:55, 09/08/2022 1 Sandwich 35 Temporal Data Sequence Data : Sequential data without time-stamp, instead they are time-ordered Example: Gene sequence Speech Data: How can I help you? Any other example? 36 Temporal Data Time-series Data: collection of observation/measurements recorded over time Univariate, multivariate Example: Stock prices, number of covid deaths everyday, sensor data 37 Spatial Data Instances having spatial features such as position, location, GPS Weather Data in different cities Covid cases in different states Taxi pickup and drop-off locations in a city 38 Spatio-temporal Data Trajectories: Time-ordered movements of users, vehicles etc. Sampling points 39 Unstructured Data Data that are not organized in a clearly defined framework Can not be displayed in row-columns or relational databases Most qualitative data in the form of text, audio files, or video files Examples: Emails, spreadsheets, text reviews etc. 40 Graph Data Graph is represented by nodes, edges (links) Graph captures relationship (edges) among data objects (nodes) Directed vs Undirected graph Examples: World wide web (WWW), road network, social network / Image Source: https://blogs.cornell.edu/info2040/2021/09/14/google-maps-and-graph-theory Data Quality How good our data w.r.t. attribute values? Is the data complete, accurate, consistent? Most real-world data are dirty, incomplete, noisy Limitation in measuring device (sensor faults, drifts, calibration problem) Flaws in data collection etc. Human errors Data mining focuses on: Detection and correction of data quality problems. Develop robust algorithms that can handle poor data quality How much time data cleaning takes in ML task? 42 Typical Data Quality Issues Noise Timestamp Temperature Humidity PM 10 11/08/22, 29 72 105 Outliers 17:00 11/08/22, 29 79 105 Missing values 17:30 11/08/22, 29 79 105 Duplicates 17:30 11/08/22, 28 73 120 18:00 11/08/22, 27 NaN 320 18:30 11/08/22, 32 78 115 19:00 11/08/22, 25 79 110 19:30 43 Noise Noise: random component of a measurement error, meaningless information Noisy data are usually corrupted or distorted Example: accelerometer data with noise, distortion in a person’s voice when talking, GPS errors Do we always remove noise? Is it good sometimes? 44 Image Source: Nakamura, A Comparative Study of Information Criteria for Model Selection. Outliers Datapoints with characteristics which are significantly different that most of the other datapoints in the dataset Local outliers (covers a small subset) vs global outliers (consider entire data) Image Source: Google Images Red datapoint: local outlier, grey datapoints: global outliers 45 Missing Data Data having instances with missing one or more attribute values Possible reasons: Information is not fully collected e.g., People decline to give some of their personal information (age, weight), battery drain or communication loss in IoT Some attributed may not be applicable to certain objects e.g., children don’t have annual income How does missing data affect our analysis? Can reduce statistical power Lost data can cause bias in the estimation of parameters Can reduce representativeness of samples Types of Missing Data Missing completely at random: purely random points are missing power may be lost, but analysis remain unbiased Missing at random: something affects missingness, but no relation with the value e.g., faulty sensors, some people don’t fill out forms correctly Missing not at random: Systematic missingness linked to the value e.g., sensor decay, sick people leaving study 46 usually problematic, has to be modelled or resolved Duplicate Data Dataset may include objects that are duplicates or almost duplicates of one another Possible reason and examples Very high sampling rate of Sensors Sensor faults Human error Merging data from multiple sources Examples: Same person with multiple email address Same sensor measurements for a slowly varying phenomena (e.g., temperature) How it affects our model/analysis? May create unbalance, bias Model may overfit, may not generalize well Additional computation and storage 47 Data Pre-processing Why pre-processing? Quality of data may affect data mining results The ML models takes lot of assumptions about the data. In reality, these assumptions are often violated Therefore, pre-process the data before finally feeding it into algorithms. May comprise majority of the work in ML (could be as high as 90%) Data Pre-processing ML/DL Techniques Models 48 Major Data Pre-processing Pre-processing TechniquesTasks Data Cleaning Handing Noise, Outliers, Duplicates, Missing Data Data Transformation Scaling Encoding Feature Engineering Sampling 49 Data Cleaning Imputation (for handling missing values) Imputation: Estimate missing data (constant, mean, k-NN, iterative based) Ignore missing values Binning (for noisy and outlier handing) Sort data and partition into bins (equi-width) Smooth by bin means, median, boundary etc. Regression (or curve fitting) Smooth by fitting data into regression functions May also be used to impute missing data Clustering Group values in clusters and remove outliers Will be discussed in coming lectures Combined human and human inspection Detect unusual values automatically and checked by human Need domain knowledge or expert decision 50 Binning Equal width binning Divides the range into K number of intervals of equal width Equal depth binning Divides the range in K number of interval where each interval approximately contains same number of samples. 51 Image Source: SFU, CMPT 741, Fall 2009, Martin Ester Regression Replace noisy or impute missing value by fitting appropriate curve over given set of data points. Image Source: https://madrury.github.io/jekyll/update/statistics/2017/08/12/noisy-regression.html https://datascience.stackexchange.com/questions/9529/how-to-select-regression-algorithm-for-noisy-scattered-data 52 Data Transformation: Data Scaling Why is scaling needed ? What happens when different numeric features have different scales (different range of values)? Distances depend mainly on feature with large values Features with much higher values may overpower the others. Goal: Bring all datapoints in the same range 53 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Types of Scaling Standard scaling (Normalization) Min-Max scaling Robust scaling 54 Data Scaling Standard Scaling (Standardization, z-score normalization) Assumes that data is normally distributed mean (𝜇) and standard deviation (𝜎). For each feature, subtract mean and divide it by standard deviation. The scaled data is assumed to be drawn from a standard normal distribution (𝜇 = 0 and 𝜎 = 1) 𝑥−𝜇 𝑥!"# = 𝜎 55 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Data Scaling Min-max Scaling Scales all features between a given min and max value (e.g., 0 and 1) Makes more sense if min/max values have meaning in your data Data may become sensitive to outlier. 56 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Data Scaling Robust Scaling Subtracts the median of data & scales between first and third (25% and 75%) quantiles, say a & b. Scaled features has median 0 with a= -1 and b=1. Ignores outliers. 57 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Data Transformation: Data Encoding Why Encoding? Many ML models require data to be in numerical format, be it output or input. We need to encode categorical features into numerical values before fitting them to model Two famous encoding techniques are used widely: Ordinal Encoding One-hot Encoding Other encoding mechanism (beyond the scope of this course) Target encoding 58 Data Encoding Ordinal Encoding Only useful if there exist a natural order in categories. Model will consider one category “higher” or “closer” to another according to the pre-existing order. It basically assigns an integer value to each category according to their order. 59 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Data Encoding One-hot Encoding Assigns new 0/1 for every category. A data point having 1 denotes that it has the category. Cons: can explore if feature has lots of values, high-dimensionality problems 60 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Some Tips for Data Transformation Only transform (e.g., scale) the input features (X), not the targets (y) Data Transformation should follow fit-predict paradigm First transform (e.g., scale) the training data, then train the learning model (fit) Transform the test data, then evaluate the model (predict) 61 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Some Tips for Data Transformation If you fit and transform the training and test data before splitting, you get the data leakage You have looked at the test data before training the model Model evaluations will be misleading 62 Some Tips for Data Transformation If you fit and transform the training and test data separately, you distort the data. Training and test points are scaled differently 63 Image source: https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ Data Sampling If your data is very BIG, processing such huge data may take large amount. How can you make it small before feeding it to a model ? How to handle imbalance data? You have a majority class with many times the number of examples as the minority class The answer is Sampling. Also helps in learning model that generalizes well (data partition, cross- validation) 64 Image Source: SFU, CMPT 741, Fall 2009, Martin Ester Data Sampling How can one achieve “effective” sampling ? Choose a subset/sample of data that will work as good as the whole chunk. Such samples that satisfies the above property are called “representative” sample. A sample is representative if and only if it has approximately the same property (of interest) as the whole dataset. Else we say that the sample has some inherited Bias We cannot just take samples of students' height from a USA university and compute the average height of students at Indian universities. 65 Data Sampling Types of Sampling Random Sampling Equal probability of selecting any datapoint Sampling without replacement As each datapoint is selected, it is removed from the given set (population) Sampling with replacement In sampling with replacement, the same object can be picked up more than once. This makes analytical computation of probabilities easier. 66 Data Sampling Stratified Sampling Split data into several representative group (maintain class distributions across groups), then draw samples from each group. Example: I want to understand the differences between legitimate and fraudulent credit card transactions. 0.1% of transactions are fraudulent. What happens if I select 1000 transactions at random? I get 1 fraudulent transaction. Not enough to draw any conclusions. Solution: Sample 1000 legitimate and 1000 fraudulent transactions 67 Sampling to handle unbalanced data Under-sampling Copy the number of datapoints from minority class Sample from the majority class until balanced Random sampling (with or without replacement) Model-based under-sampling (nearest neighbours based, (beyond the scope of this module)) Preferred for large datasets Oversampling Copy the number of datapoints from majority class Randomly sample from the minority class, with replacement, until balanced Model-based oversampling: ADASYN (Adaptive Synthetic), SMOTE (Synthetic Minority Oversampling Technique), (beyond the scope of this module) Makes model more expensive to train, doesn’t always improve performance 68 Summary Data pre-processing is a crucial part of ML Scaling is important for many distance-based methods Categorical embedding is necessary for numeric methods Often better to impute missing data than removing data Imbalanced datasets require extra care to build reliable models Choose right pre-processing steps and models 69 References Books: García, Salvador, Data pre-processing in data mining. Web resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.htm https://ml-course.github.io/master/06%20-%20Data%20Preprocessing.slides.html#/ 70

Document Details

Tags

Related

Full Transcript

Upgrade to continue