CP610 Data Analysis PDF
Document Details
Uploaded by FamedMeadow6120
Tags
Summary
This document covers data analysis and preprocessing techniques, including data cleaning, data integration, data reduction, data transformation, and data discretization. It also details the handling of missing and noisy data and various data reduction techniques such as dimensionality reduction and numerosity reduction, including Principal Component Analysis (PCA).
Full Transcript
CP610 Data Analysis - Data Preprocessing & Visualization Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 2 Data Cleaning Data in the Real World Is Dirty: Lots of p...
CP610 Data Analysis - Data Preprocessing & Visualization Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 2 Data Cleaning Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation=“ ” (missing data) noisy: containing errors, or outliers e.g., Salary=“−10” (an error) inconsistent: containing discrepancies in codes or names, e.g., Age=“42”, Birthday=“03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records 3 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean/median/mode the attribute mean/median/mode for all samples belonging to the same class the most probable value: inference-based such as Bayesian formula or decision tree 4 How to Handle Noisy Data? Binning first sort data and partition into bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers 5 Data Integration Data integration: Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id º B.cust-# Integrate metadata from different sources Handling Redundancy in Data Integration Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table Redundant attributes may be able to be detected by correlation analysis 6 Data Reduction Strategies Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. Data reduction strategies Dimensionality reduction Numerosity reduction 7 Data Reduction 1: Dimensionality Reduction Curse of dimensionality When dimensionality increases, data becomes increasingly sparse Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful The possible combinations of subspaces will grow exponentially Dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data analysis Allow easier visualization Dimensionality reduction techniques Principal Component Analysis Supervised and nonlinear techniques (e.g., feature selection) … 8 Geometric interpretation of PCA Which variable is the principle variable? Max variance 9 First PC is direction of maximum variance from origin Subsequent PCs are orthogonal to 1st PC and describe maximum residual variance 10 PCA transformation Coefficients of the linear combination which transform the observations onto the PCs are formed by eigenvalues of the covariance matrix Covariance matrix (3 dimensions) 11 PCA Algorithm Input: Data Matrix Step 1: Normalize data matrix Step 2: Get Covariance Matrix Step 3: Calculate Eigen Vectors and Eigen Values of the covariance matrix Step 4: Sort the Eigen Vectors: Take the eigenvalues λ₁, λ₂, …, λp and sort them from largest to smallest. Step 5: Choose first k eigen vectors and calculate the new features 12 Project the standardized points in the new feature space 13 Attribute Subset Selection Another way to reduce dimensionality of data Remove: Redundant attributes: Duplicate much or all of the information contained in one or more other attributes Irrelevant attributes: Contain no information that is useful for the task Methods Filter E.g. an attribute with higher correlation to the target is preferred Wrapper Embedded E.g. L1-regularization, L2-regularization, … 14 Wrapper Task: Supervised (Classification/Regression) Iterative process Subset generation Subset selection Termination condition Search Forward Backward Data Reduction 2: Numerosity Reduction out Reduce data volume by With ement repla c choosing alternative, smaller forms of data representation With replac emen t Raw Data Non-parametric methods Random sampling Do not assume models Major families: sampling (random sampling, stratified sampling), histograms, clustering, … Parametric methods Stratified sampling Regression, … Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 17 Data Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values Methods Normalization: Scaled to fall within a smaller, specified range min-max normalization z-score normalization normalization by decimal scaling Discretization 18 Normalization Min-max normalization: to [new_minA, new_maxA] v - minA v' = (new _ maxA - new _ minA) + new _ minA maxA - minA Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then 73,600 - 12,000 $73,600 is mapped to (1.0 - 0) + 0 = 0.716 98,000 - 12,000 Z-score normalization (μ: mean, σ: standard deviation): v - µA v' = s A 73,600 - 54,000 Ex. From the data, μ = 54,000, σ = 16,000. Then = 1.225 16,000 Normalization by decimal scaling v v' = Where j is the smallest integer such that Max(|ν’|) < 1 10 j j=5; 73,600/(10^5) = 0.736 19 Discretization Discretization: Divide the range of a continuous attribute into intervals. Prepare for further analysis, e.g., classification, association rule mining… Interval labels can then be used to replace actual data values Reduce data size by discretization Discretization can be performed recursively on an attribute Split (top-down) vs. merge (bottom-up) Supervised vs. unsupervised approaches Equal Width Binning, Equal Frequency Binning, Clustering-Based Discretization, Density-Based Discretization, Decision Tree-Based Discretization, … 20 Summary about data preparation Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability Data cleaning: e.g. missing/noisy values, outliers Data integration from multiple sources: Entity identification problem Remove redundancies Detect inconsistencies Data reduction Dimensionality reduction Numerosity reduction Data transformation and data discretization 21