Data Preprocessing - MMU CDS6314 Lecture 3 PDF
Document Details

Uploaded by MotivatedMars
Multimedia University (MMU)
Tags
Summary
This document is an undergraduate level lecture on data preprocessing from Multimedia University (MMU). The lecture covers topics such as data cleaning, data integration, data reduction, and data transformation. It also provides an overview of the importance of data preprocessing in the context of data mining and machine learning.
Full Transcript
MMU FCI yploh Data Preprocessing CDS6314 LECTURE 3 Outline Data Preprocessing: An Overview Data Cleaning Data Integration Data Reduction Data Transformation Summary...
MMU FCI yploh Data Preprocessing CDS6314 LECTURE 3 Outline Data Preprocessing: An Overview Data Cleaning Data Integration Data Reduction Data Transformation Summary MMU FCI 2 yploh Importance of Data Preprocessing MMU FCI Source: Digital Ag, Ohio State University 3 yploh What is Data Preprocessing? — Major Tasks Data cleaning Handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation and data discretization Normalization Concept hierarchy generation MMU FCI 4 yploh What is Data Preprocessing? Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, … Consistency: some modified but some not, dangling, … Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood? MMU FCI 5 yploh Outline Data Preprocessing: An Overview Data Cleaning Data Integration Data Reduction and Transformation Dimensionality Reduction Summary MMU FCI 6 yploh Data Quality Issues Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, and transmission error Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation = “ ” (missing data) Noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error) Inconsistent: containing discrepancies in codes or names, e.g., Age = “42”, Birthday = “03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday? MMU FCI 7 yploh Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to Equipment malfunction Inconsistent with other recorded data and thus deleted Data were not entered due to misunderstanding Certain data may not be considered important at the time of entry Did not register history or changes of the data Missing data may need to be inferred MMU FCI 8 yploh How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class? the attribute mean the attribute mean for all samples belonging to the same class the most probable value: inference-based such as Bayesian formula or decision tree MMU FCI 9 yploh Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to Faulty data collection instruments Data entry problems Data transmission problems Technology limitation Inconsistency in naming convention Other data problems Duplicate records Incomplete data Inconsistent data MMU FCI 10 yploh How to Handle Noisy Data? Binning First sort data and partition into (equal-frequency) bins Then one can smooth by bin means bin median ? bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries ? Each bin value is then replaced by the closest boundary value MMU FCI 11 yploh What is Binning? Equal-width (distance) partitioning Divides the range into 𝑁 intervals of equal size: uniform grid if 𝐴 and 𝐵 are the lowest and highest values of the attribute, the width of intervals will be: 𝑊 = (𝐵 – 𝐴)/𝑁. The most straightforward, but outliers may dominate presentation Skewed data is not handled well Equal-depth (frequency) partitioning Divides the range into 𝑁 intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky MMU FCI 12 yploh Try this: Smoothing by Bin Means Which of the following is ? the answer? Bin 1: 4, 4, 4 Bin 1: 9, 9, 9 ? A. Bin 2: 21, 21, 21 Bin 3: 25, 25, 25 C. Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Bin 1: 8, 8, 8 Bin 1: 15, 15, 15 B. Bin 2: 21, 21, 21 Bin 3: 28, 28, 28 D. Bin 2: 24, 24, 24 Bin 3: 34, 34, 34 MMU FCI 13 yploh Try this: Smoothing by Bin Boundary Which of the following is the answer? Bin 1: 4, 4, 15 Bin 1: 4, 9, 15 ? A. Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 C. Bin 2: 21, 22, 24 Bin 3: 25, 29, 34 Bin 1: 4, 4, 4 Bin 1: 4, 15, 15 B. Bin 2: 21, 21, 21 Bin 3: 25, 25, 25 D. Bin 2: 21, 24, 24 Bin 3: 25, 34, 34 MMU FCI 14 yploh How to Handle Noisy Data? Regression Smooth by fitting the data into regression functions Clustering Detect and remove outliers Semi-supervised: Combined computer and human inspection Detect suspicious values and check by human (e.g., deal with possible outliers) MMU FCI 15 yploh Data Cleaning as a Process Data discrepancy detection It is the lack of compatibility between two or facts that should be similar. i.e. caused by poorly designed form. How to handle? Use metadata the knowledge you may already have regarding properties of the data (using the techniques in Chapter 2, e.g. domain, range, dependency, distribution) Then ask: do all values fall within the expected range? Are there any known dependencies between attributes? Values that are more than two standard deviations away from the mean for a given attribute may be flagged as potential outliers? MMU FCI 16 yploh Data Cleaning as a Process Check inconsistent data representation i.e. “2010/12/25” and “25/12/2010” for date. Check field overloading (developers squeeze new attribute definitions into unused (bit) portions of already defined attributes) Check also uniqueness rule: each value of certain given attributes must be unique consecutive rule: no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers) null rule: the use of blanks, question marks, special characters, or other strings that may indicate a value for a given attribute is not available. MMU FCI 17 yploh Data Cleaning as a Process Data migration and integration Data migration tools: allow transformations to be specified. Ex: to replace the string “gender” by “sex.” ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes (discrepancy detection and data transformation) Integration iterates Error-prone and time consuming May introduce more discrepancies MMU FCI 18 yploh Outline Data Preprocessing: An Overview Data Cleaning Data Integration Data Reduction Data Transformation Summary MMU FCI 19 yploh Data Integration Data integration Combining data from multiple sources into a coherent store Schema integration: e.g., A.cust-id ≡ B.cust-# Integrate metadata from different sources Entity identification: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units MMU FCI 20 yploh Handling Redundancy in Data Integration Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis and covariance analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality MMU FCI 21 yploh Correlation Analysis (for Categorical / Nominal Data) 𝜒! (chi-square) test (or Pearson’s Statistic): # ! ! 𝑂 " − 𝐸" 𝑂! = observed frequency 𝜒 =# 𝐸! = expected frequency 𝐸" " Null hypothesis: The two distributions (attributes) are independent (no correlation) The cells that contribute the most to the 𝜒! value are those whose actual count is very different from the expected count The larger the 𝜒! value, the more likely the variables are related Note: Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population MMU FCI 22 yploh Chi-Square Calculation: An Example Expected frequency in this case (Brackets): Expected frequency means how many times you expect “male” and “fiction” to occur if you let them happen by chance. How to derive expected frequency 90? (450×300)/1500 = 90 You expect 90 males prefer fiction if # 𝑂! − 𝐸! " 𝑂! = observed frequency there is an equal distribution – no " 𝜒 =6 𝐸! 𝐸! = expected frequency gender effect. ! The further the observed values from the expected values, the likely the chance they are correlated. MMU FCI 23 yploh Chi-Square Calculation: An Example Assume significance of 0.001 𝜒 ! (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) ! 250 − 90 ! 50 − 210 ! 200 − 360 ! 1000 − 840 ! 𝜒 = + + + = 507.93 90 210 360 840 Compute the degree of freedom, row − 1 × column − 1 2−1 × 2−1 =1 Then, check 𝜒 ! table. For 1 degree of freedom, the 𝜒 ! value needed to reject the hypothesis at the 0.001 significance is 10.828. 507.93 > 10.828. It shows that gender and preferred_reading are correlated in the group. MMU FCI 24 yploh MMU FCI 25 yploh Covariance for Two Variables Covariance and correlation are two similar measures for assessing how much two attributes change together Consider two numeric attributes 𝐴 and 𝐵, and a set of 𝑛 observations {(𝑎! , 𝑏! ) … (𝑎" , 𝑏" )} The mean values of 𝐴 and 𝐵, respectively, are also known as the expected values on 𝐴 and 𝐵, that is, ∑% "#$