Data Cleaning Lecture 5 - PDF

Data Cleaning Chapter 3 Dr. Amira Abdelatey Chapter 3: Data Preprocessing Data Preprocessing: An Overview – Data Quality – Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 2 2 Data Quality: Why Preprocess the Data? Measures for data quality: A multidimensional view – Accuracy: correct or wrong, accurate or not – Completeness: not recorded, unavailable, … – Consistency: some modified but some not, dangling, … – Timeliness: timely update? – Believability ‫المصداقية‬: how trustable the data are correct? – Interpretability‫قابل للتفسير‬: how easily the data can be understood? 3 Data Preprocessing definition Data Preprocessing refers to the steps applied to make data more suitable for Exploratory data analytics and Machine Learning. Which means data is ready to view insights and get ready to make decision. The steps used for Data Preprocessing usually fall into two categories: 1- selecting data objects and attributes for the analysis. 2- creating/changing the attributes. Major Tasks in Data Preprocessing 1. Data cleaning ▪ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies 2. Data integration ▪ Integration of multiple databases, or files 3. Data reduction ▪ Dimensionality reduction (reduce # features) ▪ Numerosity reduction (reduce data volume) 4. Data transformation and data discretization ▪ Normalization: adjusting values measured on different scales to a common scale 5 Chapter 3: Data Preprocessing Data Preprocessing: An Overview – Data Quality – Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 6 6 Load data Check the data info Statistical analysis Data Cleaning Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. 10 Data Cleaning 1. incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation=“ ” (missing data) 2. noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error) 3. inconsistent: containing discrepancies ‫ تناقضات‬in codes or names, e.g., Age=“42”, Birthday=“03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records 4. Intentional ‫( متعمد‬e.g., disguised missing data) Jan. 1 as everyone’s birthday? Incomplete (Missing) Data Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data 12Missing data may need to be inferred [how to handle this?] How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with – a global constant : e.g., “unknown”, a new class?! – the attribute mean – the attribute mean for all samples belonging to the same class: smarter – the most probable value: inference-based such as Bayesian 13 formula or decision tree [later] Filling null values with a single value Dropping missing values using dropna() Filling null values with fillna() Noisy Data Noise: random error in a measured variable Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention Other data problems which require data cleaning – duplicate records – incomplete data – inconsistent data 17 How to Detect noise data Detect noise data by : – Visualize the data using graphs, charts, or plots. Visualizing the data can help you spot patterns, trends, outliers, or anomalies that may indicate noise. How to Handle Noisy Data? Binning – first sort data and partition into (equal-frequency) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Combined computer and human inspection – detect suspicious values and check by human (e.g., deal with possible outliers) Regression (ML- later) – smooth by fitting the data into regression functions Clustering (ML- later) – detect and remove outliers 19 Binning methods for data smoothing Binning method is used to smoothing data or to handle noisy data. Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin median : In this method each bin value is replaced by its bin median value. Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. Binning methods for data smoothing Approach: 1. Sort the array of a given data set. 2. Divides the range into N intervals, each containing the approximately same number of samples(Equal-depth partitioning). 3. Store mean/ median/ boundaries in each row. Example: Binning methods for data smoothing Data Cleaning as a Process Data discrepancy detection ‫كشف تناقض البيانات‬ – Use metadata (e.g., domain, range, dependency, distribution) – Check field overloading – Check uniqueness rule, consecutive rule and null rule – Use commercial tools Data scrubbing‫تنقيح البيانات‬: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections Data auditing ‫تدقيق البيانات‬: by analyzing data to discover rules and relationship to detect violators 23 ‫( المخالفين‬e.g., correlation and clustering to find outliers)

Data Cleaning Lecture 5 - PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue