Data Preprocessing PDF

Data Preprocessing Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary 2 Why Data Preprocessing? Data in the real world is dirty ◦ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ◦ e.g., occupation=“ ” ◦ noisy: containing errors or outliers ◦ e.g., Salary=“-10” ◦ inconsistent: containing discrepancies in codes or names ◦ e.g., Age=“42” Birthday=“03/07/1997” ◦ e.g., Was rating “1,2,3”, now rating “A, B, C” ◦ e.g., discrepancy between duplicate records 3 Why Is Data Dirty? Incomplete data may come from ◦ “Not applicable” data value when collected ◦ Different considerations between the time when the data was collected and when it is analyzed. ◦ Human/hardware/software problems Noisy data (incorrect values) may come from ◦ Faulty data collection instruments ◦ Human or computer error at data entry ◦ Errors in data transmission Inconsistent data may come from ◦ Different data sources ◦ Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning 4 Why Is Data Preprocessing Important? No quality data, no quality mining results! ◦Quality decisions must be based on quality data ◦ e.g., duplicate or missing data may cause incorrect or even misleading statistics. ◦Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 5 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: ◦Accuracy ◦Completeness ◦Consistency ◦Timeliness ◦Believability ◦Value added ◦Interpretability ◦Accessibility 6 Major Tasks in Data Preprocessing Data cleaning ◦ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration ◦ Integration of multiple databases, data cubes, or files Data transformation ◦ Normalization and aggregation Data reduction ◦ Obtains reduced representation in volume but produces the same or similar analytical results Data discretization ◦ Part of data reduction but with particular importance, especially for numerical data 7 Forms of Data Preprocessing 8 Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): 1 n ◦ Trimmed mean (20%): chopping extreme values x  n x i 1 i Median: Middle value if odd number of values, or average of the middle two values otherwise Mode ◦ Value that occurs most frequently in the data ◦ Unimodal, bimodal, trimodal ◦ Empirical formula: 9 Measuring the Central Tendency Mode is the value that appears most frequently in a set of data in statistics. The value or number in a data collection with a high frequency or appears more frequently is referred to as the mode or modal value. It is one of the three measures of central tendency, together with the mean and median. The word mode comes from the French phrase La Mode, which means "fashionable.“ 10 Measuring the Central Tendency What is Mode? In a set of values, a mode is defined as the value with the highest frequency. It's the value that shows up the most frequently. For example, the mode of the data set in the given set of data: 2, 4, 5, 5, 6, 7 is 5 because it appears twice in the collection. Mode is a method of summarising important information about random variables or populations in a single number, similar to the statistical mean and median. In a normal distribution, the modal value is the same as the mean and median, however in a severely skewed distribution, the modal value might be considerably different. As an example, the Mode is 6 in {6, 3, 9, 6, 6, 5, 9, 3} as the number 6 has occurred often. As a result, we may easily find the mode with a finite number of observations. A set of values can have only one mode, multiple modes, or none at all. 11 Measuring the Central Tendency Types of Mode Unimodal Mode A unimodal mode is a set of data with only one mode. The mode of data set A = {14, 15, 16, 17, 15, 18, 15, 19}, for example, is 15 because just one value repeats itself. As a result, it's a unimodal data set. Bimodal Mode A bimodal mode is a set of data that has two modes. This indicates that the data values with the highest frequencies are two. Set A = {2,2,2,3,4,4,5,5,5} has a mode of 2 and 5, because both 2 and 5 are repeated three times in the provided set. 12 Measuring the Central Tendency Trimodal Mode A trimodal mode is a set of data that has three modes. This indicates that the top three data values have the most frequency. Set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} has a mode of 2, 5, and 8 since all three numbers are repeated thrice in the provided set. As a result, it's a trimodal data collection. Multimodal Mode A multimodal mode is a set of data that contains four or more modalities. Because all four values in the given set recur twice, the mode of data set A = 100, 80, 80, 95, 95, 100, 90, 90,100,95 is 80, 90, 95 and 100. As a result, it's a multimodal dataset. 13 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data 14 Measuring the Dispersion of Data Quartiles, outliers and boxplots ◦ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ◦ Inter-quartile range: IQR = Q3 – Q1 ◦ Five number summary: min, Q1, Median, Q3, max ◦ Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually. ◦ Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) ◦ Variance: (algebraic, scalable computation) ◦ Standard deviation s (or σ) is the square root of variance s2 (or σ2) 15 Properties of Normal Distribution Curve The normal (distribution) curve ◦From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) ◦ From μ–2σ to μ+2σ: contains about 95% of it ◦From μ–3σ to μ+3σ: contains about 99.7% of it 16 Boxplot Analysis Five-number summary of a distribution: Minimum, Q1, Median, Q3, Maximum Boxplot ◦ Data is represented with a box ◦ The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ ◦ The median is marked by a line within the box ◦ Whiskers: two lines outside the box extend to Minimum and Maximum 17 Visualization of Data Dispersion: Boxplot Analysis 18 Histogram Analysis Graph displays of basic statistical class descriptions ◦Frequency histograms ◦ A univariate graphical method ◦ Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data 19 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary 20 Data Cleaning Importance ◦ “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball ◦ “Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks: ◦ Fill in missing values ◦ Identify outliers and smooth out noisy data ◦ Correct inconsistent data ◦ Resolve redundancy caused by data integration 21 Missing Data Data is not always available ◦ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry ◦ not register history or changes of the data Missing data may need to be inferred. 22 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. X Y Class Fill in the missing value manually: tedious + infeasible? 1 5 t 4 5 f Fill in it automatically with mode Mean ◦ a global constant : e.g., “unknown”, a new class?! base ◦ the attribute mean. attribut e ◦ the attribute mean for all samples belonging to the same class: smarter. 1 5 t ◦ the most probable value: inference-based such as Bayesian formula or 4 decision5tree t 23 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to: ◦ faulty data collection instruments ◦ data entry problems ◦ data transmission problems ◦ technology limitation ◦ inconsistency in naming convention Other data problems which requires data cleaning ◦ duplicate records ◦ incomplete data ◦ inconsistent data 24 How to Handle Noisy Data? 10 7 8 8 9 9 7 8 2500 SMOOTHING Binning 1. first sort data and partition into (equal-frequency) bins 2. then one can - smooth by bin means. - smooth by bin median. - smooth by bin boundaries. Regression ◦ smooth by fitting the data into regression functions Clustering ◦ detect and remove outliers Combined computer and human inspection ◦ detect suspicious values and check by human (e.g., deal with possible outliers) 25 Data Cleaning as a Process Data discrepancy detection ◦ Use metadata (e.g., domain, range, dependency, distribution) ◦ Check field overloading ◦ Check uniqueness rule, consecutive rule and null rule ◦ Use commercial tools ◦ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections ◦ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration ◦ Data migration tools: allow transformations to be specified ◦ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes ◦ Iterative and interactive (e.g., Potter’s Wheels) 26 Correlation Analysis (Categorical Data) Χ2 (chi-square) test (Observed  Expected ) 2  2  Expected The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count Correlation does not imply causality ◦ # of hospitals and # of car-theft in a city are correlated ◦ Both are causally linked to the third variable: population 27 Chi-Square Calculation: An Example Play Not play Sum chess chess (row) Like science fiction 250(90) 200(360) 450 Not like science 50(210) 1000(840) 1050 fiction Sum(col.) 300 1200 1500 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) (250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2 2     507.93 90 210 360 840 It shows that like_science_fiction and play_chess are correlated in the group 28 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range 1- min-max normalization 2- z-score normalization Attribute/feature construction ◦ New attributes constructed from the given ones 29 Data Transformation: Normalization Min-max normalization: to [new_minA, new_maxA] v  minA v'  (new _ maxA  new _ minA)  new _ minA maxA  minA ◦ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 73,600  12,000 (1.0  0)  0 0.716 1.0]. Then $73,000 is mapped to 98,000  12,000 Z-score normalization (μ: mean, σ: standard deviation): v  A 73,600  54,000 v'  1.225  A 16,000 ◦ Ex. Let μ = 54,000, σ = 16,000. Then v v'  j 10 3- Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 30

Data Preprocessing PDF

Document Details

Tags

Related

Summary

Full Transcript