Data Preprocessing: DM_Lec3 Notes PDF
Document Details
Uploaded by MomentousStanza2491
Haliç University
Dua Mining
Tags
Summary
This document is lecture notes on data preprocessing techniques, specifically covering data quality, cleaning, integration, transformation, and reduction. The notes explain how these techniques can potentially improve data quality and the efficiency of data mining processes.
Full Transcript
Data Preprocessing Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 1 Data Quality and Major Tasks in Data Preprocessing...
Data Preprocessing Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 1 Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 2 Data Preprocessing Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. “How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?” Data preprocessing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. Data Mining 3 Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: – Noise and outliers Noise: random error or variance in a measured variable Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set – Missing values – Duplicate data Data Mining 4 Data Quality Missing Values and Duplicate Data Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values – Eliminate Data Objects – Estimate Missing Values – Ignore the Missing Value During Analysis – Replace with all possible values (weighted by their probabilities) Data set may include data objects that are duplicates, or almost duplicates of one another – Major issue when merging data from heterogenous sources Data Mining 5 Data Quality: Why Preprocess the Data? Data have quality if they satisfy the requirements of the intended use. Measures for data quality: A multidimensional view – Accuracy: correct or wrong, accurate or not – Completeness: not recorded, unavailable, … – Consistency: some modified but some not, dangling, … – Timeliness: timely update? – Believability: how trustable the data are correct? – Interpretability: how easily the data can be understood? Data Mining 6 Data Quality: Why Preprocess the Data? Accuracy: correct or wrong, accurate or not There are many possible reasons for inaccurate data. – Human or computer errors occurring at data entry. – Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information such as choosing the default value “January 1” displayed for birthday. – Incorrect data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). Completeness: not recorded, unavailable, … Attributes of interest may not always be available Data may not be included simply because they were not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding. Missing data, tuples with missing values for some attributes, may need to be inferred. Data Mining 7 Data Quality: Why Preprocess the Data? Consistency: some modified but some not, dangling, … Containing discrepancies in the department codes used to categorize items. Inconsistencies in data codes, or inconsistent formats for input fields (e.g., date). Timeliness: timely update? Is the data is timely updated? Believability: how trustable the data are correct? How much the data are trusted by users? The past errors can effect the trustability of the data. Interpretability: how easily the data can be understood? Data Mining 8 Major Tasks in Data Preprocessing Data cleaning can be applied to remove noise and correct inconsistencies in the data. – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. – Integration of multiple databases, data cubes, or files Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering. – Dimensionality reduction, Numerosity reduction, Data compression Data transformations and Data Discretization, such as normalization, may be applied. – For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. – Concept hierarchy generation Data Mining 9 Major Tasks in Data Preprocessing Data Cleaning Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. – If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. – Dirty data can cause confusion for the mining procedure, resulting in unreliable output Data Mining 10 Major Tasks in Data Preprocessing Data Integration Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data Mining 11 Major Tasks in Data Preprocessing Data Reduction Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data reduction strategies include dimensionality reduction and numerosity reduction. Data Mining 12 Major Tasks in Data Preprocessing Data transformations and Data Discretization The data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. Data discretization is a form of data transformation. – Data discretization transforms numeric(continuous) data by mapping values to interval or concept labels. Data Transformation: Normalization Data Mining 13 Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 14 Data Cleaning Data in the real world is dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation = “ ” (missing data) – noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error) – inconsistent: containing discrepancies in codes or names, e.g., Age = “42”, Birthday = “03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records – intentional: (e.g., disguised missing data) Jan. 1 as everyone’s birthday? Data Mining 15 Incomplete (Missing) Data Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data. Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data Missing data may need to be removed. Data Mining 16 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with – a global constant : e.g., “unknown”, a new class?! – the attribute mean – the attribute mean for all samples belonging to the same class: smarter – the most probable value: inference-based such as Bayesian formula or decision tree. a popular strategy. In comparison to the other methods, it uses the most information from the present data to predict missing values. Data Mining 17 Noisy Data and How to Handle Noisy Data? Noise: random error or variance in a measured variable Outliers may represent noise. Given a numeric attribute such as, say, price, how can we “smooth” out the data to remove the noise? Data Smoothing Techniques: Binning – first sort data and partition into (equal-frequency) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression – smooth by fitting the data into regression functions Clustering – detect and remove outliers Combined computer and human inspection – detect suspicious values and check by human (e.g., deal with possible outliers) Data Mining 18 Binning Methods for Data Smoothing Binning methods smooth a sorted data by distributing them into bins (buckets). Smoothing by bin means: Each value in a bin is replaced by the mean value of the bin. Smoothing by bin medians: Each bin value is replaced by the bin median. Smoothing by bin boundaries: The minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. Data Mining 19 Binning Methods for Data Smoothing: Example Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: – Bin 1: 4, 8, 15 – Bin 2: 21, 21, 24 – Bin 3: 25, 28, 34 Smoothing by bin means: – Bin 1: 9, 9, 9 – Bin 2: 22, 22, 22 – Bin 3: 29, 29, 29 Smoothing by bin medians: – Bin 1: 8, 8, 8 – Bin 2: 21, 21, 21 – Bin 3: 28, 28, 28 Smoothing by bin boundaries: – Bin 1: 4, 4, 15 – Bin 2: 21, 21, 24 – Bin 3: 25, 25, 34 Data Mining 20 Data Cleaning as a Process Data discrepancy detection – Use metadata (e.g., domain, range, dependency, distribution) – Check uniqueness rule, consecutive rule and null rule – For example, values that are more than two standard deviations away from the mean for a given attribute may be flagged as potential outliers. – Use commercial tools Data migration and integration – Data migration tools: allow transformations to be specified – ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Data Mining 21 Data Cleaning as a Process Data Mining 22 Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 23 Data Integration Data integration: – Combines data from multiple sources into a coherent source. – Careful integration can help reduce and avoid redundancies and inconsistencies. Schema integration: – Integrate metadata from different sources – e.g., A.cust-id B.cust-# Entity identification problem: – Identify real world entities from multiple data sources, – e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts – For the same real world entity, attribute values from different sources are different – Possible reasons: different representations, different scales, e.g., metric vs. British units Data Mining 24 Handling Redundancy in Data Integration Redundancy is another important issue in data integration. An attribute may be redundant if it can be “derived” from another attribute or set of attributes. Redundant attributes may be able to be detected by correlation analysis. – 2 (chi-square) test for nominal attributes. – correlation coefficient and covariance for numeric attributes Data Mining 25 Correlation Analysis (for Numeric Data) Correlation Coefficient For numeric attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient. _ where n is the number of tuples, Ā and B are the respective means of A and B, σA and σB are the respective standard deviation of A and B. Note that -1 rA,B 1 If rA,B > 0: A and B are positively correlated (A’s values increase as B’s). – The higher value implies a stronger correlation. rA,B = 0: independent; rAB < 0: negatively correlated Data Mining 26 Correlation Analysis (for Numeric Data) Covariance The mean values of A and B, are also known as the expected values on A and B. The covariance between A and B is defined as: Covariance is similar to correlation coefficient: It can also be shown that: – This equation simplifies the calculation of Cov(A,B). Data Mining 27 Covariance: Example Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (7, 14). Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? – E(A) = (2 + 3 + 5 + 4 + 7)/ 5 = 21/5 = 4.2 – E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 – Cov(A,B) = (2×5+3×8+5×10+4×11+7×14)/5 − 4.2 × 9.6 = 4.88 Thus, A and B rise together since Cov(A, B) > 0. Data Mining 28 Correlation Analysis (for Numeric Data) Covariance Positive covariance: If CovA,B > 0, then A and B both tend to be larger than theirexpected values Negative covariance: If CovA,B < 0 then if one of the attributes tends to be above its expected value when the other attribute is below its expected value, Independence: CovA,B = 0 – If A and B are independent, CovA,B = 0. – But the converse is not true: Some pairs of random variables may have a covariance of 0 but they are not independent. – Covarianve indicates linear relationship (not non-linear relationship) Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence. 29 Correlation Test (for Nominal Data) 2 (Chi-Square) Test For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a 2 (chi-square) test. Suppose A has c distinct values, a1 … ac. B has r distinct values, b1 … br. 2 (chi-square) Test: where oij is the observed frequency (i.e., actual count) of the joint event (Ai,Bj) and eij is the expected frequency of (Ai,Bj), which can be computed as where n is the number of data tuples, count(A=ai) is the number of tuples having value ai for A, and count(B = bj) is the number of tuples having value bj for B. The larger the 2 value, the more likely the variables are related. – The cells that contribute the most to the 2 value are those whose actual count is very different from the expected count. Data Mining 30 Chi-Square Calculation: An Example Contingency Table for two attributes LikeScienceFiction and PlayChess Numbers in cells are observed frequencies (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories). eLSF,PC = count(LSF)*count(PC) / n = 300*450 / 1500 = 90 2 (chi-square) calculation (250 90)2 (50 210)2 (200 360)2 (1000 840)2 2 507.93 90 210 360 840 Data Mining 31 Chi-Square Calculation: An Example For this 2x2 table, the degrees of freedom are (2-1)(2-1) = 1. – There are two possible values for LikeScienceFiction attribute and two possible values for PlayChess attribute. For 1 degree of freedom, the 2 value needed to reject the hypothesis at the 0.001 significance level is 10.83 (from table of upper percentage points of 2 distribution). Since our computed value 507.93 is above 10.83, we can reject the hypothesis that LikeScienceFiction and PlayChess are independent and conclude that the two attributes are (strongly) correlated for the given group of people. Data Mining 32 Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 33 Data Transformation In data transformation, the data are transformed or consolidated into forms appropriate for mining. In data transformation, a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values. Some of data transformation strategies: Normalization: The attribute data are scaled so as to fall within a small specified range. Discretization: A numeric attribute is replaced by a categorical attribute. Other data transformation strategies – Smoothing: Remove noise from data. Smoothing is also a data cleaning method. – Attribute/feature construction: New attributes constructed from the given ones. Attribute construction is also a data reduction medhod. – Aggregation: Summarization, data cube construction. Aggregation is also a data reduction method. Data Mining 34 Normalization An attribute is normalized by scaling its values so that they fall within a small specified range. A larger range of an attribute gives a greater effect (weight) to that attribute. – This means that an attribute with a larger range can have greater weight at data minining tasks than an attribute with a smaller range. Normalizing the data attempts to give all attributes an equal weight. – Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering. Some Normalization Methods: Min-max normalization Z-score normalization Normalization by decimal scaling Data Mining 35 Min-Max Normalization Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are minimum and maximum values of an attribute A. Min-max normalization maps a value, v i of an attribute A to 𝒗 ′𝒊 in the range [new_minA, new_maxA] by computing: Min-max normalization preserves the relationships among the original data values. We can standardize the range of all the numerical attributes to [0,1] by applying min-max normalization with newmin=0 and newmax=1 to all the numeric attributes. Data Mining 36 Min-Max Normalization: Example Suppose that the range of the attribute income is $12,000 to $98,000. We want to normalize income to range [0.0, 1.0]. Then $73,000 is mapped to 73000 − 12000 newvalue 73000 = 1.0 − 0.0 + 0 = 0.716 98000 − 12000 Suppose that the range of the attribute income is $12,000 to $98,000. We want to normalize income to range[1.0, 5.0]. Then $73,000 is mapped to 73000 − 12000 newvalue 73000 = 5.0 − 1.0 + 1.0 = 3.864 98000 − 12000 Data Mining 37 Z-score Normalization In z-score normalization (or zero-mean normalization), the values for an attribute A are normalized based on the mean and standard deviation of A. A value v i of attribute A is normalized to 𝒗 ′𝒊 by computing Ā 𝝈𝑨 are the mean and standard deviation of attribute A. where and 𝐀 z-score normalization is useful when the actual minimum and maximum of an attribute are unknown. Suppose that the mean and standard deviation of the values for the attribute income are $54,000 and $16,000. With z-score normalization a value of $73,600 for income: 73600 − 54000 newvalue 73600 = = 1.225 16000 Data Mining 38 Normalization by Decimal Scaling Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value 𝒗𝒊 of attribute A is normalized to 𝒗 ′𝒊 by computing ′ 𝑣𝑖 𝑣𝑖 = 10 𝑗 where j is the smallest integer such that 𝑚𝑎𝑥(|𝑣𝑖′|) < 1. Example: – Suppose that the recorded values of A range from -986 to 917. – The maximum absolute value of A is 986. – To normalize by decimal scaling, we therefore divide each value by 1000 so that -986 normalizes to -0.986 and 917 normalizes to 0.917. Data Mining 39 Discretization Discretization: To transform a numeric (continuous) attribute into a categorical attribute. Some data mining algorithms require that data be in the form of categorical attributes. In discretization: – The range of a continuous attribute is divided into intervals. – Then, interval labels can be used to replace actual data values to obtain a categorical attribute. Simple Discretization Example: income attribute is discretized into a categorical attribute. – Target categories (low, medium, high). – Calculate average income: AVG. If income> 2* AVG, new_income_value = “high”. If income < 0.5* AVG, new_income_value = “low”. Otherwise, new_income_value = “medium”. Data Mining 40 Discretization Methods A basic distinction between discretization methods for classification is whether class information is used (supervised) or not (unsupervised). Some of discretization methods are as follows: Unsupervised Discretization: If class information is not used, then relatively simple approaches are common. Binning Clustering analysis Supervised Discretization: Classification (e.g., decision tree analysis) Correlation (e.g., 2 ) analysis Data Mining 41 Discretization by Binning Attribute values can be discretized by applying equal-width or equal-frequency binning. Binning aproaches sorts the atribute values first, then partition them into the bins. – equal width approach divides the range of the attribute into a user-specified number of intervals each having the same width. – equal frequency (equal depth) approach tries to put the same number of objects into each interval. After bins are determined, all values are replaced by bin labels to discretize that attribute. – Instead of bin labels, values may be replaced by bin means (or medians). Binning does not use class information and is therefore an unsupervised discretization technique. Data Mining 42 Discretization by Binning: Example equal-width approach Suppose a group of 12 values of price attribute has been sorted as follows: price 5 10 11 13 15 35 50 55 72 89 204 215 equal-width partitioning: The width of each interval is (215-5)/3 = 70. Partition them into three bins bin1 5, 10, 11, 13, 15, 35, 50, 55, 72 bin2 89 bin3 204, 215 Replace each value with its bin label to discretize. price 5 10 11 13 15 35 50 55 72 89 204 215 categorical attr. 1 1 1 1 1 1 1 1 1 2 3 3 Data Mining 43 Discretization by Binning: Example equal-frequency approach Suppose a group of 12 values of price attribute has been sorted as follows: price 5 10 11 13 15 35 50 55 72 89 204 215 equal-frequency partitioning: Partition them into three bins: each interval contains 4 values bin1 5, 10, 11, 13 bin2 15, 35, 50, 55 bin3 72, 89, 204, 215 Replace each value with its bin label to discretize. price 5 10 11 13 15 35 50 55 72 89 204 215 categorical attr. 1 1 1 1 2 2 2 2 3 3 3 3 Data Mining 44 Discretization by Clustering A clustering algorithm can be applied to discretize a numeric attribute. – The values of the attribute are partitioned into clusters by a clustering algorithm. – Each value in a cluster is replaced by the label of that cluster to discretize. Clustering takes the distribution and closeness of attribute values into consideration, and therefore is able to produce high-quality discretization results. – Later, we will talk different clustering algorithms (such as k-means). Simple clustering: partition data from biggest gaps. – Example: partition data along 2 biggest gaps into three bins. bin1 5, 10, 11, 13, 15 bin2 35, 50, 55, 72, 89 bin3 204, 215 – Replace each value with its bin label to discretize. price 5 10 11 13 15 35 50 55 72 89 204 215 categorical attr. 1 1 1 1 1 2 2 2 2 2 3 3 Data Mining 45 Discretization by Classification Techniques used for a classification algorithm such as decision tree can be applied to discretization. Decision tree approaches to discretization are supervised, that is, they make use of class label information. These techniques employ a top-down splitting approach for attribute values: – Class distribution information is used in the calculation and determination of split-points. – The main idea is to select split-points so that a given resulting partition contains as many tuples of the same class as possible. Entropy is the most commonly used measure for this purpose.(ID3 and MDLP alg.) Later, we will talk about classification algorithms. Data Mining 46 Data Quality and Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Transformation and Data Discretization Data Reduction Data Mining 47 Data Reduction Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. Data reduction strategies: – Dimensionality reduction: e.g., remove unimportant attributes Wavelet transforms Principal Components Analysis (PCA) Feature subset selection, feature creation – Numerosity reduction: Data cube aggregation Sampling Clustering, … Data Mining 48 Data Reduction: Dimensionality Reduction Curse of dimensionality – When dimensionality increases, data becomes increasingly sparse. – Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful. – The possible combinations of subspaces will grow exponentially. Dimensionality reduction – Avoid the curse of dimensionality. – Help eliminate irrelevant features and reduce noise. – Reduce time and space required in data mining. – Allow easier visualization. Dimensionality reduction techniques – Wavelet transforms – Principal Component Analysis – Linear Discriminant Analysis – Supervised and nonlinear techniques (e.g., feature selection) Data Mining 49 Numerosity Reduction Data Cube Aggregation If the data has sales per quarters, and we are interested in annual sales the data can be aggregated so that the resulting data summarize the total sales per year instead of per quarter. The resulting data set is smaller in volume, without loss of information necessary for the analysis task. sales per quarter are aggregated to provide the annual sales. Data Mining 50 Numerosity Reduction Data Cube Aggregation Data cubes store multidimensional aggregated information. Data cubes provide fast access to precomputed, summarized data, thereby benefiting on-line analytical processing as well as data mining. The following data cube for multidimensional analysis of sales data with respect to annual sales per item type. – Each cell holds an aggregate data value, corresponding to the data point in multidimensional space. Data Mining 51 Numerosity Reduction Sampling Sampling is the main technique employed for data selection. – It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. The key principle for effective sampling is the following: – using a sample will work almost as well as using the entire data sets, if the sample is representative – A sample is representative if it has approximately the same property (of interest) as the original set of data Data Mining 52 Data Preprocessing: Summary Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability Data cleaning: e.g. missing/noisy values, outliers Data integration from multiple sources: – Entity identification problem – Remove redundancies – Detect inconsistencies Data transformation and data discretization – Normalization – Concept hierarchy generation Data reduction – Dimensionality reduction – Numerosity reduction Data Mining 53