Data Mining: Data Preprocessing I Lecture Notes PDF
Document Details
Uploaded by BonnyLynx8210
Helwan University
Dr. Doaa Elzanfaly
Tags
Summary
These lecture notes cover data preprocessing in data mining. They discuss different types of data issues like noisy, missing, and inconsistent data and methods to improve data quality. The notes also include examples, methodologies, and definitions for each method, making them a useful resource for data analysis students.
Full Transcript
DATA MINING Lectures 3: Data Preprocessing I Dr. Doaa Elzanfaly 1 1 Introduction Real-world databases are highly susceptible to noisy, miss...
DATA MINING Lectures 3: Data Preprocessing I Dr. Doaa Elzanfaly 1 1 Introduction Real-world databases are highly susceptible to noisy, missing, and inconsistent data. ◼ due to their typically huge size ◼ Origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. How can the data be pre-processed to help improve the quality of the data and, consequently, of the mining results? 2 2 2 Lecture Outline ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization 3 3 3 Measures for data quality: A multidimensional view ◼ Data have quality if they satisfy the requirements of the intended use. ◼ Accuracy: correct or wrong, accurate or not. ◼ Completeness: not recorded, unavailable, … ◼ Consistency: some modified but some not, … ◼ Timeliness: timely update? ◼ Believability: how trustable the data are correct? ◼ Interpretability: how easily the data can be understood? 4 4 4 How to measure each dimension ◼ 1.Accuracy: Validation against ground truth: Compare data entries to authoritative sources (e.g., check address or name against an official directory). ◼ 2. Completeness: Missing data rate: Determine how many required fields are empty. ◼ 3. Consistency: Cross-field validation: Check if values in related fields are logically consistent (e.g., end date cannot be earlier than start date). ◼ 4.Cross-system consistency: Verify if the same data in different systems is consistent (e.g., the same customer data in CRM and billing systems). 5 5 5 Cont. ◼ 4. Timeliness: Latency check: Measure the time delay between the generation of data and when it becomes available for use. ◼ 5. Believability: Source validation: Check the trustworthiness of the source of the data. ◼ 6. Interpretability: Documentation assessment: Check if metadata, descriptions, and formats are well-documented and understandable. 6 6 6 Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Fill in missing values, smooth noisy data (including redundancy), identify or remove outliers, and resolve inconsistencies. ◼ Data Integration ◼ Integration of multiple databases, data cubes, or files. ◼ Data Reduction (reduce number of objects or attributes) ◼ Dimensionality reduction. Reduce attributes (Data compression. Wavelet transform, PCA) ◼ Numerosity reduction. Data aggregation ◼ Data Transformation and Discretization ◼ Normalization. that is, scaled to a smaller range such as [0.0, 1.0] ◼ Concept hierarchy generation. (e.g., zip codes to cities, cities to countries). 7 7 7 Data Cleaning (or data cleansing) ◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◼ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ◼ e.g., Occupation=“ ” (missing data) ◼ noisy: containing noise, errors, or outliers ◼ e.g., Salary=“−10” (an error) ◼ inconsistent: containing discrepancies in codes or names, e.g., ◼ Age=“42”, Birthday=“03/07/2010” ◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ discrepancy between duplicate records ◼ Intentional errors(e.g., masked missing data) ◼ Jan. 1 as everyone’s birthday?, for security or simplicity or testing 8 8 8 Data Cleaning (or data cleansing) https://www.rfwireless-world.com/Terminology/What-is-Data-Cleansing-or-cleaning.html 9 9 9 Incomplete (Missing) Data ◼ Data is not always available ◼ Many tuples have no recorded value for several attributes. ◼ Missing data may be due to: ◼ equipment malfunction ◼ inconsistent with other recorded data and thus deleted ◼ data not entered due to misunderstanding ◼ certain data may not be considered important at the time of entry ◼ not register history or changes of the data ◼ Missing data may need to be inferred 10 10 10 Types of Missing Data - Three Main Types Missing data at Missing Completely other attributes Missing at Random - The missing Missing not at Random Random - MCAR have been done MAR can justify why values can - MNAR Missing data are unrelated to the by mistake this The factdata is are missing that data explainis why Missingness directly related to observation being studied or the can be predicted from other the value of the missing other variables in the data set. missing variables, but not from the they are missing observation. missing data themselves. Men are less likely Those with high to answer the levels of depression Lost Surveys. Biased Analysis, survey on can result if missing data refused to answer. isn’t properly addressed depression. Random Depends on A certain other variables range is Depends on missing gender Unbiased Analysis 11 11 11 How to Handle Missing Data? ◼ 1) Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably. ◼ 2)Fill in the missing value manually —tedious + infeasible? ◼ Fill in it automatically with ◼ 3) a global constant : e.g., “unknown”, a new class?! ◼ 4) measure of central tendency - attribute mean(average) or median ◼ 5) the attribute mean for all samples belonging to the same class: smarter ◼ 6) the most probable value: inference-based such as Bayesian formula or decision tree. 12 12 12 Two Main Missing Values Imputation Techniques Approaches Correlation Statistical Inference (if one value increases the other increases too) MEAN MODE KNN (Numerical Data) (Categorical Data) (Numerical & Categorical Data) average Most frequent - Handle instances with multiple - Good Results. - Simple. missing values - Biased Variance – Lead by - leads to underestimation of - Takes into consideration the correlation of the data. the outliers. the variance. - Selecting the distance method is challenging 13 13 13 Noisy Data ◼ Noise: random error or variance in a measured variable ◼ Incorrect attribute values may be due to: ◼ faulty data collection instruments ◼ data entry problems ◼ data transmission problems ◼ technology limitation ◼ inconsistency in naming convention 14 14 14 Noise Noise andOutliers outliers random variations or errors in extreme values that data that do not represent real significantly deviate from other values or patterns observations in the dataset. typically, random and can occur They can be either legitimate at multiple points in the or erroneous, depending on dataset, introducing small the context. variations that obscure real specific data points that are signals distinctly different from others Ex: In a time-series dataset of Ex: In financial transactions, a daily temperatures, slight transaction amount that is fluctuations due to faulty significantly larger than all sensors could be noise that other transactions could be an makes it harder to identify the outlier long-term trend. 15 15 15 How to Handle Noisy Data? ◼ Binning Main goal: reduce ◼ first sort data and partition into bins. the variation ◼ then one can smooth by bin means, smooth by bin median, of smooth noise to by a bin boundaries, etc. Common class ◼ Regression ◼ smooth by fitting the data into regression functions ◼ Clustering ◼ detect and remove outliers ◼ Combined computer and human inspection ◼ detect suspicious values and check by human (e.g., deal with possible outliers) 16 16 16 Binning Methods for Data Smoothing Two types of Bins: ◼ Equal-width - Histogram No of people in each bin ◼ Divide the range into N intervals of equal size W = (B-A)/N A and B are the lowest and highest values of the attribute ◼ The most straightforward ◼ Each bin has the same width ◼ Skewed data is not handled well. Same No of people in each bin ◼ Equal-depth (Sorting is a must for data) But different range in each class ◼ It divides the range into N intervals, each containing approximately same number of samples ◼ Each bin has the same number of elements ◼ Good data scaling https://www.youtube.com/watch?v=FdCJb1qJg-k 17 17 17 ◼ Distribution in Equal-Width Bins Bin 1: [1,000, 1,200, 1,500, 1,800, 2,000, 2,200, 2,300, 2,400, 2,500, 2,600, 2,700] 11 people fall into this bin. Bin 2: [2,800, 2,900, 3,000, 3,100, 3,200, 3,300, 3,400, 3,500, 3,600, 3,700, 3,800, 3,900, 4,000, 4,100, 4,200] 15 people fall into this bin. Bin 3: [4,500, 5,000, 5,200, 5,500, 5,800, 6,000] 6 people fall into this bin. Bin 4: [6,200, 6,500, 7,000, 7,500, 8,000] 5 people fall into this bin. Bin 5: [8,500, 9,000, 9,200, 9,500, 9,800, 10,000] 6 people fall into this bin. ◼ Distribution in Equal-Depth Bins Bin 1: [1,000, 1,200, 1,500, 1,800, 2,000, 2,200, 2,300, 2,400, 2,500, 2,600] This bin contains the lowest 10 incomes. Bin 2: [2,700, 2,800, 2,900, 3,000, 3,100, 3,200, 3,300, 3,400, 3,500, 3,600] The next 10 people fall into this bin. Bin 3: [3,700, 3,800, 3,900, 4,000, 4,100, 4,200, 4,500, 5,000, 5,200, 5,500] Another 10 people in this range. Bin 4: [5,800, 6,000, 6,200, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,200] Again, 10 people are grouped here. Bin 5: [9,500, 9,800, 10,000] Contains the highest incomes, although fewer than 10 values fit due to data limits. 10/27/2024 Data Mining: Concepts and Techniques 18 18 18 income distribution example if bins have Example of Binning for Data Smoothing counts like [10, 15, 2, 20, 5], smoothing can adjust these counts to [12, 12, 12, 9, Sorted data the: 4, 8, 9, 15, 21, 7],21,reflecting 24, 25, 26,a 28, 29, 34 more balanced trend across the range. Partition into (equal-depth) bins: - Bin 1: 4, 8, 9, 15 Closest boundary Replace each value - Bin 2: 21, 21, 24, 25 value will be chosen with the mean to replace other values - Bin 3: 26, 28, 29, 34 Smoothing by bin means: Smoothing by bin boundaries: - Bin 1: 9, 9, 9, 9 - Bin 1: 4, 4, 4, 15 - Bin 2: 23, 23, 23, 23 - Bin 2: 21, 21, 25, 25 - Bin 3: 29, 29, 29, 29 - Bin 3: 26, 26, 26, 34 Replace each value Smoothing by bin median: with the median - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 19 19 19 How does binning solve noise 1.Reduces the Influence of Outliers: By grouping data into bins, extreme values that could skew analysis are absorbed into broader categories, reducing their impact on overall data patterns. 2.Smooths Data Variability: Binning smooths out the small variations and fluctuations in data by averaging or summarizing values within each bin, which helps to highlight the general trend and patterns rather than the noise. 3.Simplifies Data Representation: By converting continuous data into discrete bins, the data becomes easier to manage and analyze, reducing the complexity that noise can introduce. 4.Improves Model Performance: In machine learning, binning can improve the performance of algorithms sensitive to noise by providing cleaner, more interpretable input features. However, it’s important to choose the bin width or the number of bins carefully; too few bins can oversimplify the data, while too many bins can retain too much noise. 10/27/2024 Data Mining: Concepts and Techniques 20 20 20 Data discrepancy detection ◼ Data discrepancy detection ◼ Use metadata (e.g., domain, range, dependency, distribution) ◼ Check field overloading ◼ Check uniqueness rule, consecutive rule and null rule ◼ Use commercial tools ◼ Data scrubbing: use simple domain knowledge (e.g., postal code, spell- check) to detect errors and make corrections ◼ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Some data inconsistencies may be corrected manually using external references. Most errors, however, will require data transformations. 21 21 21 Data Integration Combining data from multiple sources into a coherent store ◼ Schema integration: e.g., A.cust-id B.cust-# ◼ Metadata can be used. ◼ Consider Functional Dependency and Referential Integrity. ◼ Entity identification problem: ◼ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton ◼ Detecting and resolving data value conflicts ◼ For the same real-world entity, attribute values from different sources are different ◼ Possible reasons: different representations, different scales, e.g., metric vs. British units 22 22 22 Data Integration – Record Duplication Types Deterministic/ Exact Probabilistic/ Near ID Name Age Address No. Name DoB Address 1 Mohamed Ahmed 50 Giza 001 Mohamed Ahmed 12/7/73 Giza 1 Mohamed Ahmed 50 Giza 002 Muhammed Ahmed 7/12/73 Mohandesen 23 23 Handling Redundancy in Data Integration ◼ An attribute may be redundant if: ◼ The same attribute or object may have different names in different databases ◼ Duplicate much or all of the information contained in one or more other attributes. E.g., purchase price of a product and the amount of sales tax paid or Age and the DoB. ◼ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis and covariance analysis 24 24 24 Correlation Analysis: Correlation measures the strength and direction of the linear relationship between two variables. It is expressed as a correlation coefficient, typically ranging from -1 to 1. A correlation of +1 indicates a perfect positive relationship. A correlation of -1 indicates a perfect negative relationship. A correlation of 0 indicates no linear relationship. 25 25 25 Correlation Analysis (Nominal Data) ◼ 26 26 26 Correlation Analysis (Nominal Data) ◼ If a cell’s observed count is very different from the expected count, it suggests that the assumption of independence might not hold true for that part of the data. 27 27 27 Correlation Analysis (Nominal Data) Correlation: Correlation measures the strength and direction of a linear relationship between two variables. It quantifies how closely related the variables are in terms of their movement together (positive correlation), inversely (negative correlation), or not related at all (no correlation). Causality (Causation): Causality means that a change in one variable directly causes a change in another variable. This implies a cause-and-effect relationship, where one event (the cause) directly influences the occurrence of another event (the effect). 28 28 28 Chi-Square Calculation: An Example Null Hypothesis (H₀): There is no association between like for science fiction and playing chess Alternative Hypothesis (H₁): There is an association between like for science fiction and playing chess Null Hypothesis (H₀): There is no association between gender and job satisfaction. Alternative Hypothesis (H₁): There is an association between gender and job satisfaction. 29 29 29 Chi-Square Calculation: An Example Play chess Not play chess Sum (row) Like science fiction 250 (90) 200 (360) 450 Not like science fiction 50 (210) 1000 (840) 1050 Sum(col.) 300 1200 1500 ◼ Numbers in parenthesis are expected counts calculated based on the data distribution in the two categories (250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2 2 = + + + = 507.93 90 210 360 840 ◼ It shows that like_science_fiction and play_chess are correlated in the group The expected frequency for the cell (Like science fiction, Play chess) = (all who like science fiction x all who play chess)/total = (450 * 300)/1500 = 90 30 30 30 Level of Freedon ◼ Determine Degrees of Freedom (df) The degrees of freedom for a test of independence is: ◼ df=(r−1)×(c−1), Where r is the number of rows, and c is the number of columns. ◼ Degrees of freedom (df) are the number of values in the final calculation of a statistic that are free to vary. It’s essentially the number of independent data points that are available to estimate a statistical parameter, like a correlation coefficient. 10/27/2024 Data Mining: Concepts and Techniques 31 31 31 the table of upper percentage points of the X2 distribution ◼ Find the Critical Value and Make a Decision Using a Chi-Square distribution table at a significance level (α) of 0.05 and 1 degree of freedom, the critical value is approximately 3.84. If χ2>3.84 reject the null hypothesis. Since our calculated χ2=507.9, which is greater than 3.84, we reject the null hypothesis. 10/27/2024 Data Mining: Concepts and Techniques 32 32 32 Correlation Analysis (Numeric Data) ◼ Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. ◼ If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. ◼ rA,B = 0: independent; rAB < 0: negatively correlated 33 33 33 Correlation Analysis (Numeric Data) A B Calculate mean 1 2 2 3 4 6 5 8 Calculate Standard Deviations 6 8 10/27/2024 Data Mining: Concepts and Techniques 34 34 34 Correlation coefficient ◼ The Pearson correlation coefficient (rA,B) approximately 0.99, indicating a very strong positive linear relationship between variables A and B. As the values of A increase, the values of B also increase in a nearly perfect linear manner. 10/27/2024 Data Mining: Concepts and Techniques 35 35 35 References ◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999 ◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996 ◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 ◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997. ◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. VLDB'01 ◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07 ◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997 ◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998 ◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003 ◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 ◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001 ◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001 ◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995 36 36 36