Data Cleaning PDF

FEATURE ENGINEERING Module 2 DATA PREPROCESSING In real time, the original data gathered are highly susceptible to noisy, missing, and inconsistent due to their typically huge size (often several gigabytes or more) origin from multiple, heterogenous sources Having irrelevant features in data can decrease the accuracy of the models and make the model learn based on irrelevant features - Hence, it needs to be processed first before any analysis. MAJOR TASKS IN DATA PREPROCESSING DATA PREPROCESSING TECHNIQUES Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. 1. DATA CLEANING Data cleaning process works to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers. In data, there may be errors, unusual values, incomplete, inaccurate or noisy, and inconsistent (e.g., containing discrepancies) The factors comprising data quality, including accuracy, completeness, consistency, timeliness and interpretability. The dirty data can cause confusion for the machine learning process, resulting in unreliable output. (on avoiding overfitting the data) Types: A. Filling in missing values, B. Smoothing noisy data, C. Identifying or removing outliers, D. Resolving inconsistencies REASONS FOR INACCURATE DATA The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information - This is known as disguised missing data. Errors in data transmission can also occur. There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption. Incorrect data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also require data cleaning. REASONS FOR INCOMPLETE DATA oAttributes of interest may not always be available, such as customer information for sales transaction data- Other data may not be included simply because they were not considered important at the time of entry. oRelevant data may not be recorded due to a misunderstanding or because of equipment malfunctions. oData that were inconsistent with other recorded data may have been deleted. oFurthermore, the recording of the data history or modifications may have been overlooked. oMissing data, particularly for tuples with missing values for some attributes, may need to be inferred. ▪Timeliness refers to the time expectation for accessibility and availability of information. Timeliness can be measured as the time between when information is expected and when it is readily available for use. Believability reflects how much the data are trusted by users Interpretability reflects how easy the data are understood. A. HANDLING MISSING VALUES There are 2 primary ways of handling missing values: I. Deleting the Missing values II. Imputing the Missing Values df.isnull().sum() i) Ways to delete the Missing values Deleting the entire row Deleting the entire column df.dropna(axis=0) df.drop(['col'],axis=1) df.drop(index=df.index[val],axis=0) df.dropna(axis=1) ii) Ways to Impute the missing values: ▪Fill in the missing value manually ▪Use a global constant to fill in the missing value ▪Use a measure of central tendency for the attribute (e.g., Symmetric -mean or Skewed -median) to fill in the missing value ▪Use the attribute mean or median for all samples belonging to the same class as the given tuple ▪Replacing with previous value, next value ▪Use different interpolation methods like ‘polynomial’, ‘linear’, ‘quadratic’ ▪Use the most probable value to fill in the missing value (regression, Bayesian inference or decision tree) ▪Imputation of Missing Value Using sci-kit learn Library ▪Univariate Approach –SimpleImputer ▪Multivariate approach –KNNImputer or IterativeImputer B. HANDLING NOISY DATA ▪Noise is a random error or variance in a measured variable. 1. Some basic statistical description techniques (e.g., boxplots and scatter plots), and methods of data visualization can be used to identify outliers, which may represent noise. 2. Use Data smoothing techniques that “smooth” out the data to remove the noise 3. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it. 4. Regression: a technique that conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. BINNING Binning method is used to smoothing data or to handle noisy data. In this method, the data is first sorted and then the sorted values are distributed into a number of buckets or bins. As binning methods consult the neighbourhood of values, they perform local smoothing. Three approaches to performing smoothing: 1. Smoothing by bin means : each value in a bin is replaced by the mean value of the bin. 2. Smoothing by bin median : each bin value is replaced by its bin median value. 3. Smoothing by bin boundary : the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. C. IDENTIFYING OR REMOVING Outlier analysis –process ofOUTLIERS identifying extreme values, or abnormal observations Outlier analysis: Outliers may be detected by clustering, Intuitively, values that fall outside of the set of clusters may be considered outliers IDENTIFYING OR REMOVING OUTLIERS Types: 1. Mean > Median (right side) or Mean < Median (left side) 2. large difference between 75th %tile (Q3) and max values of predictors 3. outside of minimum and maximum points Q1–1.5*IQR and Q3+1.5*IQR respectively 4. Box plot visualization 5. Z-score is >3 D. RESOLVING INCONSISTENCIES Discrepancy detection: Discrepancies can be caused by several factors, including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors and data decay. Discrepancies may also arise from inconsistent data representations and inconsistent use of codes. Other sources of discrepancies include errors in instrumentation devices that record data and system errors. Errors can also occur when the data are (inadequately) used for purposes other than originally intended. There may also be inconsistencies due to data integration. As a starting point, use any knowledge you may already have regarding properties of the data. Such knowledge or “data about data” is referred to as metadata. From this, you may find noise, outliers, and unusual values that need investigation. The data should also be examined regarding unique rules, consecutive rules, and null rules. A unique rule says that each value of the given attribute must be different from all other values for that attribute. A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers). A null rule specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition (e.g., where a value for a given attribute is not available), and how such values should be handled. DATA CLEANING TOOLS Commercial tools that can aid in the discrepancy detection: Data scrubbing tools use simple domain knowledge to detect errors and make corrections in the data(Inspect using rules and correct any flaws). Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. Some data inconsistencies may be corrected manually using external references. Most errors, however, will require data transformations. Once we find discrepancies, we typically need to define and apply (a series of) transformations to correct them. Data migration tools allow simple transformations to be specified such as to replace the string “gender” by “sex.” ETL (extraction/transformation/loading) tools allow users to specify transforms through a graphical user interface (GUI). Potter’s Wheel- a publicly available data cleaning tool that integrates discrepancy detection and transformation. "potter's wheel" - refers to a metaphor for an interactive data cleaning system that allows users to manipulate and refine their data in a fluid, iterative manner that allows users to manipulate and refine their data in a fluid, iterative manner, similar to how a potter shapes clay on a spinning wheel, by identifying inconsistencies, transforming values, and making adjustments as needed, all within a single interface Key aspects of a "potter's wheel" in data cleaning: Interactive manipulation: allows users to dynamically view and modify data elements in real-time, making adjustments as they identify issues. Visual feedback: The system provides visual cues to highlight areas of concern, such as potential outliers, missing values, or data type mismatches, enabling users to quickly assess the data quality. Customizable cleaning rules: Users can define specific rules and constraints for their data, allowing for targeted cleaning based on the context and domain knowledge. Transformation capabilities: Beyond just identifying errors, a "potter's wheel" tool enables users to directly transform data by changing values, merging columns, or applying calculations to correct inconsistencies.

Document Details

Tags

Related

Summary

Full Transcript