Data Mining and Practical Machine Learning PDF
Document Details
Uploaded by WieldyNeptunium
Tags
Summary
This document provides an overview on data preprocessing techniques including missing value imputation and methods for handling noisy data, together with some sample examples.
Full Transcript
DATA MINING AND PRACTICAL MACHINE LEARNING Week 2:- Data Preparation Techniques 10/22/2024 ICT 462-3 DICT-UWU 1 INTRODUCTION ▪Data preparation is a critical step in data mining and machine...
DATA MINING AND PRACTICAL MACHINE LEARNING Week 2:- Data Preparation Techniques 10/22/2024 ICT 462-3 DICT-UWU 1 INTRODUCTION ▪Data preparation is a critical step in data mining and machine learning ▪Raw data is transformed into a format that can be effectively used for analysis and modeling. ▪The quality of the data often directly impacts the performance of the models, so preparing data properly is crucial. 10/22/2024 DICT-UWU 2 WHY PREPROCESSING? Improves Data Quality: Data may contain errors, missing values, or outliers that can affect the accuracy of models. Increases Model Efficiency: Well-preprocessed data can reduce the complexity of the learning models and lead to better results. Ensures Compatibility: Data from different sources may not always be in the same format, and preprocessing ensures that the data is uniform. 10/22/2024 DICT-UWU 3 There are accuracy, many factors completeness, comprising data quality, consistency, including timeliness, believability, and interpretability. 10/22/2024 DICT-UWU 4 DATA PREPARATION Data Data Data Discretization Transformation Integration Data Cleaning Data Reduction 10/22/2024 DICT-UWU 5 10/22/2024 DICT-UWU 6 DATA CLEANING 10/22/2024 DICT-UWU 7 DATA CLEANING ▪Data cleaning is a fundamental step in data preprocessing ▪Data cleaning routines work to “clean” the data by ▪Filling in missing values, ▪Smoothing noisy data, ▪Identifying or removing outliers, ▪and resolving inconsistencies 10/22/2024 DICT-UWU 8 HANDLING MISSING VALUES 10/22/2024 DICT-UWU 9 CONT.. You note that many tuples have no recorded value for several attributes such as age and cabin. How can you go about filling in the missing values for this attribute? Let’s look at the following methods 10/22/2024 DICT-UWU 10 CONT.. 1. Ignore the tuple ▪ This is usually done when the class label is missing (assuming the mining task involves classification). ▪ This method is not very effective, unless the tuple contains several attributes with missing values. ▪ By ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple. Such data could have been useful to the task at hand. 10/22/2024 DICT-UWU 11 CONT.. 2. Imputation ▪ Data imputation is a method for retaining the majority of the dataset's data and information by substituting missing data with a different value. ▪ These methods are employed because it would be impractical to remove data from a dataset each time. ▪ Additionally, doing so would substantially reduce the dataset's size, raising questions about bias and impairing analysis. 10/22/2024 DICT-UWU 12 CONT.. ▪Mean Imputation ▪ Replace missing values with the mean of the column. ▪ Best for numerical data that is normally distributed. ▪Median Imputation ▪ Replace missing values with the median of the column. ▪ Suitable for numerical data that is skewed. ▪Mode Imputation ▪ Replace missing values with the mode (most frequent value). ▪ Suitable for categorical data. 10/22/2024 DICT-UWU 13 CONT.. ▪Fixed value imputation ▪ Universal technique that replaces the null data with a fixed value and is applicable to all data types. ▪ You can impute the null values in a survey using "not answered" as an example of using fixed imputation on nominal features. ▪Next or Previous Value ▪ For time-series data or ordered data, there are specific imputation techniques. ▪ The next or previous value inside the time series is typically substituted for the missing value as part of a common method for imputed incomplete data in the time series. ▪Maximum or Minimum Value ▪ You can use the minimum or maximum of the range as the replacement cost for missing values if you are aware that the data must fit within a specific range. 10/22/2024 DICT-UWU 14 CONT.. ▪Missing Value Prediction ▪Using a machine learning model to determine the final imputation value for characteristic x based on other features is another popular method. ▪Algorithms such as k-nearest neighbors (KNN), or random forest can be employed for this. ▪Example:- ▪ K-Nearest Neighbors (KNN): Predicts missing values by finding the nearest neighbors in the dataset and taking an average (for numerical values) or mode (for categorical values). ▪ When training a decision tree, data instances with known values are split based on features. The tree learns which feature values lead to specific outcomes. For a new instance with a missing value, the tree can trace the path based on other features, using the majority class or value. 10/22/2024 DICT-UWU 15 EXAMPLE ▪Median imputation works well when there’s a moderate number of missing values, and it’s easy to implement. ▪ It retains data while maintaining the central tendency of the Age feature. ▪KNN imputation can predict missing ages based on the values of other features. ▪ This method considers the similarity between data points to estimate the missing value. 10/22/2024 DICT-UWU 16 HANDLING NOISY DATA 10/22/2024 DICT-UWU 17 HANDLING NOISY DATA What is noise? Noise is a random error or variance in a measured variable. Sources can be Errors introduced by measurement tools, And random errors are introduced by processing or experts when the data is gathered. What is an Outlier? An outlier is a data point that significantly deviates from other observations in the dataset. It may not follow the general trend or pattern but could still be a valid data point. 10/22/2024 DICT-UWU 18 EXAMPLE Noise? Outlier? 10/22/2024 DICT-UWU 19 CONT.. ▪Smoothing is a technique used to reduce the noise in the data by smoothing out fluctuations. ▪It involves using various methods to smooth the data while retaining the essential patterns. ▪The main techniques used for smoothing include Binning Regression methods. 10/22/2024 DICT-UWU 20 CONT.. 1. Binning ▪Binning is a simple yet effective technique used to smooth noisy data by dividing it into intervals or "bins" ▪and replacing values within these bins with representative values (e.g., mean, median) ▪This has a soothing effect on the input data and may also reduce the chances of over fitting in the case of small datasets 10/22/2024 DICT-UWU 21 CONT.. Smoothing by bin mean method In this method, the values in the bin are replaced by the mean value of the bin. Smoothing by bin median In this method, the values in the bin are replaced by the median value. Smoothing by bin boundary In this method, the using minimum and maximum values of the bin values are taken, and the closest boundary value replaces the values. 10/22/2024 DICT-UWU 22 EXAMPLE 10/22/2024 DICT-UWU 23 CONT.. 2. Regression ▪Data smoothing can also be done by regression, a technique that conforms data values to a function. ▪Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. ▪Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. ▪Polynomial Regression fits a polynomial curve, which may capture more complex patterns in the data compared to a straight line. 10/22/2024 DICT-UWU 24 CONT.. 3. Filtering Techniques ▪ Filtering techniques, like median filtering or low-pass filtering, remove high-frequency noise from data, particularly in image or signal processing. ▪ Ex:- Removes high-frequency components (noise) from the data, allowing only the low-frequency components (the signal) to pass through. 10/22/2024 DICT-UWU 25 CONT.. Outlier Detection ▪Detecting and removing outliers is a form of noise removal. ▪Common techniques include ▪ Z-score: Detects how many standard deviations away a point is from the mean. Points with a Z-score above a certain threshold are considered outliers and treated as noise. ▪ IQR (Interquartile Range): Values outside the range of 1.5 times the interquartile range (IQR) from the lower and upper quartiles are considered outliers. ▪ Isolation Forest: Detects outliers by isolating data points that require fewer splits to be separated from the rest of the data in a decision tree. ▪Outlier detection is common in many types of numerical datasets where a few extreme points (noise) can distort analysis. 10/22/2024 DICT-UWU 26 CONT.. Clustering ▪ Clustering can help in noise removal by grouping similar data points (distance-based or density-based algorithms) and treating outliers or noise as separate clusters. ▪ Clustering algorithms can detect and eliminate noisy data points that do not fit well within any cluster. ▪ e.g., K-Means ▪ Points that are far from any cluster centroid or belong to small, isolated clusters are flagged as outliers. ▪ Clustering is effective in detecting and handling noise in spatial data or high-dimensional datasets. 10/22/2024 DICT-UWU 27 EXAMPLE ▪In this dataset, the price of the third house ($1,500,000) is an outlier, as it is significantly higher than the rest. IQR (Interquartile Range): Uses the interquartile range to detect outliers. The formula is: Lower Bound=Q1−1.5∗IQR Upper Bound=Q3+1.5∗IQR Values falling outside these bounds are considered outliers and could be removed or capped at these boundaries. 10/22/2024 DICT-UWU 28 EXAMPLE CONT.. ▪Errors like the negative price (-$100,000) can be detected and corrected based on domain knowledge or rules. ▪In the dataset, negative values don’t make sense for house prices, so this row would need attention. Replace with a more reasonable value (e.g., mean or median of valid prices). Remove the row if the data is unreliable and cannot be corrected. 10/22/2024 DICT-UWU 29 DATA REDUCTION 10/22/2024 DICT-UWU 30 INTRODUCTION ▪Data reduction refers to the process of reducing the volume or dimensionality of data without losing significant information. ▪ By making the data smaller, ▪the analysis becomes more efficient, ▪storage costs are reduced, ▪and algorithms can run faster, especially with large datasets. ▪Various techniques can be employed ▪Dimensionality reduction ▪Feature selection 10/22/2024 DICT-UWU 31 TECHNIQUES IN DATA REDUCTION 1. Dimensionality Reduction ▪Reduce the number of features (dimensions) in the dataset while preserving essential information. ▪Dimensionality reduction simplifies models, reduces overfitting, and speeds up processing time. ▪Go visit https://www.kaggle.com/c/digit- recognizer/data (dataset for digit recognizer) 10/22/2024 DICT-UWU 32 CONT.. ▪Key technique in dimensionality reduction ▪Principal Component Analysis (PCA): ▪PCA is a statistical technique that transforms the original features into a new set of uncorrelated variables called principal components. ▪t-SNE (t-Distributed Stochastic Neighbor Embedding): ▪t-SNE is a non-linear technique used for visualizing high-dimensional data in a two- or three-dimensional space. ▪Unlike PCA, it focuses on maintaining local structure (similar data points stay close together in the reduced space). 10/22/2024 DICT-UWU 33 EXERCISE Discussion about PCA Key steps in PCA: ▪Standardize the data. ▪Covariance Matrix ▪ Calculate the covariance matrix of the features. ▪Eigenvectors and Eigenvalues ▪ Compute these to identify principal components (directions of maximum variance). ▪Transform Data ▪ Project data onto a new space defined by these components. References:- https://builtin.com/data-science/step-step-explanation-principal-component-analysis 10/22/2024 DICT-UWU 34 CONT.. 2. Feature Selection ▪ Feature selection techniques aim to retain the most relevant features, discarding irrelevant or redundant ones. ▪ This process enhances model performance by simplifying the input space 10/22/2024 DICT-UWU 35 CONT.. ▪ Key technique for feature selection ▪ Statistical Methods: ▪ Techniques such as correlation analysis, variance thresholding, and hypothesis testing can help identify which features contribute most to the target variable. ▪ Ex:- correlation-coefficient, chi-square ▪ Wrapper Methods: ▪ Techniques like forward selection, backward elimination, and recursive feature elimination (RFE) are used to test subsets of features to determine the best combination. ▪ Ex:- Forward Selection, Backward Elimination ▪ Embedded Methods: ▪ These methods, such as regularization (e.g., Lasso), automatically perform feature selection during the model training process. ▪ Ex:-Decision Trees and Random Forests 10/22/2024 DICT-UWU 36 CONT.. 3. Sampling ▪Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample (or subset). ▪Ex:- ▪ Simple random sample ▪ Cluster sample ▪ Stratified sample 10/22/2024 DICT-UWU 37 DATA TRANSFORMATION 10/22/2024 DICT-UWU 38 INTRODUCTION ▪Data transformation is an essential step in the data preprocessing phase of data analysis workflows. ▪It involves converting data from one format or structure to another to improve data quality, align the data with the model requirements, or make patterns more interpretable. 10/22/2024 DICT-UWU 39 TECHNIQUES FOR DATA TRANSFORMATION 1. Normalization (Min-Max Scaling) Rescales data to fit within a specific range, usually between 0 and 1. Used when features have different scales or ranges. It ensures that no feature dominates simply due to its scale. Formula where 𝑥′ is the normalized value, 𝑥min and 𝑥 max are the minimum and maximum values of the feature. Works best for algorithms that require feature values to be within a specific range, like neural networks or support vector machines (SVMs). 10/22/2024 DICT-UWU 40 CONT.. 2. Standardization(z-score Normalization) ▪Transforms data to have a mean of 0 and a standard deviation of 1, based on the assumption that the data follows a normal (Gaussian) distribution. ▪Formula where z is the standardized value, x is the original value, μ is the mean, and σ is the standard deviation. ▪Useful for algorithms that assume the data is normally distributed (e.g., linear regression, logistic regression, k-means clustering). ▪Especially useful when the dataset contains features with vastly different units or magnitudes. ▪Required for algorithms like principal component analysis (PCA), which are sensitive to scale. 10/22/2024 DICT-UWU 41 EXAMPLE Apply Normalization for the Price attribute. Use min-max scaling. Min=15 and Max=1200 Lets take first value as an example(300) X’=300-15/1200-15 =285/1185 ~0.24 10/22/2024 DICT-UWU 42 CONT.. 3. Encoding categorical variables ▪ Converts categorical (non-numeric) data into numerical form, which is essential because most machine learning algorithms work only with numerical data. 10/22/2024 DICT-UWU 43 10/22/2024 DICT-UWU 44 CONT.. Common Encoding Techniques: ▪ One-Hot Encoding: ▪ Converts each category into a separate binary feature (0 or 1). ▪ Useful when there is no ordinal relationship between categories. 10/22/2024 DICT-UWU 45 CONT.. Common Encoding Techniques: ▪ Label Encoding: ▪ Assigns each category a unique integer label. ▪ Best for ordinal data where the categories have a meaningful order. Position Position Manager 0 Manager = 0 Engineer 1 Engineer = 1 Analyst = 2 Analyst 2 Executive = 3 Manager 0 Executive 3 10/22/2024 DICT-UWU 46 EXAMPLE 1. Use One hot encoding for Gender 2. Use Label encoding for Education Level 10/22/2024 DICT-UWU 47 DATA DISCRETIZATION 10/22/2024 DICT-UWU 48 DATA DISCRETIZATION ▪The raw values of a numeric attribute are replaced by interval labels or conceptual labels. ▪e.g., age 0–10, 11–20, etc or youth, adult, senior 10/22/2024 DICT-UWU 49 CONT.. 1. Binning ▪ Binning is a top-down splitting technique based on a specified number of bins. ▪ We have already discussed binning methods for data smoothing. ▪ These methods are also used as discretization method. 2. Histogram Analysis ▪ A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins. 10/22/2024 DICT-UWU 50 CONT.. 3. Discretization by Cluster ▪ A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning the values of A into clusters or groups. 4. Correlation analysis ▪ Measures of correlation can be used for discretization. ▪ Ex:- ▪ ChiMerge 10/22/2024 DICT-UWU 51 REFERENCES Books Data Mining and Concepts by Jiawei Han, Michelin Kamber and Jian Pei Data Mining Practical Machine Learning Tools and Techniques by Ian H Written, Eibe Frank, and Mar A Hall Web sites Data Science DOJO https://datasciencedojo.com/blog/categorical-data-encoding/ Geeks for geeks https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/ Research articles 10/22/2024 DICT-UWU 52 ACTIVITY Read the Paper given with the focus on the dataset characteristics and preprocessing techniques applied in the study. ▪Discuss the dataset used. ▪Who collected the data, and for what purpose? ▪What are the key questions the dataset is meant to answer? ▪What time period does the data cover? ▪Are there any known limitations or biases in the dataset? ▪Identify the preprocessing techniques. 10/22/2024 DICT-UWU 53 THANK YOU! 10/22/2024 DICT-UWU 54