Data Processing PDF
Document Details
Uploaded by MeritoriousConstructivism363
Tags
Summary
This document discusses various techniques and tools related to data preprocessing and transformation. It covers data cleaning, handling missing data, and different methods for dealing with these types of data issues. Some examples of methods and data types are included.
Full Transcript
Data Preprocessing and Transformation Data Preprocessing Data preprocessing, one of the first and crucial step — the process in which we prepare the raw data and make it suitable for a ML model to increase its accuracy and efficiency. Data pre-processing is an important...
Data Preprocessing and Transformation Data Preprocessing Data preprocessing, one of the first and crucial step — the process in which we prepare the raw data and make it suitable for a ML model to increase its accuracy and efficiency. Data pre-processing is an important step in the data science process that prepares data for analysis. It involves several tasks such as handling missing values, data cleaning, and data transformation. It’s a technique/process that involves the conversion of data into usable and desired form. Data processing starts with data in its raw form and converts it into a more readable format ( image, graph, table, vector file, audio, charts, etc) Data Processing Various tools — Analysis tools — Excel — tools that help in applying relevant formulas to process the whole data Statistical Tools — SAS Database tools — Oracle, MongoDb, Hadoop etc that help in processing large amounts of data Data Cleaning Data Cleaning: Data cleaning involves identifying and removing inaccuracies, inconsistencies, and outliers in the data. This step can improve the quality of the data and increase the accuracy of the analysis. Hence, data science professionals usually spend a very large portion of their time on Data Cleaning. Data Cleaning The golden rule is that better data beats fancier algorithms Completeness: Does the given data include all required information? Validity: Does the given data correspond with business rules and/or restrictions? Uniformity: Is the given data specified using consistent units of measurement? Consistency: Is the given data consistent across your datasets? Accuracy: Is the given data close to the true values? Data Cleaning is an important process and it starts with removing unwanted samples/observations in the given dataset Missing Data Missing Data Missing data is the data that is not captured for a variable for the observation in question. If the missing values are not handled properly by the data science professional, then he may end up drawing an inaccurate inference about the data. Missing data reduces the statistical power of the analysis, which can distort the validity of the results. Hence, it is very important to handle missing data Missing Data Missing Data Handling the missing data values Missing values, incompleteness, unknown data etc is one of the biggest issues while building a machine learning model as it impacts the accuracy. Ways to Handle Missing Values Drop missing values Ignore tuples with missing values Imputation others Missing Data Missing Data Replace each NaN we have in the dataset, we can use the replace() method from numpy import NaN df.replace({NaN:1.00}) In order to replace with a Scalar Value, use fillna() method df.fillna(12) To fill forward or backward, use the methods pad or fill, and to fill backward, use bfill and backfill. df.fillna(method='backfill') Missing Data Data imputation Handling missing values: Missing values can be handled in several ways, Mean/mode/median imputation involves replacing the missing value with the mean/mode/median of the variable. Hot deck imputation involves replacing the missing value with a value from a similar observation. Regression imputation involves using a regression model to predict the missing value based on the other variables in the dataset. Stochastic regression imputation: Stochastic regression imputation is an extension of multiple imputation method. Missing Data Hot-Deck Imputation: 1.Technique for handling non-respondents in data. 2.Matching: Non-respondents matched to resembling respondents. 3.Imputation: Missing value replaced with the score of a similar respondent - Approaches: 1.Distance Function Approach: 1.Impute missing value with the score of the nearest neighbor. 2.Uses squared distance statistics for matching. 2.Pattern Matching Approach: 1.More common. 2.The sample is stratified into homogenous groups. 3.Imputed value is randomly drawn from cases in the same group. - Characteristics: 1.Preserves variable distribution by replacing missing data with realistic scores. 2.Common in survey research Missing Data import numpy as np import pandas as pd from sklearn.impute import SimpleImputer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Handling missing values - Mean/mode/median Imputation def handle_missing_values(data, strategy='mean'): imputer = SimpleImputer(strategy=strategy) imputed_data = imputer.fit_transform(data) return imputed_data # Handling missing values - Hot Deck Imputation def hot_deck_imputation(data): imputed_data = data.copy() for column in imputed_data.columns: missing_indices = imputed_data[column].isnull() unique_values = imputed_data.loc[~missing_indices, column].unique() imputed_data.loc[missing_indices, column] = np.random.choice(unique_values, size=missing_indices.sum()) return imputed_data Noisy Data Regression Regression imputation involves using a regression model to predict the missing value based on the other variables in the dataset. In order to preserve the relationships between features, we can use regression imputation, basically, a technique in which we fit a regression model on a feature with missing data and then using this model predict the values which is used to replace the missing values. # Regression Imputation def regression_imputation(data): imputer = IterativeImputer() imputed_data = imputer.fit_transform(data) return imputed_data # Calling Regression Imputation imputed_data = regression_imputation(data) Missing Data Missing Data Stochastic regression imputation: is an extension of a multiple imputation method, where the imputed values are drawn from the predictive distributions of a regression model. This is done by simulating multiple datasets by drawing imputed values from the posterior predictive distributions of the model. Aim: Reduce bias with an additional step. Augmentation: Predicted scores are adjusted with a residual term. Residual Term: Normally distributed, mean = 0, variance = residual variance. Data Variability: Preserved, ensuring unbiased parameter estimates. Risk: Increased chance of type I errors due to lack of uncertainty about imputed values Missing Data Missing Data import numpy as np import pandas as pd from sklearn.impute import SimpleImputer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Regression Imputation def regression_imputation(data): imputer = IterativeImputer() imputed_data = imputer.fit_transform(data) return imputed_data # Stochastic regression imputation def stochastic_regression_imputation(data, iterations=10): imputer = IterativeImputer(sample_posterior=True, max_iter=iterations) imputed_data = imputer.fit_transform(data) return imputed_data Missing Data Interpolation is a method of estimating missing values by taking the average of the values on either side of the missing data point. This can be done using the interpolate function in pandas, which can use various interpolation methods such as linear or polynomial. Interpolation (linear) is basically a straight line between two given points where data points between these two are missing: Two red points are known Blue point is missing Missing Data # Mean Imputation and Interpolate Method # Creating a sample DataFrame with missing values data = { 'A': [1, 2, np.nan, 4, 5], 'B': [5, np.nan, 7, np.nan, 9], 'C': [np.nan, 10, 11, 12, 13] } df = pd.DataFrame(data) # Mean imputation using fillna() function df_mean_imputed = df.fillna(df.mean()) # Interpolation using interpolate() function df_interpolated = df.interpolate() # Displaying the mean imputed and interpolated DataFrames print("\nMean Imputed DataFrame:") print(df_mean_imputed) print("\nInterpolated DataFrame:") print(df_interpolated) Noisy Data Noisy Data Removing or replacing incorrect or irrelevant data Noise unwanted/meaningless data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. The occurrences of noisy data in data set can significantly impact prediction of any meaningful information and causes the algorithms to miss out patterns in the data. Noise in data set dramatically led to decreased classification accuracy and poor prediction results. It can be — certain anomalies in features & target, irrelevant/weak features, and noisy records. Noisy Data Therefore, it becomes important for any data scientist to take care as well as eliminate noise when applying any algorithm over a noisy data. Noisy Data Techniques to handle Noisy data Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove Noisy data Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) Noisy Data Techniques to handle Noisy data Binning involves grouping a set of continuous or numerical data into a smaller number of discrete “bins” or ranges. This can be done using the cut or qcut functions in pandas, which allow you to specify the number of bins and the labels for each bin. Noisy Data Binning # Binning Data # Creating a sample DataFrame data = { 'Height': [160, 172, 183, 155, 168, 178, 162, 170], 'Category': ['Short', 'Average', 'Tall', 'Short', 'Average', 'Tall', 'Short', 'Average'] } df = pd.DataFrame(data) # Defining the bin edges bins = [150, 165, 180, np.inf] # Applying binning using cut() function df['Height Category'] = pd.cut(df['Height'], bins=bins, labels=['Short', 'Average', 'Tall']) # Displaying the binned DataFrame print("\nBinned DataFrame:") df Noisy Data Binning using KBinsDiscretizer import pandas as pd import numpy as np from sklearn.preprocessing import KBinsDiscretizer # Create a sample DataFrame data = { 'Height': [160, 172, 183, 155, 168, 178, 162, 170], 'Category': ['Short', 'Average', 'Tall', 'Short', 'Average', 'Tall', 'Short', 'Average'] } df = pd.DataFrame(data) # Perform binning on 'Age' column est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') df['Age_Bins'] = est.fit_transform(df[['Height']]) print("DataFrame after binning:") print(df) Noisy Data Binning using KBinsDiscretizer and pd.cut # Create a sample DataFrame data = { 'Height': [160, 172, 183, 155, 168, 178, 162, 170], 'Category': ['Short', 'Average', 'Tall', 'Short', 'Average', 'Tall', 'Short', 'Average'] } df = pd.DataFrame(data) # Perform binning on 'Age' column kbd = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') df['Height_Bins'] = kbd.fit_transform(df[['Height']]) bin_edges = kbd.bin_edges_ # Map bin edges to category names (e.g., 'Low', 'Medium', 'High') category_names = ['Low', 'Medium', 'High'] # Convert bin edges to category values df['Height_Bins'] = pd.cut(df['Height'], bins=bin_edges, labels=category_names, include_lowest=True) df Noisy Data Regression import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression # Create a sample DataFrame data = { 'Hours_Studied': [2, 4, 6, 8, 10], 'Score': [60, 70, 80, 90, 95] } df = pd.DataFrame(data) # Perform linear regression X = df[['Hours_Studied']] y = df['Score'] reg = LinearRegression().fit(X, y) # Predict score for new hours studied new_hours_studied = [] predicted_score = reg.predict(new_hours_studied) print("Predicted Score:", predicted_score) Clustering Noisy Data Clustering import pandas as pd import numpy as np from sklearn.cluster import KMeans # Create a sample DataFrame data = { 'X': [2, 2, 8, 5, 7, 6], 'Y': [10, 5, 4, 8, 5, 2] } df = pd.DataFrame(data) # Perform k-means clustering kmeans = KMeans(n_clusters=2, random_state=0).fit(df) # Get cluster labels for the data points cluster_labels = kmeans.labels_ print("Cluster Labels:", cluster_labels) Data transformation Data transformation involves changing the data in a way that makes it more appropriate for analysis. This can include Discretization, rescaling data, binarizing data, and feature scaling. Rescaling data involves changing the scale of a variable to a standard range, such as 0–1. Binarizing data involves converting a variable to a binary format, such as 0 or 1. Feature scaling involves changing the scale of a variable so that it has a similar range as the other variables in the dataset. Data transformation Rescale Data — In order to uniformly scale the attributes with varying scales, rescaling is a useful technique to all have the attributes on the same scale using scikit-learn using the MinMaxScaler class. # Rescale Data def rescale_data(data): scaler = MinMaxScaler() scaled_data = scaler.fit_transform(data) return scaled_data # Rescale Data scaled_data = rescale_data(data) Data transformation Feature Scaling Feature scaling is a technique to standardize the independent variables in the data in a specified range by putting our variables in the same range and scale so that variables don’t dominate each other. It’s important because it always converges and gives results faster. Normalization also known as Min-Max scaling is a technique in which values in the data are scaled so that they end up ranging between 0 and 1. Standardization is a technique in which the values are centered around the mean with a unit standard deviation. Data transformation Data transformation Feature Scaling— Feature scaling involves changing the scale of a variable so that it has a similar range as the other variables in the dataset. # Feature Scaling def feature_scaling(data): # Perform feature scaling operations scaled_data = (data - data.mean()) / data.std() return scaled_data # Feature Scaling scaled_data = feature_scaling(data) Data transformation import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler # Min-Max Scaling (Normalization) def min_max_scaling(data): scaler = MinMaxScaler() scaled_data = scaler.fit_transform(data) return scaled_data # Standardization def standardization(data): scaler = StandardScaler() standardized_data = scaler.fit_transform(data) return standardized_data data = pd.read_csv('data.csv') # Min-Max Scaling (Normalization) normalized_data = min_max_scaling(data) # Standardization standardized_data = standardization(data) Data transformation Binarizing data One hot encoding One hot encoding is used for treating categorical variables. One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data Data transformation Binarization It simply creates additional features based on the number of unique values in the categorical feature One hot encoder only takes numerical categorical values, hence any value of string type should be label encoded before one-hot encoded One hot encoding makes our training data more useful and expressive, and it can be rescaled easily Data transformation import pandas as pd from sklearn.preprocessing import OneHotEncoder # Create a sample DataFrame data = { 'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'] } df = pd.DataFrame(data) # Perform one-hot encoding encoder = OneHotEncoder() encoded_data = encoder.fit_transform(df[['Color']]) df_encoded = pd.DataFrame(encoded_data.toarray(), columns=encoder.categories_) print("DataFrame after one-hot encoding:") print(df_encoded) Data transformation In Pandas: dum_color = pd.get_dummies(data, columns=[“color"], prefix=["Type_is"] ) # merge with main df data on key values data = data.join(dum_color) data Data transformation Label Encoding Label Encoding is used to handle categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering. In the context of a data transformation pipeline, label encoding is often applied during the preprocessing phase before feeding the data into a machine learning model. The goal is to enhance the model's ability to understand and make predictions based on the input data. Data transformation Label Encoding Sklearn provides a method for encoding the categories of categorical features into numeric values Label encoder encodes labels with credit between 0 and n-1 classes where n is the number of diverse labels In pandas: import pandas as pd # Create a sample DataFrame data = { 'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'] } df = pd.DataFrame(data, dtype='category') df['Encoded_Color'] = df['Color'].cat.codes df Data transformation Label Encoding It can be implemented using preprocessing module from sklearn package and them import LabelEncoder class: import pandas as pd from sklearn.preprocessing import LabelEncoder # Create a sample DataFrame data = { 'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'] } df = pd.DataFrame(data) # Perform label encoding encoder = LabelEncoder() df['Encoded_Color'] = encoder.fit_transform(df['Color']) print("DataFrame after label encoding:") df Data transformation Log Transformation Logarithm transformation (or log transform): It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal. In most of the cases the magnitude order of the data changes within the range of the data. It also decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust. Data transformation Log Transformation The data you apply log transform must have only positive values, otherwise you receive an error. Also, you can add 1 to your data before transform it. Thus, you ensure the output of the transformation to be positive. df['scaled_Salary'] = df.['Salary'].transform(lambda x: (x - x.mean()) / x.std()) Data transformation Filtering