Chapter 5 Information Pre-processing for Analytics PDF

Chapter 5 Information Pre- processing for Analytics FJD Data Quality Assessment Chapter 0 Overview  Mismatched data types  Mixed data values 1   Data outliers Missing data Data Clean...

Chapter 5 Information Pre- processing for Analytics FJD Data Quality Assessment Chapter 0 Overview  Mismatched data types  Mixed data values 1   Data outliers Missing data Data Cleaning 0  Dealing with Missing data  Noisy data 2 Data Transformation 0   Aggregation Normalization 3  Feature selection  Discretization  Concept hierarchy generation Data Reduction 0  Attribute selection  Numerosity reduction 4  Dimensionality reduction Why Data Pre-Processing? Why we Pre- Imagine you want toProcess Data? cook instant ramen, you don’t just simply throw the entire instant noodle packet to hot boiling water, First before you cook the noodle, you have to open up the packet and take out all of the ingredients including the uncooked noodle inside Only then you can start cook the noodle Similar to “Data Pre-Processing” Before we can use the data, we have to do data pre-processing where we transforms it into a format that can be understood and analyzed by computers and machine learning. Pre-Process Data The real-world data comes in various format and form, it can be in form of text, images, and video. All of this data may contain errors, inconsistencies, and incomplete and non-uniform states. Therefore, data must first be cleaned and formatted before analysis. The main goal for Pre-processing is to understand the nature of the data and perform a more meaningful data analysis.  Data Quality Assessment Steps in pre-processing:  Data Cleaning  Data Transformation  Data Reduction Data Quality Assessment Data quality assessment is a crucial step in the data pre-processing phase, as it involves evaluating the quality of the data to ensure that it is accurate, complete, and reliable. 0 Data Quality 1 Assessment  Mismatched data types Mixed data values Data outliers Missing data Mismatched data types Mismatched data types refer to situations where the format or type of data in a particular column is inconsistent with the intended or expected data type. This inconsistency ID Name Date of Birth can lead to errors in analysis and must be addressed 1 Alice 1990-05-15 during the data quality assessment phase. 2 Bob 03/22/1985 Objective: 3 Charlie 1978/12/10 Ensuring that data in different formats is reformatted to maintain 4 David 1995-08-31 consistency and facilitate proper analysis. Example: Consider a dataset that includes a column for "Date of Birth." In a well-maintained dataset, the date of birth should be consistently ID Name Date of Birth formatted as a date data type. However, due to various reasons, the data in this column might be inconsistently formatted. 1 Alice 15/05/1990 2 Bob 22/03/1985 To address this issue, you would need to reformat the "Date of Birth" column to a consistent date format, such as "YYYY-MM-DD" 3 Charlie 12/10/1978 for all records. This can be done using data transformation techniques during the data pre-processing stage. 4 David 31/08/1995 Mixed Data Values ID Name Gender Mixed data values refer to situations where different descriptors or representations are used for the same 1 Alice Female concept or category. Standardizing these values is essential 2 Bob M for consistency and accurate analysis. 3 Charlie M Objective: 4 David Male Ensuring that different descriptors for features are made uniform for consistency in the dataset. Example: ID Name Gender Based on the dataset, column for "Gender" should be represented using a consistent set of values, such as "Male" and "Female" 1 Alice Female However, the data in this column might have mixed 2 Bob Male representations. 3 Charlie Male To address this issue, you would need to standardize the values in the "Gender" column. For example, you might decide to use "Male" 4 David Male and "Female" as the uniform descriptors for gender. Data Outliers Data outliers are values that deviate significantly from the rest of the dataset. Identifying and addressing outliers is ID Name Salary important to prevent them from skewing analysis results. 1 Alice RM 2,500 2 Bob RM 2,800 Objective: 3 Charlie RM 120,000 Identifying and addressing data points that deviate significantly 4 David RM 1,500 from the majority of the dataset. Example: Consider a dataset that includes a column for “salary" In this dataset, the income values generally increase gradually, there's a ID Name Salary significant jump from 2800 to 120,000 and then back to 3,500. 1 Alice RM 2,400 To address this issue, you would need to decide on an appropriate strategy. In this case, you can choose to remove or transform the 2 Bob RM 2,800 values 3 Charlie … Besides that, you may decide not to treat these high incomes as outliers but instead document that these values represent high- 4 David RM 3,500 earning individuals and are not errors. Data Cleaning Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. The goal of data cleaning is to improve data quality by handling missing or incomplete information, correcting inaccuracies, and ensuring that the data is reliable and suitable for analysis. 0 Data Cleaning 2 Dealing with Missing data Noisy data Missing data ID Name Age Missing data refers to the absence or incompleteness of values in a dataset. It occurs when no data is recorded or 1 Alice 33 when data is not available for a particular observation or 2 38 variable. 3 Charlie Objective: 4 David 28 Addressing fields with missing values, ensuring that empty or placeholder values are handled appropriately. Example: Based on the dataset, column for “Age” some of the data are ID Name Age missing, usually in this case we can use “Flag” to address null value, 1 Alice 33 In other advance analysis we can use techniques such as deletion, 2 zzz 38 imputation, interpolation, or predictive modeling. 3 Charlie -- For predictive modeling, some attribute such as “Derived Attribute” 4 David 28 can be use to predict the value of other attribute, however this solution only apply to certain data. Noisy Data ID Name Points Noisy data refers to data that contains errors or inconsistencies, typically in the form of irrelevant or 1 Alice 55 misleading information, outliers, or inaccuracies. Noise can 2 Bob 58 be introduced during data collection, transmission, or processing, and it can impact the accuracy and reliability of 3 Charlie 120 analysis results. 1 Alice 55 Dealing with noisy data involves a combination of statistical 4 David 57 techniques, domain knowledge, and data preprocessing methods such as: Duplicate Data Removing, Data Transformation, Smoothing or Aggregation and Outlier Detection and Removal ID Name Points Log (Points) Example: 1 Alice 55 4.01 There are many approach to deal with Noisy Data, here are example of “Duplicate Data Removing” and “Data Transformation” 2 Bob 58 4.06 3 Charlie 120 4.79 4 David 57 4.09 Data Transformation Data transformation is the process of converting or altering data to make it more suitable for analysis or to meet specific requirements. This involves applying various mathematical, statistical, or business rules to the data, resulting in a modified dataset The choice of transformation depends on the characteristics of the data and the objectives of the analysis. 0 Data Transformation 3 Aggregation Normalization Feature selection Discretization Concept Hierarchy Generation Aggregation Aggregation is the process of combining multiple data values into a single summary value. It involves grouping and summarizing data to provide a more concise and Month Sales informative view. Aggregation is often used to analyze data January 100 at a higher, more abstract level, enabling the extraction of February 120 meaningful insights from complex datasets. March 90 For example, a dataset of monthly sales for a retail store. April 110 May 130 In this example, the original dataset provides monthly sales data. To get a more comprehensive view, aggregation is applied to obtain yearly total sales. The process involves summing up the monthly sales values to produce a single value representing the total sales for the entire year. Total Average Year Sales Sales Types of Aggregation Functions: 2023 550 110 Sum: Adds up all values in the group. Average (Mean): Calculates the average of values in the group. Count: Counts the number of values in the group. Min and Max: Finds the minimum or maximum value in the group. Custom Functions: Other custom aggregation functions Normalization ID Age Income Normalization is the process of scaling data into a 1 25 50000 standardized or regularized range. This is often done to 2 30 52000 ensure that all features of a dataset have the same scale, 3 35 55000 preventing certain features from dominating others during 4 40 60000 analyses. 5 45 65000 Normalized Normalized In this example, the original data contains age and income values. ID Age Income Normalization is applied independently to each feature, scaling 1 0 0 them to a range between 0 and 1. 2 0.25 0.25 3 0.5 0.5 4 0.75 0.75 The result is that all values are transformed into a regularized 5 1 1 range between 0 and 1. Normalization is especially useful when different features have different ranges or units, ensuring that each feature contributes proportionally to the analysis without being biased by its scale. Feature selection Education Credit Feature selection is the process of choosing a subset of ID Age Income relevant and significant variables (features) from a larger Level Score 1 25 50000 Bachelor's 700 set of features in a dataset. This is done to improve the 2 30 52000 Master's 720 model's performance, reduce overfitting, and enhance 3 35 55000 High School 650 interpretability. 4 40 60000 PhD 750 5 45 65000 Bachelor's 680 Example: In this example, the original data contains multiple features, including Age, Income, Education Level, and Credit Score. Feature selection involves choosing a subset of features that are most relevant for the analysis. Incom Credit ID Age e Score Age, Income, and Credit Score are chosen as the critical features 1 25 50000 700 for analysis, and Education Level is excluded in this case. 2 30 52000 720 3 35 55000 650 Reasons for Feature Selection: 4 40 60000 750 Simplicity: Selecting only the most critical features simplifies the 5 45 65000 680 model and improves its interpretability. Performance: Removing irrelevant or redundant features can improve the model's performance by reducing overfitting. Computational Efficiency: Working with a smaller set of features can lead to faster model training and predictions. Discretization (1/2) Discretization is the process of transforming continuous data into discrete or categorical values by dividing the ID Age range of values into intervals or bins. This is done to 1 25 simplify the data, reduce noise, and make it more suitable 2 30 3 35 for certain types of analyses, especially those that work well 4 40 with categorical variables. 5 45 For example, based on this dataset of ages, we Discretized Age Data (into 3 Intervals) In this example, the original data contains continuous age values. ID Age Discretization is applied to group these continuous values into 1 20-30 intervals, making them categorical. 2 30-40 3 30-40 In this table, the "Age" column is discretized into intervals, and 4 40-50 each individual's age is now represented by the interval to which it 5 40-50 belongs. Discretization (2/2) Discretization Methods: Method Description Example Equal Width Divides the range of values 0-10, 10-20, 20-30,... (Equal Interval)into equally sized intervals. Divides the data into intervals Equal Frequency with an equal number of 0-15, 15-30, 30-45,... points. Utilizes clustering algorithms to Age groups based on k-means Clustering-Based group similar values. clustering Concept Hierarchy Generation Concept hierarchy generation involves creating Employee hierarchical structures within and between features to ID Name Job Level capture relationships that may not be present in the original 1 Alice Junior data. This can enhance the understanding and 2 Bob Senior representation of complex relationships in the dataset. 3 Charlie Junior 4 David Manager Example: 5 Emily Senior In this example, the original data contains a simple "Job Level" feature. Concept hierarchy generation involves mapping these job levels into a hierarchical structure, introducing additional semantic meaning. Employee ID Name Job Level Now, the "Job Level" feature is not just categorical; it has a 1 Alice Entry Level hierarchical structure that adds more meaningful information about 2 Bob Mid Level the organizational hierarchy. 3 Charlie Entry Level 4 David Senior Level Concept hierarchy generation can be applied to various features, 5 Emily Mid Level creating relationships that might not be explicit in the original data. This process enhances the interpretability and depth of understanding of the dataset. Data Reduction Data reduction is the process of reducing the volume but producing the same or similar analytical results by representing the data in a more concise format. This encompasses Attribute Selection, which focuses on choosing relevant features; Numerosity Reduction, which involves reducing the number of instances or records in the dataset; and Dimensionality Reduction, which aims to decrease the number of variables or features. 0 Data Reduction 4 Attribute selection Numerosity reduction Dimensionality reduction Attribute selection Attribute selection is the process of choosing a subset of ID Age BMI Blood Cholesterol Physical relevant features or attributes from the original dataset. This Pressure Level Activity Level involves selecting and combining tags or features to 1 25 22.5 120/80 Normal High 2 30 25 130/85 High Moderate create a more focused and concise representation of the 3 35 28 140/90 High Low data. 4 40 30.5 150/95 Very High High 5 45 26.8 125/82 Normal Moderate For example, our study aim to investigate relationship between Physical Activity on Blood Pressure The original health data includes multiple features related to a person's health. Blood Physical Activity ID Age BMI Pressure Level Attribute selection involves choosing a subset of relevant features, 1 25 22.5 120/80 High such as age, BMI, blood pressure, and physical activity level, to 2 30 25 130/85 Moderate create a more focused dataset. 3 35 28 140/90 Low 4 40 30.5 150/95 High 5 45 26.8 125/82 Moderate Numerosity reduction Transactio Customer Product Amount (RM) Numerosity reduction is the process of selecting and n ID ID utilizing only the relevant data instances and variables 1 101 Laptop 3200 that are essential for a particular analysis. This technique 2 102 Smartphone 1800 focuses on reducing the number of instances or records in 3 103 Headphones 150 4 105 Tablet 1200 the dataset, creating a more streamlined and efficient 5 105 Laptop 1800 representation for the specific analytical task. 6 103 Tablet 950 7 103 Laptop 2300 Example: In this example, the original transaction data includes multiple instances representing customer transactions. Numerosity reduction involves selecting only the relevant Transactio Customer Product Amount (RM) transactions for a specific analysis, in this example, we n ID ID want to analyze transaction that related to Laptop only, 1 101 Laptop 3200 therefore we exclude the any data that not involve 5 105 Laptop 1800 “laptop”. 7 103 Laptop 2300 Dimensionality reduction Dimensionality reduction is the process of reducing the Weight Taste number of variables or features in a dataset while Fruit Color Size (cm) (g) Score retaining its essential information. This technique aims to Apple Red 8 150 0.7 decrease the dimensionality of the data, making it more Banana Yellow 15 120 0.9 manageable for analysis and downstream processes. Grape Purple 3 5 0.8 Orange Orange 10 200 0.6 For example, a dataset with three features representing the Lemon Yellow 5 30 0.9 characteristics of fruits In this example, the original fruit data includes features like color, size, weight, and taste score. Fruit Feature 1 Feature 2 Feature 3 Dimensionality reduction is applied to transform these features into Apple 0.2 0.6 -0.1 a reduced set of features (Feature 1, Feature 2, Feature 3) using a Banana 0.8 -0.3 0.5 technique like Principal Component Analysis (PCA). Grape -0.5 0.1 -0.7 Orange 0.3 0.8 0.2 The reduced dimensionality representation captures the essential Lemon -0.7 -0.5 0.8 information of the original data in a more compact form. End of Chapter 5

Chapter 5 Information Pre-processing for Analytics PDF

Document Details

Tags

Related

Summary

Full Transcript