Lecture 3- BMT 443.pdf
Document Details
Uploaded by Deleted User
Full Transcript
BMT 443 Working with data: Data preparation Lecture 3 Dr. Asma Abahussin Department of Biomedical Technology College of Applied Medical Sciences King Saud University Objectives To learn and understand: ▪ What is meant by data preparation ▪ Steps of data cleaning and preprocessing and the im...
BMT 443 Working with data: Data preparation Lecture 3 Dr. Asma Abahussin Department of Biomedical Technology College of Applied Medical Sciences King Saud University Objectives To learn and understand: ▪ What is meant by data preparation ▪ Steps of data cleaning and preprocessing and the importance of each step ▪ Techniques used in each step of data cleaning and preprocessing 2 Introduction E- ❖ Raw data can rarely be used right out of the box. ❖ There will be errors in the collection, missing values, corrupt records, and other challenges. - ❖ Data scientists must preprocess and clean the data to ensure data integrity and that the data - - - are in a form ready for analysis. 3 Data Preparation Data preparation implies asking questions like: Does the data make sense? = Does the data match the column labels? Does the data follow the rules for its field? How many of the values are null? Is the number of nulls acceptable? Are there duplicates? Data preparation = Noticing all the problems within the data and fixing them = Data Cleaning and Preprocessing 4 Data Cleaning and Preprocessing Workflow ❖ Data cleaning and preprocessing workflow often varies based on the project and the nature of the data. ❖ A typical workflow may involve the following steps: 1. Data Collection 2. Data Cleaning 3. Data Integration 4. Data Transformation 5. Data Reduction 5 Step 1: Data Collection Data can be collected through various sources, such as social media monitoring, online tracking, surveys, feedback, databases, manual entry, etc. 6 Step 2: Data Cleaning -5¥ Clean the collected data by identifying and correcting common data quality issues such as: ▪ Missing values: can take various forms, from clear blanks to placeholders like "N/A" or "-99” - - - can lead to skewed analyses or introduce bias in models ▪ Outliers: Data points that differ significantly from other observations in the dataset = can skew the mean and inflate the standard deviation, misleading the overall data distribution 7 Step 2: Data Cleaning Common data quality issues (continue): ▪ Inconsistent formatting: Inconsistencies can occur in various forms such as date formats, casing in string data, or numeric data - - stored as text - ▪ Duplicate data: Duplicates are repeated records in the dataset can bias the analysis and lead to incorrect conclusions 8 Step 2: Data Cleaning Some techniques for handling and correcting some of the common data quality issues: ▪ Missing values: ✓Deletion EE ✓Imputation: the process of substituting missing data with substituted values such as Mean/Median/Mode for numerical/Ordinal/Nominal data, respectively. ▪ Outliers: ✓Deletion ✓Transformation: transform the value to its log or square root. 9 Step 3: Data Integration ❖ Data integration involves combining data from different sources, resolving any inconsistencies and providing a unified view of these data. ❖ Data integration approaches include: ▪ Merging: is the process of combining two or more data tables based on common columns between ± them. ▪ Concatenating: is a process of appending datasets, i.e., it adds data tables along a particular axis, either - row-wise or column-wise. 10 Step 4: Data Transformation ❖ Data Transformation is the process of converting data from one format or structure into another format or structure to make it suitable for analysis. = ❖ Data Transformation techniques include: ▪ Encoding categorical variables: The transformation of categorical variables into a suitable numeric format Different techniques are used such as Label Encoding where each label is assigned a unique integer based on alphabetical ordering 11 Step 4: Data Transformation Data Transformation techniques include (continue): ▪ Normalizing numerical variables: ▪ Normalization is a method to change the- values of numeric columns in a dataset to a common scale. ▪ Models perform better when input numerical variables fall within a similar scale. ▪ Without normalization or scaling, variables with higher values may dominate the model's outcome. This could lead to misleading results and a model that fails to capture the influence of other variables. ▪ Different statistical-based techniques are used for data normalization one of the simplest is Min-max scaling where each variable is scaled and translated to values between a specified range such as the range of 0 to 1. 12 Step 5: Data Reduction ❖ Data Reduction is reducing the data dimensionality if necessary to focus on the most relevant - variables. I ❖ Data reduction techniques include: ▪ Feature selection: is the process of selecting a subset of relevant variables for use in model construction. ▪ Feature extraction: is the process of transforming the original high-dimensional data into new lower- dimensional data which represents most of the relevant variables in the original data. 13 Step 5: Data Reduction ❖ Data reduction is important for different reasons including: Simplicity: Fewer features make the model simpler and easier to interpret. E- Speed: Fewer data means algorithms train faster. Prevention of overfitting: Less irrelevant data means less opportunity to make decisions based on - noise. 14