Data Preprocessing and Transformation Methods PDF

Document Details

UnrivaledActinium

Uploaded by UnrivaledActinium

PUP Laboratory High School

Tags

data preprocessing data transformation machine learning data analysis

Summary

This document details various methods of data preprocessing and transformation. It covers topics such as categorical and numerical variables, common challenges in data preprocessing, and different types of data transformations like Box-Cox and standardization. The document also provides context on how data transformation is crucial in data analysis and machine learning projects.

Full Transcript

GROUP 5 Data Transformation- The process of converting, cleansing, and structuring data into a usable format...

GROUP 5 Data Transformation- The process of converting, cleansing, and structuring data into a usable format that can be analyzed to support decision making processes, and to propel the growth of an Data preprocessing is the crucial initial step in any data analysis or machine learning project. It involves organization. cleaning, transforming, and organizing raw data into a suitable format for analysis. Common Transformation Technique Data pre-processing refers to identifying incorrect, incomplete, irrelevant parts of the data and then modifying, replacing, or deleting the dirty or coarse data. It ensures data quality, consistency, and BOX- COX TRANSFORMATION efficiency in subsequent modeling and analysis tasks used to stabilize variance and make the data more closely approximate a normal distribution, KEY STAGES OF DATA PREPROCESSING which is a common requirement for many statistical techniques. Data Cleaning helps in fulfilling the assumptions of linear models, making the data more normal, which is Data Integration crucial for accurate predictions. Data Transformation Data Reduction STANDARDIZATION (Z-SCORE NORMALIZATION) A statistical procedure used to make data points from different datasets comparable. Another COMMON CHALLENGES IN DATA PREPROCESSING Feature scaling method where the values are centered around the mean with a unit standard Missing data: Dealing with incomplete information. deviation. Data inconsistency: Handling conflicting or contradictory data. Standardized data ensures that each feature contributes equally to the analysis Data quality issues: Addressing errors, outliers, and noise. Data volume: Managing large datasets efficiently. LOG TANSFORMATION DATA TYPES It is particularly useful for compressing the range of data values. 1. Categorical Variables Useful for datasets with exponential growth, financial data, or skewed data. It makes patterns Nominal: Represent categories without any inherent order (e.g., gender, color, country). in the data more discernible and improves the performance of statistical models. Ordinal: Represent categories with a specific order (e.g., education level, satisfaction rating). LAGGING TIME SERIES 2. Numerical Variables Continuous: Can take any value within a range (e.g., height, weight, temperature). A technique used in time series analysis to shift data points in time, helping to capture Discrete: Can only take specific values (e.g., number of children, count of items) relationships between current and past values. Critical in time series forecasting, where understanding how past values affect future outcomes LEVEL OF MEASUREMENTS can significantly improve model accuracy. Lags help in building autoregressive models that rely Nominal on past data for predictions. Data can only be classified into categories (e.g., gender, marital status) GROUP 6 Ordinal Data can be classified and ranked (e.g., education level, satisfaction rating) MACHINE LEARNING Interval between Data can be classified, ranked, and differences values are meaningful (e.g., temperature in Celsius or - Is an application of artificial intelligence that involves algorithms and data that automatically Fahrenheit) analyze and make decisions by itself without human intervention. Ratio - It describes how computer perform tasks on their own by previous experiences Data can be classified, ranked, differences are meaningful, and there is a true zero point (e.g., height, MACHINE LEARNING; BIAS AND VARIANCE TRADE OFF weight, income) Bias – it is the amount by which Machine Learning (ML) model predictions differ from the actual value of the target. Variance – it is the amount by which the ML prediction would change if we estimate it using different GROUP 7 training data sets. Variable Selection- process of identifying and selecting the most relevant predictors (independent Under fitting and Overfitting variables) to include in a statistical or machine learning model. Under fitting – ML Model with high bias pays very little attention to the training dataset and leads to Importance of Variable Selection high error on training as well as testing datasets. Improves Model Performance - focusing on the most relevant variables, the model becomes more High bias tends to under fitting accurate and reliable. Overfitting- the model with high variance pays a lot of attention to the training dataset and does not Prevents Overfitting -reduces the risk of overfitting by eliminating irrelevant or redundant variables, generalize the unseen data. ensuring the model generalizes better to new data. CROSS VALIDATION Reduces Computational Cost- fewer variables mean less data to process, which can lead to faster computations, especially with large datasets. Is a technique which is used to train and evaluate our model on a portion of our data base before reportioning our data set and evaluating it on the new portions. Conditionality Issues- Conditionality issues occur when the effect of a variable on the outcome We partition dataset into TRAINING and TESTING DATA depends on the presence or absence of other variables. Training data - will be used by model to learn Multi Collinearity- Multi collinearity occurs when two or more independent variables in regression Testing data - will be used by model to predict an unseen data. It is used to evaluate our model are highly correlated in each other. models performance. Variable Selection Methods (Feature Selection in Machine Learning)- are techniques used in statistical TYPES OF CROSS VALIDATION modeling and machine learning to choose the most relevant variables (or features) from a larger set of 1. VALIDATION SET APPROACH / MSE- used to evaluate and refine the performance of a model variables to include in a model. before it's deployed for real-world applications Variable Selection Method.-training set -validation set or hold out se 1. Filter Method 2. Leave One Out Cross Validation - used in machine learning to evaluate the performance of a model These methods use statistical techniques to evaluate the relevance of each variable independently of when the dataset is limited. the model. The variables are ranked based on certain criteria, and a subset of the most relevant variables is selected. 3. K fold Cross Validation the dataset is divided into k subsets (folds) of approximately equal size. The model is trained and evaluated k times, each time using a different fold as the test set and the Common Strategies remaining folds as the training set. 1. Anova or Analysis of Variance 2. Correlation Analysis BOOTSTRAP 3. Information Gain The process of taking repeated random samples (with replacement) of a dataset and 2. Wrapper Method estimating some parameter on each sample. Is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated wrap the machine learning algorithm and iteratively evaluate feature combinations to find the best with a given estimator or statistical learning method. subset for model performance. For example, it can provide an estimate of the standard error of a coefficient or a confidence interval for the coefficient in more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought. Common Strategies: POWER REGRESSION is a type of nonlinear regression that fits a power function to the data. 1. Forward Selection POLYNOMIAL MODELS are a type of regression analysis where the relationship between the independent and dependent variables is modeled as a polynomial function. 2. Backward Elimination 3. Recursive Feature Elimination (RFE) 4. Exhaustive Search 3. Embedded Methods perform variable selection during the process of model training and are specific to a particular learning algorithm. Common Strategies: 1.Lasso (Least Absolute Shrinkage and Selection Operator) Regression (L1 Regularization) 2. Ridge Regression (L2 Regularization) BACKWARD ELIMINATION -The simplest of all variable selection methods. - This method starts with a full model that considers all of the variables to be included in the model. FORWARD SEL ECTION -The reverse of the Backward Elimination method. - This method starts with no variable in the model then add variable to the model one by one. STEPWISE SELECTION -Combines forward selection and backward elimination. -Variables are added and removed at each step based on their contribution to the model. GROUP 8 NONLINER REGRESSION - refers to a broader category of regression models where the relationship between the dependent variable and the independent variables is not assumed to be linear. Types of Non-Linear Regression- There are two main types of Non-Linear regression in Machine Learning: Parametric non-linear regression assumes that the relationship between the dependent and independent variables can be modeled using a specific mathematical function. Non-parametric non-linear regression does not assume that the relationship between the dependent and independent variables can be modeled using a specific mathematical function. LOGARITHMIC REGRESSION is a type of nonlinear regression that fits a logarithmic function to the data. EXPONENTIAL REGRESSION is a type of nonlinear regression that fits an exponential function to the data.

Use Quizgecko on...
Browser
Browser