Data Pre-processing and Transformation in Machine Learning

The workshop is led by Kelly Hong, Associate Director of Data Sciences at Gilead Sciences, who has a Ph.D. in Information Sciences with a focus on natural language processing and machine learning.
Kelly Hong introduces herself as a data scientist with expertise in data processing and transformation, having worked on various projects in machine learning and natural language processing.
The workshop's agenda includes introducing the topic, discussing learning objectives, and performing hands-on practice on data pre-processing and transformation using the WIDS Datathon 2024 dataset.
The workshop focuses on a fundamental topic in machine learning, which is often overlooked but crucial for successful machine learning model development.
Kelly Hong shares her background, having worked in various countries, including Vietnam, the UK, Singapore, and the US, and having expertise in computer science, information systems, business intelligence, and data science.
The learning objectives of the workshop include understanding data pre-processing and transformation, its importance in machine learning, and hands-on practice with the WIDS Datathon 2024 dataset.
Data pre-processing and transformation are critical steps in machine learning development, often taking up to 70% of the project time, and are essential for high-quality data to feed into machine learning algorithms.
The machine learning pipeline includes data cleaning, feature engineering, training, testing, and deployment, with data pre-processing and transformation being a crucial part of the process.
Data pre-processing contains four basic tasks: data cleaning, data integration, data transformation, and data reduction.
Data cleaning removes incorrect, incomplete, or inaccurate data from the dataset, while data integration combines the original dataset with external data to enrich it.
Data transformation changes the original format of the data to a format suitable for machine learning models, such as converting text data to numerical representations.
Data reduction reduces the volume of the data to remove unnecessary information and improve model performance.
Real-world data in medicine is often derived from different sources, including electronic health records, insurance claims, clinical registries, and patient surveys, making it messy and prone to quality issues.
The WIDS Datathon 2024 dataset is a real-world dataset provided by Gilead Sciences, containing information about patient demographics, diagnosis, treatment, and insurance information for breast cancer patients.
The dataset is enriched with third-party demographic data to provide insights into social economic aspects contributing to health equity.
The challenge task is to develop a machine learning model to predict the duration of time it takes for patients to receive treatment for their diagnosis.
The hands-on practice will be performed on Google Colab using Python 3, with the data file already uploaded to the platform.
The workshop will demonstrate basic data pre-processing steps and data transformation techniques using popular libraries such as NumPy, pandas, and scikit-learn.
The first step is to import the necessary libraries and load the dataset into a pandas dataframe for initial analysis.
The initial analysis involves understanding the dataset's size, columns, and features to decide on the necessary data processing steps and transformations.- Info is a pandas function that prints all column names, their types, distributions of empty and non-empty values.
Initial analysis helps assess data quality and identify columns needing attention during pre-processing.
Issues like missing data in columns like Ray and BMI are highlighted for further examination.
The dataset contains categorical and numerical data, no text data.
Different data types require different pre-processing and transformation techniques.
Spotting missing values is essential for data analysis; visualization can aid in identifying columns with high missing ratios.
A function is used to plot missing value ratios exceeding 10% for columns in a bar chart.
Top five columns with the most missing values and ratios over 10% are identified: Pay Type, Pay Ray, BMI, among others.
Understanding column values, such as Pay Type, through visualization helps gain insights.
Categorical columns like Pay Type can be visualized through bar charts to understand category distributions.
Numerical columns like BMI can have their statistical summaries printed to understand data distribution.
The BMI column's statistical summary shows mean, max, and count values for analysis.
Correlation between different columns, such as Race and BMI, can be analyzed to determine feature relevance.
Correlation between columns like Patient Race and Pay Type can also be plotted for insights.
Initial data analysis provides a high-level overview, guiding further data exploration and cleaning steps.
Fundamental data cleaning steps include handling missing values and filtering outliers.
Missing values can be filled with appropriate values or handled by removing the columns with high missing ratios.
Outliers like BMI with extreme values can be removed or imputed with median values depending on the scenario.
Data transformation techniques like Label Encoding and One-Hot Encoding are used to convert categorical data into numerical formats.
Label Encoding assigns integers to categories, while One-Hot Encoding converts categories into binary columns to avoid bias.
Numeric data like patient age can be bucketed into categories for better understanding and analysis.
Scaling techniques like Min-Max Scaling can be applied to numeric data to standardize values within a specific range for better interpretation.
Optimization is necessary when working with large datasets to ensure efficient code execution and visualization strategies for effective data analysis.