Machine Learning Data Preprocessing

Study Notes

Data preprocessing transforms raw data into a format suitable for machine learning models.
Purpose: Better data representations lead to enhanced model performance.

Tabular Data: Common in classification, regression, clustering (e.g., user demographics).
Time Series Data: Used for forecasting and anomaly detection.
Textual Data: Suitable for sentiment analysis and natural language processing.
Graph Data: Analyzed in social networks and predictive modeling.
Image/Video Data: Applied in computer vision tasks like object detection.
Geospatial Data: Facilitates location and spatial analyses.

Raw data is often messy; preprocessing is crucial for model effectiveness.
Key techniques include scaling, encoding, feature engineering, and handling missing data.

Read Datasets: Use functions like read.csv() for importing data from CSV files.
Visualization: Adequate data visualization (e.g., histograms, boxplots) is essential for understanding data distribution and outliers.

Invalid Observations: Treat as missing values; can occur due to measurement errors.
Valid Observations: Use truncation or Z-score methods for valid data handling.
Model Sensitivity: Nonparametric models are less sensitive to outliers compared to parametric models.

Regular expressions (regex) are powerful for pattern matching and string manipulation, aiding in data cleansing and extraction.
Character Classes: Match specific sets of characters or ranges in strings.
Shorthand Classes:
- \d: Digit
- \w: Word character
- \s: Whitespace
Grouping and Capturing: Extract specific parts of matched strings for further analysis.

Focused on creating meaningful features that enhance predictive performance while reducing computational complexity.
Techniques include categorical encoding, scaling, and generating interaction terms.

Ordinal Features: Integer encoding assigns unique integers to unique values.
Nominal Features: Dummy encoding creates binary indicators for each category.
Binary Features: Treated as integer indicators (0 or 1).

Log Transform: Reduces skewness, particularly useful for right-skewed distributions.
Power Transform: Adjust data distribution based on power values.
Feature Scaling: Ensures numerical features are on similar scales through normalization or standardization.

Significant imbalances in class distribution can degrade model performance.
Undersampling: Reduces majority class examples to balance the dataset.
Oversampling: Increases minority class examples to create balance, often done with replacement.
Class balancing techniques should be applied only to training data, keeping test data intact for valid evaluation.

Data preprocessing is essential for building robust machine learning models, involving various techniques tailored to the data's characteristics and the intended model use.### Data Processing Techniques
Undersampling helps to address class imbalance issues by reducing the number of instances from the majority class.
SMOTE (Synthetic Minority Oversampling Technique) is an advanced method that generates synthetic examples for the minority class to improve model training.

Accuracy is not a reliable metric for evaluating model performance on imbalanced datasets.
Recommended metrics include AUC (Area Under the ROC Curve) and precision-recall curves for a more nuanced performance assessment.

Raw data usually needs preprocessing to create an effective representation for machine learning models.
Many machine learning algorithms require numerical encoding of categorical variables to function correctly.
Scaling of features is essential for distance-based algorithms like k-Nearest Neighbors (kNN), Support Vector Machines (SVM), and neural networks.
Optimize preprocessing steps by using a portion of the training set as a validation set to assess model accuracy effectively.
Be cautious of data leakage; preprocessing techniques must not be influenced by the test data.
Imbalanced datasets necessitate additional strategies and care in model development to ensure meaningful results.