Feature Engineering Cycle Overview

Target Transformation: Modifies predictor/response variable to improve model fit
- Improves model fit when variable shows a skewed distribution
- Transforms data towards a normal distribution
- Examples: log(x), log(x+1), sqrt(x), sqrt(x+1)
Feature Extraction: Derives new features from existing ones
- Imputation
- Outlier Detection
- Log Transformation
- Grouping
- Splitting
- Scaling
Feature Encoding: Transforms categorical features into numeric features
- Provides more fine-grained information
- Captures non-linear relationships and interactions between feature values
- Essential for machine learning algorithms that require numeric input
- Methods: Labeled Encoding, One Hot Encoding

Addresses the common problem of missing values in data
Reasons for missing values: human errors, interruptions, privacy concerns
Solutions:
- Drop the row/column: Simple but can lead to data loss
- Imputation: Preferred option for filling missing values
  - Numerical Imputation: Assigning zero, NA, default values, or medians
  - Categorical Imputation: Using the most frequent value or creating an "Other" category

Outliers deviate significantly from the normal data distribution
Distinguish outliers from noise data:
- Noise is random error or variance
- Noise should be removed before outlier detection
Outliers are interesting as they violate the typical data generation mechanism
Potential applications: fraud detection, customer segmentation, medical analysis

Global Outlier (or Point Anomaly): Deviates significantly from the entire dataset
Contextual Outlier (or Conditional Outlier): Deviates significantly based on a specific context
Collective Outliers: A subset of data objects collectively deviate significantly, even if individual objects may not be outliers

Methods for outlier discovery:
- Visualization Tools: Box plots, Scatter plots
- Statistical methodologies: Z-score, IQR score
Advanced techniques for anomaly detection are discussed later

Graphical method to display groups of numerical data using quartiles
Extends lines (whiskers) from the boxes to indicate variability outside quartiles
Outliers are plotted as individual points outside the box

Displays relationships between two variables using Cartesian coordinates
Plots data points based on values of two variables on the horizontal and vertical axes

Aggregating data into meaningful groups
Applies to datasets with different levels of granularity (e.g., transactions)
Key point is defining aggregation functions for different features
Aggregating categorical columns:
- Highest frequency: Maximum operation
- Pivot table: Merging features into aggregated, more informative features
Numerical columns are usually grouped using sum and mean

Converts categorical features into numeric features
Enhances the quality and interpretability of categorical information
Provides more fine-grained information for machine learning models
Methods:
- Labeled Encoding: Assigns ordered integers to categories
- One Hot Encoding: Creates binary features representing each category

Fraudulent transactions are typically of much higher value than non-fraudulent transactions.
Non-fraudulent transactions are typically closer to home, but with several outliers.

Splitting data into training and testing sets is crucial for machine learning models.
Training data teaches the model to learn relationships between features and target variables.
Test data evaluates the model's performance by comparing predictions to actual values.
Normalizing features is important for kNN algorithms as they measure distances between data points.
The code uses five splits, meaning the data is divided into five equal groups. Four groups are used for training and one for testing.
Accuracy scores are calculated for each split, averaged to find the best model.

The k value (number of nearest neighbors) is a key hyperparameter.
The code calculates accuracy scores for a range of k values from 1 to 30.
The results show that k values between 9 and 13 have an accuracy score of around 95%.
Choosing a smaller k value is generally advisable as it relies on closer data points.
Other evaluation metrics can be considered, such as precision, recall, and F1-score.

kNN is a voting system where the majority class label among the nearest k neighbors determines the class label of a new data point.
The number of nearest neighbors (k) is a crucial factor in determining the model's accuracy and robustness.
kNN can be used for both classification and regression tasks.
Distance metrics used for determining proximity between points are important for kNN performance.