Introduction aux arbres de décision

Study Notes

Decision trees are a type of supervised machine learning algorithm used for both classification and regression tasks.
They represent decisions and their possible consequences in a tree-like graph structure.
Each internal node represents a feature or attribute, each branch represents a decision rule based on the attribute, and each leaf node represents the outcome or class label.
Decision trees are relatively easy to understand and interpret, making them popular for visualizing decision-making processes.

The process of building a decision tree involves recursively partitioning the data based on features that best separate the classes or predict the target variable.
Key goal is to find the optimal split point that maximizes the information gain or minimizes the impurity of the resulting subsets.
Various algorithms exist, like ID3, C4.5, and CART (Classification and Regression Trees), with slight variations in splitting criteria.

Entropy: A measure of impurity in a dataset. Higher entropy indicates more uncertainty about the class labels.
Information Gain: The amount of reduction in entropy achieved by splitting the data based on a certain attribute. Higher information gain indicates a better choice of split.
Gini Impurity: Another measure of impurity, quantifying the probability that two randomly selected items from a dataset will belong to different classes.
Splitting Criteria: Different algorithms use distinct criteria to select the best attribute for splitting. For example, ID3 uses information gain, CART uses Gini impurity or the variance reduction metric for regression problems.
Pruning: The process used to reduce the complexity of a decision tree by removing branches that may lead to overfitting from the training dataset. Reduces over-fitting to improve generalisation to unseen data.

Interpretability: Easy to understand and visualize the decision-making process.
Simplicity: Relatively easy to implement and understand compared to other machine learning algorithms.
Handles both categorical and numerical data.

Overfitting: Trees can become too complex, particularly if not pruned, resulting in high variance and poor generalization to unseen data.
Sensitivity to data: Small variations in the dataset can lead to significantly different tree structures.
Non-monotonicity: A feature may have a non-linear or more than one relationship to the outcome (e.g. negative relationship in one portion of the data and positive relationship in other portion).

Medical Diagnosis: Diagnosing diseases based on patient symptoms.
Financial Risk Assessment: Assessing the likelihood of loan defaults.
Customer Segmentation: Grouping customers based on their purchasing behaviour.
Fraud Detection: Identifying fraudulent transactions.

ID3: One of the earliest algorithms, uses information gain to select attributes.
C4.5: An evolution of ID3, improving on handling continuous and missing values.
CART: Can handle both classification and regression tasks, typically uses Gini impurity.

Feature Selection: Selecting relevant features can improve performance and reduce unnecessary complexity.
Handling Missing Values: Appropriate methods are needed to deal with missing data during the tree building process.
Data Preprocessing: Essential to prepare the data for analysis, including data cleaning, normalization, and handling outliers.
Evaluation Metrics: Evaluating the performance using appropriate metrics regarding the type of task.

Decision trees are powerful tools for decision making and for building machine learning models.
Their simplicity can greatly help with model interpretation and insight.
Their potential drawbacks need consideration to avoid overfitting and improve model robustness.