Supervised Machine Learning Overview

Study Notes

Supervised machine learning involves creating models that can predict outputs based on labeled input data.
The dataset for supervised learning consists of records: input variables and their corresponding labels (outputs).
The goal is to build a model that relates inputs to outputs and predict the class of unseen data.

Datasets are usually divided into training and testing sets.
The training set is used to develop the model, while the testing set assesses the model's accuracy on unseen data.

For binary classification, the model's output can be represented using a confusion matrix.
The confusion matrix shows how many instances are correctly classified by the model.
True Positives (TP): Instances correctly classified as belonging to the positive class.
True Negatives (TN): Instances correctly classified as belonging to the negative class.
False Positives (FP): Instances incorrectly classified as belonging to the positive class.
False Negatives (FN): Instances incorrectly classified as belonging to the negative class.

Precision: Measures the proportion of correctly classified positive instances out of all instances predicted as positive.
Recall: Measures the proportion of correctly classified positive instances out of all actual positive instances.
F-score: A weighted average of precision and recall, providing a balanced measure of the model's performance.

Regression models are used to predict continuous output values based on input variables.
An example is predicting the price of a house based on its characteristics.
The mean squared error (MSE) is a commonly used metric to evaluate the performance of regression models and measures the average difference between the predicted and actual values.

Occurs when a model learns the training data too well, resulting in poor performance on unseen data.
The model memorizes the training data's noise rather than capturing the underlying relationships between inputs and outputs.
This can lead to inaccurate predictions on new data.

Occurs when the model fails to capture the underlying structure in the training data, leading to poor performance on both training and testing sets.
The model is too simple to effectively learn the data patterns.

A well-performing machine learning model aims to minimize generalization errors, meaning it makes accurate predictions on data it hasn't been trained on.
This ability to generalize is crucial for practical applications.