Machine Learning Algorithms and Metrics

Study Notes

Supervised Learning: Trained on labeled data to learn the relationship between input data and output labels, making predictions on new, unseen data.
Unsupervised Learning: Trained on unlabeled data to discover hidden patterns or relationships, identifying clusters, dimensions, or anomalies.

A popular, interpretable, and widely used supervised learning algorithm for classification and regression tasks.
Gini Index: A measure of impurity or uncertainty in a dataset, used to determine the best split in a decision tree.

Accuracy: The proportion of correctly classified instances out of total instances.
Precision: The proportion of true positives among all positive predictions made by the model.
Recall: The proportion of true positives among all actual positive instances.
F1 Score: The harmonic mean of precision and recall, providing a balanced measure of both.

A simple, non-parametric, and supervised learning algorithm for classification and regression tasks.
Difference in K: The value of K significantly affects the model's performance, with small K values biased towards noise and large K values biased towards simplicity.

Combining multiple base models to improve the overall performance, generalizability, and robustness of the system.
Examples include bagging, boosting, random forests, and stacking.

Complete Linkage: A hierarchical clustering method where the distance between two clusters is the maximum distance between any two points, one from each cluster.
Single Linkage: A hierarchical clustering method where the distance between two clusters is the minimum distance between any two points, one from each cluster.
K-Means: A popular, iterative, and centroid-based clustering algorithm for partitioning data into K clusters.

The process of selecting a subset of data points from a larger population, essential in machine learning for model training, evaluation, and data preprocessing.

Supervised Learning: Trained on labeled data to learn mapping between input data and output labels, with goal of making predictions on new, unseen data
Unsupervised Learning: Trained on unlabeled data to discover patterns or structure, with goal of identifying relationships or groupings in data

Decision Tree: A tree-based model that splits data into subsets based on features, using a decision-making process to classify or predict outcomes
Gini Index: A measure of impurity or uncertainty in a decision tree, with lower values indicating more homogeneous nodes
Metrics for Decision Trees:
- Accuracy: Proportion of correctly classified instances
- Precision: Proportion of true positives among all predicted positive instances
- Recall: Proportion of true positives among all actual positive instances
- F1 Score: Harmonic mean of precision and recall, providing balanced measure of both

KNN Algorithm: Classifies new instances based on majority vote of K most similar instances in training data
Effect of K on KNN:
- Small K: More localized, sensitive to noise, and may not capture underlying pattern
- Large K: More global, smoother, and less sensitive to noise, but may lose local patterns

Ensemble Methods: Combine predictions from multiple models to improve performance, robustness, and generalizability
Types of Ensemble Learning: Bagging, Boosting, Stacking, and Voting

Clustering: Grouping similar instances based on features, without prior knowledge of groups or labels
Types of Clustering:
- K-Means: Partitions data into K clusters, each associated with a centroid
- Complete Linkage: Hierarchical clustering method that merges clusters based on maximum distance between clusters
- Single Linkage: Hierarchical clustering method that merges clusters based on minimum distance between clusters