Machine Learning Fundamentals

Study Notes

Machine learning enables learning from examples to perform tasks such as clustering, outlier detection, classification, and regression without a complete physical model.
Two datasets are essential: a training set (labeled examples for system development) and a test set (unlabeled examples for evaluation).

The purpose of machine learning is to generalize beyond training data, making predictions for new, unseen examples.
Generalization is crucial in assessing the performance of a machine learning model.

Objects are encoded using features (e.g., shape, weight, color) to facilitate automated tasks in machine learning.
A feature vector represents the values of these features for each object within a dataset.

A dataset is collected by measuring the features of many objects, resulting in labeled objects where each object is associated with a feature vector.
In classification, datasets form the basis for teaching models to recognize patterns.

The quality of features significantly impacts the ability to recognize patterns; well-defined features enhance recognition while poorly defined features do not contribute.
Other approaches to defining objects include dissimilarity and structural pattern recognition, though the feature approach is more developed.

Real-world measurements are imperfect and can introduce noise; variations exist within classes of objects.
Statistical methods are necessary to account for these variations to improve model accuracy.

In classification tasks, such as distinguishing between different Iris flower types, specific measurements (e.g., sepal and petal dimensions) are utilized to differentiate between classes.
Such measurements are fundamental in developing effective predictive models in machine learning.### Class Posterior Probability
The objective is to estimate the conditional probability p(blue|feature 1).
This involves determining how likely an object belongs to the blue class for given feature values.
Posterior probability helps in classifying new data points based on existing training data.

Data is represented in a Gaussian distribution, characterized by bell-shaped curves.
Features are plotted on a graph, typically indicating the distribution for different classes.

Charts illustrate probabilities of being in the blue class across different values of feature 1.
Values range from 0 to 1, where 1 indicates certainty of being blue and 0 represents no likelihood.
Mean and variance are essential characteristics of Gaussian distributions aiding in class prediction.

The process involves calculating p(y|x), where y denotes class and x represents features.
Probability estimates assist in determining the class label for classification tasks in machine learning.

Classify new observations by assigning the label corresponding to the highest posterior probability.
Requires calculation of probabilities for all potential classes before making a decision.

Probability thresholds can be analyzed to define decision boundaries between classes.
For example, an object is classified as blue if p(blue|feature 1) exceeds the probabilities of other classes.

Understanding class posterior probability and its relation to Gaussian distributions is crucial in making accurate predictions in machine learning tasks.