Machine Learning (EEC3501) Quiz

Study Notes

This course covers machine learning fundamentals.
Problem Setup: Predict a scalar value (t) based on another scalar value (x). The dataset is a collection of (x(i), t(i)) pairs. Inputs are denoted as x(i) and targets as t(i).
Model: The model predicts y as a linear function of x: y = wx + b.
- w is the weight.
- b is the bias.
- w and b together are parameters.
- Settings of these parameters are called hypotheses.
Loss Function: Squared error (L(y, t)) = (y − t)² Aims to minimize the residual (y – t). A factor of 1/2 is included for mathematical convenience.
Cost Function: The average loss across all training examples. J(w,b) = 1/2N * Σ (y(i) − t(i))².
Multivariable Regression: When multiple input variables (x1, x2, ..., xD) are present. The linear model in this case is y = Σ wjxj + b. This differs from the single input case only in visual complexity, not in the fundamental setup.
Vectorization: Using matrix and vector operations to optimize computing performance (faster than using loops). y = np.dot(w,x) + b, where w and x are vectors.

Organizing Training Data: Arrange input values (x) into a design matrix (X) with each row representing a training example and each column corresponding to a feature. Targets (t) are organized into a vector (t).
Prediction for Whole Dataset: Compute predictions for the entire dataset: y = Xw + b.
Squared Error Cost: Compute the cost function over the complete dataset: J = 1/(2N) * ||y − t||². This simplifies calculation in a matrix format in Python (code included).

Finding Minimum Analytically: Find the minimum of the cost function by setting the partial derivatives equal to 0.
Derivation: Calculating partial derivatives and their relation to cost function parameters.
Optimal Weights: A system of linear equations can efficiently give the optimal weights for this linear model. An explicit formula exists: w = (XTX)⁻¹ Xᵀt.

Iterative Minimization: A numerical approach to find the minimum of the cost function. Repeated adjustments in parameter direction.
Initialization: Start with initial values for weights. For instance, using all zeros.
Step Size in Gradient Descent: Adjust weights using the step-size parameter, α.
Gradient Calculation: Determine the gradient, which shows how the cost function changes across different parameters. Used in iteration formulas for parameter adjustments.

Polynomial Regression: A method to fit curves rather than straight lines.
Feature Representation: Defines a mapping process.
Applying Methods: The same linear algorithms will work with feature mapping (example using polynomials to represent curves as a dataset).

Underfitting: The model is too simple to capture the complexity of the data.
Overfitting: The model is too complex and fits only the training data very precisely, failing to generalize well to new data. The training error will likely decrease, but test error will increase.

Balancing Model Complexity & Data Fit: Prevent overfitting by adding a penalty to the cost, which discourages large coefficients.
Hyperparameter Tuning: Use a validation set to experiment with various values of the regularization parameter (λ).
Observation: Polynomial models with overfitting often have large coefficients; this suggests the need to reduce these coefficients.

Classification Models: Methods to place data points into predefined categories.
Binary Classification: Identifying items into one of two categories.
Examples: Medical diagnosis, spam filtering, transaction fraud detection.

Binary Target Values: Predicting target variables with values in the set of {0, 1}.
Linear Model: Mapping input variables to a score (z) via a linear function: z=wᵀx +b.
Threshold: Applying a threshold (a cutoff value) to the score z to produce a prediction. y=1 if z>r, and y=0 otherwise.

0-1 Loss: A fundamental choice indicating a perfect prediction (match) or a prediction error. 1/(y≠t) or {0 if y=t, 1 if y≠t}
Surrogate Loss Functions: Use a simplified or more easily optimized loss to gain better optimization opportunities. A common one is squared error loss. (y-t)² / 2
Problem (with 0-1 Loss): Traditional gradient descent updates are often ineffective when using 0,1 loss.

Probability Estimation: Estimating probabilities instead of just a class prediction.
Activation Function: Squash predicted values y into the interval of [0, 1]. A common one is (1/(1 + e⁻ᶻ)).
Choosing a Loss Function: Instead of 0 - 1 loss, use cross-entropy loss for better gradient descent performance since it captures issues with confidence. (LCE = -t log y - (1-t) log(1 – y)).