Podcast
Questions and Answers
What is the purpose of gradient descent in machine learning?
What is the purpose of gradient descent in machine learning?
Which statement best describes regularization in machine learning?
Which statement best describes regularization in machine learning?
What would likely be a primary reason to apply feature mapping in a machine learning model?
What would likely be a primary reason to apply feature mapping in a machine learning model?
In the context of a linear classifier, what does the 'attempt' refer to?
In the context of a linear classifier, what does the 'attempt' refer to?
Signup and view all the answers
Which of the following is NOT a direct solution method in machine learning?
Which of the following is NOT a direct solution method in machine learning?
Signup and view all the answers
Signup and view all the answers
Study Notes
Machine Learning (EEC3501)
- This course covers machine learning fundamentals.
- Problem Setup: Predict a scalar value (t) based on another scalar value (x). The dataset is a collection of (x(i), t(i)) pairs. Inputs are denoted as x(i) and targets as t(i).
-
Model: The model predicts y as a linear function of x: y = wx + b.
- w is the weight.
- b is the bias.
- w and b together are parameters.
- Settings of these parameters are called hypotheses.
- Loss Function: Squared error (L(y, t)) = (y − t)² Aims to minimize the residual (y – t). A factor of 1/2 is included for mathematical convenience.
- Cost Function: The average loss across all training examples. J(w,b) = 1/2N * Σ (y(i) − t(i))².
- Multivariable Regression: When multiple input variables (x1, x2, ..., xD) are present. The linear model in this case is y = Σ wjxj + b. This differs from the single input case only in visual complexity, not in the fundamental setup.
- Vectorization: Using matrix and vector operations to optimize computing performance (faster than using loops). y = np.dot(w,x) + b, where w and x are vectors.
Cost Function Derivation
- Organizing Training Data: Arrange input values (x) into a design matrix (X) with each row representing a training example and each column corresponding to a feature. Targets (t) are organized into a vector (t).
- Prediction for Whole Dataset: Compute predictions for the entire dataset: y = Xw + b.
- Squared Error Cost: Compute the cost function over the complete dataset: J = 1/(2N) * ||y − t||². This simplifies calculation in a matrix format in Python (code included).
Direct Solution
- Finding Minimum Analytically: Find the minimum of the cost function by setting the partial derivatives equal to 0.
- Derivation: Calculating partial derivatives and their relation to cost function parameters.
- Optimal Weights: A system of linear equations can efficiently give the optimal weights for this linear model. An explicit formula exists: w = (XTX)⁻¹ Xᵀt.
Gradient Descent
- Iterative Minimization: A numerical approach to find the minimum of the cost function. Repeated adjustments in parameter direction.
- Initialization: Start with initial values for weights. For instance, using all zeros.
- Step Size in Gradient Descent: Adjust weights using the step-size parameter, α.
- Gradient Calculation: Determine the gradient, which shows how the cost function changes across different parameters. Used in iteration formulas for parameter adjustments.
Feature Mapping
- Polynomial Regression: A method to fit curves rather than straight lines.
- Feature Representation: Defines a mapping process.
- Applying Methods: The same linear algorithms will work with feature mapping (example using polynomials to represent curves as a dataset).
Underfitting and Overfitting
- Underfitting: The model is too simple to capture the complexity of the data.
- Overfitting: The model is too complex and fits only the training data very precisely, failing to generalize well to new data. The training error will likely decrease, but test error will increase.
Regularization
- Balancing Model Complexity & Data Fit: Prevent overfitting by adding a penalty to the cost, which discourages large coefficients.
- Hyperparameter Tuning: Use a validation set to experiment with various values of the regularization parameter (λ).
- Observation: Polynomial models with overfitting often have large coefficients; this suggests the need to reduce these coefficients.
Linear Classifier
- Classification Models: Methods to place data points into predefined categories.
- Binary Classification: Identifying items into one of two categories.
- Examples: Medical diagnosis, spam filtering, transaction fraud detection.
Binary Linear Classification
- Binary Target Values: Predicting target variables with values in the set of {0, 1}.
- Linear Model: Mapping input variables to a score (z) via a linear function: z=wᵀx +b.
- Threshold: Applying a threshold (a cutoff value) to the score z to produce a prediction. y=1 if z>r, and y=0 otherwise.
Loss Functions
- 0-1 Loss: A fundamental choice indicating a perfect prediction (match) or a prediction error. 1/(y≠t) or {0 if y=t, 1 if y≠t}
- Surrogate Loss Functions: Use a simplified or more easily optimized loss to gain better optimization opportunities. A common one is squared error loss. (y-t)² / 2
- Problem (with 0-1 Loss): Traditional gradient descent updates are often ineffective when using 0,1 loss.
Logistic Regression
- Probability Estimation: Estimating probabilities instead of just a class prediction.
- Activation Function: Squash predicted values y into the interval of [0, 1]. A common one is (1/(1 + e⁻ᶻ)).
- Choosing a Loss Function: Instead of 0 - 1 loss, use cross-entropy loss for better gradient descent performance since it captures issues with confidence. (LCE = -t log y - (1-t) log(1 – y)).
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your understanding of machine learning fundamentals, including problem setups, model parameters, and loss functions. This quiz covers essential concepts such as multivariable regression and vectorization strategies. Perfect for students of EEC3501.