ELE888 - Midterm (Theory)

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the significance of the center of the ellipses in the context of a cost function?

Answer hidden

Which of the following is a characteristic of using normal equations to minimize a cost function?

Answer hidden

What potential issue can arise from using a learning rate that is too large in gradient descent?

Answer hidden

How does a small learning rate affect the training process in gradient descent?

Answer hidden

You are working with a dataset that has a relatively small number of data points. Which method of minimizing the cost function would be most appropriate?

Answer hidden

Which of the following scenarios is best addressed using unsupervised learning?

Answer hidden

A machine learning model is trained to predict whether a customer will click on an advertisement (yes/no). What type of supervised learning is being used?

Answer hidden

In reinforcement learning, what is the primary goal of an agent?

Answer hidden

Which of the following machine learning approaches would be most suitable for building a system that recommends movies to users based on the viewing history of similar users?

Answer hidden

Why might dimensionality reduction be a useful step in building a machine learning model?

Answer hidden

How does the cost function contribute to improving a model's predictive capability?

Answer hidden

What is the primary purpose of cross-validation in model training?

Answer hidden

What distinguishes Ridge Regression from standard linear regression?

Answer hidden

In Lasso Regression, what is the significance of reducing some coefficients to zero?

Answer hidden

How does increasing the value of alpha in Ridge Regression affect the model?

Answer hidden

What is the key difference between overfitting and underfitting in machine learning models?

Answer hidden

In linear regression, what role does the regression line play in prediction?

Answer hidden

What type of relationship is typically visualized using contour plots in the context of machine learning?

Answer hidden

In the context of training a machine learning model, what is a potential drawback of using a very small batch size?

Answer hidden

Which of the following is the primary purpose of normalizing features before applying gradient descent?

Answer hidden

What distinguishes Stochastic Gradient Descent (SGD) from Batch Gradient Descent?

Answer hidden

In the context of machine learning, what does an 'epoch' represent?

Answer hidden

If a learning curve plateaus during the training of a machine learning model, what does this typically indicate?

Answer hidden

Which of the following best describes 'model hyperparameters' in machine learning?

Answer hidden

What is the main purpose of Logistic Regression?

Answer hidden

How does increasing the batch size typically affect the stability and memory requirements of the training process?

Answer hidden

Why is feature normalization important in machine learning?

Answer hidden

Which of the following is most likely to cause numerical instability?

Answer hidden

A model shows high accuracy on the training set but performs poorly on the validation set. Which of the following is the most likely cause?

Answer hidden

Which of the following strategies is least likely to help reduce overfitting?

Answer hidden

Which regularization technique is most likely to perform feature selection by setting some feature weights exactly to zero?

Answer hidden

What is true about an overfitted model?

Answer hidden

How does increasing the amount of training data help to reduce overfitting?

Answer hidden

What is the primary purpose of cross-validation in machine learning?

Answer hidden

In a fraud detection model with skewed learning, 99% of transactions are legitimate. Which of the following strategies would be MOST effective in addressing the challenges posed by this skewed dataset?

Answer hidden

A medical diagnosis model is trained on a dataset with 98% healthy patients and 2% having a rare disease. If the model consistently predicts 'healthy,' which evaluation metric would be MOST informative in assessing its performance?

Answer hidden

Which of the following is a key similarity between logistic regression and neural networks?

Answer hidden

How does the range of output values differ between the Sigmoid and Tanh activation functions, and what is the implication of this difference?

Answer hidden

Zero-centering is a desirable property for activation functions because it helps prevent large positive or negative values from accumulating in deeper layers. Which activation function is zero-centered?

Answer hidden

Both Sigmoid and Tanh activation functions suffer from the vanishing gradient problem. Under what conditions does this problem typically occur?

Answer hidden

In the context of skewed learning, which of the following is the MOST critical consideration when evaluating model performance?

Answer hidden

In what way do both Logistic Regression and Neural Networks utilize a weighted sum of inputs?

Answer hidden

Flashcards

Machine Learning

Learning from data patterns instead of explicit programming.

Classification

A type of supervised learning that predicts categories or classes.

Clustering

Groups data points into clusters based on similar features, without labels.

Dimensionality Reduction

Reducing the number of features while retaining important information.

Signup and view all the flashcards

Reinforcement learning

Learning through trial and error, with rewards and punishments.

Signup and view all the flashcards

Model-based learning

Uses models that resemble the training examples directly to make predictions.

Signup and view all the flashcards

Decision boundary

Distinguishes between two classes by assigning points to different sides based on a function's sign.

Signup and view all the flashcards

Cost (Loss) Function

Measures how well a model predicts values; the goal is to minimize this to improve accuracy.

Signup and view all the flashcards

Cross-validation

Maximizes the usage of data points for training and evaluates the model's performance.

Signup and view all the flashcards

Regularization

A technique that prevents a model from learning the training data too closely, improving its generalizability.

Signup and view all the flashcards

Overfitting

When a model learns the training data too well, leading to poor performance on new data.

Signup and view all the flashcards

Ridge Regression

Adds a penalty to all coefficients, reducing their impact, especially for less important features.

Signup and view all the flashcards

Lasso Regression

Reduces some coefficients to zero, effectively disregarding their impact and highlighting important features.

Signup and view all the flashcards

Ellipses (Cost Function)

Visual representation showing different levels of a cost function; each ellipse represents a specific cost value.

Signup and view all the flashcards

Gradient Descent

Iteratively adjusts model parameters to improve performance.

Signup and view all the flashcards

Normal Equations

Directly calculates optimal parameter values using a formula.

Signup and view all the flashcards

Learning Rate

Controls the size of the steps taken during gradient descent.

Signup and view all the flashcards

Large Learning Rate - Risk

Large steps in learning can lead to missing the optimal values and causing divergence.

Signup and view all the flashcards

Stochastic Gradient Descent (SGD)

An iterative learning algorithm that updates the model after each data point.

Signup and view all the flashcards

Batch Size

Splitting the dataset into smaller groups during training.

Signup and view all the flashcards

Epoch

One complete pass through the entire training dataset.

Signup and view all the flashcards

Learning Curve

A graph showing model performance over time.

Signup and view all the flashcards

Model Hyper-parameters

Settings chosen before training that control how the model learns.

Signup and view all the flashcards

Model Parameters

Values learned from the data during training.

Signup and view all the flashcards

Normalizing Features

Ensuring features have a similar scale for faster convergence of gradient descent.

Signup and view all the flashcards

Logistic Regression

Supervised learning used for predicting one of two possible outcomes, using a decision threshold.

Signup and view all the flashcards

Skewed Learning

Dataset with imbalanced class distribution, where some classes appear much more frequently than others.

Signup and view all the flashcards

Fraud Detection Example

Predicting 'legitimate' all the time in fraud detection, even with high accuracy, is useless.

Signup and view all the flashcards

Medical Diagnosis Example

Predicting 'healthy' all the time in a rare disease dataset, though accurate, misses actual cases.

Signup and view all the flashcards

Spam Detection Example

Model struggles classifying spam due to imbalanced data (95% non-spam, 5% spam).

Signup and view all the flashcards

Vanishing/Exploding Gradients

Features on different scales can lead to gradients becoming too small (vanishing) or too large (exploding).

Signup and view all the flashcards

Logistic Regression and Neural Networks similarities

Both use a weighted sum of inputs, activation functions, classification, Loss function, and Gradient descent.

Signup and view all the flashcards

Weighted Sum of Inputs

Both use a weighted sum of inputs.

Signup and view all the flashcards

Numerical Stability

Scaling prevents large feature values from causing numerical overflow or precision errors in computations.

Signup and view all the flashcards

Sigmoid Activation Function

Outputs values between 0 and 1, which can lead to problems in models with zero-centered data.

Signup and view all the flashcards

Equal Feature Contribution

Normalization ensures no single feature dominates the learning process due to its magnitude.

Signup and view all the flashcards

Tanh (Hyperbolic Tangent)

Outputs values between -1 and 1, which helps keep activations closer to zero, improving learning stability.

Signup and view all the flashcards

Hypothesis (Model)

A mathematical function that makes predictions based on input features and their relationships to the output.

Signup and view all the flashcards

Signs of Overfitting

High accuracy on training data but poor performance on validation/test data.

Signup and view all the flashcards

More Training Data

Collecting more samples to help the model generalize better, reducing dependency on noise.

Signup and view all the flashcards

Feature Selection & Engineering

Removing irrelevant or redundant features and creating meaningful features using domain knowledge.

Signup and view all the flashcards

Study Notes

  • Machine learning performs tasks based on patterns, not explicit instructions

Benefits of Machine Learning

  • Shorter

  • More accurate

  • Easier to maintain

  • Machine learning includes spam filters

  • Spam filters detect unusual words/phrases and labels them as spam

Machine Learning Branches

  • Supervised learning is more successful than unsupervised learning
    • Classification predicts categorical values
    • Used to Predict numerical values
  • Unsupervised learning groups data based on similar features
    • Clustering groups/clusters points without labels, using similar features
    • Dimensionality reduction reduces features by obtaining principal variables, and not all features are useful
  • Reinforcement learning involves a system learning through its actions
  • Favored actions yield positive rewards
  • Unfavored actions yield negative rewards/punishment
  • Actions are remembered for future decisions
  • Agents select the best action to achieve a task and improve model performance

Instance Based Learning

  • Instance based learning memorizes examples

  • Instance based learning uses similar features to identify other examples.

  • Using word count to detect spam emails is an example

  • Emails shares similar word counts with known spam emails will be flagged

  • K-NN is an example

  • Datasets are useful for supervised and unsupervised learning only

  • Data for Reinforcement learning cannot be tabulated

Target Columns

  • Supervised machine learning model has a target column

  • Presence of a target column with numerical values indicates regression

  • Presence of target column containing categorical values indicates classification

  • Model based learning uses the example models to make predictions

Decision Boundary

  • Decision boundaries distinguish two classes
  • Point (x,y) can be plugged into a function to define the boundary
    • Positive output: one side of the boundary
    • Negative output: the other side

Cost (AKA Loss) Function

  • The measure how well a model predicts values

  • The goal is to minimize cost for accurate predictions

  • The function improves model accuracy

  • Cross validation maximizes the amount of data using in training

  • Cross validation is crucial when evaluating the model

  • Regularization is a technique preventing a model from overlearning, and improves the model's generalizability

  • Overlearning occurs when a model learns too much training data, causing poor performance

Ridge Regression

  • Ridge regression adds a penalty to all coefficients to reduce them equally
  • Impact of less important features is reduced
  • Alpha = 0: reverts to linear regression cost function and disregards overfitting
  • High Alpha: penalizes large coefficients, leading to a simple, under-fitted model

Lasso Regression

  • Some coefficients are reduced to 0

  • All values that are zero are not taken into consideration in the model

  • Coefficients with a non-zero value are considered important features of the data set

  • Overfitting describes models that learns too closely with training data, and cannot generalize

  • Underfitting describes models that has not learned enough from the data, and cannot make any accurate predictions

Linear Regression

  • Linear Regression is a method to predict value using relationship between two variables
  • Attempts to find the line of best fit and is known as the regression line
  • Used for future predictions
  • Independent Variable: variable used for making predictions
  • Dependent Variable: is the variable being predicted

Contour Plots

  • 3D Representation depicting relationship between two variables
  • Variables involved include theta0, theta1 and the cost function

Ellipses

  • Shows different levels of cost function through ellipses
  • Each ellipse corresponds to specific cost function
  • Every point on the same elliptical ring has J(theta) or cost function
  • Cost will decrease near the ellipse center which indicates accurate predictions from the model
  • The ellipse's center represents the optimal theta0, theta1 values, helping minimize cost function

Ways to Minimize Cost Function

  • Gradient descent
  • Normal equations

Gradient Descent

  • Repeatedly adjusts model parameters to achieve task or improve model's performance

Normal Equations

  • They use direct solution and formula to find the exact values

Benefits of Using Normal Equations

  • Its not iterative
  • It works well with small datasets
  • Gives exact solution

Negatives of Using Normal Equations

  • Associated computational costs are high

  • Doesn't work well with large datasets

  • Learning rate controls how much a model's weights are updated during training

Large Learning Rate

  • Learns much faster/reaches optimal values quicker
  • Overshooting is possible because big steps miss optimal values
  • Models can oscillate wildly causing divergence

Small Learning Rate

  • Tune the process of reaching optimal values (achieves accurate results) because of small steps
  • Since the changes are small it takes Model takes longer to learn since updates are small
  • Model may get stuck in a sub-optimal value, unable to reach the global optimal alue

Logistic Regression

  • Used in binary classification to predict one of two outcomes
  • Predicts a continuous probability value, implements a decision threshold to classify outcome of two classes

Stochastic Gradient Descent (SGD)

  • An iterative learning algorithm performs updates to model after one data point
  • Makes model faster, but increases noise
  • Reduces memory usage and computation time

Batch Size

  • When training mode, dataset is split into batches/groups

  • Small batch size indicates faster but less stable training

  • Large training size: more stable, but requires more memory

  • Mini Batch: splits data into small portions, updates model after batches are passed through

  • With 10 samples and batch size of 2, model updates after all 5 batches are completed

  • Batch set or batch gradient uses the entire dataset at once to compute gradients

  • Batch methods need more memory and are slow for large datasets

  • Epochs are a full pass through the entire data set during training

  • Learning curve is a graph that shows how wel a model is learning overtime

  • Improving curves mean that the model is learning

  • Flat curves mean that the model has stopped learning

Model Hyper-Parameters and Parameters

  • Model hyper-parameters are selected before model training and control how the model learns
  • Model parameters are values learned from the dataset or can change during training
  • Normalizing features before running gradient descent is crucial for a variety of reasons

Normalizing Features

  • Faster convergence allows gradient descents to update model parameters based on partial derivatives of cost function

  • When features have vastly different scales, cost function becomes elongated which makes gradient descent steps inefficient and slow

  • Normalization ensures that the minimum is reached faster

  • Spherical Contour refers to the cost function

  • Avoiding Vanishing or Explosoding Gradients

  • If features are on different scales, the gradients can either become to small or to large leading to poor updates in learning

  • Better Numerical Stability

  • Large value features can cause numerical overflow errors, particulary when dealing with matrix operations

  • Equal contribution of Features

  • Features with large magnitudes dominate learning process, making model biased toward those features

  • Normalization ensures each feature contributes equality

  • Hypothesis (also called the model) is used when machine learning used mathematical functions to make predictions based on input features

  • Represents relationship between input values and outputs

Overfitting

  • Overfitting occurs when machine learning models capture noise in the training data instead of underlying pattern
  • Models perform well on training data but poorly on new data
  • Simpler terms, they are to complex and use memorization instead of generalization

Signs of Overfitting

  • High accuracy on training data, but poor performance on validation/test data
  • Large gap between training and validation loss
  • Models perform well on known data, but fails on new inputs

How to Reduce Overfitting

  • More Training Data
  • Collect more samples to help model generalize better
  • Helps reduce models dependency on noise
  • Feature Selection and Engineering
  • Remove irrelevant features the introducee noise
  • Use domain knowledge to create meaningful features
  • Regularization
  • Adds absolute weight penalties, leading to feature selection
  • L2 Regularization: Adds Squared weight penalties, reducing coefficients and making model more simple
  • Cross Validation
  • Use techniques to ensure model is evaluated on multiple data subsets

Skewed Learning

  • Skewed learning refers to situations where dataset has imbalanced distribution
  • Some classes appear more than others
  • It is common in calssification problems where one class dominates, and leading to biased data

Examples of Skewed Learning

  • Fraud Detection
  • 99% of transactions are legitimate and 1% are fraudulent
  • Models predicting legit all time would be 99%accurate
  • Models would be useless for detecting fraud
  • Medical Diagnosis (Rare Diseases)
  • Datasets are made of 98% healthy patients and 2% who have a disease
  • If model always predicts “healthy“ it will have high accuracy but fail in detecting disease
  • Spam Detection
  • 95% of emails are non spam, and 5% that are spam
  • Models struggle to correctly spam messages

Similarities Between Logistic Regression and Neural Networks

  • Share similarities in their fundamental concepts and math formulations
  • Both use a weighted sum of inputs
  • Both use sigmod activation function
  • Both are used for classification
  • Both use loss functions based on Log Likelihood
  • Both use a gradeient descent for Optimization

Sigmoid va Hyperbolic Tangent Differences

  • Both sigmoid and hyperbolic tangent functions are S shaped, but have key differences in their properties and neural networks
  • Their equations are not the same

Range of Output Values

  • Sigmoid outputs value between 0 and 1, which can lead to problems in models with centering data
  • Tanh outputs values between negative 1 and 1, which helps activations stay closer to zero, improving learning stability

Zero-Centering and Effect on Learning

  • Tanh is zero centered, meaning outputs are evenly and distrubted around data
  • Helps in Training
  • Sigmoids is not Zero Centered
  • Can cause bias shifts in activations, leading to slower convergence

Gradient Vanishing Problem

  • Both Sigmoid and Tanh both suffer from vanishing gradients when inputs are too large or to small

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Use Quizgecko on...
Browser
Browser