AT Lecture 9

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the primary goal of hyperparameter optimization?

  • To reduce the training time of machine learning models.
  • To find the optimal hyperparameter configuration(s) from a set of possible configurations. (correct)
  • To normalize the input data for machine learning models.
  • To identify the best architecture for a neural network.

Data normalization is NOT considered a part of data preprocessing.

False (B)

Which of the following is NOT a type of layer parameter considered during neural architecture search?

  • Number of different layers
  • How to stack the layers
  • Types of layers (conv, maxpool)
  • Activation function of fully connected layers (correct)

Which of the following is an example of a hyperparameter that is optimized in machine learning models?

<p>Batch size (B)</p>
Signup and view all the answers

Neural Architecture Search (NAS) and hyperparameter optimization must always be performed sequentially, one after the other.

<p>False (B)</p>
Signup and view all the answers

Name one of the Automated Machine Learning tools mentioned.

<p>IBM Auto AI</p>
Signup and view all the answers

Which of the following falls under the category of 'Configuration Selection' in hyperparameter optimization?

<p>Efficient selection of a good configuration (A)</p>
Signup and view all the answers

In hyperparameter optimization, allocating more resources to promising hyperparameter configurations is part of ______ evaluation.

<p>configuration</p>
Signup and view all the answers

Match the hyperparameter types with their example:

<p>Continuous = Learning rate Integer = Number of units Categorical = Activation function</p>
Signup and view all the answers

A hyperparameter whose activity depends on the value of another hyperparameter is known as a:

<p>Conditional hyperparameter (D)</p>
Signup and view all the answers

In blackbox hyperparameter optimization, lower sample efficiency is preferred since the function is assumed to be inexpensive to evaluate.

<p>False (B)</p>
Signup and view all the answers

Which search technique involves evaluating every combination of hyperparameters from a preset list?

<p>Grid search (C)</p>
Signup and view all the answers

What is a primary disadvantage of grid search?

<p>It suffers from the curse of dimensionality. (C)</p>
Signup and view all the answers

Random search is generally less efficient for parameter optimization compared to grid search.

<p>False (B)</p>
Signup and view all the answers

Name one training resource that is often considered during hyperparameter optimization of machine learning models.

<p>Size of training set</p>
Signup and view all the answers

What kind of resources does the validation loss depend on?

<p>Total resources allocated (D)</p>
Signup and view all the answers

Which of the following best describes the underlying principle of Successive Halving?

<p>The relative performance after a small number of iterations is roughly maintained. (D)</p>
Signup and view all the answers

Successive Halving always allocates the same amount of resources to all hyperparameter configurations.

<p>False (B)</p>
Signup and view all the answers

What is the trade-off that Successive Halving suffers from?

<p>The 'n vs B/n' trade-off (C)</p>
Signup and view all the answers

The Hyperband algorithm addresses the hyperparameter optimization problem by:

<p>Combining adaptive resource allocation with random sampling. (C)</p>
Signup and view all the answers

The Hyperband algorithm performs a type of ______ search over the feasible values of 'n'.

<p>grid</p>
Signup and view all the answers

In the context of the Hyperband algorithm, a smaller value of 's' means the algorithm will throw out hyperparameters early in the process.

<p>False (B)</p>
Signup and view all the answers

What does the acronym CASH stand for in the context of AutoML?

<p>Combined Algorithm Selection and Hyperparameter Optimization (A)</p>
Signup and view all the answers

What is used to find the combination of algorithm in a combined algorithm selection and hyperparameter optimization (CASH)?

<p>The minimal loss (D)</p>
Signup and view all the answers

What is the first step in Bayesian Hyperparameter Optimization?

<p>Build a surrogate probability model of the objective function (A)</p>
Signup and view all the answers

Surrogate probability models are NOT employed in Bayesian hyperparameter optimization processes.

<p>False (B)</p>
Signup and view all the answers

In Sequential Model-Based Optimization (SMBO), Bayesian reasoning is applied to improve:

<p>The hyperparameters being tested. (D)</p>
Signup and view all the answers

In the SMBO framework, a ______ function helps evaluate which hyperparameters to choose next.

<p>selection</p>
Signup and view all the answers

Match the following Surrogate models to a valid technique:

<p>Gaussian Processes = Surrogate Model Random Forest Regressions = Surrogate Model Tree Parzen Estimators (TPE) = Surrogate Model</p>
Signup and view all the answers

Which of the following is used to express the selection function?

<p>Expected Improvement (C)</p>
Signup and view all the answers

The threshold value of the objective function is directly related to the result of the selection function.

<p>True (A)</p>
Signup and view all the answers

In the Tree-structured Parzen Estimator (TPE), what is modeled?

<p>The probability of y given x (A)</p>
Signup and view all the answers

The Tree Parzen Estimator uses ______ models to represent the probability distribution above and below a threshold.

<p>density</p>
Signup and view all the answers

Which area of Machine learning includes Neural Architecture Search and Hyperparameter Optimization?

<p>Automated Machine Learning</p>
Signup and view all the answers

Flashcards

Hyperparameter Optimization

Identifying good hyperparameter configurations from possible configurations.

Data Preprocessing

Normalization & data augmentation

Neural Architecture Search (NAS)

Finding the best neural network architecture automatically.

Hyperparameter Optimization

Optimizing batch size, learning rate, and momentum.

Signup and view all the flashcards

Hyperparameter Optimization

Problem of identifying good hyperparameter configuration(s) from the set of possible configurations.

Signup and view all the flashcards

Configuration Selection

Selecting which configurations to try.

Signup and view all the flashcards

Configuration Evaluation

Assessing the quality of a hyperparameter configuration.

Signup and view all the flashcards

Conditional Hyperparameters

Hyperparameters B are only active if hyperparameters A are set a certain way.

Signup and view all the flashcards

Blackbox Hyperparameter Optimization

DNN performance based on hyperparameter values.

Signup and view all the flashcards

Grid Search

Every hyperparameter combination is tested.

Signup and view all the flashcards

Random Search

Random combinations of hyperparameter values are tested.

Signup and view all the flashcards

Training resources

Size of the dataset, number of features, iterations, and training time.

Signup and view all the flashcards

Underlying principle of Successive Halving

Even if performance after a small number of iterations is very unrepresentative of the absolute performance of any configuration, its relative performance compared with many alternatives trained with the same number of iterations is roughly maintained.

Signup and view all the flashcards

Successive Halving

Evenly allocate a budget to a set of configurations and eliminate the worst performers.

Signup and view all the flashcards

Hyperband

Formulating hyperparameter optimization as a pure-exploration adaptive resource allocation problem addressing how to allocate resources among randomly sampled hyperparameter configurations.

Signup and view all the flashcards

get_hyperparameter_configuration (n)

Returns a set of n i.i.d samples from some distribution defined over the hyperparameter configuration space.

Signup and view all the flashcards

run_then_return_val_loss(t, r)

Takes in a hyperparameter configuration and resource allocation and returns the validation loss after training the configuration.

Signup and view all the flashcards

Combined Algorithm Selection and Hyperparameter Optimization (CASH)

A combination of algorithm A* = A(i) and hyperparameter configuration that minimizes the loss.

Signup and view all the flashcards

Bayesian Hyperparameter Optimization

Build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function.

Signup and view all the flashcards

Sequential Model-Based Optimization (SMBO)

Running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate)

Signup and view all the flashcards

Surrogate Model

Approximates the objective function.

Signup and view all the flashcards

Selection Function

Used to choose the next hyperparameters to try.

Signup and view all the flashcards

Study Notes

Automated Machine Learning

  • Involves data preprocessing, neural architecture search (NAS), and hyperparameter optimization.
  • Data preprocessing includes normalization and data-augmentation.

Neural Architecture Search (NAS)

  • Can use standard architectures or synthesize new ones.
  • Synthesizing a new architecture involves deciding on types of layers, number of layers, and how to stack them, also defining convolution layer parameters.

Hyperparameter Optimization

  • Includes tuning batch size, learning rate, and momentum.
  • NAS and hyperparameter optimization can be done jointly or sequentially.
  • Examples are: IBM Auto AI, running a sample AutoAI experiment in IBM WML, and H2O AutoML.

Hyperparameter Optimization

  • This is the process of identifying the best hyperparameter configuration(s) from a set of possible configurations.
  • Consists of two sub-problems: configuration selection and configuration evaluation.
  • Configuration selection focuses on efficiently selecting a good configuration.
  • Configuration evaluation involves adaptive computation, allocating more resources to promising configurations while eliminating poor ones.

Mathematical Formulation

  • The critical step is choosing the set of trials.

Types of Hyperparameters

  • Continuous hyperparameters, such as learning rate.
  • Integer hyperparameters, such as number of units.
  • Categorical hyperparameters, which have a finite domain, unordered.
    • Examples: algorithm choice (SVM, RF, NN), activation function (ReLU, Leaky ReLU, tanh), and operator for convolution (conv3x3, separable conv3x3, max pool).

Conditional Hyperparameters

  • Conditional hyperparameters are only active if other hyperparameters are set a certain way.
    • Example: Hyperparameter B is Adam's second momentum hyperparameter and is only active if hyperparameter A is set to Adam as the choice of optimizer.

Blackbox Hyperparameter Optimization

  • Consists of a DNN hyperparameter setting and a validation performance function (f(λ)
  • Sample efficiency is important because the blackbox function is expensive to evaluate.

Techniques for Hyperparameter Optimization

  • Grid search.
  • Random search.
  • Hyperband: random configuration search with adaptive resource allocation.
  • Bayesian optimization methods.
    • Focus on configuration selection.
    • Identify good configurations more quickly than standard baselines by selecting configurations in an adaptive manner.
  • Bayesian optimization with adaptive resource allocation.
  • Involves evaluating a model for every combination of a preset list of hyperparameter values.
  • K represents the number of hyperparameters.
  • Grid search requires choosing a set of values for each variable (L(1)...L(K)).
  • Suffers from the curse of dimensionality.
  • Random search is a technique uses random combinations of hyperparameters to find the best solution for the built model.
  • Empirically and theoretically, random search is more efficient for parameter optimization than grid search.

Training Resources

  • Resources include the size of the training set, number of features, number of iterations for iterative algorithms, and hours of training time.

Validation Loss vs Resource Allocated

  • The shaded areas bound the maximum distance of the intermediate losses from the terminal validation loss, monotonically decreasing with the resource.
  • Distinguishing between configurations is possible when the envelopes no longer overlap.
  • More resources are needed to differentiate between configurations when the envelope functions are wider or the terminal losses are closer together.

Successive Halving

  • The underlying principle is that if performance after a small number of iterations is unrepresentative of absolute performance, then the relative performance is roughly maintained.
  • Succesive Halving uniformly allocates a budget to training resources, evaluating performance and allocating more resources to more promising configurations.
  • It has a "n vs B/n" trade-off where n is the number of configurations and B is budget.

n vs B/n

  • For a simple strategy; If hyper-parameter configurations can be discriminated quickly, n should be chosen large.
  • Otherwise, If hyper-parameter configurations are slow to differentiate, B/n should be large.
  • If n is large, then some good configurations which can be slow to converge at the beginning will be killed off early.
  • If B/n is large, then bad configurations will be given a lot of resources, even though they could have been stopped before.

Hyperband

  • Optimization is formulated as a pure-exploration resource allocation problem, addressing how to allocate resources among randomly sampled hyperparameter configurations.
  • Several possible values of n, performing a grid search.
  • Considers degrees of aggressiveness.
  • Resource constrained.

Hyperband Algorithm Details

  • R is the max amount of resource allocated to a single configuration, and η is the proportion of configurations discarded in each round.
  • get_hyperparameter_configuration (n) function which returns a set of n independent and identically distributed (i.i.d) samples and uniformly samples the hyperparameters.
  • run_then_return_val_loss(t, r) function takes a hyperparameter configuration t and resource allocation r as inputs; function returns the validation loss after training.
  • top_k(configs, losses, k) function takes a set of configurations and associated losses, function returns the top k performing configurations.

Hyperband Behavior

  • Each inner loop indexed by s is designed to take B total iterations and each value of s takes about the same amount of time on average.
  • For large values of s, many configurations are considered, but discards hyperparameters on just a very small number of iterations, this may be undesirable.
  • For small values of s, fewer configurations are considered, and the algorithm does not throw out hyperparameters until after many iterations.

AutoML

  • It is a combined Algorithm Selection and Hyperparameter Optimization (CASH).
  • The CASH problem is to find a combination of algorithm A* = A(i) and hyperparameter configuration λ* that minimizes loss.

Bayesian Hyperparameter Optimization

  • Involves building a probability model of the objective function and using it to select the most promising hyperparameters.
  • Has 2 steps: fit a probabilistic model to the function evaluation and use that model to trade off exploration v exploitation

Bayesian Optimization Details

  • Steps:
    • Build a surrogate probability model of the objective function.
    • Find the hyperparameters that perform best on the surrogate.
    • Apply these hyperparameters to the true objective function.
    • Update the surrogate model incorporating the new results.
    • Repeat steps 2–4 until max iterations or time is reached.

Sequential Model-Based Optimization (SMBO)

  • The process of running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate).
  • Main components:
    • A domain of hyperparameters over which to search.
    • An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize).
    • The surrogate model of the objective function. -A criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model. -A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model.

Surrogate Models

  • Gaussian Processes.
  • Random Forest Regressions.
  • Tree Parzen Estimators (TPE).

Selection Function

  • Expected Improvement.
  • Expected improvment formula relies on a threshold based value objective function (y*), a a proposed set of hyperparameters (x), a value of the objective function using hyperparmeters, and a surrogate probability model.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser