AT Lecture 9

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the primary goal of hyperparameter optimization?

To reduce the training time of machine learning models.
To find the optimal hyperparameter configuration(s) from a set of possible configurations. (correct)
To normalize the input data for machine learning models.
To identify the best architecture for a neural network.

Data normalization is NOT considered a part of data preprocessing.

False (B)

Which of the following is NOT a type of layer parameter considered during neural architecture search?

Number of different layers
How to stack the layers
Types of layers (conv, maxpool)
Activation function of fully connected layers (correct)

Which of the following is an example of a hyperparameter that is optimized in machine learning models?

Batch size (B)

Signup and view all the answers

Neural Architecture Search (NAS) and hyperparameter optimization must always be performed sequentially, one after the other.

False (B)

Signup and view all the answers

Name one of the Automated Machine Learning tools mentioned.

IBM Auto AI

Signup and view all the answers

Which of the following falls under the category of 'Configuration Selection' in hyperparameter optimization?

Efficient selection of a good configuration (A)

Signup and view all the answers

In hyperparameter optimization, allocating more resources to promising hyperparameter configurations is part of ______ evaluation.

configuration

Signup and view all the answers

Match the hyperparameter types with their example:

Continuous = Learning rate Integer = Number of units Categorical = Activation function

Signup and view all the answers

A hyperparameter whose activity depends on the value of another hyperparameter is known as a:

Conditional hyperparameter (D)

Signup and view all the answers

In blackbox hyperparameter optimization, lower sample efficiency is preferred since the function is assumed to be inexpensive to evaluate.

False (B)

Signup and view all the answers

Which search technique involves evaluating every combination of hyperparameters from a preset list?

Grid search (C)

Signup and view all the answers

What is a primary disadvantage of grid search?

It suffers from the curse of dimensionality. (C)

Signup and view all the answers

Random search is generally less efficient for parameter optimization compared to grid search.

False (B)

Signup and view all the answers

Name one training resource that is often considered during hyperparameter optimization of machine learning models.

Size of training set

Signup and view all the answers

What kind of resources does the validation loss depend on?

Total resources allocated (D)

Signup and view all the answers

Which of the following best describes the underlying principle of Successive Halving?

The relative performance after a small number of iterations is roughly maintained. (D)

Signup and view all the answers

Successive Halving always allocates the same amount of resources to all hyperparameter configurations.

False (B)

Signup and view all the answers

What is the trade-off that Successive Halving suffers from?

The 'n vs B/n' trade-off (C)

Signup and view all the answers

The Hyperband algorithm addresses the hyperparameter optimization problem by:

Combining adaptive resource allocation with random sampling. (C)

Signup and view all the answers

The Hyperband algorithm performs a type of ______ search over the feasible values of 'n'.

grid

Signup and view all the answers

In the context of the Hyperband algorithm, a smaller value of 's' means the algorithm will throw out hyperparameters early in the process.

False (B)

Signup and view all the answers

What does the acronym CASH stand for in the context of AutoML?

Combined Algorithm Selection and Hyperparameter Optimization (A)

Signup and view all the answers

What is used to find the combination of algorithm in a combined algorithm selection and hyperparameter optimization (CASH)?

The minimal loss (D)

Signup and view all the answers

What is the first step in Bayesian Hyperparameter Optimization?

Build a surrogate probability model of the objective function (A)

Signup and view all the answers

Surrogate probability models are NOT employed in Bayesian hyperparameter optimization processes.

False (B)

Signup and view all the answers

In Sequential Model-Based Optimization (SMBO), Bayesian reasoning is applied to improve:

The hyperparameters being tested. (D)

Signup and view all the answers

In the SMBO framework, a ______ function helps evaluate which hyperparameters to choose next.

selection

Signup and view all the answers

Match the following Surrogate models to a valid technique:

Gaussian Processes = Surrogate Model Random Forest Regressions = Surrogate Model Tree Parzen Estimators (TPE) = Surrogate Model

Signup and view all the answers

Which of the following is used to express the selection function?

Expected Improvement (C)

Signup and view all the answers

The threshold value of the objective function is directly related to the result of the selection function.

True (A)

Signup and view all the answers

In the Tree-structured Parzen Estimator (TPE), what is modeled?

The probability of y given x (A)

Signup and view all the answers

The Tree Parzen Estimator uses ______ models to represent the probability distribution above and below a threshold.

density

Signup and view all the answers

Which area of Machine learning includes Neural Architecture Search and Hyperparameter Optimization?

Automated Machine Learning

Signup and view all the answers

Flashcards

Hyperparameter Optimization

Identifying good hyperparameter configurations from possible configurations.

Data Preprocessing

Normalization & data augmentation

Neural Architecture Search (NAS)

Finding the best neural network architecture automatically.

Hyperparameter Optimization

Optimizing batch size, learning rate, and momentum.

Signup and view all the flashcards

Hyperparameter Optimization

Problem of identifying good hyperparameter configuration(s) from the set of possible configurations.

Signup and view all the flashcards

Configuration Selection

Selecting which configurations to try.

Signup and view all the flashcards

Configuration Evaluation

Assessing the quality of a hyperparameter configuration.

Signup and view all the flashcards

Conditional Hyperparameters

Hyperparameters B are only active if hyperparameters A are set a certain way.

Signup and view all the flashcards

Blackbox Hyperparameter Optimization

DNN performance based on hyperparameter values.

Signup and view all the flashcards

Grid Search

Every hyperparameter combination is tested.

Signup and view all the flashcards

Random Search

Random combinations of hyperparameter values are tested.

Signup and view all the flashcards

Training resources

Size of the dataset, number of features, iterations, and training time.

Signup and view all the flashcards

Underlying principle of Successive Halving

Even if performance after a small number of iterations is very unrepresentative of the absolute performance of any configuration, its relative performance compared with many alternatives trained with the same number of iterations is roughly maintained.

Signup and view all the flashcards

Successive Halving

Evenly allocate a budget to a set of configurations and eliminate the worst performers.

Signup and view all the flashcards

Hyperband

Formulating hyperparameter optimization as a pure-exploration adaptive resource allocation problem addressing how to allocate resources among randomly sampled hyperparameter configurations.

Signup and view all the flashcards

get_hyperparameter_configuration (n)

Returns a set of n i.i.d samples from some distribution defined over the hyperparameter configuration space.

Signup and view all the flashcards

run_then_return_val_loss(t, r)

Takes in a hyperparameter configuration and resource allocation and returns the validation loss after training the configuration.

Signup and view all the flashcards

Combined Algorithm Selection and Hyperparameter Optimization (CASH)

A combination of algorithm A* = A(i) and hyperparameter configuration that minimizes the loss.

Signup and view all the flashcards

Bayesian Hyperparameter Optimization

Build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function.

Signup and view all the flashcards

Sequential Model-Based Optimization (SMBO)

Running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate)

Signup and view all the flashcards

Surrogate Model

Approximates the objective function.

Signup and view all the flashcards

Selection Function

Used to choose the next hyperparameters to try.

Signup and view all the flashcards

Study Notes

Automated Machine Learning

Involves data preprocessing, neural architecture search (NAS), and hyperparameter optimization.
Data preprocessing includes normalization and data-augmentation.

Neural Architecture Search (NAS)

Can use standard architectures or synthesize new ones.
Synthesizing a new architecture involves deciding on types of layers, number of layers, and how to stack them, also defining convolution layer parameters.

Hyperparameter Optimization

Includes tuning batch size, learning rate, and momentum.
NAS and hyperparameter optimization can be done jointly or sequentially.
Examples are: IBM Auto AI, running a sample AutoAI experiment in IBM WML, and H2O AutoML.

Hyperparameter Optimization

This is the process of identifying the best hyperparameter configuration(s) from a set of possible configurations.
Consists of two sub-problems: configuration selection and configuration evaluation.
Configuration selection focuses on efficiently selecting a good configuration.
Configuration evaluation involves adaptive computation, allocating more resources to promising configurations while eliminating poor ones.

Mathematical Formulation

The critical step is choosing the set of trials.

Types of Hyperparameters

Continuous hyperparameters, such as learning rate.
Integer hyperparameters, such as number of units.
Categorical hyperparameters, which have a finite domain, unordered.
- Examples: algorithm choice (SVM, RF, NN), activation function (ReLU, Leaky ReLU, tanh), and operator for convolution (conv3x3, separable conv3x3, max pool).

Conditional Hyperparameters

Conditional hyperparameters are only active if other hyperparameters are set a certain way.
- Example: Hyperparameter B is Adam's second momentum hyperparameter and is only active if hyperparameter A is set to Adam as the choice of optimizer.

Blackbox Hyperparameter Optimization

Consists of a DNN hyperparameter setting and a validation performance function (f(λ)
Sample efficiency is important because the blackbox function is expensive to evaluate.

Techniques for Hyperparameter Optimization

Grid search.
Random search.
Hyperband: random configuration search with adaptive resource allocation.
Bayesian optimization methods.
- Focus on configuration selection.
- Identify good configurations more quickly than standard baselines by selecting configurations in an adaptive manner.
Bayesian optimization with adaptive resource allocation.

Grid Search

Involves evaluating a model for every combination of a preset list of hyperparameter values.
K represents the number of hyperparameters.
Grid search requires choosing a set of values for each variable (L(1)...L(K)).
Suffers from the curse of dimensionality.

Random Search

Random search is a technique uses random combinations of hyperparameters to find the best solution for the built model.
Empirically and theoretically, random search is more efficient for parameter optimization than grid search.

Training Resources

Resources include the size of the training set, number of features, number of iterations for iterative algorithms, and hours of training time.

Validation Loss vs Resource Allocated

The shaded areas bound the maximum distance of the intermediate losses from the terminal validation loss, monotonically decreasing with the resource.
Distinguishing between configurations is possible when the envelopes no longer overlap.
More resources are needed to differentiate between configurations when the envelope functions are wider or the terminal losses are closer together.

Successive Halving

The underlying principle is that if performance after a small number of iterations is unrepresentative of absolute performance, then the relative performance is roughly maintained.
Succesive Halving uniformly allocates a budget to training resources, evaluating performance and allocating more resources to more promising configurations.
It has a "n vs B/n" trade-off where n is the number of configurations and B is budget.

n vs B/n

For a simple strategy; If hyper-parameter configurations can be discriminated quickly, n should be chosen large.
Otherwise, If hyper-parameter configurations are slow to differentiate, B/n should be large.
If n is large, then some good configurations which can be slow to converge at the beginning will be killed off early.
If B/n is large, then bad configurations will be given a lot of resources, even though they could have been stopped before.

Hyperband

Optimization is formulated as a pure-exploration resource allocation problem, addressing how to allocate resources among randomly sampled hyperparameter configurations.
Several possible values of n, performing a grid search.
Considers degrees of aggressiveness.
Resource constrained.

Hyperband Algorithm Details

R is the max amount of resource allocated to a single configuration, and η is the proportion of configurations discarded in each round.
get_hyperparameter_configuration (n) function which returns a set of n independent and identically distributed (i.i.d) samples and uniformly samples the hyperparameters.
run_then_return_val_loss(t, r) function takes a hyperparameter configuration t and resource allocation r as inputs; function returns the validation loss after training.
top_k(configs, losses, k) function takes a set of configurations and associated losses, function returns the top k performing configurations.

Hyperband Behavior

Each inner loop indexed by s is designed to take B total iterations and each value of s takes about the same amount of time on average.
For large values of s, many configurations are considered, but discards hyperparameters on just a very small number of iterations, this may be undesirable.
For small values of s, fewer configurations are considered, and the algorithm does not throw out hyperparameters until after many iterations.

AutoML

It is a combined Algorithm Selection and Hyperparameter Optimization (CASH).
The CASH problem is to find a combination of algorithm A* = A(i) and hyperparameter configuration λ* that minimizes loss.

Bayesian Hyperparameter Optimization

Involves building a probability model of the objective function and using it to select the most promising hyperparameters.
Has 2 steps: fit a probabilistic model to the function evaluation and use that model to trade off exploration v exploitation

Bayesian Optimization Details

Steps:
- Build a surrogate probability model of the objective function.
- Find the hyperparameters that perform best on the surrogate.
- Apply these hyperparameters to the true objective function.
- Update the surrogate model incorporating the new results.
- Repeat steps 2–4 until max iterations or time is reached.

Sequential Model-Based Optimization (SMBO)

The process of running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate).
Main components:
- A domain of hyperparameters over which to search.
- An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize).
- The surrogate model of the objective function. -A criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model. -A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model.

Surrogate Models

Gaussian Processes.
Random Forest Regressions.
Tree Parzen Estimators (TPE).

Selection Function

Expected Improvement.
Expected improvment formula relies on a threshold based value objective function (y*), a a proposed set of hyperparameters (x), a value of the objective function using hyperparmeters, and a surrogate probability model.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

AT Lecture 9

Choose a study mode

Podcast

Questions and Answers

What is the primary goal of hyperparameter optimization?

Data normalization is NOT considered a part of data preprocessing.

Which of the following is NOT a type of layer parameter considered during neural architecture search?

Which of the following is an example of a hyperparameter that is optimized in machine learning models?

Neural Architecture Search (NAS) and hyperparameter optimization must always be performed sequentially, one after the other.

Name one of the Automated Machine Learning tools mentioned.

Which of the following falls under the category of 'Configuration Selection' in hyperparameter optimization?

In hyperparameter optimization, allocating more resources to promising hyperparameter configurations is part of ______ evaluation.

Match the hyperparameter types with their example:

A hyperparameter whose activity depends on the value of another hyperparameter is known as a:

In blackbox hyperparameter optimization, lower sample efficiency is preferred since the function is assumed to be inexpensive to evaluate.

Which search technique involves evaluating every combination of hyperparameters from a preset list?

What is a primary disadvantage of grid search?

Random search is generally less efficient for parameter optimization compared to grid search.

Name one training resource that is often considered during hyperparameter optimization of machine learning models.

What kind of resources does the validation loss depend on?

Which of the following best describes the underlying principle of Successive Halving?

Successive Halving always allocates the same amount of resources to all hyperparameter configurations.

What is the trade-off that Successive Halving suffers from?

The Hyperband algorithm addresses the hyperparameter optimization problem by:

The Hyperband algorithm performs a type of ______ search over the feasible values of 'n'.

In the context of the Hyperband algorithm, a smaller value of 's' means the algorithm will throw out hyperparameters early in the process.

What does the acronym CASH stand for in the context of AutoML?

What is used to find the combination of algorithm in a combined algorithm selection and hyperparameter optimization (CASH)?

What is the first step in Bayesian Hyperparameter Optimization?

Surrogate probability models are NOT employed in Bayesian hyperparameter optimization processes.

In Sequential Model-Based Optimization (SMBO), Bayesian reasoning is applied to improve:

In the SMBO framework, a ______ function helps evaluate which hyperparameters to choose next.

Match the following Surrogate models to a valid technique:

Which of the following is used to express the selection function?

The threshold value of the objective function is directly related to the result of the selection function.

In the Tree-structured Parzen Estimator (TPE), what is modeled?

The Tree Parzen Estimator uses ______ models to represent the probability distribution above and below a threshold.

Which area of Machine learning includes Neural Architecture Search and Hyperparameter Optimization?

Flashcards

Hyperparameter Optimization

Data Preprocessing

Neural Architecture Search (NAS)

Hyperparameter Optimization

Hyperparameter Optimization

Configuration Selection

Configuration Evaluation

Conditional Hyperparameters

Blackbox Hyperparameter Optimization

Grid Search

Random Search

Training resources

Underlying principle of Successive Halving

Successive Halving

Hyperband

get_hyperparameter_configuration (n)

run_then_return_val_loss(t, r)

Combined Algorithm Selection and Hyperparameter Optimization (CASH)

Bayesian Hyperparameter Optimization

Sequential Model-Based Optimization (SMBO)

Surrogate Model

Selection Function

Study Notes

Automated Machine Learning

Neural Architecture Search (NAS)

Hyperparameter Optimization

Hyperparameter Optimization

Mathematical Formulation

Types of Hyperparameters

Conditional Hyperparameters

Blackbox Hyperparameter Optimization

Techniques for Hyperparameter Optimization

Grid Search

Random Search

Training Resources

Validation Loss vs Resource Allocated

Successive Halving

n vs B/n

Hyperband

Hyperband Algorithm Details

Hyperband Behavior