Podcast
Questions and Answers
What is the primary goal of hyperparameter optimization?
What is the primary goal of hyperparameter optimization?
- To reduce the training time of machine learning models.
- To find the optimal hyperparameter configuration(s) from a set of possible configurations. (correct)
- To normalize the input data for machine learning models.
- To identify the best architecture for a neural network.
Data normalization is NOT considered a part of data preprocessing.
Data normalization is NOT considered a part of data preprocessing.
False (B)
Which of the following is NOT a type of layer parameter considered during neural architecture search?
Which of the following is NOT a type of layer parameter considered during neural architecture search?
- Number of different layers
- How to stack the layers
- Types of layers (conv, maxpool)
- Activation function of fully connected layers (correct)
Which of the following is an example of a hyperparameter that is optimized in machine learning models?
Which of the following is an example of a hyperparameter that is optimized in machine learning models?
Neural Architecture Search (NAS) and hyperparameter optimization must always be performed sequentially, one after the other.
Neural Architecture Search (NAS) and hyperparameter optimization must always be performed sequentially, one after the other.
Name one of the Automated Machine Learning tools mentioned.
Name one of the Automated Machine Learning tools mentioned.
Which of the following falls under the category of 'Configuration Selection' in hyperparameter optimization?
Which of the following falls under the category of 'Configuration Selection' in hyperparameter optimization?
In hyperparameter optimization, allocating more resources to promising hyperparameter configurations is part of ______ evaluation.
In hyperparameter optimization, allocating more resources to promising hyperparameter configurations is part of ______ evaluation.
Match the hyperparameter types with their example:
Match the hyperparameter types with their example:
A hyperparameter whose activity depends on the value of another hyperparameter is known as a:
A hyperparameter whose activity depends on the value of another hyperparameter is known as a:
In blackbox hyperparameter optimization, lower sample efficiency is preferred since the function is assumed to be inexpensive to evaluate.
In blackbox hyperparameter optimization, lower sample efficiency is preferred since the function is assumed to be inexpensive to evaluate.
Which search technique involves evaluating every combination of hyperparameters from a preset list?
Which search technique involves evaluating every combination of hyperparameters from a preset list?
What is a primary disadvantage of grid search?
What is a primary disadvantage of grid search?
Random search is generally less efficient for parameter optimization compared to grid search.
Random search is generally less efficient for parameter optimization compared to grid search.
Name one training resource that is often considered during hyperparameter optimization of machine learning models.
Name one training resource that is often considered during hyperparameter optimization of machine learning models.
What kind of resources does the validation loss depend on?
What kind of resources does the validation loss depend on?
Which of the following best describes the underlying principle of Successive Halving?
Which of the following best describes the underlying principle of Successive Halving?
Successive Halving always allocates the same amount of resources to all hyperparameter configurations.
Successive Halving always allocates the same amount of resources to all hyperparameter configurations.
What is the trade-off that Successive Halving suffers from?
What is the trade-off that Successive Halving suffers from?
The Hyperband algorithm addresses the hyperparameter optimization problem by:
The Hyperband algorithm addresses the hyperparameter optimization problem by:
The Hyperband algorithm performs a type of ______ search over the feasible values of 'n'.
The Hyperband algorithm performs a type of ______ search over the feasible values of 'n'.
In the context of the Hyperband algorithm, a smaller value of 's' means the algorithm will throw out hyperparameters early in the process.
In the context of the Hyperband algorithm, a smaller value of 's' means the algorithm will throw out hyperparameters early in the process.
What does the acronym CASH stand for in the context of AutoML?
What does the acronym CASH stand for in the context of AutoML?
What is used to find the combination of algorithm in a combined algorithm selection and hyperparameter optimization (CASH)?
What is used to find the combination of algorithm in a combined algorithm selection and hyperparameter optimization (CASH)?
What is the first step in Bayesian Hyperparameter Optimization?
What is the first step in Bayesian Hyperparameter Optimization?
Surrogate probability models are NOT employed in Bayesian hyperparameter optimization processes.
Surrogate probability models are NOT employed in Bayesian hyperparameter optimization processes.
In Sequential Model-Based Optimization (SMBO), Bayesian reasoning is applied to improve:
In Sequential Model-Based Optimization (SMBO), Bayesian reasoning is applied to improve:
In the SMBO framework, a ______ function helps evaluate which hyperparameters to choose next.
In the SMBO framework, a ______ function helps evaluate which hyperparameters to choose next.
Match the following Surrogate models to a valid technique:
Match the following Surrogate models to a valid technique:
Which of the following is used to express the selection function?
Which of the following is used to express the selection function?
The threshold value of the objective function is directly related to the result of the selection function.
The threshold value of the objective function is directly related to the result of the selection function.
In the Tree-structured Parzen Estimator (TPE), what is modeled?
In the Tree-structured Parzen Estimator (TPE), what is modeled?
The Tree Parzen Estimator uses ______ models to represent the probability distribution above and below a threshold.
The Tree Parzen Estimator uses ______ models to represent the probability distribution above and below a threshold.
Which area of Machine learning includes Neural Architecture Search and Hyperparameter Optimization?
Which area of Machine learning includes Neural Architecture Search and Hyperparameter Optimization?
Flashcards
Hyperparameter Optimization
Hyperparameter Optimization
Identifying good hyperparameter configurations from possible configurations.
Data Preprocessing
Data Preprocessing
Normalization & data augmentation
Neural Architecture Search (NAS)
Neural Architecture Search (NAS)
Finding the best neural network architecture automatically.
Hyperparameter Optimization
Hyperparameter Optimization
Signup and view all the flashcards
Hyperparameter Optimization
Hyperparameter Optimization
Signup and view all the flashcards
Configuration Selection
Configuration Selection
Signup and view all the flashcards
Configuration Evaluation
Configuration Evaluation
Signup and view all the flashcards
Conditional Hyperparameters
Conditional Hyperparameters
Signup and view all the flashcards
Blackbox Hyperparameter Optimization
Blackbox Hyperparameter Optimization
Signup and view all the flashcards
Grid Search
Grid Search
Signup and view all the flashcards
Random Search
Random Search
Signup and view all the flashcards
Training resources
Training resources
Signup and view all the flashcards
Underlying principle of Successive Halving
Underlying principle of Successive Halving
Signup and view all the flashcards
Successive Halving
Successive Halving
Signup and view all the flashcards
Hyperband
Hyperband
Signup and view all the flashcards
get_hyperparameter_configuration (n)
get_hyperparameter_configuration (n)
Signup and view all the flashcards
run_then_return_val_loss(t, r)
run_then_return_val_loss(t, r)
Signup and view all the flashcards
Combined Algorithm Selection and Hyperparameter Optimization (CASH)
Combined Algorithm Selection and Hyperparameter Optimization (CASH)
Signup and view all the flashcards
Bayesian Hyperparameter Optimization
Bayesian Hyperparameter Optimization
Signup and view all the flashcards
Sequential Model-Based Optimization (SMBO)
Sequential Model-Based Optimization (SMBO)
Signup and view all the flashcards
Surrogate Model
Surrogate Model
Signup and view all the flashcards
Selection Function
Selection Function
Signup and view all the flashcards
Study Notes
Automated Machine Learning
- Involves data preprocessing, neural architecture search (NAS), and hyperparameter optimization.
- Data preprocessing includes normalization and data-augmentation.
Neural Architecture Search (NAS)
- Can use standard architectures or synthesize new ones.
- Synthesizing a new architecture involves deciding on types of layers, number of layers, and how to stack them, also defining convolution layer parameters.
Hyperparameter Optimization
- Includes tuning batch size, learning rate, and momentum.
- NAS and hyperparameter optimization can be done jointly or sequentially.
- Examples are: IBM Auto AI, running a sample AutoAI experiment in IBM WML, and H2O AutoML.
Hyperparameter Optimization
- This is the process of identifying the best hyperparameter configuration(s) from a set of possible configurations.
- Consists of two sub-problems: configuration selection and configuration evaluation.
- Configuration selection focuses on efficiently selecting a good configuration.
- Configuration evaluation involves adaptive computation, allocating more resources to promising configurations while eliminating poor ones.
Mathematical Formulation
- The critical step is choosing the set of trials.
Types of Hyperparameters
- Continuous hyperparameters, such as learning rate.
- Integer hyperparameters, such as number of units.
- Categorical hyperparameters, which have a finite domain, unordered.
- Examples: algorithm choice (SVM, RF, NN), activation function (ReLU, Leaky ReLU, tanh), and operator for convolution (conv3x3, separable conv3x3, max pool).
Conditional Hyperparameters
- Conditional hyperparameters are only active if other hyperparameters are set a certain way.
- Example: Hyperparameter B is Adam's second momentum hyperparameter and is only active if hyperparameter A is set to Adam as the choice of optimizer.
Blackbox Hyperparameter Optimization
- Consists of a DNN hyperparameter setting and a validation performance function (f(λ)
- Sample efficiency is important because the blackbox function is expensive to evaluate.
Techniques for Hyperparameter Optimization
- Grid search.
- Random search.
- Hyperband: random configuration search with adaptive resource allocation.
- Bayesian optimization methods.
- Focus on configuration selection.
- Identify good configurations more quickly than standard baselines by selecting configurations in an adaptive manner.
- Bayesian optimization with adaptive resource allocation.
Grid Search
- Involves evaluating a model for every combination of a preset list of hyperparameter values.
- K represents the number of hyperparameters.
- Grid search requires choosing a set of values for each variable (L(1)...L(K)).
- Suffers from the curse of dimensionality.
Random Search
- Random search is a technique uses random combinations of hyperparameters to find the best solution for the built model.
- Empirically and theoretically, random search is more efficient for parameter optimization than grid search.
Training Resources
- Resources include the size of the training set, number of features, number of iterations for iterative algorithms, and hours of training time.
Validation Loss vs Resource Allocated
- The shaded areas bound the maximum distance of the intermediate losses from the terminal validation loss, monotonically decreasing with the resource.
- Distinguishing between configurations is possible when the envelopes no longer overlap.
- More resources are needed to differentiate between configurations when the envelope functions are wider or the terminal losses are closer together.
Successive Halving
- The underlying principle is that if performance after a small number of iterations is unrepresentative of absolute performance, then the relative performance is roughly maintained.
- Succesive Halving uniformly allocates a budget to training resources, evaluating performance and allocating more resources to more promising configurations.
- It has a "n vs B/n" trade-off where n is the number of configurations and B is budget.
n vs B/n
- For a simple strategy; If hyper-parameter configurations can be discriminated quickly, n should be chosen large.
- Otherwise, If hyper-parameter configurations are slow to differentiate, B/n should be large.
- If n is large, then some good configurations which can be slow to converge at the beginning will be killed off early.
- If B/n is large, then bad configurations will be given a lot of resources, even though they could have been stopped before.
Hyperband
- Optimization is formulated as a pure-exploration resource allocation problem, addressing how to allocate resources among randomly sampled hyperparameter configurations.
- Several possible values of n, performing a grid search.
- Considers degrees of aggressiveness.
- Resource constrained.
Hyperband Algorithm Details
- R is the max amount of resource allocated to a single configuration, and η is the proportion of configurations discarded in each round.
- get_hyperparameter_configuration (n) function which returns a set of n independent and identically distributed (i.i.d) samples and uniformly samples the hyperparameters.
- run_then_return_val_loss(t, r) function takes a hyperparameter configuration t and resource allocation r as inputs; function returns the validation loss after training.
- top_k(configs, losses, k) function takes a set of configurations and associated losses, function returns the top k performing configurations.
Hyperband Behavior
- Each inner loop indexed by s is designed to take B total iterations and each value of s takes about the same amount of time on average.
- For large values of s, many configurations are considered, but discards hyperparameters on just a very small number of iterations, this may be undesirable.
- For small values of s, fewer configurations are considered, and the algorithm does not throw out hyperparameters until after many iterations.
AutoML
- It is a combined Algorithm Selection and Hyperparameter Optimization (CASH).
- The CASH problem is to find a combination of algorithm A* = A(i) and hyperparameter configuration λ* that minimizes loss.
Bayesian Hyperparameter Optimization
- Involves building a probability model of the objective function and using it to select the most promising hyperparameters.
- Has 2 steps: fit a probabilistic model to the function evaluation and use that model to trade off exploration v exploitation
Bayesian Optimization Details
- Steps:
- Build a surrogate probability model of the objective function.
- Find the hyperparameters that perform best on the surrogate.
- Apply these hyperparameters to the true objective function.
- Update the surrogate model incorporating the new results.
- Repeat steps 2–4 until max iterations or time is reached.
Sequential Model-Based Optimization (SMBO)
- The process of running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate).
- Main components:
- A domain of hyperparameters over which to search.
- An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize).
- The surrogate model of the objective function. -A criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model. -A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model.
Surrogate Models
- Gaussian Processes.
- Random Forest Regressions.
- Tree Parzen Estimators (TPE).
Selection Function
- Expected Improvement.
- Expected improvment formula relies on a threshold based value objective function (y*), a a proposed set of hyperparameters (x), a value of the objective function using hyperparmeters, and a surrogate probability model.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.