Reinforcement Learning Quiz

Reinforcement Learning in Machine Learning

Reinforcement learning (RL) is a machine learning paradigm concerned with how intelligent agents should take actions in an environment to maximize the cumulative reward.
RL differs from supervised learning in not needing labelled input/output pairs to be presented and not needing sub-optimal actions to be explicitly corrected.
The environment is typically stated in the form of a Markov decision process (MDP).
Reinforcement learning is studied in many disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, and statistics.
The problems of interest in reinforcement learning have also been studied in the theory of optimal control.
Basic reinforcement learning is modeled as an MDP where the agent learns an optimal policy that maximizes the reward function that accumulates from the immediate rewards.
A basic RL agent AI interacts with its environment in discrete time steps, receives the current state and reward, chooses an action from the set of available actions, which is subsequently sent to the environment.
The goal of an RL agent is to learn a policy that maximizes the expected cumulative reward.
Reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off.
Reinforcement learning has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers, and Go (AlphaGo).
Reinforcement learning requires clever exploration mechanisms, and the exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs.
One exploration method is ε-greedy, where ε is a parameter controlling the amount of exploration vs. exploitation.Value Function Approaches for Reinforcement Learning
The value function estimates "how good" it is to be in a given state.
The value function is defined as expected return starting with a state and following a policy.
The return is the sum of future discounted rewards, where the discount rate is less than 1.
The algorithm must find a policy with maximum expected return.
The search can be restricted to the set of stationary policies, which can be further restricted to deterministic stationary policies.
Brute force approach entails generating all policies and selecting the one with the highest expected return.
The number of policies can be large or infinite, and the variance of returns may be large.
Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy.
Optimality is defined as achieving the best-expected return from any initial state.
An optimal policy can always be found amongst stationary policies.
It is useful to define action-values in addition to state-values.
If a policy achieves optimal values in each state, it is called optimal.Reinforcement Learning Methods: Monte Carlo, Temporal Difference, and Function Approximation
The optimal action-value function is sufficient for knowing how to act optimally.
Value iteration and policy iteration can be used to compute the optimal action-value function.
Monte Carlo methods can be used in the policy evaluation step of policy iteration.
The estimate of the value of a given state-action pair can be computed by averaging the sampled returns that originated from that pair over time.
The next policy is obtained by computing a greedy policy with respect to the action-value function.
Problems with this procedure include: spending too much time evaluating a suboptimal policy, using samples inefficiently, slow convergence when returns have high variance, working only for episodic problems and small, finite MDPs.
Sutton's temporal difference (TD) methods are based on the recursive Bellman equation.
The computation in TD methods can be incremental or batch.
TD methods overcome the issue of working only for episodic problems.
Linear function approximation is used to address the issue of working only for small, finite MDPs.
A mapping assigns a finite-dimensional vector to each state-action pair in linear function approximation.
The action values of a state-action pair are computed using the dot product of the mapping and the weight vector.Overview of Reinforcement Learning
Linear combination of components of ϕ(s, a) with some weights θ is used to adjust the weights in reinforcement learning.
Value iteration can be used to give rise to the Q-learning algorithm, including Deep Q-learning methods, with various applications in stochastic search problems.
Direct policy search is an alternative method to search directly in (some subset of) the policy space in which the problem becomes a case of stochastic optimization.
A large class of methods avoids relying on gradient information, including simulated annealing, cross-entropy search or methods of evolutionary computation.
All of the above methods can be combined with algorithms that first learn a model, for instance, the Dyna algorithm learns a model from experience and uses that to provide more modeled transitions for a value function, in addition to the real transitions.
Both the asymptotic and finite-sample behaviors of most algorithms are well understood.
Research topics include comparison of reinforcement learning algorithms, associative reinforcement learning, deep reinforcement learning, adversarial deep reinforcement learning, fuzzy reinforcement learning, inverse reinforcement learning, and safe reinforcement learning.
Associative reinforcement learning tasks combine facets of stochastic learning automata tasks and supervised learning pattern classification tasks.
Adversarial deep reinforcement learning is an active area of research in reinforcement learning focusing on vulnerabilities of learned policies.
Fuzzy reinforcement learning approximates the state-action value function with fuzzy rules in continuous space.
Inverse reinforcement learning (IRL) infers the reward function given an observed behavior from an expert.
Safe reinforcement learning (SRL) can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes.

Reinforcement Learning Quiz

9 Questions

What is reinforcement learning?

What is the difference between reinforcement learning and supervised learning?

What is the typical form of the environment in reinforcement learning?

What is the goal of an RL agent?

What is the ε-greedy exploration method?

What is the value function in RL?

What is the difference between value function approaches and brute force approach?

What are the three reinforcement learning methods discussed in the text?

What is the inverse reinforcement learning (IRL)?

Study Notes

Make Your Own Quizzes and Flashcards

More Quizzes Like This

Machine Learning Fundamentals Quiz: Supervised, Unsupervised, Deep Lea...

Types of Machine Learning: Unsupervised and Reinforcement Learning

AI: Machine Learning, NLP, and Reinforcement Learning

Unsupervised Learning Applications: Quiz and Flashcards on Deep Learni...

Upgrade to continue