chapter2.pdf

Notes on Chapter 2: Tabular Value-Based Reinforcement Learning 1 Core Concepts Agent: An entity that interacts with the environment, making decisions and learning from the outcomes. Environment: The world in which the agent operates, providing states and rewards based on the agent’s actions. MDP (Markov Decision Process): A mathematical framework for modeling decision-making with states, actions, transitions, rewards, and policies. State: A representation of the current situation in the environment. Action: A decision made by the agent that affects the environment. Reward: Feedback from the environment based on the action taken. Value: The expected cumulative reward of a state or state-action pair. Policy: A strategy used by the agent to determine actions based on states. Exploration and Exploitation: The balance between exploring new actions to find better re- wards and exploiting known actions that provide good rewards. 2 Core Problem Objective: Learn a policy from interactions with the environment to maximize cumulative reward. 3 Core Algorithms Value Iteration: Iteratively updates the value of each state based on possible actions and rewards to compute the optimal policy. Temporal Difference Learning: Updates value estimates based on the difference between esti- mated and actual rewards. Q-learning: An off-policy algorithm that learns the value of state-action pairs and updates values based on the best possible actions. 4 Finding a Supermarket Example Scenario: Moving to a new city and finding a supermarket without a map. Actions: Exploring the city to find the supermarket and noting the route. Learning: Follow the same route (exploitation) or try a new route (exploration) to find a shorter path. 1 5 Sequential Decision Problems Definition: Problems where the agent makes a sequence of decisions to maximize cumulative future rewards. Examples: Grid worlds, mazes, and box puzzles where the agent navigates to reach a goal. 5.1 Grid Worlds Description: A rectangular grid where the agent moves to reach a goal while avoiding obstacles. Goal: Find the sequence of actions to reach the goal state from the start state. 5.2 Mazes and Box Puzzles Mazes: Complex environments requiring trajectory planning. Box Puzzles (e.g., Sokoban): Puzzles where the agent pushes boxes to specific locations, with irreversible actions. 6 Tabular Value-Based Agents 6.1 Agent and Environment Agent: Learns from interacting with the environment. Environment: Provides states, rewards, and transitions based on the agent’s actions. Interaction: The agent takes actions, receives new states and rewards, and updates its policy based on the rewards received. 6.2 Markov Decision Process (MDP) Formalism: Defined as a 5-tuple (S, A, Ta , Ra , γ): S: Finite set of states. A: Finite set of actions. Ta : Transition probabilities between states. Ra : Reward function for state transitions. γ: Discount factor for future rewards. 6.2.1 State S Representation: The configuration of the environment. Types: Deterministic Environment: Each action leads to a specific state. Stochastic Environment: Actions can lead to different states based on probabilities. 6.2.2 State Representation Description: How states are defined and represented in the environment. 6.2.3 Deterministic and Stochastic Environment Deterministic: Predictable outcomes for actions. Stochastic: Probabilistic outcomes for actions. 2 6.2.4 Action A Types: Discrete: Finite set of actions (e.g., moving in a grid). Continuous: Infinite set of actions (e.g., robot movements). 6.2.5 Irreversible Environment Action Definition: Actions that cannot be undone once taken. 6.2.6 Discrete or Continuous Action Space Discrete: Limited number of actions. Continuous: Unlimited range of actions. 6.2.7 Transition Ta Function: Describes how actions change states. Graph View: States and actions as nodes, transitions as edges. 6.2.8 Graph View of the State Space Description: Visual representation of states and transitions. 6.2.9 Trial and Error, Down and Up Method: Learning through experimentation and feedback. 6.2.10 Reward Ra Definition: Feedback from the environment based on the action taken. Sparse Rewards: Rewards are only given for specific states (e.g., terminal states in chess). Dense Rewards: Rewards are given for every state. 6.2.11 Discount Factor γ Purpose: Balances the importance of future rewards versus immediate rewards. Range: [0, 1] γ = 0: Only considers immediate rewards. γ = 1: Treats all future rewards equally. 6.2.12 Policy π Definition: A strategy that maps states to actions. Objective: Find the optimal policy π ∗ that maximizes the cumulative reward. 6.3 MDP Objective 6.3.1 Trace τ Definition: A sequence of state-action pairs. Return R: The cumulative reward of a trace. 6.3.2 State Value V Definition: Expected cumulative reward starting from a state s and following policy π. Equation: "∞ # X π i V (s) = Eτ γ · rt+i | st = s i=0 3 6.3.3 State-Action Value Q Definition: Expected cumulative reward starting from a state s, taking action a, and following policy π. Equation: "∞ # X Qπ (s, a) = Eτ γ i · rt+i | st = s, at = a i=0 6.3.4 Reinforcement Learning Objective Goal: Find the policy that maximizes the expected cumulative reward. 6.3.5 Bellman Equation State Value: ! X ′ ′ V (s) = max R(s, a) + γ Ta (s, s )V (s ) a s′ State-Action Value: X Q(s, a) = R(s, a) + γ Ta (s, s′ ) max ′ Q(s′ , a′ ) a s′ 6.4 MDP Solution Methods 6.4.1 Hands On: Value Iteration in Gym OpenAI Gym: A toolkit for developing and comparing reinforcement learning algorithms. Taxi Example: Solve the taxi problem using value iteration by updating state values based on the Bellman equation. 6.4.2 OpenAI Gym Description: A platform for developing and comparing reinforcement learning algorithms. Installation: Use pip install gym to install the Gym environment. 6.4.3 Taxi Example with Value Iteration Scenario: Taxi problem where the agent picks up and drops off passengers. Implementation: Uses value iteration to solve the MDP. Steps: 1. Initialize Environment: Set up the Taxi environment. 2. Define Value Function: Initialize the value function. 3. Iterate Value Function: Update the value function using Bellman updates until convergence. 4. Build Policy: Derive the policy from the value function. 5. Execute Policy: Apply the policy to the environment to evaluate performance. 6.4.4 Model-Free Learning Monte Carlo Sampling: Learning from complete episodes by averaging returns. Temporal Difference Learning: Learning from incomplete episodes by bootstrapping. Bias-Variance Trade-off : Monte Carlo methods have high variance and low bias, while temporal difference methods have low variance and high bias. 6.4.5 Monte Carlo Sampling Description: Generates random episodes and uses returns to update the value function. 4 6.4.6 Temporal Difference Learning Description: Updates value estimates based on differences between successive state values. 6.4.7 Exploration Bandit Theory: Balances exploration and exploitation. -greedy Exploration: Chooses a random action with probability , and the best-known action with probability 1-. 6.4.8 Bandit Theory Description: Studies strategies for choosing actions to maximize rewards with minimal trials. 6.4.9 -greedy Exploration Description: Introduces randomness in action selection to ensure exploration. 6.4.10 Off-Policy Learning On-Policy SARSA: Updates policy based on the actions taken by the current policy. Off-Policy Q-Learning: Updates policy based on the best possible actions, not necessarily those taken by the current policy. 6.4.11 On-Policy SARSA Description: Uses the state-action value of the current policy to update. 6.4.12 Off-Policy Q-Learning Description: Uses the maximum state-action value to update, regardless of the action taken. 6.4.13 Sparse Rewards and Reward Shaping Sparse Rewards: Rewards given only at specific states, making learning more difficult. Reward Shaping: Modifying the reward function to make learning easier by providing more frequent feedback. 6.4.14 Hands On: Q-learning on Taxi Scenario: Apply Q-learning to the Taxi problem. Implementation: 1. Initialize Environment: Set up the Taxi environment. 2. Define Q-Table: Initialize Q-values for all state-action pairs. 3. Run Episodes: For each episode, reset the environment and take actions based on -greedy policy. 4. Update Q-Values: Use Q-learning update rule to adjust Q-values. 5. Evaluate Performance: Assess the learned policy by its ability to complete tasks efficiently. 6.4.15 Tuning Learning Rate Description: Adjusting hyperparameters like exploration parameter (), discount factor (), and learning rate () to improve performance. 5 7 Classic Gym Environments 7.1 Mountain Car and Cartpole Mountain Car: The agent must drive a car up a steep hill by building momentum through back-and-forth movements. Cartpole: The agent must balance a pole on a moving cart by applying forces. 7.1.1 Mountain Car Objective: Drive the car to the top of the mountain using momentum. Challenge: The car’s engine is not strong enough to climb the mountain in one go. 7.1.2 Cartpole Objective: Balance a pole on a moving cart. Challenge: Apply forces to keep the pole balanced and prevent it from falling over. 7.2 Path Planning and Board Games Path Planning: Finding the optimal route in a grid or maze. Board Games: Strategy games like chess and checkers. 7.2.1 Path Planning Description: Solving navigation tasks in mazes or environments with obstacles. Example: Sokoban, where the agent pushes boxes to specific locations. 7.2.2 Board Games Description: Planning and strategy games. Example: Chess and checkers, where the agent learns optimal moves. 8 Bootstrapping in Temporal Difference Learning Bootstrapping: The process of refining an estimate using successive updates. In TD learning, boot- strapping involves updating value estimates using other learned estimates rather than waiting for the final outcome. Process: TD Learning updates the value of a state based on the value of the next state and the reward received: V (s) ← V (s) + α [r + γV (s′ ) − V (s)] Here, s is the current state, s′ is the next state, r is the reward received after moving from s to s′ , α is the learning rate, and γ is the discount factor. Example: Consider an agent navigating a grid world. When it moves from state s to state s′ and receives a reward r, it immediately updates the value of s using the estimated value of s′. 9 Bias-Variance Trade-off Definition: The bias-variance trade-off is a key concept in machine learning, balancing the error intro- duced by bias and variance. Bias: The error introduced by approximating a real-world problem, which may be complex, by a simplified model. Variance: The error introduced by the model’s sensitivity to small fluctuations in the training set. Trade-off : - High Bias: Simplistic models (e.g., linear models) might not capture the underlying pattern (underfitting). - High Variance: Complex models (e.g., deep neural networks) might capture noise in the training data as if it were a true pattern (overfitting). 6 In Reinforcement Learning: - Monte Carlo Methods: Low bias, high variance. They wait until the end of an episode to update the values, leading to high variance because each complete return can vary significantly. - Temporal Difference Methods: High bias, low variance. They update the values after each step using bootstrapping, leading to more stable updates but introducing bias as they rely on current estimates. Example: - Monte Carlo: A robot learning a task by simulating entire episodes. It gathers complete returns, resulting in high variance but low bias as it does not bootstrap. - TD Learning: The same robot using TD learning updates its value estimates at each step, reducing variance but introducing bias due to bootstrapping. 10 Summary and Further Reading Summary: Reinforcement learning solves sequential decision problems by learning policies through interaction with the environment. Further Reading: Explore advanced topics and algorithms in reinforcement learning. 11 Exercises We will end with questions on key concepts, with programming exercises to build up more experience. Questions The questions below are meant to refresh your memory, and should be answered with yes, no, or short answers of one or two sentences. 1. In reinforcement learning, the agent can choose which training examples are gener- ated. Why is this beneficial? What is a potential problem? Answer: This is beneficial because it allows the agent to explore different parts of the state space, leading to a more comprehensive understanding of the environment. A potential prob- lem is that the agent might spend too much time exploring suboptimal parts of the state space, leading to inefficient learning. 2. What is Grid world? Answer: Grid world is a type of environment used in reinforcement learning, where an agent navigates a rectangular grid to reach a goal while avoiding obstacles. 3. Which five elements does an MDP have to model reinforcement learning problems? Answer: The five elements are States (S), Actions (A), Transition probabilities (Ta ), Rewards (Ra ), and Discount factor (γ). 4. In a tree diagram, is successor selection of behavior up or down? Answer: Successor selection of behavior is down. 5. In a tree diagram, is learning values through backpropagation up or down? Answer: Learning values through backpropagation is up. 6. What is τ ? Answer: τ is a trace, which is a sequence of state-action pairs. 7. What is π(s)? Answer: π(s) is a policy, which defines the action taken by the agent in state s. 8. What is V (s)? Answer: V (s) is the state value, representing the expected cumulative reward starting from state s and following policy π. 7 9. What is Q(s, a)? Answer: Q(s, a) is the state-action value, representing the expected cumulative reward start- ing from state s, taking action a, and following policy π. 10. What is dynamic programming? Answer: Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems, using the principle of optimality. 11. What is recursion? Answer: Recursion is a method of solving problems where the solution depends on solutions to smaller instances of the same problem. 12. Do you know a dynamic programming method to determine the value of a state? Answer: Yes, value iteration is a dynamic programming method used to determine the value of a state. 13. Is an action in an environment reversible for the agent? Answer: Not always. Some environments have irreversible actions. 14. Mention two typical application areas of reinforcement learning. Answer: Two typical application areas are game playing (e.g., chess, Go) and robotics (e.g., robotic control). 15. Is the action space of games typically discrete or continuous? Answer: The action space of games is typically discrete. 16. Is the action space of robots typically discrete or continuous? Answer: The action space of robots is typically continuous. 17. Is the environment of games typically deterministic or stochastic? Answer: The environment of games can be either deterministic or stochastic, but many classic board games have deterministic environments. 18. Is the environment of robots typically deterministic or stochastic? Answer: The environment of robots is typically stochastic due to the unpredictability of real-world conditions. 19. What is the goal of reinforcement learning? Answer: The goal of reinforcement learning is to learn a policy that maximizes the cumulative reward. 20. Which of the five MDP elements is not used in episodic problems? Answer: The discount factor (γ) is typically less emphasized in episodic problems where episodes have a clear termination. 21. Which model or function is meant when we say “model-free” or “model-based”? Answer: ”Model-free” refers to methods that do not use a model of the environment’s dy- namics (e.g., Q-learning), while ”model-based” refers to methods that use a model of the environment to make decisions (e.g., value iteration). 22. What type of action space and what type of environment are suited for value-based methods? Answer: Value-based methods are suited for discrete action spaces and environments where the state and action spaces are not excessively large. 8 23. Why are value-based methods used for games and not for robotics? Answer: Value-based methods are used for games because games often have discrete action spaces and clearly defined rules. In contrast, robotics often involves continuous action spaces and highly dynamic, unpredictable environments that are better suited for policy-based or model-based methods. 24. Name two basic Gym environments. Answer: Two basic Gym environments are Mountain Car and Cartpole. 12 In Class RL Questions 1. What is the biological name of RL? Operant Conditioning. 2. What two elements are central to RL Interaction? Agent and Environment. 3. What problems is RL applied to? Sequential decision-making problems. 4. What is the main problem of assigning reward? Defining a reward function that accurately reflects long-term objectives without unintended side effects. 5. How can we model sequential decision problems? Using Markov Decision Processes (MDPs). 6. Name the 5 MDP elements. States (S), Actions (A), Transition probabilities (Ta ), Rewards (Ra ), Discount factor (γ). 7. Which of these are in the agent, which are in the environment? Agent: Actions, Policy; Environment: States, Transition probabilities, Rewards, Discount factor. 8. What goes up, what goes down? Rewards go up, actions go down. 9. What is the name of the recursion relation central to the value function? Bellman Equation. 10. What is the name of an algorithm that computes it? Value Iteration or Policy Iteration. 11. What is model-free? Methods that allow the agent to learn directly from raw experience without a model of the environment dynamics (e.g., Q-learning). 12. What is model-based? Methods that use a model of the environment to make decisions (e.g., Dyna-Q). 13. Why do we need Temporal Difference? 9 To combine ideas from Monte Carlo methods and dynamic programming, updating value estimates based on partial returns that allows agent to learn directly from raw experience without a model of the environment dynamics. 14. Why do we need Exploration and Exploitation? Mention a simple algorithm. To balance learning new information and utilizing known information; ϵ-greedy. 15. How can we compute minimal regret? No regret is the difference between the best Using regret minimization algorithms like the Boltzmann policy. action reward and the action took 16. SARSA is X and Q-learning is Y SARSA is on-policy, and Q-learning is off-policy. 17. What is Gym? OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. 10

Document Details

Tags

Related

Full Transcript

Upgrade to continue