Reinforcement Learning Lecture Notes PDF

CHAPTER 6: Dr Nor Azuana Ramli 3 CONTENTS What is Reinforcement Learning? Markov Decision Processes Q-Learning Policy Gradients What is Reinforcement Learning? ✓ Agents are self-trained on reward and punishment mechanisms. ✓ It’s about taking the best possible action or path to gain maximum rewards and minimum punishment through observations in a specific situation. ✓ It acts as a signal to positive and negative behaviors. ✓ Essentially an agent (or several) is built that can perceive and interpret the environment in which is placed, furthermore, it can take actions and interact with it. Reinforcement learning, a type of machine learning, in which agents take actions in an Different definitions from environment aimed at maximizing their cumulative rewards – NVIDIA the experts Reinforcement learning (RL) is based on rewarding desired behaviors or punishing It is a type of machine learning technique where a undesired ones. Instead of one input producing computer agent learns to perform a task through one output, the algorithm produces a variety of repeated trial and error interactions with a dynamic outputs and is trained to select the right one based on certain variables – Gartner environment. This learning approach enables the agent to make a series of decisions that maximize a reward metric for the task without human intervention and without being explicitly programmed to achieve the task – Mathworks Terminologies used in Reinforcement Learning Agent – is the sole decision-maker and learner Environment – a physical world where an agent learns and decides the actions to be performed Action – a list of action which an agent can perform State – the current situation of the agent in the environment Reward – For each selected action by agent, the environment gives a reward. It’s usually a scalar value and nothing but feedback from the environment Policy – the agent prepares strategy(decision-making) to map situations to actions. Value Function – The value of state shows up the reward achieved starting from the state until the policy is executed Model – Every RL agent doesn’t use a model of its environment. The agent’s view maps state-action pairs probability distributions over the states 8 Start in a state. How does Reinforcement Take an action. Learning Work? Receive a reward or penalty from the environment. Observe the new state of the environment. Update your policy to maximize future rewards 9 10 11 Reinforcement Learning Workflow Train and Create the Define the Create the Deploy the validate the Environment reward agent policy agent 12 Difference between RL and Supervised Learning 13 Characteristics of Reinforcement Learning No supervision, only a real value or reward signal Decision making is sequential Time plays a major role in reinforcement problems Feedback isn’t prompt but delayed The following data it receives is determined by the agent’s actions 14 Challenges in Reinforcement Learning Trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. 15 Reinforcement Learning Algorithms 16 Reinforcement Learning Algorithms Policy-Based – In policy- based, you enable to Value-Based – The main come up with a strategy Model-Based – In this goal of this method is to that helps to gain method, we need to maximize a value maximum rewards in the create a virtual model function. Here, an agent future through possible for the agent to help in through a policy expects actions performed in learning to perform in a long-term return of the each state. Two types of each specific current states. policy-based methods environment are deterministic and stochastic. 17 Some Reinforcement Learning Terms Off-policy vs. on-policy learning An off-policy learning algorithm learns the value of the optimal policy independently of the policy based on which the agent chooses actions. Q-learning is an off-policy learning algorithm. An on-policy learning algorithm learns the value of the policy being carried out by the agent. Model-based vs. model-free In model-based learning, we estimate the transition and reward functions by taking some actions, then solve the MDP using them. In model-free learning, we don’t attempt to model the MDP, and instead just try to learn the values directly. Passive vs. active Passive learning involves using a fixed policy as we try to learn the values of our states, while active learning involves improving the policy as we learn. 18 Case Study: Autonomous Driving Model-Based Approach: Planning: Modeling the Environment: The agent uses the model to simulate The agent builds a model of the various scenarios, like the movements of driving environment, which other vehicles and pedestrians, and plans includes other vehicles, its actions accordingly (e.g., when to pedestrians, traffic signals, and slow down, when to accelerate, when to so on. change lanes). Challenges: Sample Efficiency: The complexity of the real world can The model helps the agent learn make it extremely difficult to create an to drive safely with fewer actual accurate model. Any discrepancy driving trials, as it can learn a lot between the model and the real world through simulation. (model bias) can lead to poor decision- making. 19 Case Study: Autonomous Driving Model-Free Approach: Learning from Interaction: Direct Policy or Value Function The agent learns to drive by Estimation: interacting with the Without modeling the environment, environment, receiving feedback the agent directly learns the policy in the form of rewards (e.g., (what action to take in each state) or positive for maintaining safe value function (how good each state distance, negative for collisions). or action is). Robustness: Sample Inefficiency: The approach is potentially more The agent might need to robust to the complexities of the experience many driving hours to real-world driving environment learn a good policy, as it learns since it doesn’t rely on a possibly purely from interaction. flawed model. 20 “Model-based methods can be more sample-efficient and capable of planning, they suffer from model bias and complexity. On the other hand, model-free methods are simpler and potentially more robust but require more interactions with the environment to learn effectively. The choice between model-based and model- free RL often depends on the specific requirements and constraints of the application at hand” 21 Widely used Models for RL: Markov Decision Process (MDP) MDP is a foundational element of RL. It allows the formalization of sequential decision-making where actions from a state not just influences the immediate reward, but also the subsequent state. It is a very useful framework to model problems that maximize longer-term returns by taking a sequence of actions. Expressing a problem as an MDP is the first step towards solving it through techniques like dynamic programming or other RL techniques. 22 MDP Basics In an MDP, an agent interacts with an environment by taking actions and seeking to maximize the rewards the agent gets from the environment. Mathematical frameworks for mapping solutions in RL. The goal of the agent is to maximize the total rewards collected over a period of time. The agent needs to find optimal action on a given The probability distribution of taking actions, A(s) state that will maximize this from a states, S is called policy. The goal of solving total reward. an MDP is to find an optimal policy 23 24 25 Real World Example of MDP Whether to fish salmons this year We need to decide what proportion of salmons to catch in a year in a specific area, maximizing the longer-term return. Each salmon generates a fixed amount of dollars. However, if a large proportion of salmons are caught, then next year’s yield will be lower. We need to find the optimum portion of salmons to catch to maximize the return over a long time period. 26 Whether to fish salmons this year Solutions: States: The number of salmons available in that area in that year. For simplicity assume there are only four states; empty, low, medium, high. The four states are defined as follows Empty -> no salmons are available; low -> available number of salmons are below a certain threshold t1; medium -> available number of salmons are between t1 and t2; high -> available number of salmons are more than t2. Actions: For simplicity, assumes there are only two actions; fish and not_to_fish. Fish means catching certain proportions of salmon. For the state empty the only possible action is not_to_fish. Rewards: Fishing at certain state generates rewards, let’s assume the rewards of fishing at state low, medium and high are $5K, $50K and $100k respectively. If an action takes to empty state then the reward is very low -$200K as it require re- breeding new salmons which takes time and money. State Transitions: Fishing in a state has higher a probability to move to a state with lower number of salmons. Similarly, not_to_fish action has higher probability to move to a state with higher number of salmons (excepts for the state high). 27 28 Simple Example: Student 29 30 31 Markov Property 32 State Transition Matrix 33 Markov Process 34 Markov Reward Process 35 36 Return 37 Why Discount? 38 Value Function 39 Example Calculations 40 State-value Function for Student MRP (1) 41 State-value Function for Student MRP (2) 42 State-value Function for Student MRP (3) 43 Exercise: Consider the MDP given in the figure below. Assume the discount factor γ = 0.9. The r-values (written within the node) are rewards that the agent gets when in that node. The numbers next to arrows are probabilities of outcomes. Note that only state S1 has two actions (a and b). The other states have only one action (a) for each state. Write down the optimal value of state S1: 44 Widely used Models for RL: Q-Learning It’s a value-based model free approach for supplying information to intimate which action an agent should perform. It revolves around the notion of updating Q values which shows the value of doing action A in state S. Value update rule is the main aspect of the Q- learning algorithm. 45 Practical Applications of RL Text summarization Robotics for Industrial Autonomous Self Driving engines, dialogue agents Automation Cars (text, speech), gameplays AI Toolkits, Manufacturing, Machine Learning and Aircraft Control and Robot Automotive, Healthcare, Data Processing Motion Control and Bots Building artificial intelligence for computer games 46 47 48 49

Reinforcement Learning Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript