Reinforcement Learning PDF
Document Details
Uploaded by FoolproofXenon
Alexandria University
Dr. Doaa B. Ebaid
Tags
Summary
This document provides an introduction to reinforcement learning, a type of machine learning. It covers topics like agent-environment interaction, state-action pairs, reward systems, and various reinforcement learning approaches, including model-based and value-based methods. The document also includes examples, such as robot navigation in a warehouse, and the key concepts behind reinforcement learning.
Full Transcript
IGSR Course Representation By: Dr. Doaa B. Ebaid Part 5 : Introduction to Reinforcement Learning Outlines What is Reinforcement Learning? R Vs Supervised Vs Unsupervised Learning? Reinforcement Learning Approaches. How does Reinforcement Learning Work? Bellman Equation....
IGSR Course Representation By: Dr. Doaa B. Ebaid Part 5 : Introduction to Reinforcement Learning Outlines What is Reinforcement Learning? R Vs Supervised Vs Unsupervised Learning? Reinforcement Learning Approaches. How does Reinforcement Learning Work? Bellman Equation. Exploration vs. Exploitation. Curse of Dimensionality. Advantages of Reinforcement Learning. Disadvantages of Reinforcement Learning. Challenges in Reinforcement Learning What is Reinforcement Learning? What is Reinforcement Learning? What is Reinforcement Learning? Terms used in Reinforcement Learning Agent: The learner or decision-maker. Agent Environment: Everything the agent interacts with. Environment State (S) : A specific situation/ or configuration of the environment in which the agent finds itself. Action (A): All possible choices/moves the agent can make. Reward (R): Feedback from the environment based on the action taken. Policy (π): essentially a strategy or a rule that an agent follows to determine its actions based on the current state of the environment. Actions: What is Reinforcement Learning? Reinforcement Learning (RL) is a feedback-based ML technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty. RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such as game-playing, robotics, etc. What is Reinforcement Learning? The agent-environment interaction loop involves: Environment The agent observes the current state of the environment Based on that, the agent selects an action The environment transitions to a new state Action Reward and gives the agent a reward This interaction repeats, with the agent state learning to maximize rewards over time This is known as a Markov decision process. The agent's next action depends only on the current state, not the full history. R Vs Supervised Vs Unsupervised Learning? Unlike supervised learning, where the agents learn from labeled examples, or in case of unsupervised learning which finds patterns in unlabeled data, reinforcement learning relies on trial and error learning through interactions with the environment. So the agent is bound to learn by its experience only. Here we do not need to pre-program the agent, as it learns from its own experience without any human intervention. Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the diamond. The agent interacts with the environment by performing some actions, and based on those actions, the state of the agent gets changed, and it also receives a reward or penalty as feedback. The agent continues doing these three things (take action, change state/remain in the same state, and get feedback), and by doing these actions, he learns and explores the environment. The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative point. Reinforcement Learning Approaches There are mainly three ways to implement reinforcement-learning in ML, which are: RL approaches Model- Free Model-Based Methods Methods Policy-Based Value-Based Deterministic Stochastic Reinforcement Learning Approaches Model-based: In the model-based approach, a virtual model is created for the environment, and the agent explores that environment to learn it. There is no particular solution or algorithm for this approach because the model representation is different for each environment. Attempts to learn a model of the environment dynamics. This model predicts the next state and reward for a given state-action pair. The agent uses this model to plan and simulate actions in a virtual environment before taking them in the real world. While conceptually attractive, this approach can be computationally expensive for complex environments and often requires additional assumptions about the environment’s behavior. Reinforcement Learning Approaches Example: Robot Navigation in a 1.1 State (s) The state represents the robot's current position and orientation in Warehouse the warehouse. It could also include information about nearby obstacles, package location, and destination. Imagine a robot navigating through a 1.2 Action (a) The robot can take actions such as moving forward, turning left or warehouse to deliver packages to right, or stopping. different locations. The robot needs to 1.3 Reward (R(s,a)) avoid obstacles (shelves, walls, other The reward is based on several factors: robots) and take the most efficient route Positive reward for reaching the destination (e.g., delivering a package). to the delivery point. The warehouse is Negative reward for hitting an obstacle or deviating from the dynamic, with different conditions at optimal route. different times (e.g., blocked paths due to Small negative reward for every time step to encourage faster delivery. other robots). 1.4 Model of the Environment The robot either has or learns a model of the warehouse. This model Goal: The robot must learn to navigate could include: the warehouse efficiently, avoiding State transition function: It predicts how the robot’s position changes when it takes an action. obstacles, and delivering packages as Reward function: It predicts the reward the robot will receive after quickly as possible. an action. Reinforcement Learning Approaches Policy-based: Directly learns the policy function, which maps states to actions. The goal is to find the optimal policy that leads to the highest expected future rewards. Examples of policy-based methods include REINFORCE, Proximal Policy Optimization (PPO), and Actor-Critic methods. The policy-based approach has mainly two types of policy: Deterministic: The same action is produced by the policy (π) at any state. Stochastic: In this policy, probability determines the produced action. Policy (π: S → A) The goal of RL is to learn a policy π, which maps states to actions. The objective is to find the policy that maximizes cumulative reward. Reinforcement Learning Approaches Value-based: The value-based approach is about to find the optimal value function, which is the maximum value at a state under any policy. Therefore, the agent expects the long-term return at any state(s) under policy π. Focuses on learning a value function that estimates the expected future reward for an agent in a given state under a specific policy. The agent aims to maximize this value function to achieve long-term reward. Popular algorithms in this category include Q-Learning, SARSA, and Deep Q-Networks (DQN). Value Functions Value functions estimate the expected cumulative reward from a state or state-action pair. These are central to RL algorithms. 1) State-value function Vπ(s): ϒ: Discount factor This represents the expected discounted reward starting from state s and following policy π. Value Functions Value functions estimate the expected cumulative reward from a state or state-action pair. These are central to RL algorithms. 2) Action-value function (Q-value) Qπ(s,a): This gives the expected cumulative reward for taking action a in state s and following policy π. How does Reinforcement Learning Work? Let's take an example of a maze environment that the agent needs to explore. Agent can take four actions: move up, move down, move left, and move right. If the agent reaches the diamond at the S4 block, it gets the +1 reward; if it reaches the fire pit, then it gets -1 reward (punishment). How does Reinforcement Learning Work? The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps. Ex. Path #1 :S9-S5-S1-S2-S3 Path #2 :S9-S10-S11-S7-S3 Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-reward point. How does Reinforcement Learning Work? The agent will try to remember the preceding steps that it has taken to reach the final step. To V=1 V=1 V=1 memorize the steps, it assigns 1 value to each previous step. V=1 V=1 How does Reinforcement Learning Work? Now, the agent has successfully stored the previous steps V=1 assigning the 1 value to each V=1 previous block. But what will the agent do if he starts moving from the block, V=1 which has 1 value block on both sides? V=1 How does Reinforcement Learning Work? V=1 V=1 V=1 V=1 Bellman Equation The Bellman equation plays a central role in value-based reinforcement learning (RL) because it provides a recursive relationship between the value of a state (or state-action pair) and the values of its successor states. This helps to break down the complex problem of determining long- term rewards into simpler, iterative steps, allowing the agent to efficiently learn optimal behaviors. According to the Bellman Equation, long-term- reward in a given action is equal to the reward from the current action combined with the expected reward from the future actions taken at the following time. Bellman Equation The Bellman equation provides a recursive relationship for value functions For state-value functions: V(s) = max [R(s,a) + γV(s`)] This states that the value of a state s is the immediate reward plus the discounted value of the next state. γ: determines how much the agent cares about rewards in the distant future relative to those in the immediate future. It has a value between 0 and 1. Lower value encourages short–term rewards while higher value promises long-term reward Bellman Equation V(s) = max [R(s,a) + γV(s`)] At S3, V(S3)= max [1 + 0.9 *0]=1 V=0.81 V=0.9 V=1 At S2, V(S2)= max [0 + 0.9 *1]=0.9 At S1, V(S1)= max [0 + 0.9 *0.9]= 0.81 At S5, V(S5)= max [0 + 0.9 *0.81]= V=0.73 0.73 At S9, V(S9)= max [0 + 0.9 *0.73]= 0.66 V=0.66 Bellman Equation V(s) = max [R(s,a) + γV(s`)] At S7, V(S7)= max( [0 + 0.9 *1] UP, V=0.81 V=0.9 V=1 [-1 + 0.9 * ()] (fire), Down]) =0.90 At S11, V(S11)= max [0 + 0.9 *0.9]= 0.81 V=0.73 V=0.9 At S10, V(S10)= max( [0 + 0.9 *0.81], [0 + 0.9 *0.66])= 0.73 At S12, V(S12)= max[0 + 0.9 *0.81]= V=0.66 V=0.73 V=0.81 V=0.73 0.73 Bellman Equation The max denotes the most optimum action among all the actions that the agent can take V=0.81 V=0.9 V=1 in a particular state which can lead to the reward after repeating this process every consecutive step. V=0.73 V=0.9 The current starting state of our agent can choose V=0.66 any random action UP or RIGHT sin V=0.73 V=0.81 V=0.73 ce both lead towards the reward with the same number of steps. Bellman Equation The Bellman equation provides a recursive relationship for value functions For Q-values: Here, s′ is the state resulting from action a in s, and a′ is the optimal action to take in state s′. This algorithm allows the agent to learn the optimal policy by updating Q-values as it interacts with the environment. Exploration vs. Exploitation One of the central challenges in RL is balancing exploration (trying new actions to discover better rewards) and exploitation (using the known best actions). Methods like ϵ-greedy and Boltzmann exploration help balance this trade-off. Curse of Dimensionality As the number of states and actions increases, the size of the Q-value table grows exponentially. This leads to scalability issues, which require more advanced techniques like function approximation (e.g., using neural networks) to generalize over large state-action spaces. This section also mentions temporal difference (TD) learning, hierarchical approaches, and other advanced methods to tackle larger problems and partially observable environments. Advantages of Reinforcement Learning The best part is that even when there is no training data, it will learn through the experience it has from processing the training data. It can solve higher-order and complex problems. Also, the solutions obtained will be very accurate. The reason for its perfection is that it is very similar to the human learning technique. This model will undergo a rigorous training process that can take time. This can help to correct any errors. Since the model learns constantly, a mistake made earlier would be unlikely to occur in the future. Various problem-solving models are possible to build using reinforcement learning. When it comes to creating simulators, object detection in automatic cars, robots, etc., reinforcement learning plays a great role in the models. For various problems, which might seem complex to us, it provides the perfect models to tackle them. Due to it’s learning ability, it can be used with neural networks. This can be termed as deep reinforcement learning. Disadvantages of Reinforcement Learning The usage of reinforcement learning models for solving simpler problems won’t be correct. The reason being, the models generally tackle complex problems. We will be wasting unnecessary processing power and space by using it for simpler problems. We need lots of data to feed the model for computation. Reinforcement Learning models require a lot of training data to develop accurate results. This consumes time and lots of computational power. When it comes to building models on real-world examples, the maintenance cost is very high. Like for building driverless vehicles, robots, we would require a lot of maintenance for both hardware and software. Excessive training can lead to overloading of the states of the model. This will result in the model for getting the result. This may happen if too much memory space goes out in processing the training data. Challenges in Reinforcement Learning The reward-based functions need to be designed properly. Unspecified reward functions can be too risk-sensitive and objective. Limited samples to learn are a problem as these would result in the model producing inadequate results. Overloading of states is never a good sign as it may drastically impact the results. This happens when too much RL is done on a problem. Too many parameters given to the model may cause delays in results and lead to longer processing and consumption of CPU power. The coder should use necessary details to explain policies and actions that are there in the algorithm for the system operators.