Deep Reinforcement Learning Lecture PDF
Document Details
Uploaded by Deleted User
Astana IT University
Dr. Darkhan Zholtayev
Tags
Summary
This lecture introduces the fundamental concepts of deep reinforcement learning. It covers topics like the environment, rewards, and value functions. It also explores various applications and challenges within the field.
Full Transcript
Deep and reinforcement learning Introduction Dr. Darkhan Zholtayev Assistant professor at Department of Computational and Data Science [email protected] General graph AI map Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Wo...
Deep and reinforcement learning Introduction Dr. Darkhan Zholtayev Assistant professor at Department of Computational and Data Science [email protected] General graph AI map Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science. https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159 Research hero https://medium.com/@Coursesteach/deep-learning-part-1-86757cf5a0c3 Lecture outline What is the robotics? Control in robotics Reinforcement learning Deep reinforcement learning Robots Sensors (Camera, IMU, Lidar, GPS) Actuators (Hydraulic, Motors) Limbs Robotic sensors Robot control system is Controlling the robotic system is not complex straight forward Coordinate transformations 1. Forward Kinematics (FK) 2. Inverse Kinematics (IK) 3. Homogeneous Transformation (HT) 4. Denavit-Hartenberg (D-H) 5. Rotation Matrix (RM) 6. Translation Vector (TV) 7. Screw Theory (ST) 8. Jacobian Matrix (JM) Control methods Proportional-Integral-Derivative (PID) Control Feedforward Control Model Predictive Control (MPC) Optimal Control Sliding Mode Control (SMC) Adaptive Control Fuzzy Logic Control Neural Network Control Reinforcement Learning (RL) Control Control of any system Feedback control refers to the process of using the difference between a set point and a controlled variable to determine the corrective action needed to keep the controlled variable close to its desired value. Glimpse of DRL https://www.linkedin.com/posts/roboticworld_a-bipedal-robot-is- mastering-agile-football-ugcPost-7269480950902767616- I1Of?utm_source=share&utm_medium=member_desktop https://www.linkedin.com/posts/nicholasnouri_innovation- technology-future-ugcPost-7241697270289604608- 8yvl?utm_source=share&utm_medium=member_desktop https://www.linkedin.com/posts/imgeorgiev_we-have-a-new- icml-paper-adaptive-horizon-ugcPost-7203887894967541760- 0SUx?utm_source=share&utm_medium=member_desktop Reinforcement learning examples https://www.youtube.com/watch?v= kopoLzvh5jY https://www.youtube.com/watch?v= 3jDoPobFgwA The RL framework How does RL fit in the bigger picture? It is essentially the science of decision taking. This general applicability is also what makes RL so interesting to me personally. RL is one of the potential technologies that could get us closer to general AI: an AI system that can solve any task, in contrast to a narrow set of tasks. Different paradigms Reinforcement learning framework 1. Observe the current state of the environment O 2. Take an action A (which would change the state) 3. Get a reward R 4. Observe the next state of the environment O Core idea If the agent takes greedy action at each step, it will follow the path N-1, N-2, and N-5. However, the path which gives the maximum total reward is N-1, N-2, N-3, N-4, and N-5. State When the agent directly observes the environment state, it is referred to as the case of full observability. Otherwise, it is referred to as partial observability. Markov state The future is independent of the past given the present’ Remember the graph example above, assuming that we are at Node-3. The next state, i.e., the next node we will move to, is independent of which nodes we have passed until Node-3. Return & Value Function The agent moves from one state (observation) to another while collecting a reward in each move. If we somehow knew which states lead us to the maximum cumulative reward, we would decide accordingly about which state should be the next. Where, γ in [0,1] is called the discount rate. Discounted reward and value function A value function v(s), which evaluates the long-term value of the state s, is the expected return starting from state s: State Transition Matrix: Markov Decision Process A Markov decision process (MDP) is a way to formalize almost all RL problems with a tuple (S, A, P, R, γ) where S denotes a set of states, A denotes a set of actions, R denotes immediate rewards, γ denotes the discount factor, and P denotes the dynamics of the MDP such that for all s’, s ϵ S, r ϵ R, and a ϵ A. It is the probability of observing the next state s’ and getting the reward r, given that the agent is in state s and takes the action a. Policy: Policy is a map of the behaviors of an agent. There are two types of policies: Deterministic policy: No uncertainty. The policy function π takes the state as input and gives the action as output, which we write as π(s) = a. Stochastic policy: There is a probability distribution over actions given states Policy: Policy Notice that for stochastic policy the policy function should satisfy: In reinforcement learning, we aim to find the optimal (best) policy that will lead to the maximum expected cumulative reward. Policy From the dynamics function, we can compute the state transition matrix as follows: Next, we will dive into the state-value and action-value functions. The state- value function measures the expected total reward by starting from the state s, whereas the action-value function measures the expected total reward by starting from the state s and taking the action a in that state. State-Value Function & Action-Value Function & Bellman Equations: We define the state-value function (i.e., the long-term value of the state s) and the action-value function (i.e., the long-term value of taking the action a in state s) under the policy π as: State-Value Function & Action-Value Function & Bellman Equations: From equation (1), both the state-value and the action-value functions can be decomposed into immediate reward and the discounted value of the successor state: These equations are known as Bellman equations which serve as a touchstone for our way from value functions to optimal policy. Value function The properties of the expected value function give us a useful way to rewrite the Bellman equation as for state-value function. Moreover, we can express the Bellman equation using matrices where n denotes the number of states. Value function modification Now the problem has become a linear equations system, so if we know the state-transition matrix and rewards, we can easily find the state- value functions by solving it directly. It is not dirrectly applicable However, this approach is not applicable in general. We use direct solutions for problems only with a few number of states because of the computational complexity. For a large number of state spaces, we use iterative methods, e.g., Dynamic Programming Monte Carlo TD Learning Interrelation Optimal Policy & Optimal Value-Function: The optimal state-value function is defined by the maximum value function over all policies: In the same manner, the optimal action-value function is the maximum action-value function over all policies: Optimality Optimal policy Consider the optimal policy function below to understand the last two points Bellman Equations for Optimality: Bellman equation The Bellman equations for optimality are non-linear, hence we cannot solve them by linear algebraic methods. Instead, there are many iterative solutions, which are divided into two categories; model-based and model-free methods. Q learning problem What is Q-Learning ? Mathematics behind Q-Learning Implementation using python Problem The scoring/reward system is as below: 1. The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches the goal as fast as possible. 2. If the robot steps on a mine, the point loss is 100 and the game ends. 3. If the robot gets power , it gains 1 point. 4. If the robot reaches the end goal, the robot gets 100 points Introducing the Q-Table In the Q-Table, the columns are the actions and the rows are the states. Introducing the Q-Table Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration. But the questions are: How do we calculate the values of the Q-table? Are the values available or predefined Mathematics: the Q-Learning algorithm Q-function The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a). There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table. Introducing the Q-learning algorithm process Step 1: initialize the Q-Table Steps 2 and 3: choose and perform an action We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when the episode initially starts, every Q-value is 0. We’ll use something called the epsilon greedy strategy. In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. The logic behind this is that the robot does not know anything about the environment. As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the environment. During the process of exploration, the robot progressively becomes more confident in estimating the Q-values. Random actions For the robot example, there are four actions to choose from: up, down, left, and right. We are starting the training now — our robot knows nothing about the environment. So the robot chooses a random action, say right. Steps 4 and 5: power = +1 mine = -100 evaluate end = +100 RL application RECOMMENDATION AUTONOMOUS LLM TRAINING CREATING A SYSTEMS SYSTEMS PERSONALIZED LEARNING SYSTEM Challenges in RL Sample (in-)efficiency Transferring from simulation into reality Convergence instability The exploration-exploitation trade- off The sparse-reward problem Deep RL taxonomy Prominent papers in DRL GYM environment Components: Observation Space: Defines the format of observations. Action Space: Defines the set of possible actions. Reward Structure: Defines how rewards are given. State Transition: Defines how the environment changes with actions. Gym API Methods reset(): Initializes the environment. step(action): Applies an action and returns the result. render(): Visualizes the environment. close(): Cleans up resources. Evaluating Agent Performance Metrics to Consider: Total Reward: Sum of rewards over an episode. Episode Length: Number of steps before termination. Stability and Consistency: How performance varies across episodes. Visualization Tools TensorBoard: For tracking metrics. Matplotlib: For plotting performance graphs. Best Practices: Run multiple training sessions. Compare against baseline models. Training deep RL robots (ROS) Thanks for the attention