Machine Learning for Networks Slides 2024 PDF
Document Details
Uploaded by Deleted User
Université Côte-d'Azur
2024
Tiago da Silva Barros, Hicham Lesfari
Tags
Summary
These slides provide an introduction to reinforcement learning for networking, covering key concepts, examples, and applications.
Full Transcript
Reinforcement Learning for Networking (Part I) Tiago da Silva Barros, Hicham Lesfari PhD student at Université Côte d’Azur/I3S/Inria COATI [email protected] Master 2 Level...
Reinforcement Learning for Networking (Part I) Tiago da Silva Barros, Hicham Lesfari PhD student at Université Côte d’Azur/I3S/Inria COATI [email protected] Master 2 Level 1 Roadmap 1. Introduction to Reinforcement Learning 2. Markov Decision Processes 3. Dynamic Programming 4. Monte-Carlo Learning 5. Temporal-Difference Learning 6. Lab session 2 Roadmap 1. Introduction to Reinforcement Learning 2. Markov Decision Processes 3. Dynamic Programming 4. Monte-Carlo Learning 5. Temporal-Difference Learning 6. Lab session 3 Introduction to Reinforcement Learning Learning paradigms 4 Introduction to Reinforcement Learning Supervised learning Training data Resulting model Applied to new input ❑ Data: 𝑥, 𝑦 , 𝑥 is data, 𝑦 is label ❑ Goal: Learn a function to map 𝑥 → 𝑦 ❑ Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. 5 Introduction to Reinforcement Learning Unsupervised learning Training data Resulting model Applied to new input ❑ Data: 𝑥, just data, no labels! ❑ Goal: Learn some underlying hidden structure of the data ❑ Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 6 Introduction to Reinforcement Learning Principle Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take smart actions in order to maximize cumulative rewards 7 Introduction to Reinforcement Learning Examples: Robot locomotion Google's DeepMind AI Just Taught Itself To Walk ❑ Objective: Make the robot move forward ❑ State: Angle and position of the joints ❑ Action: Torques applied on joints ❑ Reward: 1 at each time step upright + forward movement 8 Introduction to Reinforcement Learning Examples: Cart-pole problem ❑ Objective: Balance a pole on top of a movable cart ❑ State: Angle, angular speed, position, horizontal velocity ❑ Action: Horizontal force applied on the cart ❑ Reward: 1 at each time step if the pole is upright 9 Introduction to Reinforcement Learning Examples: Go (圍棋) board game ❑ Objective: Win the game ❑ State: Position of all pieces ❑ Action: Where to put the next piece down ❑ Reward: 1 if win at the end of the game, 0 otherwise 10 Introduction to Reinforcement Learning Examples: Maze ❑ Objective: Find the star ❑ State: Agent’s location ❑ Action: North, East, South, West ❑ Reward: -1 per time step 11 Introduction to Reinforcement Learning Examples: Robot Soccer ❑ Objective: Win a robot soccer match ❑ State: Robots location and Ball Location ❑ Action: Robot Speed (angular and linear) ❑ Reward: Goals, ball movement... 12 Introduction to Reinforcement Learning Applications Very active field of research RL can be used to solve optimization problems with cumulative objective function Factory process optimization, airplane traffic regulation Scheduling (e.g., pickup dropoff passengers with a cab) Playing games (e.g., old Atari games) Highlights: In 1997: Beating world champion of Chess with Deep Blue (IBM) In 2011: Beating previous winners of Jeopardy! show with Watson (IBM) In 2016: Beating world champion of Go with AlphaGo (DeepMind) In 2017: Beating AlphaGo and top chess program (Stockfish) with AlphaGo Zero (DeepMind) In 2018: Beating Dota2 world Champion with OpenAI Five (OpenAI) In 2019: Grand Master Status obtained with AlphaStar 13 (DeepMind) Introduction to Reinforcement Learning Key concepts ❑ Episode: a sequence of states, actions and rewards, which ends with some terminal state ❑ Policy: agent’s behavior function Deterministic: 𝑎 = 𝜋(𝑠) Stochastic: 𝜋(𝑎| 𝑠)= ℙ𝜋 [ 𝐴 = 𝑎| 𝑆 = 𝑠] 14 Introduction to Reinforcement Learning Maze: policy Arrows represent policy 𝜋 𝑠 for each state 𝑠 15 Introduction to Reinforcement Learning Key concepts ❑ Cumulative Reward: Reward obtained in a long run 16 Introduction to Reinforcement Learning Key concepts ❑ Value function: how rewarding a state and/or action is 17 Introduction to Reinforcement Learning Maze: Value function Numbers represent value 𝑉𝜋 𝑠 for each state 𝑠 18 Introduction to Reinforcement Learning Key concepts 19 Introduction to Reinforcement Learning Maze: Model 𝑎 Grid layout represents transition model policy 𝑃𝑠𝑠' Numbers represent immediate reward 𝑅 𝑎𝑠 from each state 𝑠 (same for all 𝑎) 20 Introduction to Reinforcement Learning RL approaches RL agent taxonomy Learning and Planning 21 Roadmap 1. Introduction to Reinforcement Learning 2. Markov Decision Processes 3. Dynamic Programming 4. Monte-Carlo Learning 5. Temporal-Difference Learning 6. Lab session 22 Markov Decision Processes Formulation ❑ Almost all RL problems can be formalized as MDPs ❑ A state 𝑆𝑡 is Markov if and only if The future is independent of the past given the present Once the state is known, the history may be thrown away All states in a MDP have the Markov property ❑ A Markov Decision Process is a tuple 𝑀 = <𝑆, 𝐴, 𝑃, 𝑅, 𝛾> 𝑆: finite set of states 𝐴: finite set of actions 𝑃: state transition probability matrix 𝑃𝑠𝑎𝑠 ' = ℙ[𝑆𝑡 + 1 = 𝑠′|𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎] 𝑅: reward function 𝑅 𝑎 p ,p ' = 𝔼 [𝑅t|𝑆t = 𝑠, 𝐴t = 𝑎] 𝛾: discount factor in [0,1] 23 Markov Decision Processes MDP: Example 24 Markov Decision Processes Terminology ❑ The return 𝐺𝑡 is the total discounted reward from time-step 𝑡 The discount factor 𝛾 is the present value of future rewards Enables to value immediate reward above delayed reward Avoids infinite returns in cyclic Markov processes 25 Markov Decision Processes Terminology 26 Markov Decision Processes Value functions ❑ The state-value function is the expected return starting from state 𝑠, and then following policy 𝜋: 𝑉𝜋 (𝑠) = 𝔼 𝜋[𝐺𝑡|𝑆𝑡 = 𝑠] ❑ The action-value function is the expected return starting from state 𝑠, taking action a, and then following policy 𝜋: 𝑄𝜋(𝑠, 𝑎) =𝔼𝜋[𝐺𝑡|𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎] 27 Markov Decision Processes Bellman Equations ❑ The value function can be decomposed into into immediate reward plus discounted value of successor state ❑ Similar decomposition for the Q-value 28 Markov Decision Processes Bellman Equations 29 Markov Decision Processes Bellman Equations (Matrix form) ❑ The Bellman equations can be expressed concisely using matrices 𝑉𝜋= 𝑅 𝜋 + 𝛾 P 𝜋 𝑉𝜋 where 𝑉𝜋is a column vector with one entry per state ❑ It can be solved directly 𝑉𝜋= (𝐼 − 𝛾 P 𝜋 ) – 1 𝑅 𝜋 ❑ Computational complexity is 𝑂(𝑛3) for 𝑛 states ❑ Direct solution only possible for small MDPs ❑ Many iterative methods for large MDPs Dynamic programming Monte-Carlo evaluation Temporal-Difference learning 30 Markov Decision Processes Bellman Optimality Equations ❑ The optimal state-value function is the maximum value over all policies 𝑉∗ 𝑠 = 𝑚𝑎𝑥𝜋 𝑉𝜋(𝑠) ❑ The optimal action-value function is the maximum action-value over all policies 𝑄∗ 𝑠, 𝑎 = 𝑚𝑎𝑥𝜋 𝑄𝜋 (𝑠, 𝑎) The optimal value function specifies the best possible performance in the MDP An MDP is “solved” when we know the optimal value function 31 Markov Decision Processes Optimal Value Function 32 Markov Decision Processes Optimal Action-Value Function 33 Markov Decision Processes Optimal Policy ❑ Define a partial ordering over policies 𝜋 ≥ 𝜋′ if 𝑉𝜋 𝑠 ≥ 𝑉𝜋 ' 𝑠 , ∀𝑠 ❑ Theorem For any Markov Decision Process There exists an optimal policy 𝜋∗ that is better than or equal to all other policies, 𝜋∗ ≥ 𝜋, ∀𝜋 All optimal policies achieve the optimal value function 𝑉𝜋∗ 𝑠 = 𝑉∗ 𝑠 All optimal policies achieve the optimal action-value function 𝑄𝜋 ∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 → Proof based on Banach fixed-point theorem ❑ An optimal policy can be found by maximising over 𝑄𝜋∗ 𝑠, 𝑎 34 Markov Decision Processes Optimal Policy 35 Markov Decision Processes Bellman Optimality Equations Recall that the optimal values 𝑉∗and 𝑄∗ are the best returns we can obtain 𝑉∗ 𝑠 = 𝑚𝑎𝑥𝑎 ∈𝐴 𝑄∗ (𝑠, 𝑎) 𝑄∗ 𝑠, 𝑎 = 𝑅 𝑎𝑠 + 𝛾 ∑ 𝑠𝘍∈𝑆 𝑃𝑠𝑎𝑠 ' 𝑉∗ 𝑠′ 𝑉∗ 𝑠 = 𝑚𝑎𝑥𝑎∈𝐴 (𝑅 𝑎𝑠 + 𝛾 ∑ 𝑠'∈𝑆 𝑃𝑠𝑎𝑠 ' 𝑉∗ 𝑠′ ) 𝑄∗ 𝑠, 𝑎 = 𝑅 𝑎𝑠 + 𝛾 ∑ 𝑠'∈𝑆 𝑃𝑠𝑠' 𝑎 𝑚𝑎𝑥 𝑎'∈𝐴 𝑄∗ (𝑠′, 𝑎′) If we have complete information of the environment, this turns into a planning problem, solvable by Dynamic Programming Optimality equations are non-linear equations Many iterative solution methods → Value Iteration, Policy Iteration, Q-learning, Sarsa 36 Markov Decision Processes Bellman Optimality Equations 37 Roadmap 1. Introduction to Reinforcement Learning 2. Markov Decision Processes 3. Dynamic Programming 4. Monte-Carlo Learning 5. Temporal-Difference Learning 6. Lab session 38 Dynamic Programming Primer ❑ Wikipedia definition: “optimization method for solving complex problems by breaking them down into simpler subproblems” ❑ Steps for Solving DP Problems Define subproblems Write down the recurrence that relates subproblems Recognize and solve the base cases 39 Dynamic Programming 1-DP ❑ Problem: given n, find the number of different ways to write 𝑛 as the sum of 1, 3, 4 ❑ Example: for 𝑛 = 5, the answer is 6 5 = 1+1+1+1+1 =1+1+3 =1+3+1 =3+1+1 =1+4 =4+1 40 Dynamic Programming 1-DP ❑ Define subproblems Let 𝐷𝑛 be the number of ways to write n as the sum of 1, 3, 4 ❑ Find the recurrence Consider one possible solution 𝑛 = 𝑥1 + 𝑥2 + · · · +𝑥𝑚 If 𝑥 𝑚 = 1, the rest of the terms must sum to 𝑛 − 1 Thus, the number of sums that end with 𝑥 𝑚 = 1 is equal to 𝐷 𝑛–1 Take other cases into account (𝑥𝑚 = 3, 𝑥 𝑚 = 4) 41 Dynamic Programming 1-DP ❑ Recurrence is then 𝐷𝑛 = 𝐷 𝑛–1 + 𝐷 𝑛–3 + 𝐷 𝑛–4 ❑ Solve the base cases 𝐷0= 1 𝐷𝑛 = 0 for all negative 𝑛 Alternatively, can set: 𝐷0= 𝐷1 = 𝐷2 = 1, and 𝐷3 = 2 ❑ Implementation (very short!) D = D = D = 1; D = 2; for(i = 4; i