Learning for Robot Manipulation: An Overview PDF

Learning for Robot Manipulation An Overview Dr. Alex Mitrevski Master of Autonomous Systems Winter semester 2023/24 Structure ▶ Why learning for robot manipulation ▶ Overview of learning for manipulation ▶ State representation ▶ Manipulation policy learning ▶ Transition model learning Learning for Robot Manipulation 2 / 45 Why Learning for Robot Manipulation Learning for Robot Manipulation 3 / 45 Manipulation Skill Examples ▶ In everyday environments, there is a large variety of useful manipulation skills, which require varying degrees of dexterity ▶ Many such skills can be designed using model-based techniques, but many others require flexibility that can be tricky to model explicitly ▶ An alternative approach is to allow a robot to acquire such skills (semi-)autonomously Learning for Robot Manipulation 4 / 45 Learning for Contact-Heavy Interactions ▶ Learning can be particularly useful to consider for manipulation tasks that involve prolonged or precise contacts with the environment ▶ This is because, in principle, contact-heavy interactions can be challenging to model in sufficient detail ▶ Instead, it is sensible to allow a robot to learn an appropriate interaction policy Learning for Robot Manipulation 5 / 45 Learning and Robot Control ▶ Particularly when considering manipulation motor skills, the learning problem is very related to that solved by classical control theory: make a robot act so that a certain objective is satisfied ▶ The approaches are, however, conceptually different: ▶ Control theory models systems and controllers explicitly ▶ Learning enables robots to optimise controllers through direct experience ▶ Depending on the nature of the learning problem, a combination of control theory and learning is both possible and reasonable (e.g. learning can be used to optimise the parameters of an explicitly modelled controller) Learning for Robot Manipulation 6 / 45 Lessons from Natural Systems ▶ The perspective on the previous slides is a practical one — explicitly programming robot skills is often challenging or inflexible, so we use learning techniques instead ▶ Learning is also interesting to look at from a cognitive developmental point of view — after all, biological creatures acquire most of their skills via learning ▶ Robots that have learning and adaptation capabilities that are similar to those of biological creatures are likely to be most useful in our complex, regularly changing environments D. Han and K. E. Adolph, “The impact of errors in infant development: Falling like a baby,” Developmental Science, vol. 24, no. 5, pp. e13069:1–14, 2020. Learning for Robot Manipulation 7 / 45 Overview of Learning for Manipulation Learning for Robot Manipulation 8 / 45 What to Learn for Manipulation? Learning in the manipulation context can be concerned with multiple aspects, for instance: Object models Policy parameters Manipulation tasks generally involve handling In many cases, we want a robot to execute a objects, whose models (e.g. visual recognition motor policy πθ that is defined by well-defined models or part models) can be learned parameters θ, which need to be learned Skill models Skill hierarchies In a more general case, a complete skill can be When multiple (primitive) skills are available, it learned (a policy as well as the skill’s initiation can be useful to learn how the skills can be and termination conditions) combined for solving complex tasks Learning for Robot Manipulation 9 / 45 Learning for Manipulation Overview Learning for Robot Manipulation 10 / 45 State Representation Learning for Robot Manipulation 11 / 45 Why Does the State Representation Matter? ▶ The overall objective of robot manipulation is to enable robots to perform purposeful actions in the world, namely actions that change the environment so that some desired goal can be achieved ▶ The manner in which the environment changes based on a robot’s actions can be captured by a change of state; thus, the state representation should be able to capture relevant changes in the environment ▶ In addition, an appropriate state representation is often responsible for simplifying otherwise intractable learning problems Learning for Robot Manipulation 12 / 45 Robot and Environment State ▶ Robot actions have an effect both on the robot itself and on the robot’s environment ▶ A general state representation thus has to capture both of these aspects and has the following form: S = Sr ∪ Se where ▶ Sr is a representation of the robot’s internal state ▶ Se represents the state of the task environment Learning for Robot Manipulation 13 / 45 Object-Centric Environment Representation ▶ As manipulation is typically concerned with handling objects, the environment state is typically modelled through the states of individual objects of interest ▶ Let Oj represent the state of some object oj and n be the number of objects of interest for a task; then n So = ∪ Oj j=1 ▶ In many cases, it can also be useful to capture some general information Sw about the environment; thus, the complete environment state can be seen as a combination of the general environment state and the object-specific states: n Se = Sw ∪ So = Sw ∪ ∪ Oj j=1 Learning for Robot Manipulation 14 / 45 Generalisation over Contexts ▶ When a robot learns an execution policy, the policy is typically specific for certain environmental parameters (e.g. for a specific object mass) that remain constant during the execution ▶ The dependence on such parameters can be made explicit by representing them as an execution context vector τ ∈ C ▶ The execution context can then serve as information that conditions the execution policy: π :S×C →A Learning for Robot Manipulation 15 / 45 Task Family ▶ When modelling learning problems, we can often define a task family, which is a collection of tasks Ti , 1 ≤ i ≤ t that ▶ have the same action space A ▶ but each of them has its own state space Si , context space Ci , a transition function Ti , as well as a reward function Ri ▶ The relation between tasks of a task family can be expressed through the reward function Ri , which can be modelled as Ri = Gi − E where Gi represents a task-specific goal and E is a common cost function ▶ Overall, a task family of t tasks can be represented as a collection of Markov Decision Processes (MDPs): P (M) = {(Si , A, Ti , Ri , Ci , γ) | 1 ≤ i ≤ t} Learning for Robot Manipulation 16 / 45 Object Representations ▶ Objects are often part of the state representation in robot manipulation, but there is no unique way in which objects themselves are modelled ▶ Concretely, there are different hierarchy levels at which objects can be observed: Part level Point level Object level Objects are represented through their Individual object points (e.g. pixels) Objects are represented as a whole individual parts (e.g. a cup has a are identified (e.g. through a bounding box) container and a handle) ▶ Each of these hierarchies can be useful for different tasks: ▶ A point-level representation is suitable when specific points of an object are relevant during an interaction (e.g. the prongs of a fork) ▶ Parts can be useful to look at for task-oriented grasping ▶ An object-level representation aids scene understanding, but is also required for ensuring that actions are performed on an object of interest ▶ Ideally, the hierarchical levels are used by different skills that can be composed to solve a specific task Learning for Robot Manipulation 17 / 45 Passive vs. Interactive Perception ▶ A robot needs to perceive the environment to acquire information about objects and the overall scene ▶ There are two overall perception strategies depending on whether the robot is simply passively observing or actively investigating the scene: Passive perception Interactive perception A robot performs actions to collect information A scene is observed passively based on received about certain environmental aspects (e.g. sensory data (e.g. camera images) touching an object to find out its material) ▶ Many aspects of the environment are not observable using passive perception only (e.g. the mass of an object), so interactive perception is often essential for successful task completion Learning for Robot Manipulation 18 / 45 Manipulation Policy Learning Learning for Robot Manipulation 19 / 45 Execution Policies Revisited ▶ In the last lecture, we defined a skill by an execution policy together with initiation and termination conditions ▶ A policy π : S → A models a robot’s behaviour, and a particularly large effort in robot learning is put on how to actually acquire such a policy ▶ There are a few general ways in which this can be done: Reinforcement learning Imitation learning Transfer learning A policy is learned using direct Learning is done based on expert Previously learned policies are used to interactions with the world observations guide the learning process Learning for Robot Manipulation 20 / 45 Action Spaces ▶ Execution policies can have a variety of action spaces, which are illustrated below: Cartesian Cartesian velocity force Action space types Joint Controller Joint torques parameters velocities ▶ Note that policy outputs are typically not directly used as actuator commands, but are processed by a low-level robot controller Learning for Robot Manipulation 21 / 45 Policy Representations Gaussian Nearest Locally-weighted Lookup Basis function Neural Decision process neighbour-based regression tables combinations networks trees Fixed-size Nonparametric parametric Policy representations Restricted Goal-based parametric Linear quadratic Dynamic motion Gaussian mixture regulators primitives regression Learning for Robot Manipulation 22 / 45 Deterministic vs. Stochastic Policies ▶ Regarding how actions are selected from a policy, we can distinguish between deterministic and stochastic policies Stochastic policy Deterministic policy Actions are selected by sampling from the Actions are selected by a deterministic function distribution of actions given a state of the current state at ∼ π(a|st ) at = π(st ) Learning for Robot Manipulation 23 / 45 Parameterised Policies and Trajectories ▶ In robotics, policies are often represented by parameters θ, so we denote the policy as πθ ▶ A policy is used to define a trajectory (also called episode or rollout) τ = (s0 , a0 , s1 ,..., an , sn+1 ) ▶ Given a policy π, the probability of a trajectory can be found to be n Y Pπ (s0 , a0 , s1 ,..., an , sn+1 ) = P (s0 ) Pπ (ai |si )P (si+1 |si , ai ) i=0 Learning for Robot Manipulation 24 / 45 Reinforcement Learning Objective ▶ When using reinforcement learning for acquiring a policy, the objective is to find a policy π ∗ that maximises the robot’s expected return: " # Z X ∗ π = arg max E r(st , at ) = arg max P (τ |π)R(τ )dτ π τ ∼π π t ▶ If we are given a parameterised policy πθ , the learning objective is that of finding a set of parameters θ∗ that maximise the expected return " # Z X ∗ θ = arg max E r(st , at ) = arg max P (τ |πθ )R(τ )dτ θ τ ∼πθ θ t Learning for Robot Manipulation 25 / 45 Exploration vs. Exploitation Reinforcement Learning ▶ During learning, a robot has to balance exploration and exploitation Exploitation Exploration Acting according to the best policy available to Acting by trying out actions that may not be the the robot (so far) most optimal under the current best policy ▶ There is always a trade-off between exploitation and exploration: ▶ if the robot exploits too much too early, it risks converging to a suboptimal policy ▶ the robot’s policy should eventually converge however; too much exploration can prevent that from happening Learning for Robot Manipulation 26 / 45 Model-Free Learning Reinforcement Learning ▶ When learning manipulation policies, a robot usually does not have a transition model (or even a reward model) of the environment, but has to explore the environment during learning ▶ In such instances, model-free reinforcement learning needs to be used ▶ The model-free learning setup is that we have m trajectories τi , 1 ≤ i ≤ m and the accompanying rewards along the trajectories, such that an optimal policy has to be found from these experiences ▶ There are two major families of model-free algorithms: Temporal difference learning Monte Carlo learning Performs value / policy updates at every step Estimates the return from complete trajectories (i.e. after the execution of every action) and then performs value / policy updates Learning for Robot Manipulation 27 / 45 Temporal Difference — TD(λ) — Learning and Q-Learning Reinforcement Learning ▶ The TD(λ) learning algorithm attempts to bring the value function V (st ) closer to the reward function, while preventing myopic updates ▶ The parameter λ controls the amount of prediction during learning — if λ > 0, older states are considered during learning ▶ For TD(0), only a single-step prediction is done, with α a learning rate V (st ) = V (st ) + α (r(st , at ) + γV (st+1 ) − V (st )) ▶ A popular temporal difference RL algorithm is Q-learning, which estimates a state-action value function Q(st , at ) ▶ The Q-learning update rule is given by Q(st , at ) = Q(st , at ) + α r(st , at ) + γmax Q(st+1 , a) − Q(st , at ) a Learning for Robot Manipulation 28 / 45 Deep Q-Learning Reinforcement Learning ▶ Q-learning as seen on the previous slide is defined for discrete action spaces; however, using a function approximator (e.g. a neural network), it can be extended to continuous state spaces ▶ In deep Q-learning, Q-function is represented using a deep neural network, and the objective function that is being minimised here is often of the form h i L(θ) = E Q(st , at ) − r(st , at ) + γmax Q(st+1 , a) a V. Mnih et al. “Human-level Control Through Deep Reinforcement Learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. ▶ Q-learning can be fairly unstable when a neural network is used to represent the value function (and may not even converge) — but there are practical tricks to improve the convergence Learning for Robot Manipulation 29 / 45 Policy Search Reinforcement Learning ▶ Value-based algorithms, such as TD(λ) and Q-learning, derive the policy from the value function ▶ Policy search circumvents the need for the value function by optimising in the policy space directly ▶ Policy search algorithms are useful in robotics since they allow incorporating prior knowledge about the policy Learning for Robot Manipulation 30 / 45 Policy Gradients Reinforcement Learning ▶ Policy gradient methods represent one popular family of policy search ▶ Given a parameterised policy πθ , a policy gradient algorithm estimates the gradient of the expected return and modifies the parameters θ using the update rule Z θ ← θ + ∇J(θ) = θ + ∇ R(τ )P (τ |θ)dτ ▶ Policy gradient algorithms often make use of the so-called likelihood ratio trick ∇θ P (τ |θ) = P (τ |θ)∇θ log P (τ |θ) while estimating the gradient of J Learning for Robot Manipulation 31 / 45 REINFORCE Algorithm Reinforcement Learning ▶ REINFORCE is an algorithm that forms the backbone of many 1: Initialise θ randomly practically used policy gradient algorithms 2: for i ← 1 to N do 3: T ← {} ▶ The algorithm was originally formulated for neural 4: for j ← 1 to M do network-based policies, but its general formulation is 5: τ ← sample(πθ ) applicable to any differentiable policy 6: T ←T P∪ τ 1 M P j ▶ A high-level overview of the algorithm is shown on the right 7: Jθ ← M j=1 t rt 8: θ ← θ + ∇Jθ Learning for Robot Manipulation 32 / 45 Actor-Critic Learning Reinforcement Learning ▶ Value-based algorithms can be referred to as critic-based, while policy search algorithm are also called actor-based ▶ A combination of the two also exists — this forms the so-called actor-critic family of RL algorithms, which estimate the value function and maintain a policy at the same time ▶ Actor-critic algorithms make use of a baseline b when estimating the gradient of J " T # X ∇θ J(θ) = E ∇θ log Pθ (at |st )(Rt − bt ) i=0 ▶ The benefit of actor-critic algorithms is that the variance of policy updates is reduced Learning for Robot Manipulation 33 / 45 Proximal Policy Optimisation (PPO) Reinforcement Learning ▶ PPO is a policy gradient algorithm that is often used as a baseline method in learning problems ▶ The optimisation objective of PPO is maximising L(θ) = E [min (qt (θ)Aθ (s, a), clip (qt (θ), 1 − ϵ, 1 + ϵ) Aθ (s, a))] for a small ϵ where Aθ is called the advantage function A(s, a) = Q(s, a) − V (s) and   y if x < y πθ (a|s) qt (θ) = clip(x, y, z) = z if x > z πθold (a|s)  x otherwise ▶ PPO maintains a (deep) policy network (thus it is considered a deep RL algorithm) and tries to limit the amount by which the policy is updated Learning for Robot Manipulation 34 / 45 Imitation Learning ▶ In reinforcement learning, a robot needs to interact with its environment (either in the real world or in a simulation) so that it can identify an appropriate execution policy ▶ If an expert is available that can show, a more appropriate way is to perform imitation learning ▶ There are various techniques to perform imitation learning: Inverse reinforcement learning Learning from observation Behaviour cloning Expert demonstrations are used for A policy is learned from raw An expert policy is mimicked directly extracting a reward function observations, without explicit state based on observed states and actions and action labels Learning for Robot Manipulation 35 / 45 Behaviour Cloning Imitation Learning ▶ The simplest way to perform imitation learning is to copy the actions performed by the demonstrator, an approach known as behaviour cloning ▶ In behaviour cloning, we are given a set of c observations X = {(si , ai )} , 1 ≤ i ≤ c that specifies states and ground-truth actions taken by an expert demonstrator ▶ Given such a dataset, supervised learning can be used to acquire a policy ▶ A policy learned using behaviour cloning can be further improved using reinforcement learning, but also using corrective demonstrations Learning for Robot Manipulation 36 / 45 Inverse Reinforcement Learning Imitation Learning ▶ Another way in which expert demonstrations can be utilised is for extracting a reward function — this is the approach taken by inverse reinforcement learning (aka reward inference) ▶ The reward in inverse RL is typically represented as a (linear) combination of features that can be observed ▶ Such a reward function can then be used to do reinforcement learning ▶ Inverse RL is usually performed as an iterative process that has an outer loop for reward extraction (based on some optimisation metric) and an inner loop for policy learning ▶ Inverse RL is challenging because the problem is ill-defined — there can be many possible reward functions that optimise the metric Learning for Robot Manipulation 37 / 45 Policy Transfer ▶ One additional strategy in which a policy π for a new task Tj can be acquired is to reuse a policy ∗ π Ti that has already been learned for a different task Ti , i ̸= j ∗ ▶ π Ti can be used either directly for Tj or, more frequently, fine-tuned for Tj ▶ This can also be achieved using a variety of strategies, which we will not discuss in further detail in this lecture Skill Skill reuse parameterisation Policy transfer strategies Domain Meta learning adaptation Learning for Robot Manipulation 38 / 45 Skill Learning ▶ In this section, we only discussed the aspect of learning a policy ▶ In the previous lecture, we defined a complete skill as S = (SI , ST , π), namely a skill also has initiation and termination conditions — what about those, you might ask? ▶ The aspect of learning the initiation conditions (preconditions) and the termination condition is left out on purpose; this will be discussed in a dedicated session later in the course Learning for Robot Manipulation 39 / 45 Transition Model Learning Learning for Robot Manipulation 40 / 45 Transition Models for State Prediction ▶ As a robot can affect its environment with its own actions, it can be useful for it to know how those actions affect the state before committing to specific actions ▶ Prediction is also useful when a robot co-exists with other agents ▶ A transition model T enables such predictions about the state evolution as a result of executed actions to be created ▶ Depending on the nature of the predictive process, S. Elliott and M. Cakmak, ”Robotic Cleaning Through Dirt Rearrangement we can distinguish between two types of transition Planning with Learned Transition Models,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2018, pp. 1623–1630. models: Deterministic transition model Probabilistic transition model T :S×A→S T :S×A×S →R Learning for Robot Manipulation 41 / 45 Discrete and Continuous Transition Models ▶ There are two types of predictive models based on the nature of the state variables to be predicted: Continuous predictive model Discrete predictive model Used for predicting the evolution of continuous Used when the state space can be discretised state variables (e.g. using a symbolic representation) ▶ Discrete and continuous transition models can be combined in a hybrid model to enable the representation of different manipulation modes ▶ Here, a discrete model is used to predict mode transitions ▶ A continuous model is used to predict state variables within a mode Learning for Robot Manipulation 42 / 45 Model Uncertainty ▶ Probabilistic transition models have an associated uncertainty that stems from the fact that a robot does not have perfect knowledge about the process to be predicted ▶ In this context, we need to distinguish between two sources of uncertainty: Aleatoric uncertainty Epistemic uncertainty Inherent uncertainty in the process Uncertainty due to a lack of process knowledge ▶ Epistemic uncertainty can be minimised with more training data (or with interactive perception); this is not the case with aleatoric uncertainty, where more data cannot help Learning for Robot Manipulation 43 / 45 Inverse Models ▶ Predictive models as discussed so far are often called forward models ▶ In some cases, it can be useful to know how the state was changed in a particular way (e.g. when observing other agents performing tasks and only the state is observable) ▶ An inverse model makes a prediction of the action that is performed so that a certain state transition occurs T −1 : S × S → A ▶ Inverse models can be learned similarly to predictive models Learning for Robot Manipulation 44 / 45 Next Lecture: Learning-Based Robot Navigation Learning for Robot Manipulation 45 / 45

Learning for Robot Manipulation: An Overview PDF

Document Details

Tags

Related

Summary

Full Transcript