Language-Based Learning - A Short Overview of Contemporary Language Use in Robotics - PDF
Document Details
Uploaded by PatientSlideWhistle
Hochschule Bonn-Rhein-Sieg
2024
Alex Mitrevski
Tags
Summary
This document provides a short overview of language-based learning, particularly as it applies to contemporary robotics. The lecture, part of the Winter semester 2024/25 curriculum at the Bonn-Rhein-Sieg University of Applied Sciences, details language models, tokenization, and their implications in the field of robotics.
Full Transcript
Language-Based Learning A Short Overview of Contemporary Language Use in Robotics Dr. Alex Mitrevski Master of Autonomous Systems Winter semester 2024/25 Structure ▶ (Large) Language models ▶ Robot learning and language...
Language-Based Learning A Short Overview of Contemporary Language Use in Robotics Dr. Alex Mitrevski Master of Autonomous Systems Winter semester 2024/25 Structure ▶ (Large) Language models ▶ Robot learning and language Language-Based Learning: A Short Overview of Language Use in Robotics 2 / 23 (Large) Language Models Language-Based Learning: A Short Overview of Language Use in Robotics 3 / 23 Language Models ▶ Language models are computational models of language that enable language processing, understanding, and sometimes generation, to be performed A. Radford et al., “Improving Language Understanding by Generative Pre-Training,” OpenAI, 2018. 1 T. B. Brown et al., “Language Models are Few-Shot Learners,” in Proc. 33rd Conf. on Advances in Neural Information Processing Systems (NeurIPS), 2020. Language-Based Learning: A Short Overview of Language Use in Robotics 4 / 23 Language Models ▶ Language models are computational models of language that enable language processing, understanding, and sometimes generation, to be performed ▶ Natural language tasks used to be performed with classical machine learning-based models; e.g. a Naive Bayes classifier could be used for text classification A. Radford et al., “Improving Language Understanding by Generative Pre-Training,” OpenAI, 2018. 1 T. B. Brown et al., “Language Models are Few-Shot Learners,” in Proc. 33rd Conf. on Advances in Neural Information Processing Systems (NeurIPS), 2020. Language-Based Learning: A Short Overview of Language Use in Robotics 4 / 23 Language Models ▶ Language models are computational models of language that enable language processing, understanding, and sometimes generation, to be performed ▶ Natural language tasks used to be performed with classical machine learning-based models; e.g. a Naive Bayes classifier could be used for text classification ▶ Large language models are neural network-based language models which have a very large number of parameters and which are trained on massive datasets ▶ For instance, GPT-3 has 175 billion parameters1 A. Radford et al., “Improving Language Understanding by Generative Pre-Training,” OpenAI, 2018. 1 T. B. Brown et al., “Language Models are Few-Shot Learners,” in Proc. 33rd Conf. on Advances in Neural Information Processing Systems (NeurIPS), 2020. Language-Based Learning: A Short Overview of Language Use in Robotics 4 / 23 Tokenisation ▶ When processing bodies of text (e.g. full documents), a variety of preprocessing steps are performed; language processing is performed on the resulting preprocessed representation ▶ For instance, stop words (e.g. a, the, etc.) and punctuation are typically irrelevant for language understanding tasks, so they might be removed in the preprocessing steps Language-Based Learning: A Short Overview of Language Use in Robotics 5 / 23 Tokenisation ▶ When processing bodies of text (e.g. full documents), a variety of preprocessing steps are performed; language processing is performed on the resulting preprocessed representation ▶ For instance, stop words (e.g. a, the, etc.) and punctuation are typically irrelevant for language understanding tasks, so they might be removed in the preprocessing steps ▶ Tokenisation is a process of converting text into a set of constituent entities ▶ One common tokenisation strategy is to convert text into individual words — further processing is then done on the level of words Language-Based Learning: A Short Overview of Language Use in Robotics 5 / 23 Tokenisation ▶ When processing bodies of text (e.g. full documents), a variety of preprocessing steps are performed; language processing is performed on the resulting preprocessed representation ▶ For instance, stop words (e.g. a, the, etc.) and punctuation are typically irrelevant for language understanding tasks, so they might be removed in the preprocessing steps ▶ Tokenisation is a process of converting text into a set of constituent entities ▶ One common tokenisation strategy is to convert text into individual words — further processing is then done on the level of words ▶ Hierarchical token representations can also be used ▶ In this case, tokenisation can start at the level of individual characters or sub-words and progress up to words or word combinations Language-Based Learning: A Short Overview of Language Use in Robotics 5 / 23 Tokenisation ▶ When processing bodies of text (e.g. full documents), a variety of This slide is from a preprocessing steps are performed; language processing is lecture in the performed on the resulting preprocessed representation Robot Learning course ▶ For instance, stop words (e.g. a, the, etc.) and punctuation are typically irrelevant for language understanding tasks, so they might word tokenisation be removed in the preprocessing steps { This, slide, is, from, a, ▶ Tokenisation is a process of converting text into a set of lecture, in, the, constituent entities Robot, Learning, course } ▶ One common tokenisation strategy is to convert text into individual words — further processing is then done on the level of words stop word removal ▶ Hierarchical token representations can also be used ▶ In this case, tokenisation can start at the level of individual { This, slide, is, from, lecture characters or sub-words and progress up to words or word in, Robot, Learning, course } combinations Language-Based Learning: A Short Overview of Language Use in Robotics 5 / 23 Word Embeddings ▶ Performing computations on language through numerical models (such as neural networks) requires a numerical representation of tokens ▶ The bag-of-words representation or term frequency-inverse document frequency (TF-IDF) are examples of classical representations Language-Based Learning: A Short Overview of Language Use in Robotics 6 / 23 Word Embeddings ▶ Performing computations on language through numerical models (such as neural networks) requires a numerical representation of tokens ▶ The bag-of-words representation or term frequency-inverse document frequency (TF-IDF) are examples of classical representations ▶ A word embedding is a vectorial token representation that encodes tokens in a latent space of size k, typically produced by a neural network model ▶ Embeddings are learned with respect to a vocabulary of fixed size v >> k ▶ Inputs to embedding models are often represented as one-hot encoded vectors Language-Based Learning: A Short Overview of Language Use in Robotics 6 / 23 Word Embeddings ▶ Performing computations on language through numerical models (such as neural networks) requires a numerical representation of tokens ▶ The bag-of-words representation or term frequency-inverse document frequency (TF-IDF) are examples of classical representations ▶ A word embedding is a vectorial token representation that encodes tokens in a latent space of size k, typically produced by a neural network model ▶ Embeddings are learned with respect to a vocabulary of fixed size v >> k ▶ Inputs to embedding models are often represented as one-hot encoded vectors ▶ A variety of word embeddings have been proposed over the years — some popular ones are word2vec, BERT, and ELMo Language-Based Learning: A Short Overview of Language Use in Robotics 6 / 23 Word Embeddings ▶ Performing computations on language through numerical models (such as neural networks) requires a numerical representation of tokens ▶ The bag-of-words representation or term frequency-inverse document frequency (TF-IDF) are examples of classical representations ▶ A word embedding is a vectorial token representation that encodes tokens in a latent space of size k, typically produced by a neural network model ▶ Embeddings are learned with respect to a vocabulary of fixed size v >> k ▶ Inputs to embedding models are often represented as one-hot encoded vectors ▶ A variety of word embeddings have been proposed over the years — some popular ones are word2vec, BERT, and ELMo ▶ A desirable feature of embedding models is that words that have similar meanings should be close to each other in the embedding space ▶ BERT and ELMo produce context-dependent embeddings, as they are learned by considering surrounding words Language-Based Learning: A Short Overview of Language Use in Robotics 6 / 23 Transformer A. Vaswani et al., “Attention Is All You Need,” in Proc. 31st Conf. Neural Information Processing Systems (NeurIPS), 2017. ▶ Most large language models are based on the so-called transformer architecture A. Vaswani et al., “Attention Is All You Need,” in 31st Conf. Neural Information Processing Systems (NeurIPS), 2017. Language-Based Learning: A Short Overview of Language Use in Robotics 7 / 23 Transformer A. Vaswani et al., “Attention Is All You Need,” in Proc. 31st Conf. Neural Information Processing Systems (NeurIPS), 2017. ▶ Most large language models are based on the so-called transformer architecture ▶ The main component of the transformer is an attention layer, which can be seen as computing token importance factors as a result of other tokens in the current context A. Vaswani et al., “Attention Is All ▶ The context is defined as a sequence of tokens of a predefined size You Need,” in 31st Conf. Neural Information Processing Systems (NeurIPS), 2017. Language-Based Learning: A Short Overview of Language Use in Robotics 7 / 23 Transformer A. Vaswani et al., “Attention Is All You Need,” in Proc. 31st Conf. Neural Information Processing Systems (NeurIPS), 2017. ▶ Most large language models are based on the so-called transformer architecture ▶ The main component of the transformer is an attention layer, which can be seen as computing token importance factors as a result of other tokens in the current context A. Vaswani et al., “Attention Is All ▶ The context is defined as a sequence of tokens of a predefined size You Need,” in 31st Conf. Neural Information Processing Systems (NeurIPS), 2017. ▶ Transformer networks generally use multi-head attention layers, which combine the outputs of multiple individual attention layers to produce a joint attention output Language-Based Learning: A Short Overview of Language Use in Robotics 7 / 23 Robot Learning and Language Language-Based Learning: A Short Overview of Language Use in Robotics 8 / 23 Why Does Language Matter for Robotics? Natural communication with people The ability to use language for human-robot communication eliminates the need for designing specialised, less natural communication interfaces Language-Based Learning: A Short Overview of Language Use in Robotics 9 / 23 Why Does Language Matter for Robotics? Natural communication with people Simplified task description The ability to use language for human-robot Language is an interface through which tasks — communication eliminates the need for designing both their overall and intermediate objectives — specialised, less natural communication interfaces can be described in a simple, general manner Language-Based Learning: A Short Overview of Language Use in Robotics 9 / 23 Why Does Language Matter for Robotics? Natural communication with people Simplified task description The ability to use language for human-robot Language is an interface through which tasks — communication eliminates the need for designing both their overall and intermediate objectives — specialised, less natural communication interfaces can be described in a simple, general manner Rich data source (Written) Language sources contain information about a variety of aspects relevant for existing in human-centred environments Language-Based Learning: A Short Overview of Language Use in Robotics 9 / 23 Foundation Models ▶ A foundation model is a (neural network-based) model that is trained on very large, diverse data ▶ Depending on the model’s purpose, it can be trained on a single data modality or on multimodal data R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” CoRR, vol. abs/2108.07258, July 2022. Available: https://arxiv.org/abs/2108.07258. Language-Based Learning: A Short Overview of Language Use in Robotics 10 / 23 Foundation Models ▶ A foundation model is a (neural network-based) model that is trained on very large, diverse data ▶ Depending on the model’s purpose, it can be trained on a single data modality or on multimodal data ▶ The main purpose of such a model is to be used as a basis for learning specialised tasks ▶ Using a pretrained foundation model as a basis for learning another task is an example of transfer learning R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” CoRR, vol. abs/2108.07258, July 2022. Available: https://arxiv.org/abs/2108.07258. Language-Based Learning: A Short Overview of Language Use in Robotics 10 / 23 Foundation Models ▶ A foundation model is a (neural network-based) model that is trained on very large, diverse data ▶ Depending on the model’s purpose, it can be trained on a single data modality or on multimodal data ▶ The main purpose of such a model is to be used as a basis for learning specialised tasks ▶ Using a pretrained foundation model as a basis for learning another task is an example of transfer learning R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” CoRR, vol. abs/2108.07258, July 2022. Available: https://arxiv.org/abs/2108.07258. ▶ You are certainly familiar with at least one foundation model — the GPT family of models are foundation models Language-Based Learning: A Short Overview of Language Use in Robotics 10 / 23 Foundation Models ▶ A foundation model is a (neural network-based) model that is trained on very large, diverse data ▶ Depending on the model’s purpose, it can be trained on a single data modality or on multimodal data ▶ The main purpose of such a model is to be used as a basis for learning specialised tasks ▶ Using a pretrained foundation model as a basis for learning another task is an example of transfer learning R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” CoRR, vol. abs/2108.07258, July 2022. Available: https://arxiv.org/abs/2108.07258. ▶ You are certainly familiar with at least one foundation model — the GPT family of models are foundation models “A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks...” (Bommasani et al., 2022) Language-Based Learning: A Short Overview of Language Use in Robotics 10 / 23 Vision Transformers ▶ Transformers were originally used only for language processing, but they have since been used for images as well A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. Int. Conf. Learning Representations (ICLR), 2021. Language-Based Learning: A Short Overview of Language Use in Robotics 11 / 23 Vision Transformers ▶ Transformers were originally used only for language processing, but they have since been used for images as well ▶ In a vision transformer, an image is split into image patches and an embedding is computed for each individual patch ▶ The patches together with their positions are then observed as a sequence of image tokens A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. Int. Conf. Learning Representations (ICLR), 2021. Language-Based Learning: A Short Overview of Language Use in Robotics 11 / 23 Vision Transformers ▶ Transformers were originally used only for language processing, but they have since been used for images as well ▶ In a vision transformer, an image is split into image patches and an embedding is computed for each individual patch ▶ The patches together with their positions are then observed as a sequence of image tokens ▶ Once this “image tokenisation” is done, a transformer A. Dosovitskiy et al., “An Image is Worth 16x16 Words: architecture as discussed before can be used for Transformers for Image Recognition at Scale,” in Proc. Int. Conf. Learning Representations (ICLR), 2021. processing the image ▶ Attention layers use embeddings as an input, which actually makes them independent on the input modality — as long as the modality can be appropriately embedded, a transformer is applicable Language-Based Learning: A Short Overview of Language Use in Robotics 11 / 23 Vision-Language Models (VLMs) ▶ For most useful everyday tasks, language is just an abstract representation of the world — vision makes it possible to ground language to real-world concepts and entities A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. 38th Int. Conf. Machine Learning, PMLR, 2021, pp. 8748–8763. Language-Based Learning: A Short Overview of Language Use in Robotics 12 / 23 Vision-Language Models (VLMs) ▶ For most useful everyday tasks, language is just an abstract representation of the world — vision makes it possible to ground language to real-world concepts and entities ▶ A model that combines visual and language inputs for making predictions is referred to as a vision-language model A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. 38th Int. Conf. Machine Learning, PMLR, 2021, pp. 8748–8763. Language-Based Learning: A Short Overview of Language Use in Robotics 12 / 23 Vision-Language Models (VLMs) ▶ For most useful everyday tasks, language is just an abstract representation of the world — vision makes it possible to ground language to real-world concepts and entities ▶ A model that combines visual and language inputs for making predictions is referred to as a vision-language model A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. 38th Int. Conf. Machine Learning, PMLR, 2021, pp. 8748–8763. ▶ Such models are commonly learned using contrastive learning ▶ Training requires alignment between the visual and language data Language-Based Learning: A Short Overview of Language Use in Robotics 12 / 23 Contrastive Learning ▶ In general, contrastive learning is concerned with learning a distance function d : (Rn , Rn ) → R such that2 d(p, p+ ) < d(p, p− ) where p+ is a positive example and p− is a negative example with respect to p P. H. Le-Khac, G. Healy and A. F. Smeaton, “Contrastive Representation Learning: A Framework and Review,” in IEEE Access, vol. 8, pp. 193907-193934, 2020. 2 G. Chechik et al., “Large Scale Online Learning of Image Similarity Through Ranking,” Journal of Machine Learning Research, vol. 11, pp. 1109–1135, 2010. Language-Based Learning: A Short Overview of Language Use in Robotics 13 / 23 Contrastive Learning ▶ In general, contrastive learning is concerned with learning a distance function d : (Rn , Rn ) → R such that2 d(p, p+ ) < d(p, p− ) where p+ is a positive example and p− is a negative example with respect to p ▶ When applied to a single modality, this objective encourages the creation of an embedding space where similar inputs are closer to each other than dissimilar inputs P. H. Le-Khac, G. Healy and A. F. Smeaton, “Contrastive Representation Learning: A Framework and Review,” in IEEE Access, vol. 8, pp. 193907-193934, 2020. 2 G. Chechik et al., “Large Scale Online Learning of Image Similarity Through Ranking,” Journal of Machine Learning Research, vol. 11, pp. 1109–1135, 2010. Language-Based Learning: A Short Overview of Language Use in Robotics 13 / 23 Contrastive Learning ▶ In general, contrastive learning is concerned with learning a distance function d : (Rn , Rn ) → R such that2 d(p, p+ ) < d(p, p− ) where p+ is a positive example and p− is a negative example with respect to p ▶ When applied to a single modality, this objective encourages the creation of an embedding space where similar inputs are closer to each other than dissimilar inputs ▶ In the multimodal case, the objective encourages a joint embedding space that encourages similar entities to have P. H. Le-Khac, G. Healy and A. F. Smeaton, “Contrastive similar representations across different modalities Representation Learning: A Framework and Review,” in IEEE Access, vol. 8, pp. 193907-193934, 2020. 2 G. Chechik et al., “Large Scale Online Learning of Image Similarity Through Ranking,” Journal of Machine Learning Research, vol. 11, pp. 1109–1135, 2010. Language-Based Learning: A Short Overview of Language Use in Robotics 13 / 23 Vision-Language-Action Models (VLAs) B. Zitkovich et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” Proc. Conf. Robot Learning (CoRL), 2023, pp. 2165–2183. Available: https://proceedings.mlr.press/v229/zitkovich23a.html ▶ VLMs are not designed for robot control — their main use is in visual question-and-answer tasks Language-Based Learning: A Short Overview of Language Use in Robotics 14 / 23 Vision-Language-Action Models (VLAs) B. Zitkovich et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” Proc. Conf. Robot Learning (CoRL), 2023, pp. 2165–2183. Available: https://proceedings.mlr.press/v229/zitkovich23a.html ▶ VLMs are not designed for robot control — their main use is in visual question-and-answer tasks ▶ Vision-language-action models fine-tune VLMs on robotics tasks ▶ In VLAs, actions are discretised and represented as language tokens that the model should predict ▶ The predicted tokens are then de-tokenised to extract the corresponding robot action ▶ The actions in such models typically represent end-effector delta actions (change in end-effector position and orientation) and gripper opening/closing actions Language-Based Learning: A Short Overview of Language Use in Robotics 14 / 23 RT-X: Robot-Agnostic Foundation Models Open X-Embodiment Collaboration, “Open X-Embodiment: Robotic Learning Datasets and RT-X Models”, Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2024, pp. 6892–6903. Available: https://doi.org/10.1109/ICRA57147.2024.10611477 ▶ RT-X is a collection of recent foundation models trained on the Open X-Embodiment dataset ▶ Two variants of RT-X are described, based on the recent RT-1 and RT-2 models ▶ The outputs of both models are robot actions (represented as end effector motions and gripper opening / closing actions) Language-Based Learning: A Short Overview of Language Use in Robotics 15 / 23 RT-X: Robot-Agnostic Foundation Models Open X-Embodiment Collaboration, “Open X-Embodiment: Robotic Learning Datasets and RT-X Models”, Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2024, pp. 6892–6903. Available: https://doi.org/10.1109/ICRA57147.2024.10611477 ▶ RT-X is a collection of recent foundation models trained on the Open X-Embodiment dataset ▶ Two variants of RT-X are described, based on the recent RT-1 and RT-2 models ▶ The outputs of both models are robot actions (represented as end effector motions and gripper opening / closing actions) ▶ X-embodiment combines data from multiple robots (22 in total) and a large number of robot skills (more than 500) ▶ RT-X models thus aim to be foundation models applicable to different robot embodiments Language-Based Learning: A Short Overview of Language Use in Robotics 15 / 23 OpenVLA M. J. Kim et al., “OpenVLA:An Open-Source Vision-Language-Action Model”, CoRR, vol. abs/2406.09246, Sept. 2024. Available: https://arxiv.org/abs/2406.09246 ▶ OpenVLA is a VLA that is also trained on (a subset of) the Open X-embodiment dataset, but also includes additional datasets in the training data Language-Based Learning: A Short Overview of Language Use in Robotics 16 / 23 OpenVLA M. J. Kim et al., “OpenVLA:An Open-Source Vision-Language-Action Model”, CoRR, vol. abs/2406.09246, Sept. 2024. Available: https://arxiv.org/abs/2406.09246 ▶ OpenVLA is a VLA that is also trained on (a subset of) the Open X-embodiment dataset, but also includes additional datasets in the training data ▶ The architecture is based on a pretrained VLM and uses a third-person image view (with a resolution of 224 × 224 pixels) Language-Based Learning: A Short Overview of Language Use in Robotics 16 / 23 OpenVLA M. J. Kim et al., “OpenVLA:An Open-Source Vision-Language-Action Model”, CoRR, vol. abs/2406.09246, Sept. 2024. Available: https://arxiv.org/abs/2406.09246 ▶ OpenVLA is a VLA that is also trained on (a subset of) the Open X-embodiment dataset, but also includes additional datasets in the training data ▶ The architecture is based on a pretrained VLM and uses a third-person image view (with a resolution of 224 × 224 pixels) ▶ The model is purely based on open-source models (unlike RT-X-2, which is a closed VLA model), which makes it technically possible to adapt it further ▶ But fine-tuning is performed on a cluster of 64 GPUs over 14 days, so it still requires access to powerful training hardware! Language-Based Learning: A Short Overview of Language Use in Robotics 16 / 23 Octo O. M. Team et al., “Octo: An Open-Source Generalist Robot Policy,” in Proc. Robotics: Science and Systems (RSS), 2024. Available: https://octo- models.github.io ▶ Just as OpenVLA, Octo is based on open-source components and is trained on a subset of Open X-Embodiment ▶ Octo defines a goal-conditioned policy; the goal can be specified either as an image or using language Language-Based Learning: A Short Overview of Language Use in Robotics 17 / 23 Octo O. M. Team et al., “Octo: An Open-Source Generalist Robot Policy,” in Proc. Robotics: Science and Systems (RSS), 2024. Available: https://octo- models.github.io ▶ Just as OpenVLA, Octo is based on open-source components and is trained on a subset of Open X-Embodiment ▶ Octo defines a goal-conditioned policy; the goal can be specified either as an image or using language ▶ A transformer architecture is used by the model: all inputs are mapped to tokens and are then processed by the transformer to produce readout tokens ▶ Readout tokens are used to generate robot actions using a diffusion policy Language-Based Learning: A Short Overview of Language Use in Robotics 17 / 23 Octo O. M. Team et al., “Octo: An Open-Source Generalist Robot Policy,” in Proc. Robotics: Science and Systems (RSS), 2024. Available: https://octo- models.github.io ▶ Just as OpenVLA, Octo is based on open-source components and is trained on a subset of Open X-Embodiment ▶ Octo defines a goal-conditioned policy; the goal can be specified either as an image or using language ▶ A transformer architecture is used by the model: all inputs are mapped to tokens and are then processed by the transformer to produce readout tokens ▶ Readout tokens are used to generate robot actions using a diffusion policy ▶ Unlike OpenVLA, Octo can use different camera views (including wrist cameras), and can be fine-tuned to use different observation or action spaces Language-Based Learning: A Short Overview of Language Use in Robotics 17 / 23 Diffusion Policy ▶ Octo uses a diffusion policy, which is a recent visuomotor policy representation that generates actions through a probabilistic diffusion process Language-Based Learning: A Short Overview of Language Use in Robotics 18 / 23 Diffusion Policy ▶ Octo uses a diffusion policy, which is a recent visuomotor policy representation that generates actions through a probabilistic diffusion process ▶ The denoising process of a diffusion policy is governed by ak−1 = α akt − γϵθ ot , akt + N 0, σ 2 I t where α, γ, and σ are parameters (made to vary with the step k) and ϵθ is the learned denoising network C. Chi et al., “Diffusion policy: Visuomotor policy learning via action diffusion,” Int. Journal Robotics Research, 2024. Available: https://doi.org/10.1177/02783649241273668 Language-Based Learning: A Short Overview of Language Use in Robotics 18 / 23 Diffusion Policy ▶ Octo uses a diffusion policy, which is a recent visuomotor policy representation that generates actions through a probabilistic diffusion process ▶ The denoising process of a diffusion policy is governed by ak−1 = α akt − γϵθ ot , akt + N 0, σ 2 I t where α, γ, and σ are parameters (made to vary with the step k) and ϵθ is the learned denoising network ▶ The network is trained to approximate some noise ϵk that is added to a ground-truth action a0t : L = M SE ϵk , ϵθ ot , a0t + ϵk , k C. Chi et al., “Diffusion policy: Visuomotor policy learning via action diffusion,” Int. Journal Robotics Research, 2024. Available: https://doi.org/10.1177/02783649241273668 Language-Based Learning: A Short Overview of Language Use in Robotics 18 / 23 Diffusion Policy ▶ Octo uses a diffusion policy, which is a recent visuomotor policy representation that generates actions through a probabilistic diffusion process ▶ The denoising process of a diffusion policy is governed by ak−1 = α akt − γϵθ ot , akt + N 0, σ 2 I t where α, γ, and σ are parameters (made to vary with the step k) and ϵθ is the learned denoising network ▶ The network is trained to approximate some noise ϵk that is added to a ground-truth action a0t : L = M SE ϵk , ϵθ ot , a0t + ϵk , k ▶ ϵθ is interpreted to learn a gradient field ∇E(a): C. Chi et al., “Diffusion policy: Visuomotor policy learning ak−1 t = akt − γ∇E(akt ) via action diffusion,” Int. Journal Robotics Research, 2024. Available: https://doi.org/10.1177/02783649241273668 Language-Based Learning: A Short Overview of Language Use in Robotics 18 / 23 Diffusion Policy ▶ Octo uses a diffusion policy, which is a recent visuomotor policy representation that generates actions through a probabilistic diffusion process ▶ The denoising process of a diffusion policy is governed by ak−1 = α akt − γϵθ ot , akt + N 0, σ 2 I t where α, γ, and σ are parameters (made to vary with the step k) and ϵθ is the learned denoising network ▶ The network is trained to approximate some noise ϵk that is added to a ground-truth action a0t : L = M SE ϵk , ϵθ ot , a0t + ϵk , k ▶ ϵθ is interpreted to learn a gradient field ∇E(a): C. Chi et al., “Diffusion policy: Visuomotor policy learning ak−1 t = akt − γ∇E(akt ) via action diffusion,” Int. Journal Robotics Research, 2024. Available: https://doi.org/10.1177/02783649241273668 ▶ A diffusion policy is used with model-predictive control Language-Based Learning: A Short Overview of Language Use in Robotics 18 / 23 Uses of Language / Foundation Models in Robotics R. Firoozi et al., “Foundation Models in Robotics: Applications, Challenges, and the Future”, CoRR, vol. abs/2312.07843, Dec. 2023. Available: https://arxiv.org/abs/2312.07843 Language-Based Learning: A Short Overview of Language Use in Robotics 19 / 23 Summary of Observations ▶ The development of vision-language(-action) models is a very active and ongoing process ▶ New models are published virtually every (other) month Language-Based Learning: A Short Overview of Language Use in Robotics 20 / 23 Summary of Observations ▶ The development of vision-language(-action) models is a very active and ongoing process ▶ New models are published virtually every (other) month ▶ Open X-Embodiment seems to be becoming a standard dataset for model training ▶ Policies are typically trained on subsets of the dataset and may include additional data Language-Based Learning: A Short Overview of Language Use in Robotics 20 / 23 Summary of Observations ▶ The development of vision-language(-action) models is a very active and ongoing process ▶ New models are published virtually every (other) month ▶ Open X-Embodiment seems to be becoming a standard dataset for model training ▶ Policies are typically trained on subsets of the dataset and may include additional data ▶ The current trend is to train general policies rather than robot- or task-specific policies ▶ This is directly facilitated by diverse datasets such as Open X-Embodiment ▶ Such models are typically trained over many GPUs and for several days — small-scale training / fine-tuning is virtually impossible ▶ But inference time can still be manageable Language-Based Learning: A Short Overview of Language Use in Robotics 20 / 23 Some Challenges with Robot Foundation Models No safety guarantees Current models are trained and deployed without considering safety constraints Language-Based Learning: A Short Overview of Language Use in Robotics 21 / 23 Some Challenges with Robot Foundation Models No safety guarantees Challenging failure analysis Current models are trained and deployed without The causes of failures produced by large robot considering safety constraints models can be (mildly put) difficult to understand Language-Based Learning: A Short Overview of Language Use in Robotics 21 / 23 Some Challenges with Robot Foundation Models No safety guarantees Challenging failure analysis Current models are trained and deployed without The causes of failures produced by large robot considering safety constraints models can be (mildly put) difficult to understand Unknown generalisation conditions The conditions under which generalisation between environment conditions and / or robots is possible are not well-defined Language-Based Learning: A Short Overview of Language Use in Robotics 21 / 23 Some Challenges with Robot Foundation Models No safety guarantees Challenging failure analysis Current models are trained and deployed without The causes of failures produced by large robot considering safety constraints models can be (mildly put) difficult to understand Unknown generalisation conditions Computational challenges Robot foundation models are large and require The conditions under which generalisation powerful hardware to run efficiently — using between environment conditions and / or robots them for offline execution is difficult for many is possible are not well-defined robots Language-Based Learning: A Short Overview of Language Use in Robotics 21 / 23 Summary ▶ Large language models are based on the transformer architecture, which includes a multitude of attention layers that operate over embedding tokens ▶ Vision-language models are models that are trained on aligned visual and language datasets ▶ Multimodal learning can be performed using contrastive learning, which results in a joint embedding space over the different modalities ▶ Robot foundation models, such as the recent RT-X, have been applied to various robot problems, such as task planning, policy learning, and value learning ▶ The general applicability of robot foundation models is conditioned on resolving various limitations with respect to safety, transparency, and efficiency Language-Based Learning: A Short Overview of Language Use in Robotics 22 / 23 Next Lecture: Explainable Robotics Language-Based Learning: A Short Overview of Language Use in Robotics 23 / 23