LLM Agents Lecture Notes PDF

CSC6203/CIE6021: Large Language Model Lecture 9: LLM Agents Winter 2023 Benyou Wang School of Data Science Recap… Human processes multimodal infos simultaneously Bergen, Benjamin K. Louder than words: The new science of how the mind makes meaning. Basic Books, 2012 3 CLIP [CLIP] Learning Transferable Visual Models From Natural Language Supervision (OpenAI, 2021) Flamingo's high-level architecture Flamingo: a Visual Language Model for Few-Shot Learning (DeepMind, April 29, 2022) MLLM-Bench: evaluating Multi-modal LLMs using GPT-4V Wentao Ge†, Shunian Chen†, Guiming Chen†, Junying Chen†, Zhihong Chen∗, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang. MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V. MLLM-Bench: evaluating Multi-modal LLMs using GPT-4V Wentao Ge†, Shunian Chen†, Guiming Chen†, Junying Chen†, Zhihong Chen∗, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang. MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V. Large space to be improved in MLLMs, w.r.t. GPT-4V Today’s lecture Contents in this lecture Overall framework of agents Four elements ○ Planning ○ Tools ○ Memory ○ Action Recap of agent Proof of concepts Our research What is LLM Agents Explorations of Visual-Language Model on Autonomous Driving On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving What is LLM Agents In the kitchen, one agent orders dishes, while another agent is responsible for planning and solving the cooking task. At the concert, three agents are collaborating to perform in a band. Outdoors, two agents are discussing lantern-making, planning the required materials, and finances by selecting and using tools. Users can participate in any of these Scenario of an envisioned society composed of AI agents stages of this social activity The Rise and Potential of Large Language Model Based Agents: A Survey What is LLM Agents Let an LLM decide what to do over and over, while feeding the results of its actions back into the prompt. This allows the program to iteratively and incrementally work towards its objective. Complete Guide To Setup AutoGPT https://docs.agpt.co/ The framework of agents A high-level picture Action Agent Environment Perception and feedback LLM for the cognition e.g. planning, decision making A high-level picture Action Agent Environment Perception and feedback LLM for the cognition e.g. planning, decision making Perception A high-level picture Action Agent Environment Perception and feedback LLM for the cognition e.g. planning, decision making Action and feedback helps evolution of LLM agents Use cases of LLM agents Category The use cases for LLM agents, or Language Model-based agents, are vast and diverse. These agents, powered by large language models (LLMs), can be used in various scenarios, including: 1. Single-agent applications 2. Multi-agent systems 3. Human-Agent cooperation https://gptpluginz.com/llm-agents/ Single-agent applications LLM agents can be utilized as personal assistants to assist users in breaking free from daily tasks and repetitive labor. They can analyze, plan, and solve problems independently, reducing the work pressure on individuals and enhancing task-solving efficiency. https://github.com/langchain-ai/langchain Multi-agent systems Multi-agent systems: LLM agents can interact with each other in a collaborative or competitive manner. This enables them to achieve advancement through teamwork or adversarial interactions. In these systems, agents can work together on complex tasks or compete against each other to improve their performance. Play Werewolf （狼人杀） Yuzhuang Xu , Shuo Wang, Peng Li,, Fuwen Luo, Xiaolong Wang , Weidong Liu, Yang Liu. Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf. https://arxiv.org/pdf/2309.04658.pdf https://gptpluginz.com/llm-agents/ Human-Agent cooperation Human-Agent cooperation: LLM agents can interact with humans, providing them with assistance and performing tasks more efficiently and safely. Example: interactively write code together with ChatGPT. The four Elements https://gptpluginz.com/llm-agents/ What is LLM Agents Here is a famous picture from Lilian Weng (from OpenAI) https://lilianweng.github.io/posts/2023-06-23-agent/ https://gptpluginz.com/llm-agents/ What is LLM Agents Planning: Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results. https://lilianweng.github.io/posts/2023-06-23-agent/ https://gptpluginz.com/llm-agents/ What is LLM Agents Memory: Short-term memory: all the in-context learning is utilizing short-term memory of the model to learn. Long-term memory: this provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval. https://lilianweng.github.io/posts/2023-06-23-agent/ https://gptpluginz.com/llm-agents/ What is LLM Agents Tool use: The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more. https://lilianweng.github.io/posts/2023-06-23-agent/ https://gptpluginz.com/llm-agents/ What is LLM Agents Action: The agent's ability to execute actions in the real or virtual world is crucial. This can range from performing tasks in a digital environment to controlling physical robots or devices. The execution phase relies on the agent's planning, memory, and tool use to carry out tasks effectively and adaptively. https://lilianweng.github.io/posts/2023-06-23-agent/ Why LLM Agents stand out? Language Mastery: Their inherent capability to both comprehend and produce language ensures seamless user interaction. Decision-making: LLMs are equipped to reason and decide, making them adept at solving intricate issues. Flexibility: Their adaptability ensures they can be molded for diverse applications. Collaborative Interactions: They can collaborate with humans or other agents, paving the way for multifaceted interactions. Element 1: Planning What is planning How to a solve a complicated task sequentially? One-step task: translate an paragraph Multi-step task: How to put an elephant into a fridge? - simple - complicated - usually without interaction - it involves multple steps - it usually uses external tools (e.g., operate the fridge) Two simple examples GSM8K (math word problem) GAME24 They are both multi-step problems! Examples of Planning Task Decomposition Self-Reflection/self-refinement Planning with Task Decomposition Task Decomposition: Chain of thought Chain of Thought (CoT) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test- time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Task Decomposition: Tree of Thoughts Tree of Thoughts extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Tree of Thoughts: Deliberate Problem Solving with Large Language Models Task Decomposition: Tree of Thoughts Resource Intensity: Implementing search methods like ToT is more resource-intensive, incurring higher costs compared to simpler sampling methods. Tree of Thoughts: Deliberate Problem Solving with Large Language Models Our research: better verification SOTA performance on mathematical reasoning Fei Yu, Anningzhe Gao, Benyou Wang. Outcome-supervised Verifiers for Planning in Mathematical Reasoning. https://arxiv.org/pdf/2311.09724.pdf Task Decomposition: LLM+P LLM+P involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency Task Decomposition: LLM+P In the PDDL process, LLM 1) translates the problem into “Problem PDDL”; 2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”; 3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency From today 机器之心 12] Jie, Z., Luong, T.Q., Zhang, X., Jin, X. and Li, H., 2023. Design of a Chain-of-Thought in Math Problem Solving. arXiv preprint arXiv:2309.11054. Python > Wolfram Planning with Self-Reflection Self-Reflection Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable. Self-Reflection: ReACT ReACT integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. The former enables LLM to interact with the environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language. The ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as: ReAct: Synergizing Reasoning and Acting in Language Models Self-Reflection: ReACT In both experiments on knowledge-intensive tasks and decision-making tasks, ReAct works better than the Act-only baseline where Thought: … step is removed. ReAct: Synergizing Reasoning and Acting in Language Models Self-Reflection: Chain of Hindsight Chain of Hindsight (CoH) encourages the model to improve on its own outputs by explicitly presenting it with a sequence of past outputs, each annotated with feedback. To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre- training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training. Chain of Hindsight Aligns Language Models with Feedback Element 2: tools Introduction to tools in LLMs Human + tool use: motivations As humans, we have limited time and memory, feel tired, and have emotions. Human + tool use ○ Enhanced scalability ○ Improved consistency ○ Greater interpretability ○ Higher capacity and productivity LLMs + tool use: motivations Just like human, LLMs suffer from the similar limitations. But in the same way, LLMs + tool use ○ Enhanced scalability ○ Improved consistency ○ Greater interpretability ○ Higher capacity and productivity Element 2: tools Recent Works In the case of calculator The early version GPT-4 struggled for numeric calculator Using LLM to deal with this, it is a waste of network capacity! LLMs + tool use in perspective of executable language grounding Ground language models into executable actions Mapping natural language instructions into code or actions executable within various environments such as databases, web applications, and robotic physical world. LM (planning and reasoning) + actions Robotic physical world Data analysis Web/Apps https://openai.com/blog/chatgpt-plugins https://code-as-policies.github.io/ LLMs + tool use in perspective of executable language grounding LLMs + tool use in executable language grounding tasks Inputs Language: user question/request Toolkit: code, APIs to search engines, self-defined functions, expert models… Environment: databases, IDE, web/apps, visual and robotic physical world… Outputs Grounded reasoning code/action seq that can be executed in the corresponding environment ○ What tools to select, when and how to use the selected tools LLMs + tool: PAL, PoT PAL: Program-aided Language Models Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks LLMs + webs/apps or personalized functions: ChatGPT-Plugins https://openai.com/blog/chatgpt-plugins Mind2Web: Towards a Generalist Agent for the Web LLMs + webs/apps or personalized functions: ReAct ReAct: Synergizing Reasoning and Acting in Language Models LLMs + APIs to expert models: HuggingGPT HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face LLMs + APIs to expert models: HuggingGPT Lots of AI models are available in different fields and modalities, but cannot handle complex artificial intelligence tasks. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face LLMs + APIs to expert models: HuggingGPT The system comprises of 4 stages: Task Planning: LLM analyze the user's requests, breaking them down into solvable tasks through prompts. Model Selection: LLM is presented with a list of models to choose from and distributes the tasks to expert models. LLM. Task Execution: Expert models execute on the specific tasks and log results. Response Generation: LLM receives the execution results and provides summarized results to users. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face LLMs + APIs to expert models: HuggingGPT Evaluation for task planning abilities: Single Task: The user request involves only one task. Sequential Task: The user's request needs to be broken down into a sequence of multiple subtasks. Graph Task: The user's request needs to be decomposed into a directed acyclic graph. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face LLMs + APIs to expert models: HuggingGPT Challenges to put HuggingGPT into real world usage 1) Efficiency improvement is needed as both LLM inference rounds and interactions with other models slow down the process; 2) It relies on a long context window to communicate over complicated task content; 3) Stability improvement of LLM outputs and external model services. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face LLMs + code, robotic arm, expert models: Code as Policies Code as Policies: Language Model Programs for Embodied Control Do As I Can, Not As I Say: Grounding Language in Robotic Affordances ProgPrompt: Generating Situated Robot Task Plans using Large Language Models Mind's Eye: Grounded Language Model Reasoning through Simulation LLMs + training for tool use: TALM TALM: Tool Augmented Language Models TALM: Tool Augmented Language Models LLMs + training for tool use: Toolformer Toolformer: Language Models Can Teach Themselves to Use Tools LLMs + training for tool use: Toolformer Toolformer: Language Models Can Teach Themselves to Use Tools Element 2: tools Evaluation of tools in LLMs Evaluation: API-Bank API-Bank is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Evaluation: API-Bank Evaluation index Level-1: Evaluate LLM's ability to call the API (Accuracy); given a description of the API, the model needs to determine whether to call the API. Level-2: Further evaluate LLM’s ability to retrieve APIs (Rouge); the model needs to retrieve APIs that may solve user needs. Level-3: Examine the ability of LLM planning API (number of turns). API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Evaluation: GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Extension-3 Evaluation: GPT4Tools Successful Rate of Thought measures whether the predicted decision matches the groundtruth decision. Successful Rate of Action measures whether the predicted tool name is in agreement with the name of the ground truth tool. Successful Rate of Arguments evaluates whether the predicted arguments match the ground- truth arguments. Successful Rate measures whether a chain of actions are executed successfully, which requires the correctness of thought, tool name, and tool arguments. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Element 2: tools Challenges and future work Challenges and future work Complexity: more complex domain professional/unseen tools? Interactivity: go beyond single turn? Evaluation: multiple possible solutions? Real-time interactive evaluation? Efficiency: smaller models? Reliability: know when to abstain, know its capacity, memorizing and querying tools? Others ○ Better tool API design/tool making? ○ Personalization? ○ …… Element 3: Memory LLM Agent Memory: Types of Memory in human brains 1. Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc) after the original stimuli have ended. Sensory memory typically only lasts for up to a few seconds. Subcategories include iconic memory (visual), echoic memory (auditory), and haptic memory (touch). 2. Short-Term Memory (STM) or Working Memory: It stores information that we are currently aware of and needed to carry out complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have the capacity of about 7 items (Miller 1956) and lasts for 20-30 seconds. 3. Long-Term Memory (LTM): Long-term memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. There are two subtypes of LTM: a. Explicit / declarative memory: This is memory of facts and events, and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts). b. Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard. LLM Agent Memory: Types of Memory in LLMs 1. Sensory Meory: learning embedding representations for raw inputs, including text, image or other modalities; [Vision encoder/speech encoder] 2. Short-Term Memory (STM): in-context learning. It is short and finite, as it is restricted by the finite context window length of Transformer. [prompt engineering] 3. Long-Term Memory (LTM): the external vector store that the agent can attend to at query time, accessible via fast retrieval. [Retrieval-augmented LMs ] Element 3: memory Introduction to Retrieval-Augmented LMs (RAG) Retrieval-based language models (LMs) Retrieval-based LMs = Retrieval + LMs It is a language model It retrieves from an external datastore (at least during inference time) （Also referred to semiparametric and non-parametric models) Why retrieval-based LMs? LLMs can’t memorize all (long-tail) knowledge in their parameters List 5 important papers authored by Geoffrey Hinton What is Kathy Saltzman’s occupation? Geoffrey Hinton is a renowned computer scientist … Here are five important papers authored by him: 1. "Learning Internal Representations by Error Propagation" (with D. E. Rumelhart and R. J. Williams) - This paper, published in 1986,.. 2. "Deep Boltzmann Machines" (with R. Salakhutdinov) - Published in 2009,.. … 4. "Deep Learning" (with Y. Bengio and A. Courville) - Published as a book in 2016,… GPT-3 davinci-003: 20%-30% accuracy 5. "Attention Is All You Need" (with V. Vaswani, N. Shazeer, et al.) - Published in 2017, this paper introduced the Transformer model,… When Not to Trust Language Models Why retrieval-based LMs? LLMs’ knowledge is easily outdated and hard to update Who is the CEO of Twitter? As of my knowledge cutoff in September 2021, the CEO of twitter is Jack Dorsey…. Exsiting knowledge editing methods are still NOT scalable (active research!) The datastore can be easily updated and expanded - even without retraining! When Not to Trust Language Models Why retrieval-based LMs? LLMs’ output is challenging to interpret and verify Generating text with citations Can trace knowledge source from retrieval results - better interpretability & control WebGPT: Browser-assisted question-answering with human feedback Teaching language models to support answers with verified quotes Why retrieval-based LMs? LLMs’ output is challenging to interpret and verify Why retrieval-based LMs? LLMs are shown to easily leak private training data Individualization on private data by storing it in the datastore Extracting Training Data from Large Language Models Why retrieval-based LMs? LLMs are large and expensive to train and run Long-term goal: can we possibly reduce the training and inference costs, and scale down the size of LLMs? e.g., RETRO (Borgeaud et al., 2021): “obtains comparable performance to GPT-3 on the Pile, despite using 25x fewer parameters” Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time An entire field of study on how to get similarity function better Definition of Retrieval-based LM A language model (LM) that uses an external datastore at test time Software: FAISS, Distributed FAISS, SCaNN, etc… Definition of Retrieval-based LM Software: FAISS, Distributed FAISS, SCaNN, etc… https://github.com/facebookresearch/faiss/wiki Questions to answer What’s the query & when do we retrieve? Questions to answer What’s the query & when do we retrieve? What do we retrieve? Questions to answer What’s the query & when do we retrieve? How do we use retrieval? What do we retrieve? Popular Retrieval-based LMs Retrieval-based LMs: REALM: Retrieval-Augmented Language Model Pre-Training [link] x = World Cup 2022 was the last with 32 teams before the increase to [MASK] in 2026. Retrieval-based LMs: REALM: Retrieval-Augmented Language Model Pre-Training [link] x = World Cup 2022 was the last with 32 teams before the increase to [MASK] in 2026. World Cup 2022 was … the increase to [MASK] in 2026. Retrieval-based LMs: REALM: Retrieval-Augmented Language Model Pre-Training [link] x = World Cup 2022 was the last with 32 teams before the increase to [MASK] in 2026. FIFA World Cup 2026 will expand to 48 teams + World Cup 2022 was … the increase to [MASK] in 2026. Recent research on Retrieval-based LMs REALM: Retrieve Stage Recent research on Retrieval-based LMs REALM: Retrieve Stage Recent research on Retrieval-based LMs REALM: Retrieve Stage Recent research on Retrieval-based LMs REALM: Read Stage Recent research on Retrieval-based LMs REALM: Read Stage Recent research on Retrieval-based LMs REALM: Read Stage Recent research on Retrieval-based LMs REALM: Retrieval-Augmented Language Model Pre-Training [link] What to retrieve? ○ Chunks ○ Tokens ○ Others How to use retrieval? ○ Input layer ○ Intermediate layers ○ Output layer When to retrieve? ○ Once ○ Every n tokens (n>1) ○ Every tokens Recent research on Retrieval-based LMs REALM and subsequent work REALM (Guu et al 2020): MLM followed by fine-tuning, focusing on open-domain QA DPR (Karpukhin et al 2020): Pipeline training instead of joint training, focusing on open- domain QA (no explicit language modeling) RAG (Lewis et al 2020): “Generative” instead of “masked language modeling”, focusing on opendomain QA & knowledge intensive tasks (no explicit language modeling) Atlas (Izcard et al 2022): Combine RAG with retrieval-based language model pre-training based on the encoder-decoder architecture (more to come in Section 4), focusing on open- domain QA & knowledge intensive tasks Papers that follow this approach focusing on LM perplexity have come out quite recently (Shi et al. 2023, Ram et al. 2023) Recent research on Retrieval-based LMs Perplexity: The lower the better Retrieval helps over all sizes of LMs Recent research on Retrieval-based LMs Summary of recent works Recent research on Retrieval-based LMs Training methods for retrieval-based LMs Independent training Sequential training Joint training w/ asynchronous index update Joint training w/ in-batch approximation Recent research on Retrieval-based LMs Training methods for retrieval-based LMs Independent training Sequential training Joint training w/ asynchronous index update Joint training w/ in-batch approximation Training methods for retrieval-based LMs Independent training Retrieval models and language models are trained independently - Training language models - Training retrieval models Sparse retrieval models: TF-IDF / BM25 No training needed! Dense retrieval models: DPR Contrastive learning with “in-batch” nesgatives Training methods for retrieval-based LMs Independent training Each component can be improved separately Training methods for retrieval-based LMs Independent training Work with off-the-shelf models (no extra training required) Each part can be improved independently LMs are not trained to leverage retrieval Retrieval models are not optimized for LM tasks/domains Recent research on Retrieval-based LMs Training methods for retrieval-based LMs Independent training Sequential training Joint training w/ asynchronous index update Joint training w/ in-batch approximation Training methods for retrieval-based LMs Sequential training - One component is first trained independently and then fixed - The other component is trained with an objective that depends on the first one Training methods for retrieval-based LMs Sequential training - One component is first trained independently and then fixed - The other component is trained with an objective that depends on the first one Training methods for retrieval-based LMs Sequential training Work with off-the-shelf components (either a large index or a powerful LM) LMs are trained to effectively leverage retrieval results Retrievers are trained to provide text that helps LMs the most One component is still fixed and not trained Recent research on Retrieval-based LMs Training methods for retrieval-based LMs Independent training Sequential training Joint training w/ asynchronous index update Joint training w/ in-batch approximation Training methods for retrieval-based LMs Challenges of updating retrieval models Recent research on Retrieval-based LMs Training methods for retrieval-based LMs Independent training Sequential training Joint training w/ asynchronous index update Joint training w/ in-batch approximation Training methods for retrieval-based LMs Joint training w/ asynchronous index update - Retrieval models and language models are trained jointly - Allow the index to be “stale”; rebuild the retrieval index every T steps Training methods for retrieval-based LMs REALM: Retrieval-Augmented Language Model Pre-Training Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering Joint training w/ asynchronous index update Asynchronous index update Recent research on Retrieval-based LMs Training methods for retrieval-based LMs Independent training Sequential training Joint training w/ asynchronous index update Joint training w/ in-batch approximation Training methods for retrieval-based LMs Joint training w/ in-batch approximation - Retrieval models and language models are trained jointly - Use “in-batch index” instead of full index Training methods for retrieval-based LMs Nonparametric Masked Language Modeling Training Language Models with Memory Augmentation Joint training w/ in-batch approximation Training methods for retrieval-based LMs Joint training End-to-end trained — each component is optimized Good performance Training is more complicated (async update, overhead, data batching, etc) Train-test discrepancy still remains Training methods for retrieval-based LMs Summary Element 4: action Action: Introduction In the construction of the agent, the action module receives action sequences sent by the planning module and carries out actions to interact with the environment. Action: Embodied AI In the pursuit of Artificial General Intelligence (AGI), the embodied agent is considered a pivotal paradigm while it strives to integrate model intelligence with the physical world. Action: Embodied AI Embodied AI should be capable of actively perceiving, comprehending, and interacting with physical environments, making decisions, and generating specific behaviors to modify the environment based on LLM’s extensive internal knowledge. We collectively term these as embodied actions, which enable agents’ ability to interact with and comprehend the world in a manner closely resembling human behavior Action: Embodied AI The potential of LLM-based agents for embodied actions. Cost efficiency: Some on-policy algorithms struggle with sample efficiency as they require fresh data for policy updates while gathering enough embodied data for high- performance training is costly and noisy. Embodied action generalization: An agent’s competence should extend beyond specific tasks. When faced with intricate, uncharted real-world environments, it’s imperative that the agent exhibits dynamic learning and generalization capabilities Embodied action planning: Planning constitutes a pivotal strategy employed by humans in response to complex problems as well as LLM-based agents. Embodied AI: PaLM-E: An Embodied Multimodal Language Model PaLM-E transfers knowledge from visual-language domains into embodied reasoning – from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E: An Embodied Multimodal Language Model Recap of Agent Recap1: four key components of LLM Agent Recap2: the development of LLM agents A Survey on Large Language Model based Autonomous Agents Recap3: challenges After going through key ideas and demos of building LLM-centered agents, here are couple common limitations: 1. Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention. 2. Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error. 3. Reliability of natural language interface: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output. A Survey on Large Language Model based Autonomous Agents https://gptpluginz.com/llm-agents/ Extension: frameworks for LLM agent Langchain: This is a framework that allows you to build applications with LLMs through composability. You can use different agents for different data types, such as AutoGen: AutoGen is a framework that enables development of LLM applications using multiple agents that can converse with each other to solve tasks. OpenAgents is an open platform for using and hosting language agents in the wild of everyday life. Language agents are systems that can understand and communicate in natural language, such as chatbots, voice assistants, or conversational AI. ChatDev is a project that aims to create customized software using natural language idea (through LLM-powered multi-agent collaboration). Proof of concepts POC1: Autonomous driving using GPT-4V Explorations of Visual-Language Model on Autonomous Driving On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving POC1: Autonomous driving using GPT-4V Environment Understanding Reasoning Traffic Participants Time Weather Traffic light Planning Traffic signs … Memory Tools Action On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving POC1: Autonomous driving using GPT-4V Understanding POC1: Autonomous driving using GPT-4V Reasoning POC1: Autonomous driving using GPT- 4V Planning POC1: Autonomous driving using GPT-4V Tool -> Action Our research Our research: Online Training of Large Language Models: Learn while Chatting (Agent Memory Related) Online-training can convert specific short-term memory into long-term memory efficiently and effectively. Our research: Modularization Makes Language Models Easier to Use Tools. (Agent Tool Related) Modularization is an effective way to help models quickly learn from multiple tools Acknowledgements https://github.com/Paitesanshi/LLM-Agent-Survey https://github.com/WooooDyy/LLM-Agent-Paper-List Generative Agents: Interactive Simulacra of Human Behavior. https://wenting-zhao.github.io/complex-reasoning-tutorial/ https://acl2023-retrieval-lm.github.io/ https://github.com/xlang-ai/llm-tool-use LLMs + tool use in perspective of executable language grounding Binding Language Models in Symbolic Languages Binder is a training-free neural-symbolic framework that maps the task input to an executable Binder program that (1) allows binding API calls to GPT-3 Codex into SQL/Python, (2) is executed with SQL/Python Interpreter + GPT-3 Codex to derive the answer. Project website: https://lm-code-binder.github.io, ICLR 2023 LLMs + tool: Binding Language Models in Symbolic Languages Binding Language Models in Symbolic Languages LLMs + tool: Binding Language Models in Symbolic Languages Binding Language Models in Symbolic Languages LLMs + tool: Binding Language Models in Symbolic Languages Binding Language Models in Symbolic Languages LLMs + tool: Binding Language Models in Symbolic Languages Binding Language Models in Symbolic Languages LLMs + tool: Binding Language Models in Symbolic Languages Binding Language Models in Symbolic Languages Extension-1: LLM as tool maker LATM: Large Language Models as Tool Makers Large Language Models as Tool Makers Self-Reflection: Reflexion Reflexion is a framework to equips agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion has a standard RL setup, in which the reward model provides a simple binary reward and the action space follows the setup in ReAct where the task-specific action space is augmented with language to enable complex reasoning steps. After each action at, the agent computes a heuristic ht and optionally may decide to reset the environment to start a new trial depending on the self-reflection results. Reflexion: Language Agents with Verbal Reinforcement Learning Self-Reflection: Reflexion Self-reflection is created by showing two-shot examples to LLM and each example is a pair of (failed trajectory, ideal reflection for guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM. Reflexion: Language Agents with Verbal Reinforcement Learning Self-Reflection: Chain of Hindsight Chain of Hindsight (CoH) encourages the model to improve on its own outputs by explicitly presenting it with a sequence of past outputs, each annotated with feedback. To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre- training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training. Chain of Hindsight Aligns Language Models with Feedback Self-Reflection: Chain of Hindsight The idea of CoH is to present a history of sequentially improved outputs in context and train the model to take on the trend to produce better outputs. Algorithm Distillation applies the same idea to cross-episode trajectories in reinforcement learning tasks, where an algorithm is encapsulated in a long history-conditioned policy. The goal is to learn the process of RL instead of training a task- specific policy itself. Chain of Hindsight Aligns Language Models with Feedback Embodied AI: PaLM-E: An Embodied Multimodal Language Model A single PaLM-E model directs the low-level policies of two real robots. Shown is a long-horizon mobile manipulation task in a kitchen, and one-shot / zero-shot generalization with a tabletop manipulation robot. PaLM-E: An Embodied Multimodal Language Model Embodied AI: Inner Monologue: Embodied Reasoning through Planning with Language Models Inner Monologue enables grounded closed-loop feedback for robot planning with large language models by leveraging a collection of perception models in tandem with pretrained language-conditioned robot skills. Inner Monologue: Embodied Reasoning through Planning with Language Models Embodied AI: Inner Monologue: Embodied Reasoning through Planning with Language Models Various types of textual feedback. Success Detection gives task-specific task completion information, Passive Scene Description gives structured semantic scene information at every planning step, and Active Scene Description gives unstructured semantic information only when queried by the LLM planner. Inner Monologue: Embodied Reasoning through Planning with Language Models Embodied AI: Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents Authors investigate the possibility of extracting actionable knowledge from pre-trained large language models (LLMs). They first show surprising finding that pre-trained causal LLMs can decompose high-level tasks into sensible mid-level action plans (left). To make the plans executable, They propose to translate each step into admissible action via another pre-trained masked LLM (middle). The translated action is appended to the prompt used for generating the remaining steps (right). All models are kept frozen without additional training. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents Embodied AI: Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents The top row shows the execution of the task “Complete Amazon Turk Surveys”, and the bottom row shows the task “Get Glass of Milk”. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

LLM Agents Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript