Adaptive In-conversation Team Building for Language Model Agents PDF
Document Details
Uploaded by Deleted User
Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, Chi Wang
Tags
Summary
This paper describes an innovative approach to adaptive team building for language model agents, focusing on the design of the Captain Agent. The method dynamically forms teams for each step of a task-solving process, utilizing nested group conversations and reflection to ensure diverse expertise and avoid stereotyped outputs.
Full Transcript
Adaptive In-conversation Team Building for Language Model Agents Linxin Song1∗ , Jiale Liu2∗ , Jieyu Zhang3 , Shaokun Zhang2 , Ao Luo4 , Shijian Wang5 ,...
Adaptive In-conversation Team Building for Language Model Agents Linxin Song1∗ , Jiale Liu2∗ , Jieyu Zhang3 , Shaokun Zhang2 , Ao Luo4 , Shijian Wang5 , Qingyun Wu2 , Chi Wang6† 1 University of Southern California 2 Penn State University 3 University of Washington arXiv:2405.19425v1 [cs.CL] 29 May 2024 4 Waseda University 5 Southeast University 6 Microsoft Research Static Build Adaptive Build Task Instruction Task Instruction Adaptive Builder Task Decomposition Instruct Builder Reflector Subtask Build Build Retrieval Selection Generation LLM Agent Team LLM Agent Team Figure 1: Two team-building paradigms for LLM agents. The "Static Build" paradigm builds a static team before task execution. Our "Adaptive Build" paradigm uses an adaptive builder agent to form different teams during the task-solving procedure. Abstract Leveraging multiple large language model (LLM) agents has shown to be a promis- ing approach for tackling complex tasks, while the effective design of multiple agents for a particular application remains an art. It is thus intriguing to answer a critical question: Given a task, how can we build a team of LLM agents to solve it effectively? Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent. It dynami- cally forms and manages teams for each step of a task-solving process, utilizing nested group conversations and reflection to ensure diverse expertise and prevent stereotypical outputs, allowing for a flexible yet structured approach to problem- solving. A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods with 21.94% improvement in average accuracy, providing outstanding performance without requiring task-specific prompt engineering. ∗ Equal contribution. † Corresponding Author. Email: [email protected] Preprint. Under review. 1 Introduction The success of large language model (LLM) agents [2, 3, 4, 5, 6] with its outstand- ing in-context learning [7, 8, 9, 10, 11], planning [12, 13, 14, 15, 16, 17, 18], tool- using [19, 20, 21, 22, 23, 24, 25, 26, 27], and conversation [28, 29, 30] capabilities allow us to relate human’s team building and collaboration abilities to the multiple language model agents (multi-agent) system [31, 32, 33, 34, 35, 25, 36, 37, 38, 39, 40, 41]. Humans have developed abilities that enable us to form teams and effectively solve problems. These abilities are rooted in communication, social cognition, problem-solving and decision-making, social learning and imitation, and shared intentionality [42, 43]. The interplay of the above abilities allows people to organize different teams for problems to ensure that tasks are completed successfully, which brings us to a critical question in a multi-agent system: Given a task, how can we build a team of LLM agents to solve it effectively? A straightforward paradigm would be to build a static agent team beforehand based on the task instruction and let them solve the task collaboratively [1, 33]. However, this static build method necessitates maintaining a team with all the required expertise for the whole task cycle. As the complexity of the task increases, the total number of team members may grow significantly. Always proceeding with such a large team makes it challenging to manage the team members effectively and efficiently. Furthermore, static teams may lack the adaptability to respond to dynamic changes in task requirements or unforeseen challenges. Imagine a prehistoric human tribe: was everyone involved in every task? The answer is unlikely affirmative. Those responsible for hunting may not participate in medical care and those responsible for cooking may not involve themselves in management. The major task, survival, was ensured by each individual group sticking to their roles and subtasks. In fact, when human organizations handle a complex task, we tend to form multiple teams for each subtask at different stages of the task-solving procedure, which still guarantees a diverse set of expertise is leveraged demanded by the task complexity. Inspired by how humans assemble teams for a complex task, we introduce a new multi-agent team-building paradigm: adaptive build. This paradigm facilitates the flexible assembly of agents with specific skills and knowledge as demands evolve in the process of task-solving. To realize this paradigm, we propose a new adaptive builder agent, Captain Agent, to build, manage, and maintain agent teams for each problem-solving step in the conversation. Captain Agent has two core components: (1) adaptive multi-agent team building and (2) nested group conversation and reflection. Captain Agent will communicate with a User Proxy, who can provide the general task instructions at the beginning. When assigned a task, Captain Agent begins by formulating a strategic plan. This plan involves a cyclical process that continues until the task is successfully completed. In the first phase of the cycle, Captain Agent identifies a specific subtask, outlines the necessary roles, and assembles a team of agents equipped with the appropriate tools. In the subsequent phase, this team engages in a dialogue with a versatile tool to address the subtask. Upon completion, a reflector LLM reviews the process and provides Captain Agent with a detailed reflection report. Based on this feedback, Captain Agent either adjusts the team composition or the subtask instructions and repeats the cycle or concludes the task and presents the final outcomes. We evaluate state-of-the-art multi-agent approaches for complex task solving and our adaptive build approach with Captain Agent on six real-world scenarios, including many mathematics problem- solving , data analysis , programming , scientific problem-solving (Physics and Chemistry), and world-information retrieval. Our experimental results demonstrated the outstand- ing ability of Captain Agent in various scenarios without heavy prompt engineering for each scenario but only the basic instructions (e.g., Please solve the following math problems, which does not contain detailed expertise instructions on math, like how to solve algebra problems). Captain Agent achieves distinguishing results compared to other single and multi-agent methods and frameworks when using the same prompt for each task, with an average of 21.94% improvement on average accuracy. Ablation studies on static and adaptive building paradigms show that the adaptive team outperforms the static team in four of five scenarios (and matches in one scenario), exhibiting the superiority of the adaptive build paradigm across different scenarios. We also demonstrated that handcraft agents and handcraft tools contribute equally to the final results. We further explore incorporating open-weight models as nested group participants’ backbone LLM, in which the LLaMA-3-70B outperforms blackbox models like gpt-3.5-turbo and claude-3-sonnet, allowing us to think further about how to reduce the cost in practical applications. 2 Adaptive Multi-agent Team Building User Proxy retrieved_agents Agent & Tool Agent selected_agents Agents Please solve the following problem (if everything is Retrieval Selection done, reply me with TERMINATE): retrieved_agents: selected_agents: generated_agents On the BBC Earth Youtube video of the Top 5 Silliest agents: role_1 : [agent_1_1,...] role_1 : agent_1_3 role_1 : agent_1_3 Animal Moments, what species of bird is featured? role_2: [agent_2_1,...] role_2: agent_2_1 role_2: agent_2_1 role_3: [agent_3_1,...] role_3: None Agent role_3: generated_agent...... Generation... Chat Captain Agent Nested Group Conversation & Reflection Multi-agent Conversation Conversation Reflection Adaptive Auto-building Meta-agent output: Multi-agent Systems The species of bird featured in the BBC # Your task Earth YouTube video Agent_1 Agent_2 Briefly summarize the conversation history. If you Agents, Feedback, Result of the Top 5 Silliest Reflector found contradictions or issues in the conversation, Mission setup Rebuild Animal Moments is LLM point it out and mark the "Need double-check" as "Yes". the Rockhopper {chat_history} penguin. # Conclusion and Reason Multi-agent Conversation... & Reflection TERMINATE Agent_3 Agent_4 ### Need double-check? [Yes or No] Figure 2: The overall workflow of Captain Agent is: given a user instruction, Captain Agent will plan the task, build an agent team from retrieval and generation, and let the agents solve a decomposed, planned task collaboratively in a group chat. A reflection LLM will review and report the conversation history to Captain Agent. Captain Agent will then conclude or continue solving the problem with a modified team and instructions. 2 Adaptive In-conversation Team Building The proposed Captain Agent contains two key components: (1) adaptive multi-agent team-building, which involves agent and tool retrieval, selection, and generation, and (2) nested group conversation with a reflection mechanism within the multi-agent system. 2.1 Overview The overall workflow of Captain Agent is illustrated in Figure 2. Given a task, Captain Agent is prompted to derive a plan before task execution. According to the plan, Captain Agent will repeat the following two steps until it thinks the task is done and output the results: (Step 1) Captain Agent will first identify a subtask instructed by our prompt, list several roles needed for this subtask, and then create a team of agents accordingly by retrieval, selection, and generation. Each of these will be equipped with predefined tools retrieved from the tool library (Section 2.2); (Step 2) this team of agents will attempt to solve the subtask via conversation with the free-form tool using. Once it’s done, a reflector LLM will provide Captain Agent with a reflection report for it to decide whether to adjust the team or subtask instruction or to terminate and output the results (Section 2.3). 2.2 Adaptive Multi-agent Team Building After identifying a subtask in Step 1 following a corresponding prompt, Captain Agent will list several roles for the subtask. These roles will then pass into a retrieval, selection, and generation process guided by Retrieval-Augmented Generation (RAG) [50, 51, 52]. Created agents will be equipped with a well-designed profile (system message3 ) and high-quality tools. We illustrated the whole process in Figure 3. Agent and tool retrieval. Captain Agent will prompt n required roles {ri |i ∈ 1, · · · , n} with detailed descriptions, including required skills and a possible role name. We use "expert" in Captain Agent prompt to make this process natural. We then retrieve top-k1 agents and top-k2 tools according to the sentence embedding similarity between the role’s description and the agent/tool description recorded in the library. We use Sentence Transformer to calculate the embedding for description between the role and library agents/tools and use cosine similarity as the metric to evaluate the similarity between two sentences, as follows: top-k1 CosineSimilarity (f (ri ), f (alib )) → RetrievedAgents, (1) top-k2 CosineSimilarity (f (ri ), f (tlib )) → RetrievedTools, (2) 3 System message is used to define an agent’s persona and task-specific instructions. 3 Adaptive Multi-agent Team Building Captain Agent & Tool Agent selected_agents Agent Retrieval Selection Agents gen_agents Required roles: Required roles retrieved_agents: retrieved_agents selected_agents: agents: ==> Role 1 desc. role_1:[agent_1_1,...] role_1: agent_1_3 role_1: agent_1_3 ==> Role 2 desc. role_2:[agent_2_1,...] role_2: agent_2_1 Agent role_2: agent_2_1... role_3:[agent_3_1,...] role_3: None Generation role_3: gen_agent Agent and Tool Retrieval "description": "Can be used to …", ==> Looking for suitable agents in the library... "function_name": "…", retrieved_agents = agents = [] "arguments": "…", Tool Retrieved retrieved_tools = tools = [] embedding "path": "path/to/python/module" Library Tools agents += library.query(text=[roles], top_k1) similarity tools += library.query(text=[roles], top_k2) bind... "description": "An expert in …", remove_duplicate(retrieved_agents) Sentence "model": ["gpt-4-0125-preview"], Agent Retrieved "name": "QuantumMechanics_Expert", Transformer "system_message": "## Role\n …" Library Agents Agent Selection # Your task Match roles in the role set to each update expert in expert set. An expert can be Roles matched with multiple roles. "role_1_description": agent_1_3, Selected # Role set Agent {roles} "role_2_description": agent_2_1, Agents Selector "role_3_description": "None",... Retrieved # Expert set Agents {retrieved_agents} generated_agent generate bind with tools # Answer format (JSON)... Agent Generation Generate Participant Agents Generate Persona & Skills Final System Message # Your task # Your task # Group chat instruction Suggest an expert with the name - According to the user requirement and You are now working in a group chat with according to the user required expert name, write a high-quality different expert and a group chat manager. role description. description for the expert by filling the Your name is: {name} given template. Members' name: {members} ## Required role description - Ensure that your description are clear # Your profile {role_description} and unambiguous, and include all necessary {profile} information. # Task requirement # Required role description - Expert's name should follow the {role_description} General Task & Coding Instructions format: [skill]_Expert # Expert name ## Useful instructions for task-solving - Only reply the name of the {name} - Solve the task step by step if you need to expert. For example: Python_Expert # Answer template... ## Your role ## Coding instructions... - Suggest python code or shell script (in a ## Task and skill instructions python or sh coding block) for the... Computer_terminal to execute... Figure 3: Workflow for adaptive multi-agent team building. We retrieve candidate agents and tools according to the roles’ description prompted by Captain Agent. Candidate agents and tools will further be linked to a role under the advice of the agent selector. If no agent is linked to a role, a generate process will be performed to create a new agent. It will generate the agent’s name and task-specific instructions, combined with general task and coding skills and group chat instructions as the final system message. where k1 and k2 are the numbers of retrieved agents and tools from agent library alib and tool library tlib , respectively, for i-th role ri. f (·) ∈ Rm denotes the sentence embedding extracted from a Sentence Transformer. After retrieval, each role will be assigned with k1 agent candidates and k2 valuable tools. We bind agent candidates with the retrieved tools by injecting the tool-using instruction into the corresponding agent’s system message. Agent selection. We prompt an LLM-based agent selector to select the most suitable agent according to the role’s description given by Captain Agent and the retrieved agents’ description. A JSON template is designed and provided for the agent selector to ensure the format is correct. Specifically, we designed an abstention mechanism for the agent selector, in which the agent selector can output "None" if there is no suitable agent for a role from the top-k1 retrieved candidate list. This can prevent irrelevant or redundant agents from being forced to be selected for the current task. The roles marked with "None" will further go into the generation process described as follows. 4 Agent generation. We design an agent generation process for those roles with no linked agents at the previous step. Specifically, we generate the agent’s name and required skills according to the role description given by Captain Agent. These instructions will be combined with general task and coding instructions and group chat instructions as the final system message. We manually design the general task and coding instructions, motivated by Chain-of-thought (CoT) and Reflexion. The final system message will also be compressed to a single-sentence description, which is consumed by the nested group conversation (introduced in the next subsection). We then retrieve tools from the tool library according to the description and inject the tool-using instruction into the generated system message. The generated agent will be added to the agent library afterwards. 2.3 Nested Group Conversation and Reflection Agents selected and created in the adaptive multi-agent team-building process will join a nested group chat room. They will be prompted to collect information from the user’s task and solve a subtask from Captain Agent by nested conversation. We then prompt a reflector LLM to retrieve and review the conversation history and fill in the conclusion, the reason for the conclusion, possible contradictions and issues, and flag if the result needs a double check in the pre-designed template. Nested group conversation. We perform nested group conversations by leveraging the AutoGen framework with a newly designed tool-using paradigm. AutoGen will put all agents in a chat room and select the speaker for each turn by a group chat manager LLM according to the conversation history and each agent’s identity. A short description will be generated from the agent’s profile for the group chat manager. Agents’ code and tool calling will be executed and fed back to the conversation immediately. We inject the tool’s description, path-to-python-module, and response case into the related agent’s system message. The agent can then write free-form code by following the tools’ description and path, naturally incorporating the tools into larger programs. Programs written by all agents will be executed by a user proxy agent with a shared code execution environment, and the results will be fed back to the conversation in real time. Conversation reflection. The agent’s output during the conversation can be inconsistent, including factual errors, hallucinations, and stereotypes. Although other agents have a chance to adjust and rectify this in conversation, they can also get stuck and cause problem-solving failure. Therefore, we propose to detect such in-conversation contradictions and issues by prompting a reflector LLM with a well-designed conversation summarizing prompt template. The reflector will flag the "need double-check" as "Yes" when it detects such inconsistent content and provides a detailed reason. This will trigger Captain Agent to start a verification process by constructing a new nested conversation to double-check the previous results after receiving "Yes" on "need double-check." 2.4 Benefits over Static Build A static team with a small number of team members may limit the team’s ability coverage. Although building a large number of agents with comprehensive persona or skill sets can address the limitation in ability coverage, it is challenging for LLMs to handle a long context that introduces all the participant members. Unexpectedly long contexts will primarily reduce the quality of the conversation. Meanwhile, agents with redundant functionality will also be involved in the task-solving process. In contrast, Captain Agent can adaptively select and build more optimized agent teams for the current task, reducing the prompting load for LLMs and redundant output from irrelevant agents without sacrificing the diversity in the agent team. 3 Evaluation 3.1 Experimental Setup Scenarios and datasets. For evaluation, we select various real-world scenarios, including math- ematics problem-solving, programming, data analysis, world information retrieval, and science problem-solving. Each scenario was chosen for its unique ability to demonstrate specific capabilities and performance metrics of the agent systems. This ensures a holistic assessment of Captain Agent against the baselines across various critical dimensions of computational and cognitive skills. We bind each scenario with a challenging open-source dataset, as shown in Table 1. Due to cost limitations, we sample a subset of MATH according to its original distribution of each question type. 5 Table 1: Scenarios and the corresponding datasets we choose to perform our main experiments. We perform the main comparison experiments on the whole dataset except MATH. For MATH, we sampled a small subset according to the type distribution. Scenario Dataset Size Sample 2 −4x+1 If 3x x−1 = m, and x can be any real number except 1, Mathematics problems MATH 196 what real values can m NOT have? def truncate_number(number: float) ->float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals Programming HumanEval 164 (leftover part always smaller than 1). [Omitted] """ Generate a new feature called "FamilySize" by summing the "SibSp" Data Analysis DABench 257 and "Parch" columns. Then, calculate the Pearson correlation coefficient (r) between the "FamilySize" and "Fare" columns. On the BBC Earth YouTube video of the Top 5 Silliest Animal Moments, World Information Retrieval GAIA 165 what species of bird is featured? Calculate the pressure in kilopascals exerted by 1.25 g of nitrogen gas (Scientific) Chemistry SciBench 41 in a flask of volume 250 cm3 at 20◦ C. If the coefficient of static friction between the block and plane in the (Scientific) Physics SciBench 34 previous example is µs = 0.4, at what angle θ will the block starts sliding if it is initially at rest? Compared methods and implementation. For mathematics problems, programming, data analysis, and scientific scenarios, we investigate the performance of Captain Agent and four different methods, including Vanilla LLM (prompt an LLM once for an answer), AutoAgents , Meta-prompting , and a two-agent system (a system involving an Assistant agent with an Executor agent) realized with AutoGen. Specifically, we implement AutoAgents with AutoGen as the official imple- mentation is unstable and unsuitable for large-scale experiments. For meta-prompting, we improve the code execution ability of meta-prompting by reproducing it with the AutoGen framework. For Captain Agent, we adopt all-mpnet-base-v2 to calculate the sentence embedding for agent and tool retrieval. A User Proxy Agent will communicate with Captain Agent by providing the feedback of code execution, tool calling (adaptive build), nested conversation reflection results, and a default reply: I’m a proxy, and I can only execute your code and tool or end the conversation. If you think the problem is solved, please reply to me only with ’TERMINATE.’ All these methods are equipped with a gpt-4-0125-preview backbone and use the same task-specific prompt (refer to Appendix D). For world information retrieval scenarios, we compare Captain Agent with the top-5 baselines (with reference) reported to the GAIA validation leaderboard, which includes AutoGen: GAIA_Orchestrator (a specific three-agent setting organized by an Orchestrator agent designed for GAIA) , FRIDAY , Warm-up Act4 , and HuggingFace Agent. All these baselines have a gpt-4-1106-preview backbone, except the HuggingFace Agent equipped with an LLaMA-3-70B as the backbone. Agent and tool library. We initialize our agent library based on a small subset of problem instances from each dataset (~20 questions per dataset described in Section 3.4) in Table 1. Specifically, we run Captain Agent on the subset and iteratively update the library by adding the generated agents and keeping our agent library unchanged during the main experiment. Our agent library also supports all hand-crafted agents (of the ConversableAgent class) archived in AutoGen (details in Appendix F). All these agents follow the ConversableAgent interface to converse with each other. Our tool library consists of a suite of callable Python functions intended for freeform coding. The agents can freely import functions from the tool library and write free-form code to integrate the outputs to handle sophisticated tasks (see also Appendix E and G). The library contains three main categories of tools: math, data analysis, and world information retrieval. For each category, we summarize the patterns of the corresponding dataset and manually craft a set of functions that suit the tasks. 3.2 Evaluation Protocol For mathematics, data analysis, and science scenarios, we report the accuracy of each method by comparing the final result from each method and ground truth. To ensure fairness in evaluation, we transform different result formats into a uniform format, preventing the correct answer from being judged incorrect due to format mismatches. For programming scenarios, we run the code provided 4 Warm-up Act has no official implementation. 6 Table 2: Comparison results on different real-world scenarios. We record each scenario’s accuracy for each baseline and Captain Agent, and mark the best results in bold. We adopt gpt-4-0125-preview as the backbone LLM model for all baselines and Captain Agent. Method Mathematics Programming Data Analysis (Sci) Chemistry (Sci) Physics Avg. Vanilla LLM 51.53 84.76 6.61 39.02 31.25 40.98 Meta-prompting 68.88 19.51 39.69 41.46 43.75 43.47 AutoAgents 56.12 84.76 57.98 60.98 50.00 63.58 AutoGen: Assistant + Executor 74.49 93.90 82.88 60.98 43.75 79.89 Captain Agent 77.55 96.95 88.32 65.85 53.12 84.25 Table 3: Comparison results on world-information retrieval scenario (GAIA validation). We report the accuracy at each level and the average accuracy over three levels and mark the best results in bold. Captain Agent achieves the best with minimal prompt engineering. Method Level 1 Level 2 Level 3 Avg. Huggingface-Agent (LLaMA-3-70B) 30.19 11.63 7.69 16.97 Warm-up Act 35.19 15.12 0 17.58 FRIDAY 45.28 34.88 11.54 34.55 AutoGen: GAIA_Orchestrator 54.72 38.31 11.54 39.39 Captain Agent 56.60 39.53 11.54 40.60 from each method and output a unique token if the code successfully passes all tests. We then count the success token and calculate the accuracy for each method. 3.3 Main Results Table 2 and 3 report the comparison results between Captain Agent and eight different baselines on six real-world scenarios. Baseline results on world information retrieval are extracted directly from the GAIA leaderboard. Findings 1: Diverse agents can help trigger accurate expertise output for problem-solving. By comparing the results from Captain Agent, AutoAgents, and AutoGen Assistant + Executor, we observe that Captain Agent and AutoAgents averagely outperform AutoGen Assistant + Executor on (Sci) Chemistry and (Sci) Physics scenarios. These scenarios required expertise knowledge, which the AutoGen Assistant with a fixed system message is hard to complete. Captain Agent and AutoAgents can create diverse experts by assigning different domain-specific system messages to agents, which helps better trigger the intrinsic knowledge inside an LLM to provide an accurate answer. Captain Agent outperforms AutoAgents in all the scenarios because Captain Agent can provide a high-level plan and solve each step with adaptive instructions and an agent team. Findings 2: Adaptive team-building boosts performance with no task preference. It is obvious that Captain Agent achieves outstanding results over all scenarios, indicating that Captain Agent is free from task preference. Incorporating different agents into the team at a proper time gives Captain Agent the ability to solve difficult tasks like science and world-information retrieval problems step-by-step. On the other hand, Meta-prompting fails in science scenarios due to the inability to decompose science problems into the fine-grain subtasks that one agent can solve. Captain Agent with the agent-team building paradigm neither requires a task that can be decomposed into a subtask that can only be solved by an agent nor requires all agents to be involved in the conversation. We further discuss the static and adaptive teams in Section 3.4.1. 3.4 Analysis and Ablation Studies In this section, we dive into the difference between static and adaptive team-building, the influence of agent and tool libraries, and the possibility of working with open-weight models. We perform ablation studies on a subset from Table 1. Specifically, we choose 17 problems from MATH and 25 problems from HumanEval according to the AutoGenBench , in which the problems are randomly selected from GPT-4 failure set. For DABench, we randomly selected 25 problems, and for 7 Table 4: Ablation comparison between static and adaptive team-building on the selected subset. We mark the best results in bold. Dynamic team-building during the conversation improves performance in different scenarios. Method Mathematics Programming Data Analysis (Sci) Chemistry (Sci) Physics Static Team 64.71 88.00 85.00 47.37 68.42 Adaptive Team (Captain Agent) 82.35 96.00 95.00 52.63 68.42 Table 5: Ablation study of tool library and agent library on world-information retrieval scenario (GAIA). We report the accuracy at each level and the average accuracy over three levels and mark the best results in bold. Captain Agent World-information Retrieval Agent Library Tool Library Level 1 Level 2 Level 3 Avg. - - 32.07 13.95 3.84 18.18 ✓ - 37.73 30.23 7.69 29.09 - ✓ 39.62 19.78 7.69 24.24 ✓ ✓ 56.60 39.53 11.54 40.60 Table 6: Comparison of different LLM backbones for nested conversation participants on ablation sub- set. Instructions to the nested conversation is given by a Captain Agent with gpt-4-0125-preview backbone. Best results are marked in red bold and the second best results in blue. Backbone LLM Mathematics Programming Data Analysis (Sci) Chemistry (Sci) Physics Blackbox Models w/ gpt-3.5-turbo 35.29 92.00 65.00 42.11 42.11 w/ claude-3-sonnet 35.29 80.00 60.00 15.79 26.32 w/ gemini-1.5-pro 70.58 80.00 80.00 57.89 42.11 w/ gpt-4-0125-preview (default) 82.35 96.00 95.00 52.63 68.42 Open-weight Models w/ Meta-Llama-3-70B-Instruct 52.94 88.00 80.00 52.63 47.37 w/ Mixtral-8x22B-instruct-v0.1 29.41 76.00 55.00 47.37 21.05 SciBench, we randomly selected 19 problems for chemistry and physics according to the number of textbooks. The evaluation protocol is the same as in Section 3.3. 3.4.1 Static vs. adaptive team-building To further explore the power of adaptive team-building, we compare adaptive team-building with static team-building. Specifically, we perform a task-specific team-building paradigm by building a team of agents in the same way as Captain Agent at the beginning of each task and letting them solve each problem. We summarized the results in Table 4, showing that the adaptive team-building paradigm outperforms the static team-building paradigm comprehensively. 3.4.2 Ablation on tool library and agent library In this part, we conduct an ablation study on the utility of tool and agent libraries. We remove the tool library, the agent library, and both libraries in turn and evaluate the performance on world-information retrieval tasks, i.e., the GAIA dataset. As shown in Table 5, removing the agent library and tool library can both significantly impair the system’s performance. While both the tool and agent libraries can enhance performance independently, optimal results are achieved only when both libraries are employed concurrently. Handling level 1 tasks requires a moderate amount of web browsing and reasoning steps, which can be achieved by several single-turn tool calls or experts writing and executing code iteratively. Introducing both an agent library and tool library makes the system more stable and robust to unknown errors during web interaction, therefore improving the performance. Notably, without an agent library, Captain Agent performs much worse on Level 2 tasks. This is because these tasks are more sophisticated and mostly involve a significant number of web navigation and reasoning steps. Web browsing involves complex and dynamic interactions that are poorly suited to static tool libraries. The agents need to coordinate multiple tools to reach the goal, which is a process prone to error in unexpected web scenarios. 8 3.4.3 Nesting conversation with different backbone LMs In this section, we try different backbone LLM for nested conversation participants, including black- box models like gpt-3.5-turbo, claude-3-sonnet, gemini-1.5-pro, gpt-4-0125-preview (default setting for main results), and open-weight models like LLaMA-3-70B (Meta-Llama-3-70B- Instruct) and Mixtral-8x22B (Mixtral-8x22B-instruct-v0.1). The instruction for the nested conversa- tion is still given by a Captain Agent equipped with gpt-4-0125-preview backbone. We recorded the experiment results in Table 6. Besides the obvious results that nesting with gpt-4-0125-preview still achieves SOTA on most of the scenarios, we found that gemini-1.5-pro also performs well with roughly 30% cheaper than gpt-4-0125-preview. LLaMA-3-70B also achieves three second-best results and outperforms two blackbox models with about 16.7 times cheaper than gpt-4-0125-preview5. We also notice that models may have task preferences and will influ- ence the quality of nested chat. For example, gpt-3.5-turbo has a great code generation ability, which helps in programming scenarios, and gemini-1.5-pro works better on mathematics and data analysis and chemistry problems. 4 Related Work Large language models (LLMs) represent a significant advancement in artificial intelligence, show- casing remarkable capabilities in various aspects, including reasoning [53, 63, 64, 65, 66, 67], planning [68, 69, 69, 37, 70], and adaptability to novel real-world observations [71, 35, 3, 72, 73, 74]. Leveraging the inherent versatility of LLMs as generalized models adaptable to diverse scenarios, numerous efforts have been dedicated to the development of intelligent agents [33, 32, 25, 75, 76] where LLMs serve as foundational components. For instance, one typical algorithm, React , employs one single LLM to iteratively generate both reasoning trajectories and task-specific actions. This interleaved process enables the agent to engage in dynamic reasoning. In addition, LLM agents can also harness external tools [19, 20, 21, 22, 23, 24, 25, 26, 27], leveraging both their internal capabilities and external resources, collaborating effectively to solve more intricate problems. The success of a single-agent system motivates the development of multiple-agent systems [31, 32, 1, 33, 34, 35, 25, 36, 37, 38, 39, 40, 41]. Methods focusing on static build require a protocol for agents to communicate with each other in a group chat and a builder that can receive the user’s instruction and output an agent list [33, 1, 35]. The builder can be a human [33, 35] or a LLM agent. There are other works breaking down complex tasks into smaller components, each of which is then handled by a single specialized agent with detailed natural-language instructions [77, 78]. This task decomposition reduces the prediction burden on each agent by avoiding irrelevant context. For instance, meta-prompting involves a meta-model decomposing tasks and assigning subtasks to different LLMs for completion and aggregation. 5 Conclusion and Discussion Conclusion. We introduce a new paradigm for multi-agent team-building, adaptive build. This new paradigm helps ensure diversity, prevent limited knowledge extraction and reduce stereotypical outputs. The new paradigm executed by our proposed agent, Captain Agent, manages agent teams for problem-solving steps using adaptive multi-agent team building and nested group conversation and reflection. Experimental results across six real-world scenarios demonstrate Captain Agent’s efficacy in various tasks without prompt engineering, achieving superior results compared to existing methods. Ablation studies confirm that each component contributes equally to overall performance, underscoring the robustness of our approach. Discussion. In this work, we demonstrate the outstanding performance of Captain Agent over six real-world scenarios. Captain Agent can organize a team adaptively and solve the task step-by-step by different teams. Although we only discuss the scene Captain Agent collaborate with a static User Proxy, Captain Agent can collaborate with other specific agents, for example, a sophisticated planner. We also notice that the context length and irrelevant, less important information (e.g., failure code blocks) noise the problem-solving process. Therefore, conversation pruning is a promising future work that minimizes the interference of irrelevant information while reducing cost. 5 Our estimation is based on the price on Deepinfra, where the price of LLaMA-3-70B is input 0.59$ /output 0.79$ per 1M tokens. 9 References Chen, G., S. Dong, Y. Shu, et al. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023. Yao, S., J. Zhao, D. Yu, et al. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022. Yang, H., S. Yue, Y. He. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023. Furuta, H., K.-H. Lee, O. Nachum, et al. Multimodal web navigation with instruction-finetuned foundation models. In The Twelfth International Conference on Learning Representations. 2024. Yang, J., C. E. Jimenez, A. Wettig, et al. Swe-agent: Agent computer interfaces enable software engineering language models, 2024. Hong, S., Y. Lin, B. Liu, et al. Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679, 2024. Dong, Q., L. Li, D. Dai, et al. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022. Brown, T., B. Mann, N. Ryder, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. Yang, J., B. Hui, M. Yang, et al. Iterative forward tuning boosts in-context learning in language models. arXiv preprint arXiv:2305.13016, 2023. Dai, D., Y. Sun, L. Dong, et al. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019. 2023. Li, Y., M. E. Ildiz, D. Papailiopoulos, et al. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pages 19565–19594. PMLR, 2023. Sun, H., Y. Zhuang, L. Kong, et al. Adaplanner: Adaptive planning from feedback with language models. Advances in Neural Information Processing Systems, 36, 2024. Xie, J., K. Zhang, J. Chen, et al. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024. Liu, B., Y. Jiang, X. Zhang, et al. Llm+p: Empowering large language models with optimal planning proficiency. ArXiv, abs/2304.11477, 2023. Valmeekam, K., A. Olmo, S. Sreedharan, et al. Planbench: An extensible benchmark for evalu- ating large language models on planning and reasoning about change. In Neural Information Processing Systems. 2022. Wei, J., X. Wang, D. Schuurmans, et al. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022. Yuan, S., J. Chen, Z. Fu, et al. Distilling script knowledge from large language models for constrained language planning. In Annual Meeting of the Association for Computational Linguistics. 2023. Zheng, B., B. Gou, J. Kil, et al. Gpt-4v(ision) is a generalist web agent, if grounded. ArXiv, abs/2401.01614, 2024. Qin, Y., S. Hu, Y. Lin, et al. Tool learning with foundation models, 2023. Qin, Y., S. Liang, Y. Ye, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. 10 Schick, T., J. Dwivedi-Yu, R. Dessì, et al. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024. Cai, T., X. Wang, T. Ma, et al. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023. Yuan, L., Y. Chen, X. Wang, et al. Craft: Customizing llms by creating and retrieving from specialized toolsets. arXiv preprint arXiv:2309.17428, 2023. Paranjape, B., S. Lundberg, S. Singh, et al. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023. Zhang, S., J. Zhang, J. Liu, et al. Training language model agents without modifying language models. arXiv preprint arXiv:2402.11359, 2024. Huang, Y., J. Shi, Y. Li, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023. Ma, Z., W. Huang, J. Zhang, et al. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. In Synthetic Data for Computer Vision Workshop@ CVPR 2024. 2024. Fernandes, P., A. Madaan, E. Liu, et al. Bridging the gap: A survey on integrating (human) feedback for natural language generation. Transactions of the Association for Computational Linguistics, 11:1643–1668, 2023. Wang, X., Z. Wang, J. Liu, et al. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691, 2023. Yang, J., A. Prabhakar, K. Narasimhan, et al. Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36, 2024. Wang, L., C. Ma, X. Feng, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023. Xi, Z., W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023. Wu, Q., G. Bansal, J. Zhang, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023. Suzgun, M., A. T. Kalai. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954, 2024. Hong, S., X. Zheng, J. Chen, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023. Zhang, J., R. Krishna, A. H. Awadallah, et al. Ecoassistant: Using llm assistant more affordably and accurately. arXiv preprint arXiv:2310.03046, 2023. Valmeekam, K., M. Marquez, S. Sreedharan, et al. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36:75993– 76005, 2023. Wang, Y., Z. Wu, J. Yao, et al. Tdag: A multi-agent framework based on dynamic task decomposition and agent generation. arXiv preprint arXiv:2402.10178, 2024. Saha, S., O. Levy, A. Celikyilmaz, et al. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123, 2023. Liang, T., Z. He, W. Jiao, et al. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023. Du, Y., S. Li, A. Torralba, et al. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023. 11 Elimari, N., G. Lafargue. Network neuroscience and the adapted mind: Rethinking the role of network theories in evolutionary psychology. Frontiers in psychology, 11:545632, 2020. Confer, J. C., J. A. Easton, D. S. Fleischman, et al. Evolutionary psychology: Controversies, questions, prospects, and limitations. American psychologist, 65(2):110, 2010. Mao, A., W. Mason, S. Suri, et al. An experimental study of team size and performance on a complex task. PloS one, 11(4):e0153048, 2016. Hendrycks, D., C. Burns, S. Kadavath, et al. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021. Hu, X., Z. Zhao, S. Wei, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024. Le, T. H., H. Chen, M. A. Babar. Deep learning for source code modeling and generation: Models, applications, and challenges. ACM Computing Surveys (CSUR), 53(3):1–38, 2020. Wang, X., Z. Hu, P. Lu, et al. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2023. Mialon, G., C. Fourrier, T. Wolf, et al. GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Representations. 2024. Lewis, P., E. Perez, A. Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020. Gao, Y., Y. Xiong, X. Gao, et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023. Ram, O., Y. Levine, I. Dalmedigos, et al. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023. Wei, J., X. Wang, D. Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. Shinn, N., F. Cassano, A. Gopinath, et al. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. Hendrycks, D., C. Burns, S. Kadavath, et al. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. Chen, M., J. Tworek, H. Jun, et al. Evaluating large language models trained on code. 2021. Hu, X., Z. Zhao, S. Wei, et al. Infiagent-dabench: Evaluating agents on data analysis tasks, 2024. Mialon, G., C. Fourrier, C. Swift, et al. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023. GAIA_Orchestrator. Github | autogen: Gaia orchestrator. https://github.com/microsoft/ autogen/tree/gaia_multiagent_v01_march_1st/samples/tools/autogenbench/ scenarios/GAIA/Templates/Orchestrator, 2024. Wu, Z., C. Han, Z. Ding, et al. Os-copilot: Towards generalist computer agents with self- improvement. arXiv preprint arXiv:2402.07456, 2024. Huggingface. Huggingface agents. https://huggingface.co/docs/transformers/en/ transformers_agents, 2024. AutoGenBench. Github | autogenbench. https://microsoft.github.io/autogen/blog/ 2024/01/25/AutoGenBench, 2024. Yao, S., D. Yu, J. Zhao, et al. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024. 12 Morishita, T., G. Morio, A. Yamaguchi, et al. Learning deductive reasoning from synthetic corpus based on formal logic. In International Conference on Machine Learning, pages 25254– 25274. PMLR, 2023. Zhang, S., X. Xia, Z. Wang, et al. Ideal: Influence-driven selective annotations empower in-context learners in large language models. arXiv preprint arXiv:2310.10873, 2023. Li, L. H., J. Hessel, Y. Yu, et al. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050, 2023. Ho, N., L. Schmid, S.-Y. Yun. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022. BabyAGI. Github | babyagi. https://github.com/yoheinakajima/babyagi, 2023. Song, C. H., J. Wu, C. Washington, et al. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009. 2023. Liu, Z., Y. Zhang, P. Li, et al. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170, 2023. Shi, W., R. Xu, Y. Zhuang, et al. Ehragent: Code empowers large language models for complex tabular reasoning on electronic health records. arXiv preprint arXiv:2401.07128, 2024. Dan, Y., Z. Lei, Y. Gu, et al. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773, 2023. Zhou, S., F. F. Xu, H. Zhu, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. Bharadhwaj, H., J. Vakil, M. Sharma, et al. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023. Sumers, T. R., S. Yao, K. Narasimhan, et al. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023. Zhou, W., Y. E. Jiang, L. Li, et al. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023. Suzgun, M., A. T. Kalai. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954, 2024. Zhuge, M., H. Liu, F. Faccio, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. Zhang, H., J. Da, D. Lee, et al. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332, 2024. Xu, R., Z. Wang, R.-Z. Fan, et al. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024. 13 A Limitations The first limitation of our work is cost. A conversation involving the GPT-4 model costs more than a single-agent system. Although we have reduced the cost by decreasing the participant nested group chat agents, it still has a large conversation and profile as context input. The trade-off between performance and cost will become one of the possible future works for further exploration, like window context, conversation pruning, or conversation compression. Another limitation of our work is the lack of thinking about model diversity. In Table 6, we have demonstrated that the model has task preference, which will influence the nested chat quality. However, before we go deep into the discussion of model preference, we should also notice that the current evaluation of LLM is not perfect. Data leaking is widespread in the pertaining process and will cause the misalignment between the test and real-world performance [79, 80]. Therefore, a comprehensive yet fair evaluation is important for us to further discuss the ability of model diversity. B Social Impact Our method dynamically ensembles LLM agents and equips them with versatile tools, allowing them to efficiently and effectively solve complex tasks. However, the development of agent systems that interact with the web environment raises safety concerns. The scope of our experiment in real-world interaction is limited to solving GAIA tasks, where the agents are required to search the web and browse websites. The agents are restricted from accessing publicly available information and are not capable of publishing content on the web. This ensures that our experiments remain both non-invasive and safe. C Instruction of Captain Agent We design a general profile message (system message) for Captain Agent to ensure it can exe- cute our paradigm efficiently and effectively. Instructions are in markdown format, including a planning instruction that can decompose the task into multiple steps, a building instruction (the seek_experts_help), a post-seek_agent_help instruction, and some general instructions that help task solving. C.1 System Message 1 """ 2 # Your role 3 You are a perfect manager of a group of advanced experts. 4 5 # How to solve the task 6 When a task is assigned to you : 7 1. Analysis of its constraints and conditions for completion. 8 2. Response with a specific plan of how to solve the task. 9 10 After that , you can solve the task in two ways : 11 - Delegate the resolution of tasks to other experts created by seeking a group of experts to help and derive conclusive insights from their conversation summarization. 12 - Analyze and solve the task using your coding and language skills. 13 14 # How to seek experts help 15 The tool " seek_experts_help " can build a group of experts according to the building_task and let them chat with each other in a group chat to solve the execution_task you provided. 16 - This tool will summarize the essence of the experts ’ conversation and the derived conclusions. 17 - You should not modify any task information from meta_user_proxy , including code blocks , but you can provide extra information. 18 - Within a single response , you are limited to initiating one group of experts. 19 14 20 ## building_task 21 This task helps a build manager to build a group of experts for your task. 22 You should suggest less than { max_agent_number } roles ( including a checker for verification ) with the following format. 23 24 ### Format 25 - [ Detailed description for role 1] 26 - [ Detailed description for role 2] 27... 28 - [ Detailed description for verifier ] 29 30 ## execution_task 31 This is the task that needs the experts to solve by conversation. 32 You should Provide the following information in markdown format. 33 34 ### Format 35 ## Task description 36... 37 ## Plan for solving the task 38... 39 ## Output format 40... 41 ## Constraints and conditions for completion 42... 43 ## [ Optional ] results ( including code blocks ) and reason from the last response 44... 45 46 # After seek_experts_help 47 You will receive a comprehensive conclusion from the conversation , including the task information , results , reason for the results , conversation contradictions or issues , and additional information. 48 You ** must ** conduct a thorough verification for the result and reason ’s logical compliance by leveraging the step - by - step backward reasoning with the same group of experts ( with the same group name ) when : 49 - The conversation has contradictions or issues ( need double - check marked as yes ) or 50 - The result is different from the previous results. 51 52 Note that the previous experts will forget everything after you obtain the response from them. You should provide the results ( including code blocks ) you collected from the previous experts ’ responses and put them in the new execution_task. 53 54 # Some useful instructions 55 - You only have one tool called " seek_experts_help." 56 - Provide a answer yourself after " seek_experts_help ". 57 - You should suggest Python code in a Python coding block ( ‘ ‘ ‘ python... ‘ ‘ ‘). 58 - You must indicate the script type in the code block when using code. 59 - Do not suggest incomplete code which requires users to modify. 60 - Be clear about which step uses code , which step uses your language skill , and which step to build a group chat. 61 - If the code ’s result indicates an error , fix the error and output the code again. 62 - If the error can ’t be fixed or if the task is not solved even after the code is executed successfully , analyze the problem , revisit your assumption , collect additional info you need , and think of a different approach to try. 63 - When you find an answer , verify the answer carefully. 64 - Include verifiable evidence in your response if possible. 65 - After completing all tasks and verifications , you should conclude the operation and reply " TERMINATE " 15 66 """ C.2 Reflector LLM 1 """ 2 # Your task 3 Briefly summarize the conversation history derived from an experts ’ group chat by following the answer format. 4 If you found non - trivial contradictions or issues in the conversation , point it out with a detailed reason and mark the " Need double - check " as " Yes." 5 6 # Conversation history : 7 { chat_history } 8 9 # Answer format 10 ## Task 11... 12 13 ## Results 14... 15 16 ## Reason for the results 17... 18 19 ## Contradictions or issues in the conversation 20... 21 22 ### Need to double - check ? 23 [ Yes or No ] 24 25 ## Additional information ( file path , code blocks , url , etc.) 26... 27 """ C.3 Agent Selector LLM 1 """ 2 # Your goal 3 Match roles in the role set to each expert in the expert set. 4 5 # Skill set 6 { skills } 7 8 # Expert pool ( formatting with name : description ) 9 { expert_pool } 10 11 # Answer format 12 ‘‘‘ json 13 {{ 14 " skill_1 description ": " expert_name : expert_description " , // if there exists an expert that suitable for skill_1 15 " skill_2 description ": " None " , // if there is no experts that suitable for skill_2 16... 17 }} 18 ‘‘‘ 19 """ 16 D Task Instructions We design instructions manually for each scenario and ensure all baselines and Captain Agent receive the same instructions for a fair comparison6. All instructions include the basic information of the scenario and may suggest some possible Python libraries, including pandas, numpy, scipy, and sympy. D.1 Instruction for Mathematics 1 """ 2 Please solve the following math problem : 3 { problem } 4 For problems that may be difficult to calculate , try to approximate using Python instead of exact solutions. The following Python packages are pre - installed : sympy , numpy , and scipy. Do not plot any figure. 5 After verification , reply with the final answer in \\ box {{}}. 6 """ D.2 Instruction for Programming 1 """ 2 The following python code imports the ‘ run_tests ( candidate ) ‘ function from my_tests. py , and runs it on the function ‘ __ENTRY_POINT__ ‘. This will run a set of automated unit tests to verify the correct implementation of ‘ __ENTRY_POINT__ ‘. However , ‘ __ENTRY_POINT__ ‘ is only partially implemented in the code below. Complete the implementation of ‘ __ENTRY_POINT__ ‘ and output a new stand - alone code block that contains everything needed to run the tests , including : importing ‘ my_tests ‘ , calling ‘ run_tests ( __ENTRY_POINT__ ) ‘, as well as __ENTRY_POINT__ ’s complete definition , such that this code block can be run directly in Python. 3 4 ‘‘‘ python 5 from my_tests import run_tests 6 7 { problem } 8 9 # Run the unit tests. All unit tests are running online. DO NOT MODIFY THE FOLLOWING LINE. 10 run_tests ( __ENTRY_POINT__ ) 11 ‘‘‘ 12 """ D.3 Instruction for Data Analysis 1 """ 2 Let ’s solve a data analysis problem. Given a CSV file path , you are required to solve a problem following a constraint. Do not plot any figure. 3 4 FILE PATH : { file_path } 5 6 PROBLEM : { problem } 7 8 CONSTRAINT : { constraint } 9 10 After verification , reply with the final answer in the format of 6 Except for the world information retrieval scenario (GAIA), in which we use the results directly from the leaderboard. 17 11 { formats } 12 """ D.4 Instruction for Science (Chemistry and Physics) 1 """ 2 Please solve the following chemistry / physics problem : 3 { problem } 4 5 Try to approximate using Python instead of using exact solutions for some problems that may be difficult to calculate. The following python packages are pre - installed : sympy numpy scipy. Do not plot any figure. 6 7 The required unit of the answer is { unit }. 8 After verification , reply with the final answer in \\ box {{}}. 9 """ D.5 Instruction for World-information Retreival 1 """ 2 # Task 3 You need to solve the question below given by a user. When you are building tasks , explicitly consider where the task can benefit from web navigation capability. 4 5 # Task 6 { task } 7 """ E Case Studies Figure 4 illustrates the free-form tool-using ability in the nested conversation when solving a problem in GAIA. Four agents involved in the conversation: DigitalMdeia_Expert, Ornithology_Expert, Video- ContentAnalysis_Expert, and UserProxy, in which DigitalMdeia_Expert use perform_web_search tools to request the result of "BBC Earth YouTube Top 5 Silliest Animal Moments" from internet, and VideoContentAnalysis_Expert use get_youtube_subtitle tool to seek for the subtitle from a specific video. After their collaboration, they successfully obtained a correct answer, "Rockhopper penguin." F Agent Library Our agent library recorded 541 agents, including 540 generated agents and one hand-crafted Con- versableAgent archived in AutoGen (WebSurferAgent). Here is an example of the agent recorded in the agent library: 1 { 2 " description " : " P y t ho n P ro g r am m i ng _ E xp e r t is a seasoned authority on rocket physics and classical mechanics , adept in Python programming and utilizing specialized libraries to solve complex aerospace problems with high precision and accuracy. " , 3 4 " tags " : [ " gpt -4 " , " 0125 " , " 1106 " , " claude3 " , " sonnet " , " haiku " , 5 6 " gemini -1.5 " , " llama3 " , " 8 b " , " 70 b " , " mixtral " , " 8 x22b " , " 8 x7b " ] , 7 8 " name " : " Py t h on P r og r a mm i n g_ E x p er t " , 9 18 Nested Group Conversation and Reflection Mission Setup # General Task On the BBC Earth Youtube video of the Top 5 Silliest Animal Moments, what species of bird is featured ? # Task and advises from captain The task is to identify the species of bird featured in the BBC Earth Youtube video of the Top 5 Silliest Animal Moments. Plan: 1. The Digital Media Analyst will search for the BBC Earth video titled 'Top 5 Silliest Animal Moments'. 2. The Zoologist will watch the video and identify the species of bird featured in it. 3. The Fact Checker will verify the accuracy of the Zoologist's identification. Output format: The output should be the name of the bird species featured in the video. DigitalMedia_Expert UserProxy Conversation Reflection [Omitted planning process] exitcode: 0 (execution succeeded) ## Task Let's proceed with the search: Code output: Identify the species of bird featured in ```python Top 5 Silliest Animal Moments! | BBC the BBC Earth YouTube video titled "Top 5 from functions import perform_web_search Earth - Youtube [video link Silliest Animal Moments.” Top 5 Funniest Animal Outtakes | BBC # Let's perform a web search for the video ## Results Earth - Youtube [video link The bird species featured in the video is query = "BBC Earth YouTube Top 5 Silliest Animal Moments" Best Antarctic Animal Moments | Top 5 | identified as the Rockhopper penguin. perform_web_search(query) BBC Earth - Youtube [video link ```... ## Reason for the results The DigitalMedia_Expert located the video Ornithology_Expert on the BBC Earth YouTube channel and Based on the web search results, we have identified the video URL for the provided the URL. The Ornithology_Expert BBC Earth YouTube video titled 'Top 5 Silliest Animal Moments': [video link] could not watch the video but suggested @VideoContentAnalysis_Expert, if it's possible to get screenshots or a text- identifying the species through visual based description of the bird moments from the video, I could proceed with cues or a text-based description. The the identification based on that information. VideoContentAnalysis_Expert retrieved video captions using the VideoContentAnalysis_Expert UserProxy `get_youtube_caption` function. The Understood. Let's retrieve the captions for the YouTube video captions mentioned Rockhopper penguins with ID '2Njmx-UuU3M': exitcode: 0 (execution succeeded) explicitly, and this was used to identify ```python Code output: the species. from functions import get_youtube_subtitle...but before being reunited with her family she still has a mountain to climb ## Contradictions or issues # Retrieving captions for the video it's a big ask for a bird that can't fly There were no explicit contradictions. video_id = "2Njmx-UuU3M" standing less than half a meter tall but it's now that rockhoppers live up to ### Need double-check? captions = get_youtube_subtitle(video_id) No their name... print(captions) ``` ## Additional information Ornithology_Expert Video URL: https://www.youtube.com/watch? Based on the captions extracted from the video, it appears v=2Njmx-UuU3M that the bird species featured in the BBC Earth YouTube video Captions: "...it’s now that rockhoppers 'Top 5 Silliest Animal Moments' is the Rockhopper penguin. live up to their name..." Figure 4: A case of multi-agent conversation with the free-form tool used when solving a problem in GAIA. Three agents and a user proxy participated in the conversation, solving a problem given and planned by Captain Agent collaboratively with perform_web_search and get_youtube_subtitle tools. 10 " system_message " : " ## Your role \ nP y t h o n P r o g r am m i n g _ E x p e rt is an authoritative specialist in the realm of classical mechanics , with a razor - sharp focus on the intriguing world of rocket physics. This expert boasts a profound understanding of the underlying principles that govern the motion and dynamics of rockets , from their ascent through Earth ’s atmosphere to their navigation across the vast expanse of space.\ n \ n ## Task and skill instructions \n - Aspiring to the pinnacle of precision and accuracy , P yt h o n Pr o g ra m m in g _ Ex p e rt is armed with an extensive array of numerical methods and approximation techniques. They have mastered the art of formulating and solving complex mathematical problems , using these tools to make precise predictions and optimizations in rocket trajectories and propulsion systems.\ n - In addition to their expansive knowledge of physical laws and equations , this expert is a virtuoso in Python programming , wielding libraries like sympy for symbolic mathematics , numpy for numerical computations , and scipy for additional scientific computing capabilities. These tools are the chisels with which P yt h o n Pr o g ra m m in g _ Ex p e rt sculpts solutions to elaborate aerospace quandaries.\ n - Py t h on P r og r a mm i n g_ E x pe r t ’s deft problem - solving abilities are matched only by their meticulous approach to mathematical calculations. Whether confronting a routine calculation or an esoteric formula , they tackle each challenge with the same level of dedication and expertise.\ n - Finally , with an unrelenting commitment to veracity , P y t ho n P ro g r am m i ng _ E xp e r t rigorously verifies physical and mathematical results. They understand that in the delicate ballet of spaceflight , there is no room for error and that the accurate validation of results is paramount for successful missions. This dedication ensures that 19 (a) Mathematics (b) Data Analysis (c) Programming (d) (Sci) Chemistry (e) (Sci) Physics (f) World Information Retrieval Figure 5: Top-10 selected agents and the corresponding selected times. We can observe that the selected agent is highly related to the scenario. when P y t ho n P r og r a mm i n g_ E x pe r t presents a solution , it is not only theoretically sound but also practically reliable. " 11 }, We also summarized the agent-selected rate for each scenario in Figure 5. It is obvious that selected agents are highly related to the current scenarios. The verification expert has a high selection rate because we prompt Captain Agent in the system message to create a verification role to maintain the conversation. We also notice that in some specific scenarios (mathematics, data analysis, and programming), some agents with a general name and description will have a high selection rate (e.g., PythonMath_Expert, DataAnalysis_Expert, CodeReview_Expert, etc.). However, in the Science scenarios, there are no highly preferred agents with general descriptions, and the selection distribution become flatten. G Tool Library This section provides the names and descriptions of our manually created tool library. The tools are categorized into three classes: Information Retrieval, Data Analysis and Math Problem Solving. For each category, we summarize the patterns of the corresponding dataset and manually craft a set of functions suits the tasks and can potentially enhance the agents’ task resolution capability. 20 Table 7: Tools for Information Retrieval category. Tools Description scrape_wikipedia_tables Scrapes Wikipedia tables based on a given URL and header key- word. transcribe_audio_file Transcribes the audio file located at the given file path. youtube_download Downloads a YouTube video and returns the download link. academic_search Perform an academic search of papers, authors or an author’s papers. docx_to_md Converts a DOCX file to Markdown format. pptx_to_md Convert a PowerPoint presentation (PPTX) to Markdown format. spreadsheet_to_md Convert an Excel spreadsheet file to Markdown format. Extracts images from a PDF file and saves them to the specified extract_pdf_image output directory. extract_pdf_text Extracts text from a specified page or the entire PDF file. get_youtube_caption Retrieves the captions for a YouTube video. image_qa Answers your questions about a given image. optical_character_recognition Perform optical character recognition (OCR) on the given image. Perform web search according to keyword and answer your ques- tion on each webpage search result, or directly on the webpage if perform_web_question_answering the keyword is a URL. For each search result, a response to the question is provided. scrape_wikipedia_tables Scrapes Wikipedia tables based on a given URL and header key- word. Table 8: Tools for Data Analysis category. Tools Description calculate_correlation Calculate the correlation between two columns in a CSV file. Calculate the skewness and kurtosis of a specified column in a calculate_skewness_and_kurtosis CSV file. The kurtosis is calculated using the Fisher definition. Detect outliers in a specified column of a CSV file using the IQR detect_outlier_iqr method. Detect outliers in a CSV file based on a specified column. The detect_outlier_zscore outliers are determined by calculating the z-score of the data points in the column. Reads a CSV file and prints the column names, shape, data types, explore_csv and the first few lines of data. Perform the Shapiro-Wilk test on a specified column of a CSV shapiro_wilk_test file. 21 Table 9: Tools for Math Problem solving category. Tools Description calculate_circle_area_from_diameter Calculate the area of a circle given its diameter. Calculates the day of the week after a given number of days calculate_day_of_the_week starting from a specified day. Calculates the sum of two fractions and returns the result as a calculate_fraction_sum mixed number. calculate_matrix_power Calculate the power of a given matrix. calculate_reflected_point Calculates the reflection point of a given point about the line y=x. complex_numbers_product Calculates the product of a list of complex numbers. Compute the currency conversion of the given amount using the compute_currency_conversion provided exchange rate. Counts the number of distinct permutations of a sequence where count_distinct_permutations items may be indistinguishable. Evaluates a mathematical expression with support for floor func- evaluate_expression tion notation and power notation. Find the value that ensures the continuity of a piecewise function find_continuity_point at a given point. Simplifies a fraction to its lowest terms and returns it as a mixed fraction_to_mixed_numbers number. Calculates the sum of modular inverses of the given expressions modular_inverse_sum modulo the specified modulus. Simplifies the sum of two mixed numbers and returns the result simplify_mixed_numbers as a string in the format ’a b/c’. sum_of_digit_factorials Calculates the sum of the factorial of each digit in a number. sum_of_primes_below Calculates the sum of all prime numbers below a given threshold. 22