[Linkpost] Automated Design of Agentic Systems

Bogdan Ionut Cirstea

This is a linkpost for https://arxiv.org/abs/2408.08435

Authors: Shengran Hu, Cong Lu, Jeff Clune.

Brief summary: proof of concept of automated LM scaffolding design, with state-of-the-art performance on several tasks.

Website: https://www.shengranhu.com/ADAS/.

X/twitter summary thread: https://x.com/jeffclune/status/1825551351746867502.

Abstract:

Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We formulate a new research area, Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways. We further demonstrate that there is an unexplored yet promising approach within ADAS where agents can be defined in code and new agents can be automatically discovered by a meta agent programming ever better ones in code. Given that programming languages are Turing Complete, this approach theoretically enables the learning of any possible agentic system: including novel prompts, tool use, control flows, and combinations thereof. We present a simple yet effective algorithm named Meta Agent Search to demonstrate this idea, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries. Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. Importantly, we consistently observe the surprising result that agents invented by Meta Agent Search maintain superior performance even when transferred across domains and models, demonstrating their robustness and generality. Provided we develop it safely, our work illustrates the potential of an exciting new research direction toward automatically designing ever-more powerful agentic systems to benefit humanity.

One can learn a lot from this paper. A couple of observations are as follows.

1. Two of its authors are also the authors of "AI scientist", https://arxiv.org/abs/2408.06292

These two papers are clearly a part of Jeff Clune's paradigm of "AI-generating algorithms", https://arxiv.org/abs/1905.10985 (currently 123 references on Google Scholar, but a number of its derivative works have higher citation counts).

Safety concerns were raised in the referenced twitter thread and are also discussed in the paper (Section 6, page 12). As usual, the dichotomy of whether to expose the relevant capability gains or whether to avoid exposing them is quite non-trivial, so one would expect differences of opinion here. The capability gains here are rather straightforward (one does not even use GPUs on the client side, this is straightforwardly based on the ability to do LLM inference via API).

2. The workflow "train an agent with a weak LLM, then substitute a stronger LLM after training, and the performance jumps" is very pronounced here.

In particular, see Section 4.3, page 9. They synthesized a few agents on one of the ARC datasets using GPT-3.5 as the underlying LLM^[1], reaching the performance of 12-14%. Then they substituted GPT-4 and Claude 3.5 Sonnet, and the performance jumped respectively to 30-37% and 38-49% without any further adjustments^[2].

One should expect further gains when better future LLMs are substituted here (without further adjustments of the agents).

The LLM used by generated agents during training and initial evaluation. The meta process controlling the generation of agents used gpt-4o-2024-05-13. ↩︎
Those who want to look more closely at the generated agents will find the conversation in https://github.com/ShengranHu/ADAS/issues/4 helpful. ↩︎

One can learn a lot from this paper. A couple of observations are as follows.

1. Two of its authors are also the authors of "AI scientist", https://arxiv.org/abs/2408.06292

2. The workflow "train an agent with a weak LLM, then substitute a stronger LLM after training, and the performance jumps" is very pronounced here.

One should expect further gains when better future LLMs are substituted here (without further adjustments of the agents).

The LLM used by generated agents during training and initial evaluation. The meta process controlling the generation of agents used gpt-4o-2024-05-13. ↩︎
Those who want to look more closely at the generated agents will find the conversation in https://github.com/ShengranHu/ADAS/issues/4 helpful. ↩︎