One can learn a lot from this paper. A couple of observations are as follows.
These two papers are clearly a part of Jeff Clune's paradigm of "AI-generating algorithms", https://arxiv.org/abs/1905.10985 (currently 123 references on Google Scholar, but a number of its derivative works have higher citation counts).
Safety concerns were raised in the referenced twitter thread and are also discussed in the paper (Section 6, page 12). As usual, the dichotomy of whether to expose the relevant capability gains or whether to avoid exposing them is quite non-trivial, so one would expect differences of opinion here. The capability gains here are rather straightforward (one does not even use GPUs on the client side, this is straightforwardly based on the ability to do LLM inference via API).
In particular, see Section 4.3, page 9. They synthesized a few agents on one of the ARC datasets using GPT-3.5 as the underlying LLM[1], reaching the performance of 12-14%. Then they substituted GPT-4 and Claude 3.5 Sonnet, and the performance jumped respectively to 30-37% and 38-49% without any further adjustments[2].
One should expect further gains when better future LLMs are substituted here (without further adjustments of the agents).
The LLM used by generated agents during training and initial evaluation. The meta process controlling the generation of agents used gpt-4o-2024-05-13
. ↩︎
Those who want to look more closely at the generated agents will find the conversation in https://github.com/ShengranHu/ADAS/issues/4 helpful. ↩︎
Authors: Shengran Hu, Cong Lu, Jeff Clune.
Brief summary: proof of concept of automated LM scaffolding design, with state-of-the-art performance on several tasks.
Website: https://www.shengranhu.com/ADAS/.
X/twitter summary thread: https://x.com/jeffclune/status/1825551351746867502.
Abstract: