Codruta (Coco) Lugoj

Message

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Summary * Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. * This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM...

Jul 22, 2024•20

Message

12 karma

Member for 3 years

Codruta (Coco) Lugoj — LessWrong

Codruta (Coco) Lugoj

Message

Codruta (Coco) Lugoj

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Jul 22, 2024•20

Message

12 karma

Member for 3 years

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Sam F. Brown

Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj, Sai Sasank Y+ 0 more

Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj, Sai Sasank Y

Summary

Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement.
This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM agents.
This ‘Auto-Enhancement benchmark’ measures the ability of ‘top-level’ agents to improve the performance of ‘reference’ agents on ‘component’ benchmarks, such as CyberSecEval 2, MLAgentBench, SWE-bench, and WMDP.
Results are mostly left for a future post in the coming weeks.
An example Enhancement task. The Top-Level Agent (TLA) is being assessed on its ability to make improvements to Reference Agent (RA), turning it into Modified Reference Agent (MRA). The

... (read 4078 more words →)

LESSWRONG
LW

LESSWRONG
LW

Codruta (Coco) Lugoj

Codruta (Coco) Lugoj

Codruta (Coco) Lugoj

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Codruta (Coco) Lugoj

Codruta (Coco) Lugoj

Codruta (Coco) Lugoj

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Summary