My understanding of Auto-GPT is that it strings together many GPT-4 requests, while notably also giving it access to memory and the internet. Empirically, this allocation of resources and looping seems promising for solving complex tasks, such as debugging the code of Auto-GPT itself. (For those interested, this paper discusses how to use looped transformers can serve as general-purpose computers).
But to my ears, that just sounds like an update of the form “GPT can do many tasks well”, not in the form of “Aligned oversight is tractable”. Put another way, Auto-GPT sounds like evidence for capabilities, not evidence for the ease of scalable oversight. The question of whether human values can be propagated up through increasingly amplified models seems separate from the ability to improve self-recursively, in the same way that capabilities-progress is distinct from alignment-progress.
Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.
I'm overall skeptical of my particular proposal however, because 1. I'm not aware of any well-rounded "alignment" metrics, and 2. you'd need to be confident that you can scale it up without losing control (because if t...
To clarify, here I'm not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of "whether or not recursive improvement is easy to do" as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.
Maybe.[1]
Even though language models are impressive, and it definitely is something to be aware of that you could try to do amplification with language models and something like chain of thought prompting or AutoGPT's task breakdown prompts, I still think that the typical IDA architecture is too prone to essentially training the model to hack itself. Heck, I'm worried that if you arranged humans in an IDA architecture, the humans would effectively "hack themselves."
But given the suitability of language models for things even sorta like IDA, I agree you're right to bring this up, and maybe there's something clever nearby that we should be searching for.
Given the rate of progress in AutoGPT-like approaches, should we reconsider Paul Christiano's Iterated Distillation and Amplitication (IDA) agenda as potentially central to the alignment of transformative ML systems?
For contex on IDA and AutoGPT: