Eliezer frequently claims that AI cannot "do our alignment homework for us". OpenAI disagrees and is pursuing Superalignment as their main alignment strategy.
Who is correct?
Eliezer frequently claims that AI cannot "do our alignment homework for us". OpenAI disagrees and is pursuing Superalignment as their main alignment strategy.
Who is correct?
(That said, reasonable chance of success might look like 90% which isn't amazing. And this will depend on your probablity of various issues like scheming.)
I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans (at the point where we have powerful AIs). I think this level of checking should be reasonably doable, though could go poorly (see various fields).
However, note that if you think we would fail to sufficiently check human AI safety work even given substantial time, we would also fail to solve various issues given a substantial pause (as Eliezer thinks is likely the case).
It seems pretty relevant to note that we haven't found an easy way of writing software[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended (see the underhanded C contest or cve-rs). The artifacts that AI systems will produce will be of similar or higher complexity.
I think this puts a serious dent into "evaluation is easier than generation"—sure, easier, but how much easier? In practice we can also solve a lot of pretty big SAT instances.
Unless we get AI systems to write formal verifications for the cod
I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans
Why is this an important crux? Is it necessarily the case that if we can reliably check AI safety work done by humans that we we reliably check AI safety work done by Ai's which may be optimising against us?
However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause
This does not seem automatic to me (at least in the hypothetical scenario where "pause" takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].
For example, my crux[1] is that current institutions do not subscribe ...
Assumming that there is an "alignment homework" to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.
An important disclaimer is that perhaps there is no "alignment homework" that needs to get done ("alignment by default", "AGI being impossible", etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question - namely, because they think that the homework to be done isn't particularly difficult in the first place.
For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters).
However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI's "superalignment" approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens.
And my point here is not to argue that this won't happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)
To put it in a different way:
Finally, suppose you think that the problem with "making AI go well" is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]
A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.
The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities - for example, because we know how to evaluate capabilities research, while the field of "make AI go well" is much less mature.
Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.
I think there is an important distinction between "If given substantial investment, would the plan to use the AIs to do alignment research work?" and "Will it work in practice given realistic investment?".
The cost of the approach where the AIs do alignment research might look like 2 years of delay in median worlds and perhaps considerably more delay with some probability.
This is a substantial cost, but it's not an insanely high cost.
Trivially, any AI smart enough to be truly dangerous is capable of doing our "alignment homework" for us, in the sense of having enough intelligence to solve the problem. This is something EY has also pointed out many times, but which often gets ignored. Any ASI that destroys humanity will have no problem whatsoever understanding that that's not what humanity wanted, and no difficulty figuring out what things we would have wanted it to do instead.
What is very different and less clear of a claim is whether we can use any AI developed with sufficient capabilities, but built before the "homework" was done, to do so safely (for likely/plausible definitions of "we" and "use").
Trivially, any AI smart enough to be truly dangerous is capable of doing our "alignment homework" for us, in the sense of having enough intelligence to solve the problem.
Is this trivial? People at least argue that the capability profile could be sufficiently unfortunate such that AIs are extremely dangerous prior to being extremely useful. (As a particularly extreme case, people often argue that AIs will be qualitatively wildly superhumans in dangerous skills (e.g. persuation) prior to being merely qualitatively human level at doing AI safety research. ...
Petition to change the title of this post to "Can we get AIs to do our alignment homework for us?"
My mental model is that there is an entire space of possible AIs, each with some capability level and alignability level. Given the state of the alignment field, there is some alignability ceiling, below which we can reliably align AIs. Right now, this ceiling is very low, but we can push it higher over time.
At some capability level, the AI is powerful enough to solve alignment of a more capable AI, which can then solve alignment for even more capable AI, etc all the way up. However, even the most alignable AI capable of this is still potentially very hard to align. There will of course be more alignable and less capable AIs too, but they will not be capable enough to actually kick off this bucket chain.
Then the key question is whether there will exist an AI that is both alignable and capable enough to start the bucket chain. This is a function of both (a) the shape of the space of AIs (how quickly do models become unalignable as they become more capable?) and (b) how good we become at solving alignment. Opinions differ on this - my personal opinion is that probably this first AI is pretty hard to align, so we're pretty screwed, though it's still worth a try.
I wish you wouldn't use the term "align" if actually just mean "safely use" or you would make it clear that we don't necessarily need alignment. E.g. because we could apply something like control (perhaps combined with paying AIs for their labor like normal employees).
Sorry for the word policing.
I'm no longer sure the question makes sense, and to the extent it makes sense I'm pessimistic. Things probably won't look like one AI taking over everything, but more like an AI economy that's misaligned as a whole, gradually eclipsing the human economy. We're already seeing the first steps: the internet is filling up with AI generated crap, jobs are being lost to AI, and AI companies aren't doing anything to mitigate either of these things. This looks like a plausible picture of the future: as the AI economy grows, the money-hungry part of it will continue being stronger than the human-aligned part. So it's only a matter of time before most humans are outbid / manipulated out of most resources by AIs playing the game of money with each other.
I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:
There's no clear answer to this question because it depends on your definition of "AI alignment" work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.
The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.
However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.
I can't find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can't easily be automated so Eliezer's pessimism makes sense in light of that information.
Further reading:
I am not so sure it will be possible to extract useful work towards solving alignment out of systems we do not already know how to carefully steer. I think that substantial progress on alignment is necessary before we know how to build things that actually want to help us advance the science. Even if we built something tomorrow that was in principle smart enough to do good alignment research, I am concerned we don’t know how to make it actually do that rather than, say, imitate more plausible-sounding but incorrect ideas. The fact that appending silly phrases like “I’ll tip $200” improves the probability of receiving correct code from current LLMs indicates to me that we haven’t succeeded at aligning them to maximally want to produce correct code when they are capable of doing so.
As one scales up a system, any small misalignment within that system will become more apparent- more skewed. I use shooting an arrow as an example. Say you shoot an arrow at a target from only a few feet away. If you are only a few degrees off from being lined up with the bullseye, when you shoot the close target your arrow will land very close to the bullseye. However, if you shoot a target many yards away with the same degree of error, your arrow will land much, much farther from the bullseye.
So if you get a less powerful AI aligned with your goals to a degree where everything looks fine, and then assign it the task of aligning a much more powerful AI, then any small flaw in the alignment of the less powerful AI will go askew far worse in the more powerful AI. What's worse- since you assigned the less powerful AI the task aligning the larger AI, you won't be able to see exactly what the flaw was until it's too late, because if you'd been able to see the flaw, you would have aligned the larger AI yourself.
My intuition is that it is at least feasible to align a human level intelligence with the "obvious" methods that fail for superintelligence, and have them run faster to to produce superhuman output.
Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.
And third, there is a lot of value to be captured from narrow AI that don't have deceptive capabilities but are very good at say solving math.
Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.
Why do you believe that a superhuman intelligence wouldn't be able to deceive you by producing outputs that look correct instead of outputs that are correct?
I don’t have the specifics but this is just a natural tendency of many problems - verification is easier than coming up with the solution. Also maybe there are systems where we can require the output to be mathematically verified or reject solutions whose outcomes are hard to understand.
I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we're on the same page.
Roughly speaking, what I have in mind is that there are at least two possible claims. One is that 'we can't get AI to do our alignment homework' because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there's some sort of 'intrinsic' reason why an AI built by humans could never solve alignment homework.
I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:
There's no clear answer to this question because it depends on your definition of "AI alignment" work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.
The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.
However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.
I can't find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can't easily be automated so Eliezer's pessimism makes sense in light of that information.
Further reading: