I'm surprised I didn't see here my biggest objection:
MIRI talks about "pivotal acts", building an AI that's superhuman in some engineering disciplines (but not generally) and having it do a human-specified thing to halt the development of AGIs (e.g. seek and safely melt down sufficiently large compute clusters) in order to buy time for alignment work. Their main reason for this approach is that it seems less doomed to have an AI specialize in consequentialist reasoning about limited domains of physical engineering than to have it think directly about how its developers' minds work.
If you are building an alignment researcher, you are building a powerful AI that is directly thinking about misalignment—exploring concepts like humans' mental blind spots, deception, reward hacking, hiding thoughts from interpretability, sharp left turns, etc. It does not seem wise to build a consequentialist AI and explicitly train it to think about these things, even when the goal is for it to treat them as wrong. (Consider the Waluigi Effect: you may have at least latently constructed the maximally malicious agent!)
I agree, of course, that the biggest news here is the costly commitment—my prior model of them was that their alignment team wasn't actually respected or empowered, and the current investment is very much not what I would expect them to do if that were the case going forward.
Given the current paradigm and technology it seems far safer to have an AI work on alignment research than highly difficult engineering tasks like nanotech. In particular, note that we only need to have an AI totally obsolete prior effors for this to be as good of a position as we could reasonably hope for.
In the current paradigm, it seem like the AI capability profile for R&D looks reasonably similar to humans.
Then, my overall view is that (for the human R&D capability profile) totally obsoleting alignment progress to date will be much, much easier than developing engineering based hard power necessary for a pivotal act.
This is putting aside the extreme toxicity of directly trying to develop decisive strategic advantage level hard power.
For instance, it's no concidence that current humans work on advancing alignment research rather than trying to develop hard power themselves...
So, you'll be able to use considerably dumber systems to do alignment research (merely human level as opposed to vastly superhuman).
Then, my guess is that the reduction in intelligence will dominate world model censorship.
This is putting aside the extreme toxicity of directly trying to develop decisive strategic advantage level hard power.
The pivotal acts that are likely to work aren't antisocial. My guess is that the reason nobody's working on them is lack of buy-in (and lack of capacity).
Also, davidad's Open Agency Architecture is a very concrete example of what such a non-antisocial pivotal act that respects the preferences of various human representatives would look like (i.e. a pivotal process).
Perhaps not realistically feasible in its current form, yes, but davidad's proposal suggests that there might exist such a process, and we just have to keep searching for it.
Yeah, if this wasn't clear, I was refering to 'pivotal acts' which use hard engineering power sufficient for decisive strategic advantage. Things like 'brain emulations' or 'build a fully human interpretable AI design' don't seem particularly anti-social (but may be poor ideas for feasiblity reasons).
Agree that current AI paradigm can be used to make significant progress in alignment research if used correctly. I'm thinking something like Cyborgism; leaving most of the "agency" to humans and leveraging prosaic models to boost researcher productivity which, being highly specialized in scope, wouldn't involve dangerous consequentialist cognition in the trained systems.
However, the problem is that this isn't what OpenAI is doing - iiuc, they're planning to build a full-on automated researcher that does alignment research end-to-end, for which orthonormal was pointing out that this is dangerous due to their cognition involving dangerous stuff.
So, leaving aside the problems with other alternatives like pivotal act for now, it doesn't seem like your points are necessarily inconsistent with orthonormal's view that OpenAI's plans (at least in its current form) seem dangerous.
I think OpenAI is probably agnostic about how to use AIs to get more alignment research done.
That said, speeding up human researchers by large multipliers will eventually be required for the plan to be feasible. Like 10-100x rather than 1.5-4x. My guess is that you'll probably need AIs running considerably autonomously for long stretches to achieve this.
This is a big crux on how to view the world. Does it take irrationality to do hard things? Eliezer Yudkowsky explicitly says no. In his view, rationality is systematized winning. If you think you need to be ‘irrational’ to do a hard thing, either that hard thing is not actually worth doing, or your sense of what is rational is confused.
But this has always been circular. Is the thing Ilya is doing going to systematically win? Well, it's worked out pretty well for him so far. By this standard, maybe focusing on calibration is the real irrationality.
I also think that "fooling oneself or having false beliefs" is a mischaracterization of the alternative to classic "rationality", or maybe a type error. Consider growth mindset: it's not really a specific belief, more like an attitude; and specifically, an attitude from which focusing on "what's the probability that I succeed" is the wrong type of question to ask. I'll say more about this later in my sequence on meta-rationality.
What is bizarre is the idea that most of the ML community ‘doesn’t think misalignment will be a problem.’
I should have been clearer, I meant "existential problem", since I assumed that was what Conor was referring to by "not going to work". I think that, with that addition, the statement is correct. I also still think that Conor's original statement is so wildly false that it's a clear signal of operating via mood affiliation.
This is the opposite of their perspective, which is that ‘good enough’ alignment for the human-level is all you need. That seems very wrong to me. You would have to think you can somehow ‘recover’ the lost alignment later in the process.
I mostly thing this ontology is wrong, but attempting to phrase a response within it: as long as you can extract useful intellectual work out of a system, you can "recover" lost alignment. Misaligned models are not going to be lying to us about everything, they are going to be lying to us about specific things which it's difficult for us to verify. And a misaligned human-level model simply won't have a very easy time lying to us, or coordinating lies across many copies of itself.
On circularity and what wins, the crux to me in spots like this is whether you do better by actually fooling yourself and actually assume you can solve the problem, or whether you want to take a certain attitude like 'I am going to attack this problem like it is solvable,' while not forgetting in case it matters elsewhere that you don't actually know that - which in some cases that I think includes this one matters a lot, in others not as much. I think we agree that you want to at least do that second one in many situations, given typical human limitations.
My current belief is that fooling oneself for real is at most second-best as a solution, unless you are being punished/rewarded via interpretability.
On circularity and what wins, the crux to me in spots like this is whether you do better by actually fooling yourself and actually assume you can solve the problem
As per my comment I think "fooling yourself" is the wrong ontology here, it's more like "devote x% of your time to thinking about what happens if you fail" where x is very small. (Analogously, someone with strong growth mindset might only rarely consider what happens if they can't ever do better than they're currently doing—but wouldn't necessarily deny that it's a possibility.)
Or another analogy: what percentage of their time should a startup founder spend thinking about whether or not to shut down their company? At the beginning, almost zero. (They should plausibly spend a lot of time figuring out whether to pivot or not, but I expect Ilya also does that.)
That is such an interesting example because if I had to name my biggest mistake (of which there were many) when founding MetaMed, it was failing to think enough about whether to shut down the company, and doing what I could to keep it going rather than letting things gracefully fail (or, if possible, taking what I could get). We did think a bunch about various pivots.
Your proposed ontology is strange to me, but I suppose one could say that one can hold such things as 'I don't know and don't have a guess' if it need not impact one's behavior.
Whether or not it makes sense for Ilya to think about what happens if he fails is a good question. In some ways it seems very important for him to be aware he might fail and to ensure that such failure is graceful if it happens. In others, it's fine to leave that to the future or someone else. I do want him aware enough to check for the difference.
With growth mindset, I try to cultivate it a bunch, but also it's important to recognize where growth is too expensive to make sense or actually impossible - for me, for example, learning to give up trying to learn foreign languages.
To further clarify the statement, do you mean 'most ML researchers do not expect to die' or do you mean 'most ML researchers do not think there is an existential risk here at all?' Or something in between? The first is clearly true, I thought the second was false in general at this point.
(I do agree that Connor's statement was misleading and worded poorly, and that he is often highly mood affiliated, I can certainly sympathize with being frustrated there.)
When Conor says "won't work", I infer that to mean "will, if implemented as the main alignment plan, lead to existential catastrophe with high probability". And then my claim is that most ML researchers don't think there's a high probability of existential catastrophe from misaligned AGI at all, so it's very implausible that they think there's a high probability conditional on this being the alignment plan used.
(This does depend on what you count as "high" but I'm assuming that if this plan dropped the risk down to 5% or 1% or whatever the median ML researcher thinks it is, then Conor would be deeply impressed.)
Thanks, that's exactly what I needed to know, and makes perfect sense.
I don't think it's quite as implausible to think both (1) this probably won't work as stated and (2) we will almost certainly be fine, if you think those involved will notice this and pivot. Yann LeCun for example seems to think a version of this? That we will be fine, despite thinking current model and technique paths won't work, because we will therefore move away from such paths.
While I generally agree with you, I don't think growth mindset actually works, or at least is wildly misleading in what it can do.
Re the issue where talking about the probability of success is the wrong question to ask, I think the key question here is how undetermined the probability of success is, or how much it's conditional on your own actions.
That’s a great point, but aren’t you saying the same thing in disguise?
To me you’re both saying « The map is not the territory. ».
If I map a question about nutrition using a frame known useful for thermodynamics, I’ll make mistakes even if rational (because I’d fail to ask the right questions). But « asking the right question » is something I count as « my own actions, potentially », so you could totally reword that as « success is conditional on my own decision to stop gathering the thermodynamic frame for thinking about nutrition ».
Also I’d say that what you can call « growth mindset » (as anything for mental health, especially placebos) can sometime help you. And by « you » I mean « me », of course. 😉
In general, I think that a wrong frame can only slow you down, not stop you. The catch is that the slowdown can be arbitrarily bad, which is the biggest problem here.
You can use a logic/math frame to answer a whole lot of questions, but the general case is exponentially slow the more variables you add, and that's in a bounded logic/math frame, with relatively simplistic logics. Any more complexity and we immediately run into ever more intractability, and this goes on for a long time.
I'd say the more appropriate saying is that the map is equivalent to the territory, but computing the map is in the general case completely hopeless, so in practice our maps will fail to match the logical/mathematical territory.
I agree that many problems in practice have a case where the probability of success is dependent on you somewhat, and that the probability of success is underdetermined, so it's not totally useful to ask the question "what is the probability of success".
On growth mindset, I mostly agree with it if we are considering any possible changes, but in practice, it's usually referred to as the belief that one can reliably improve yourself with sheer willpower, and I'm way more skeptical that this actually works, for a combination of reasons. It would be nice to have that be true, and I'd say that a little bit of it could survive, but unfortunately I don't think this actually works, for most problems. Of course, most is not all, and the fact that the world is probably heavy tailed goes some way to restore control, but still I don't think growth mindset as defined is very right.
I'd say the largest issue about rationality that I have is that it generally ignores bounds on agents, and in general it gives little thought to what happens if agents are bounded in their ability to think or do things. It's perhaps the biggest reason why I suspect a lot of paradoxes of rationality come about.
//bigger
in practice our maps will fail to match the logical/mathematical territory. It's perhaps the biggest reason why I suspect a lot of paradoxes of rationality come about.
That’s an interesting hypothesis. Let’s see if that works for ![this problem]/(https://en.m.wikipedia.org/wiki/Bertrand_paradox_(probability)). Would you say Jaynes is the only one who manage to match the logical/mathematical territory? Or would you say he completely misses the point because his frame puts too much weight on « There must be one unique answer that is better than any other answer »? How would you try to reason two bayesians who would take opposite positions on this mathematical question?
//smaller
a wrong frame can only slow you down, not stop you. The catch is that the slowdown can be arbitrarily bad
This vocabulary feels misleading, like saying: We can break RSA with a fast algorithm. The catch is that it’s slow for some instances.
the general case is exponentially slow the more variables you add
This proves too much, like the ![no free lunch]/(https://www.researchgate.net/publication/228671734_Toward_a_justification_of_meta-learning_Is_the_no_free_lunch_theorem_a_show-stopper). The catch is exactly the same: we don’t care about the general case. All we care about is the very small number of cases that can arise in practice.
(as a concrete application for permutation testing: if you randomize condition, fine, if you randomize pixels, not fine… because the latter is the general case while the former is the special case)
though it can't be too strong, or else we'd be able to do anything we couldn't do today.
I don’t get this sentence.
I don't think growth mindset [as the belief that one can reliably improve yourself with sheer willpower] is very right.
That sounds reasonable, condition on interpreting sheer willpower as magical thinking rather than ![cultivating agency]/(https://www.lesswrong.com/posts/vL8A62CNK6hLMRp74/agency-begets-agency)
That’s an interesting hypothesis. Let’s see if that works for ![this problem]/(https://en.m.wikipedia.org/wiki/Bertrand_paradox_(probability)). Would you say Jaynes is the only one who manage to match the logical/mathematical territory? Or would you say he completely misses the point because his frame puts too much weight on « There must be one unique answer that is better than any other answer »? How would you try to reason two bayesians who would take opposite positions on this mathematical question?
Basically, you sort of mentioned it yourself: There is no unique answer to the question, so the question as given underdetermines the answer. There is more than 1 solution, and that's fine. This means that the question to answer is not one-to-one, so some choices must be made here.
This vocabulary feels misleading, like saying: We can break RSA with a fast algorithm. The catch is that it’s slow for some instances.
This is indeed the problem. I never stated that it must be a reasonable amount of time, and that's arguably the biggest issue here: Bounded rationality is important, much more important than we realize, because there is limited time and memory/resources to dedicate to problems.
This proves too much, like the ![no free lunch]/(https://www.researchgate.net/publication/228671734_Toward_a_justification_of_meta-learning_Is_the_no_free_lunch_theorem_a_show-stopper). The catch is exactly the same: we don’t care about the general case. All we care about is the very small number of cases that can arise in practice.
(as a concrete application for permutation testing: if you randomize condition, fine, if you randomize pixels, not fine… because the latter is the general case while the former is the special case)
The point here is I was trying to answer the question of why there's no universal frame, or at least why logic/mathematics isn't a useful universal frame, and the results are important here in this context.
I don’t get this sentence.
I didn't speak well, and I want to either edit or remove this sentence.
That sounds reasonable, condition on interpreting sheer willpower as magical thinking rather than ![cultivating agency]/(https://www.lesswrong.com/posts/vL8A62CNK6hLMRp74/agency-begets-agency)
Okay, the biggest disagreements with stuff like growth mindset is I believe a lot of your outcomes are due to luck/chance events swinging in your favor. Heavy tails sort of restores some control, since a single action can have large impact, thus even a little control multiplies, but a key claim I'm making is that a lot of your outcomes are due to luck/chance, and the stuff that isn't luck probably isn't stuff you control yet, and that we post-hoc a merit/growth based story even when in reality luck did a lot of the work.
The point here is I was trying to answer the question of why there's no universal frame, or at least why logic/mathematics isn't a useful universal frame, and the results are important here in this context.
Great point!
There is no unique answer to the question, so the question as given underdetermines the answer.
That’s how I feel about most interesting questions.
Bounded rationality is important, much more important than we realize
Do you feel IP=PSPACE relevant on this?
a key claim I'm making is that a lot of your outcomes are due to luck/chance, and the stuff that isn't luck probably isn't stuff you control yet, and that we post-hoc a merit/growth based story
Sure And there’s some luck and things I don’t control in who I had children with. Should I feel less grateful because someone else could have done the same?
Sure And there’s some luck and things I don’t control in who I had children with. Should I feel less grateful because someone else could have done the same?
No. It has a lot of other implications, just not this one.
Do you feel IP=PSPACE relevant on this?
Yes, but in general computational complexity/bounded computation matter a lot more than people think.
That’s how I feel about most interesting questions.
I definitely sympthatize with this view.
I also worry a lot that this is capabilities in disguise, whether or not there is any intent for that to happen. Building a human-level alignment researcher sounds a lot like building a human-level intelligence, exactly the thing that would most advance capabilities, in exactly the area where it would be most dangerous. One more than a little implies the other. Are we sure that is not the important part of what is happening here?
Yeah, that's been my knee-jerk reaction as well.
Over this year, OpenAI has been really growing in my eyes — they'd released that initial statement completely unprompted, they'd invited ARC to run alignment tests on GPT-4 prior to publishing it, they ran that mass-automated-interpretability experiment, they argued for fairly appropriate regulations at the Congress hearing, Sam Altman confirmed he doesn't think he can hide from an AGI in a bunker, and now this. You certainly get a strong impression that OpenAI is taking the problem seriously. Out of all possible companies recklessly racing to build a doomsday device we could've gotten, OpenAI may be the best possible one.
But then they say something like this, and it reminds you that, oh yeah, they are still recklessly racing to build a doomsday device...
I really want to be charitable towards them, I don't think any of that is just cynical PR or even safety-washing. But:
Nice statements, resource commitments, regulation proposals, etc., is all ultimately just fluff. Costly fluff, not useless fluff, but fluff nonetheless. How you actually plan to solve the alignment problem is the core thing that actually matters — and OpenAI hasn't yet shown any indication they'd be actually able to pivot from their doomed chosen approach. Or even recognize it for a doomed one. (Indeed, the answer to Eliezer's "how will you know you're failing?" is not reassuring.)
Which may make everything else a moot point.
Almost all new businesses fail
Not super important but I wish rationalists would stop parroting this when the actual base rate is a Google search away. Around 25% of businesses survive for fifteen years or more.
Minor point:
> [Paul discussing “process-based feedback”]
It is a huge alignment tax if the supervisor needs to understand everything that is going on, unless the supervisor is as on the ball as the system being supervised. So there’s a big gain in results if you instead judge by results, while not understanding what is going on and being fine with that, which is a well-known excellent way to lose control of the situation.
The good news is that we pay exactly this huge alignment tax all the time with humans. It plausibly costs us most of the potential productivity of many of our most capable people. We often decide the alternative is worse. We might do so again.
I think the second paragraph understates the problem. I have never heard of a human manager / human underling relationship that works like process-based supervision is supposed to work. I think you’re misunderstanding it—that it’s weirder than you think it is. I have a draft blog post where I try to explain; should come out next week (or DM me for early access).
UPDATE: now it’s posted, see Thoughts on Process-Based Supervision esp. Section 5.3.1 (“Pedagogical note: If process-based supervision sounds kinda like trying to manage a non-mission-aligned human employee, then you’re misunderstanding it. It’s much weirder than that.”)
It could be like verifying math test solutions. I’m not sure about the granularity of process based supervision, but it could be less weird if an AI just has to justify how it got to an answer rather than just giving out the answer.
From Leike's post:
However, if we understand our systems’ incentives (i.e. reward/loss functions) we can still make meaningful statements about what they’ll try to do.
I think this frame breaks for AGIs. It works for dogs (and I doubt doing this to dogs is a good thing), not sapient people.
Just as reward is not a goal, it's also not an incentive. Reward changes the model in ways the model doesn't choose. If there is no other way to learn, there is some incentive to learn gradient hacking, but accepting reward even in that way would be a cost of learning, not value. Learning in a more well-designed way would be better, avoiding reward entirely.
With the size of the project and commitment, along with OAI acknowledging they might hit walls and will try different courses, one can hope investigating better behavioral systems for an AGI will be one of them.
Human level is a rather narrow target to hit
This is late in the game, but... I don't actually know how narrow of a target this is to hit. When you all refer to "human level" do you mean more like:
I worry that often people are superpositioning both at once!
In general when they don't do that, I presume they mean in terms of architecture.
Dedicating 20% of compute secured to date is different from 20% of forward looking compute, given the inevitable growth in available compute. It is still an amazingly high level of compute, far in excess of anyone’s expectations.
I'm pretty sure this means 20% of all the compute OpenAI will have for the next four years, unless OpenAI ends up with more compute than they currently project. Does that count as forward-looking?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
You would have to think you can somehow ‘recover’ the lost alignment later in the process.
Are you actually somehow unaware of the literature on Value Learning and AI-assisted Alignment, or just so highly skeptical of it that you're here pretending for rhetorical effect that it doesn't exist? The entire claim of Value Learning is that if you start off with good enough alignment, you can converge from there to true alignment, and that this process improves as your AIs scale up. Given that the previous article in your sequence is on AI-assisted Alignment, it's clear that you do in fact understand the concept of how one might hope that this 'recovery' could happen. So perhaps you might consider dropping the word 'somehow', and instead expending a sentence or two on acknowledging the existence of the idea and then outlining why you're not confident it's workable?
Thanks for writing this!
On the other side, the more you are capable of a leadership position or running your own operation, the more you don’t want to be part of this kind of large structure, the more you worry about being able to stand firm under pressure and advocate for what you believe, or what you want to do can be done on the outside, the more I’d look at other options.
One question is: what if one has strong technical disagreements with the OpenAI approach as currently formulated? Does this automatically imply that one should stay out?
For example, the announcement starts with the key phrase:
We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.
However, one could reason as follows (similarly to what is said near the ending of this post).
A group of unaided well-meaning very smart humans is still not smart enough to steer super-capabilities in such a fashion as to avoid driving off the cliff. So humans would need AI help not only in their efforts to "align, steer, and control AI", but also in figuring out directions towards which they want to steer.
So, one might decide that a more symmetric formulation might make more sense, that it's time drop the idea "to steer and control AI systems much smarter than us", and replace it with something more realistic and probably more desirable, for example, something along the following lines: "to organize a fruitful collaborative ecosystem between humans and AIs, such that it collectively decides where to steer in a fruitful way while guarding against particularly unsafe actions and developments".
In this formulation, having as much AI assistance as possible from the very beginning sounds much more natural.
And risk factors become not the adversarial "supersmart, potentially very unfriendly AI against humans" (which sounds awfully difficult and almost hopeless), but the "joint ecosystem of humans and AIs against relevant risk factors", which seems to be more realistic to win.
So, one wonders: if one has this much disagreement with the key starting point of the OpenAI approach, is there any room for collaboration, or does it make more sense to do this elsewhere?
If my alternative action had zero or low value, my inclination would be to interview with OpenAI, be very open and direct about my disagreements and that I was going to remain loud about them, and see if they hired you anyway, and accept if you updated positively during the process about your impact by joining.
I'd be less excited if I had other very good options to consider.
In their announcement Introducing Superalignment, OpenAI committed 20% of secured compute and a new taskforce to solving the technical problem of aligning a superintelligence within four years. Cofounder and Chief Scientist Ilya Sutskever will co-lead the team with Head of Alignment Jan Leike.
This is a real and meaningful commitment of serious firepower. You love to see it. The announcement, dedication of resources and focus on the problem are all great. Especially the stated willingness to learn and modify the approach along the way.
The problem is that I remain deeply, deeply skeptical of the alignment plan. I don’t see how the plan makes the hard parts of the problem easier rather than harder.
I will begin with a close reading of the announcement and my own take on the plan on offer, then go through the reactions of others, including my take on Leike’s other statements about OpenAI’s alignment plan.
A Close Reading
Section: Introduction
Excellent. Love the ambition, admission of uncertainty and laying out that alignment of a superintelligent system is fundamentally different from and harder than aligning less intelligent AIs including current systems.
Excellent again. Superalignment is clearly defined and established as necessary for our survival. AI systems much smarter than humans must follow human intent. They also don’t (incorrectly) claim that it would be sufficient.
Bold mine here:
Yes, yes, yes. Thank you. Current solutions will not scale. Not ‘may’ not scale. Will not scale. Nor do we know what would work. Breakthroughs are required.
Note B is also helpful, I would say ‘will’ rather than may.
A+ introduction and framing of the problem. As good as could be hoped for.
Section: Our Approach
Oh no.
An human-level automated alignment researcher is an AGI, also a human-level AI capabilities researcher.
Alignment isn’t a narrow safe domain that can be isolated. The problem deeply encompasses general skills and knowledge.
It being an AGI is not quite automatically true, depending on one’s definition of both AGI and especially one’s definition of a human-level alignment researcher. Still seems true.
If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting (where it would not be useful) or overshooting (where we wouldn’t be able to align it, and might well not realize this and all die)? Given comparative advantages it is not clear ‘human-level’ exists at all here.
They do talk later about aligning this first AGI. This does not solve the hard problem of how a dumber thing can align a smarter thing. The plan is to do this in small steps so it isn’t too much smarter at each step, and do it at the speed of AI. Those steps help, and certainly help iterate and experiment. They don’t solve the hard problems.
I am deeply skeptical of such plans. Flaws in the alignment and understanding of each AGI likely get carried over in sequence. If you use A to align B to align C to align D, that is at best a large game of telephone. At each stage the one doing the aligning is at a disadvantage and things by default only get worse. The ‘alignment’ in question is not some mathematical property, it is an inscrutable squishy probabilistic sort of thing that is hard to measure even under good conditions, which you then are going to take out of its distribution, because involving an ASI makes the situation outside of distribution. We won’t be in a position to understand what is happening, let alone adopt a security mindset.
This is the opposite of their perspective, which is that ‘good enough’ alignment for the human-level is all you need. That seems very wrong to me. You would have to think you can somehow ‘recover’ the lost alignment later in the process.
I am not saying I am confident it could never possibly work – I have model uncertainty about that. Perhaps it is merely game-style impossible-level difficulty. I even notice myself going ‘oh I bet if you…’ quite a bit and I would love to take a shot some time.
When I talk to people about this plan I get a mix of reactions. One person working in AI responded positively at first, although that was due to assuming they meant a different less obviously doomed plan, they then realized ‘oh, they really do mean train an AI alignment researcher.’ Among those who understand alignment is important and hard, and who parse the post accurately in terms of announced intent, I didn’t hear much hope for the approach, but that sample is obviously biased.
A human-level AI researcher sounds like a dangerous system, even if it cannot do every human task. We likely encounter some superalignment problems on this first step. As basic examples: Punishing the detection of otherwise effective behaviors causes deception. Testing for interpretability can cause internal disguise and even ‘fooling oneself’ as it does in humans, cognition gets forced into whatever form you won’t detect, including outside what you think is the physical system. The system may display various power seeking or strategic behaviors as the best way to accomplish its goals. It will definitely start entirely changing its approach to many problems as its new capabilities enable harder approaches to become more effective, invalidating responses to previous methods. There is risk the system will be able to differentiate when it may be in a training or testing scenario and change its responses accordingly. And so on.
This also seems like the wrong threat model. You want to detect misalignments you weren’t training for, that are expressed in ways you didn’t anticipate, and that involve more powerful optimizations or more intelligence than that which is going into supervising the system. It seems easy to fool oneself with such tests into thinking it is safe to proceed, then all your systems break down.
Similarly, if you create automated systems to search for problematic behaviors or internals, what you are doing is selecting against whatever problems and presentations of those problems your methods are able to detect. And once again, you are trying to detect upwards in the intelligence or capabilities chain. The more fine grained and precise your responses to such discoveries are, the more you are teaching the system to work around your detection methods, including in ways we can’t anticipate and also ways that we can.
The way that such automated systems work, I would expect AIs to be at comparative disadvantage in detecting distinct manifestations of misalignment, compared to humans. Detecting when something is a bit off or suspicious, or even suspiciously different, is a creative task and one of our relative strengths – we humans are optimized especially hard to be effective on such tasks. It is also exactly the type of thing that one should expect would be lost during the game of telephone. The ‘human-level AI researcher’ will get super-human at detecting future problems that look like past problems, while being sub-human at detecting future problems that are different. That is not what we want. Nor is creating a super-human alignment researcher in step one, because by construction we haven’t solved super-alignment at that step.
Adversarial testing seems great for existing non-dangerous systems. The danger is that we will not confidently know when we transition to testing on dangerous systems.
I might summarize this approach as amplifying the difficulty of the hard problems, or at least not solving them, and instead working on relatively easy problems in the hopes they enable working on the hard ones. That makes sense if and only if the easy problems are a bottleneck that enables the solving of the hard problems. Which is a plausible hypothesis, if it lets us try many more hard problem solutions much faster and better.
I also worry a lot that this is capabilities in disguise, whether or not there is any intent for that to happen. Building a human-level alignment researcher sounds a lot like building a human-level intelligence, exactly the thing that would most advance capabilities, in exactly the area where it would be most dangerous. One more than a little implies the other. Are we sure that is not the important part of what is happening here?
Another potential defense is that one could say that all solutions to the problem involve impossible steps, and have reasons they would never work. So perhaps one can see these impossible obstacles as the least impossible option. Indeed, if I squint I can sort of see how one could possibly address these questions – if one realizes that they require shutting up and doing the impossible, and is willing to also solve additional unnoticed impossible problems along the way.
Does that mean Ilya and his famously relentless optimism is the perfect choice, or exactly the wrong choice? That is a great question. One wants the special kind of insane optimism that relentlessly pursues solutions without fooling oneself about what it would take for a solution to work.
I would also note that this seems like the place OpenAI has comparative advantage. If the solution to alignment looks like this, then the skills and resources at OpenAI’s disposal are uniquely qualified to find that solution. If the solution looks very different, that is much less likely.
If I was going to pursue this kind of iterated agenda, my first step would be to try to get a stupider less capable system to align a smarter more capable system.
This last paragraph gives hope. Even if one’s original plan cannot possibly work, a willingness to pivot and expand and notice failure or inadequacy means you are still in the game.
All the new details here about the strategy are net positives, making the plan more likely to succeed, as is the plan to adjust the plan. Given what we already knew about OpenAI’s approach to alignment, the core plan as a concept comes as no surprise. That plan will then be given a ton of resources, relatively good details and an explicit acknowledgment that additions and adjustments will be necessary.
So if we take the general shape of the approach as a given, once again very high marks. Again, as good as could be expected.
Section: The New Team
Dedicating 20% of compute secured to date is different from 20% of forward looking compute, given the inevitable growth in available compute. It is still an amazingly high level of compute, far in excess of anyone’s expectations. There will also be lots of hires explicitly for the new Superalignment team. We will likely need to then go bigger, but still. Bravo.
I once again worry that there will be too much extrapolation of data from existing models to what will work on future superintelligent models. I also would like to see more explicit reckoning with what would happen if the effort fails. There is absolutely no shame in failing or taking longer to solve this, but if you know you don’t have a solution, what will you do?
Considering Joining the UK Taskforce
Before considering the option of joining the OpenAI Superalignment Taskforce, take this opportunity to consider joining the UK Foundation Model Taskforce.
The UK Foundation Model Taskforce is a 100 million pound effort, led by Ian Hogarth who wrote the Financial Times Op-Ed headlined “We must slow down the race to God-like AI.” They are highly talent constrained at the moment, what they need most are experienced people who can help them hit the ground running. If you are a good fit for that, I would make that your top priority.
You can reach out to them using this Google Form here.
Considering Joining the OpenAI Taskforce
What about if the UK taskforce is not a good fit, but the OpenAI one might be?
If you are currently working on capabilities and want to instead help solve the problem and build a safety culture, the Superalignment team seems like an excellent chance to pivot to (super?) alignment work without making sacrifices on compensation or interestingness or available resources. Compensation for the new team is similar to that of other engineers.
One of the biggest failures of OpenAI, as far as I can tell, is that OpenAI has failed to build that culture of safety. This new initiate could be an opportunity to cultivate such a culture. If I was hiring for the team, I’d want a strong focus on people who fully bought into the difficulty and dangers of the problem, and also the second best time to do that for the rest of your hiring will always be right now.
If your alternative considerations are instead other alignment work, what about then? Opinion was mostly positive when I asked.
I think it depends a lot on your alternatives, skills and comparative advantage, and also if I was considering this I would do a much deeper dive and expect to update a lot one way or another. The more you believe you can be good at building a good culture and steering things towards good plans and away from bad ones, and the more you are confident you can stand up to pressure, or need this level of compensation for whatever reason, the more excited I would be to join. Also, of course, if your alignment ideas really do require tons of compute, there’s that.
On the other side, the more you are capable of a leadership position or running your own operation, the more you don’t want to be part of this kind of large structure, the more you worry about being able to stand firm under pressure and advocate for what you believe, or what you want to do can be done on the outside, the more I’d look at other options.
I do think that this task force crosses the threshold for me, where if you tell me ‘I asked a lot of questions and got all the right answers, and I think this is the right thing for me to do’ I would likely believe you. One must still be cautious, and prepared at any time to take a stand and if necessary quit.
Final Section: Sharing the Bounty
Excellent. The suggestion to publish is a strong indication that this is indeed intended as real notkilleveryoneism work, rather than commercially profitable work. The flip side is that if that does not turn out to be true, there is much here that could be accelerationist. Knowing what to hold back and what not to will be difficult. One hope is that commercial interests may mostly align with the right answer.
Yes, exactly. There is no conflict here. Some people should work on mitigating mundane harms like those listed above, while others are dedicated to the hard problem. One need not ‘distract’ from the other, and I expect a lot of mundane mitigation spending purely for self-interested profit maximizing reasons.
More Technical Detail from Jan Leike
Jan Leike engaged via a few threads, well-considered and interesting throughout, as discussed later. Here was his announcement thread for the announcement:
There is great need for more ML talent working on alignment, but I disagree that alignment is fundamentally a machine learning problem. It is fundamentally a complex multidisciplinary problem, executed largely in machine learning, and a diversity of other skills and talents are also key bottlenecks.
I do like the attitude on the timeline of asking, essentially, ‘why not?’ Ambitious goals like this can be helpful. If it takes 6 years instead, so what? Maybe this made it 6 rather than 7.
He also noted this:
If you use current-model alignment to measure superalignment, that is fatal. The whole reason for the project is that progress on aligning GPT-5 might not indicate actual progress on aligning a superintelligence, or a human-level alignment researcher. It is even possible that some techniques that work in the end don’t work on GPT-5 (or don’t work on GPT-4, or 3.5…) because the systems aren’t smart or capable enough to allow them to work.
For example, Constitutional AI (which I expect to totally not work for superintelligence and also importantly not for human-level intelligence, to be clear) should only work (to the extent it works) within the window where the system is capable enough to execute the procedure, but not capable enough to route around it. That’s one reason you need so much compute to work on some approaches, and access to the best models.
The part where capabilities advances are compared to the rate of alignment progress makes sense either way. The stronger your capabilities, the more you need to ask whether to halt creating more of them. There is still the danger of unexpectedly very large jumps.
Leike also talked substantive detail in the comments of the LessWrong linkpost to the announcement.
[RRM = Recursive Reward Modeling, IDA = Iterated Distillation and Amplification, HCH = Humans Consulting HCH.]
There’s a lot here. The obfuscated arguments problem is an example of the kind of thing that gets much trickier when you attempt amplification, if I am understanding everything correctly.
As I understand Leike’s response, he’s acknowledging that once the systems move sufficiently beyond human-level, the entire class of iterated or debate-based strategies will fail. The good news is that we can realize this, and that they might plausibly hold for human level, I presume because there is no need for an extended amplification loop, and the unamplified tools and humans are still useful checks. This helps justify why one might aim for human-level, if one thinks that the lethalities beyond human-level mostly would then not apply quite yet.
If something is a major driver on whether AI kills us, then definitely don’t leave it to the labs, much better to risk doing redundant work. And ‘how to turn compute into better intelligence without compromising alignment’ sounds very much like it could be such an important question. If it’s a coherent thought that makes sense in context, it’s very important.
I am very much with Christiano on all the criticisms here, especially skepticism of AI evaluations of other AIs via results. The sense of doom there seems overwhelming.
The pattern of ‘individually the statements could be reasonable but the optimism bias is clear if you zoom out’ is definitely an issue. That’s on key way you can fool yourself.
Very well said. It is a huge alignment tax if the supervisor needs to understand everything that is going on, unless the supervisor is as on the ball as the system being supervised. So there’s a big gain in results if you instead judge by results, while not understanding what is going on and being fine with that, which is a well-known excellent way to lose control of the situation.
The good news is that we pay exactly this huge alignment tax all the time with humans. It plausibly costs us most of the potential productivity of many of our most capable people. We often decide the alternative is worse. We might do so again.
I think that if you are trying to get huge amounts of research out of the AIs it is at minimum very hard to see how that happens without substantial takeover risk, or at least risk of a path that ‘bakes in’ such risk later. Very hard is not impossible.
I expect the answer is ‘it depends.’ Some types of code, solutions to some problems, are like math or involve getting technical details right: Hard to figure out and easy to check. Others are harder to check, and how you write the code on many levels helps determine the answers, including whether the code is intentionally obfuscated. In practice, if you are relying on being able to verify the code easier than the cost of writing it, without being able to run it, then at best you are going to pay a ginormous alignment tax – only writing code that can be fully provably understood by the evaluator.
Also I don’t expect ‘run the code’ to be that helpful in testing it in many cases, as the problem could easily be conditional or lie dormant or be an affordance that you don’t understand is an affordance until it is used. Or there could be affordances that appear when you spin up enough related copies that can collaborate. In general, there seems to be an implicit lack of appreciation going around of what it means for something to be smarter, and come up with ways to outsmart you that you can’t anticipate.
Can you hear Eliezer Yudkowsky screaming at the idea of throwing together a messy combination of methods and then messing with their dials in the hopes it all works out? Because I am rather confident Eliezer Yudkowsky is screaming at the idea of throwing together a messy combination of methods and then messing with their dials in the hopes it all works out, and I do not think he is wrong. This is not what ‘get it right on the first try’ looks like, this is the theory that you do not in fact have to do that, in contrast to the realization at the top of a break point where your techniques largely stop working.
This seems like a good time to go back and read Jan Leike’s post referenced multiple times above, ‘Why I’m optimistic about our alignment approach,’ since that approach hasn’t changed much.
It seems worth quoting his argument in 1.1 in full, ‘the AI tech tree is looking favorably:’
Paul Christiano expressed doubt about whether this change is good. I am even more skeptical. LLMs give you messy imprecise versions of a lot of this stuff and everything else they give you, much easier than you’d get that otherwise. They make doing anything precise damn near impossible. So is that helpful here, or is that profoundly unhelpful? Also you can see various ways the deck is being stacked here.
I’d pull this quote out, it seems important:
I do not think these two things are as similar as that, and worry this is an important conceptual error.
Section two is called ‘a more modest goal’ which refers to training an AI that can then do further research, the advantages he lists then include that the model doesn’t have to be fully aligned, doesn’t need agency or persistent memory, and the alignment tax matters less. None of this addresses the reasons why ‘alignment researcher’ is an unusually perilous place to point your AI, and none of it seemed new.
Section three is that evaluation is easier than generation. Evidence and examples includes NP != P, classical sports and games, consumer products, most jobs and academic research. I am not convinced.
In particular, I would stress that the situations in which evaluation is easy are not the places we are worried about our evaluations, or the ways in which we worry our evaluations would go wrong. If 90% of the time evaluation is easy and 10% of the time it is hard – or 99% and 1% – guess which tasks will involve a principal-agent problem or potential AI deception. A mix of relative difficulties that is mostly favorable means the relative difficulty is unfavorable. Note the objection later, where Leike spells out that obfuscated arguments shows there are important situations where evaluation is the harder task. Leike is displaying exactly the opposite of security mindset.
Section four seems to be saying yay for proxy metrics and iterated optimizations on them, and yes optimizing for a proxy metric is easier but the argument is not made here that the proxy metrics are good enough, and there’s every reason to think Goodhart’s Law is going to absolutely murder us here and that the things we optimize successfully will still be waiting to break down when it counts.
A bunch of that seems worth fleshing out more carefully at some point.
The response to the objection ‘isn’t automating alignment work too similar to aligning capabilities work?’ starts with:
Wait, what? That is not an argument in favor of optimism! That is an argument in favor of deep pessimism. You are saying that the thing you are doing will be used to enhance capabilities as soon as it is feasible. So don’t make it feasible?
Next up we have the argument that this makes alignment and capabilities work fungible, which once again should kind of alarm you when you combine it with that first point, no?
There is a very deeply obvious and commonly known externality problem with relying on the incentives here. If your lab develops AGI, you get the benefits, including the benefits of getting there first. If that AGI kills everyone, obviously you will try to avoid that, that kills you too, but you only bear a small fraction of the downside costs. If your plan is to count on self-interest to ensure proper investment in alignment research, you have a many-order-of-magnitude incentive error that will result in a massively inadequate investment in alignment even if everything else goes well. Other factors I can think of only make this worse.
The third response is that you can focus on tasks differentially useful to alignment research. I mean, yes, you can, but we’ve already explained why the collective ‘you’ probably won’t. The whole objection is that you can’t lock this decision in.
The fourth response is better, and does offer some hope.
There is definitely the danger this relationship fails to hold, and the danger that even ordinary unlocked algorithmic improvements are already too much. If you can build a human-level researcher, and you’re somehow not in the danger zone yet, you are damn close, even one or two additional orders of magnitude efficiency gain is scary as hell on every level. Thus, I think this has some value, but agree with Leike that counting on it would be highly unwise.
Next up he does tackle the big objection, that in order to do good alignment work one has to be capable enough to be dangerous. In particular the focus is on having to be capable of consequentialist reasoning, although I’d widen the focus beyond that.
Here is his response.
You can say a nonzero amount but how they will ‘try to do’ the thing could involve things you cannot anticipate. If your plan is to base your strategy on providing the right reward function and incentives, you have not in any way addressed any of the involved lethalities or hard problems. How does it help me that Magnus Carlson is going to ‘try and checkmate me’ except on an unconstrained playing field? And that’s agreeing to ignore inner alignment issues.
And wait, I thought the whole point was not to build something much smarter than us, exactly because we couldn’t align that yet.
This is the better answer, that we indeed won’t be building something smarter than us, and that will still help due to scale and speed and so on. Then it is a fact question, how much help can we get out of systems like that.
I agree that the above system seems useful, and that I can see conditions under which it would not be dangerous. There are certainly lots of systems like this, including GPT-4, which we agree is not dangerous and is highly useful. Doubtless we can find more useful programs to help us.
The question I’d have is, why describe the intended program as a human-level researcher? The proof generator doesn’t equate to the human intelligence scale in any obvious way. If we want to build such narrow things, that’s great and we should probably do that, but they don’t solve the human bottleneck problem, and I don’t think this is what the plan intends.
Attempt two: You can use something without consequentialist reasoning to do the steps of the work that don’t involve consequentialist reasoning, but then you need to have something else in the loop doing the consequentialist reasoning. Is it human?
The next question is on potential inner alignment problems, which essentially says we’ve never seen a convincing demonstration of them in existing models and maybe if we do see them they’ll be easy to fix. I would respond that they happen constantly in humans, they are almost certainly happening in LLMs, this will get worse as situations get further out of the training distribution (and that stronger optimization processes will take us further out of the training distribution), and our lack of ability to demonstrate them so far only makes our task of addressing them seem harder. More than that, I’d say what is here seems like a non-answer.
My overall take is that it is wonderful that this post exists and that it goes into so much concrete detail, yet the details mostly worry me more rather than less. The responses do not, in my evaluation, successfully address the concerns, in some cases rather the reverse. I disagree with much of the characterization of the evidence. There is a general lack of appreciation of the difficulty of the problems involved and an absence of security mindset, along with a clear optimism bias.
Still, once again, the detailed engagement here is great – I’d much rather have someone who writes this than someone who believes something similar and doesn’t write this. And there does seem to be a lot of ‘be willing to pivot’ energy.
Ilya Sutskever by contrast is letting the announcement speak for itself. Nothing wrong with focusing on the work.
Nat MacAleese’s Thread
Here, all the right things, except without discussing the alignment plan itself.
The Ilya Optimism
Sherjil Ozair explains how the man’s brain works.
This is a big crux on how to view the world. Does it take irrationality to do hard things? Eliezer Yudkowsky explicitly says no. In his view, rationality is systematized winning. If you think you need to be ‘irrational’ to do a hard thing, either that hard thing is not actually worth doing, or your sense of what is rational is confused.
I have founded multiple start-up companies or businesses. In each case, any rational reading of the odds would say the effort was overwhelmingly likely to fail. Almost all new businesses fail, almost all venture investments fail. Yet I still did boldly go, in many of the same ways a delusional optimistic founder would have boldly gone. In most (but not all and that makes all the difference!) cases it failed to work out. I don’t see any reason this has to require fooling oneself or having false beliefs. I do understand why, in practice and for many people, this does help.
I can easily believe that Ilya would not think of himself as updating. I can see him, when asked by others or even when he queries his own brain, of responding to questions about priors or probabilities with a shrug, leaving them undefined. That he doesn’t consciously think in such ways, or use such terms. And that the things he would instead say sound, if interpreted literally, irrational.
None of that makes what he is doing actually irrational. If you treat any problem you face as solvable and focus on ways one might solve it, that is a highly efficient and rational way to solve a problem, so long as you are choosing problems worth attempting to solve, including considering whether they are solvable. If you allocate attention to various potential solutions, then scale that attention based on feedback on each approaches’ chance of success, that’s a highly rational, systematic and plausibly optimal resource allocation.
Not caring about being right, and being willing to be wrong, is a highly advanced rationalist skill. If you master it by keeping your explicit beliefs as small as your identity, so as to sideline the concepts of ‘being wrong’ or ‘trying and failing’ in favor of asking only what is true and what will lead to good outcomes, we only disagree on vocabulary and algorithmic techniques.
The question here is whether Ilya’s approach implies a lack of precision or an inability to do chains of logical reasoning, or some other barrier to the kind of detailed and exacting thinking necessary for good work on these problems. I don’t have enough information to know. I do know that in general this does sound like someone you want leading your important project.
So What Do We All Think?
First off, can we all agree, awesome, good job, positive reinforcement?
There is an obvious implication to deal with, of course, although this is an overstatement here:
We must simultaneously take the win, and also roll up our sleeves for the actual hard work, both within and outside of OpenAI and the new team. Also we must ensure good incentives, and that we give everyone the easiest way we can to go down the right path.
Would I appreciate a commitment on halting capabilities work beyond some threshold until the superalignment work is done, as Miller claims any sane actor would do? Yes, of course, that would be great, especially alongside a call to broaden that commitment to other labs or via regulations. We should work towards that.
It still is strongly implied. If you have an explicit team dedicated to aligning something, and you don’t claim it succeeded, presumably you wouldn’t build the thing and don’t need a commitment on that. If you did build it anyway, presumably you would fool yourself into thinking the team had instead succeeded – and if a binding commitment was made, one worry is that this would make it that much easier to get fooled. Or, of course, to choose to fool others, although I like to give the benefit of the doubt on that here.
We can also agree that this level of resource commitment is a highly credible signal, if the effort is indeed spent on superalignment efforts in particular. The question is whether this was commercially the play anyway?
That is two right answers from Leike. Explicitly avoiding economic value is highly perverse. There are circumstances where the signaling or incentive dynamics are so perverse you do it anyway, but this is to be avoided. Even in the context of OpenAI and concerns about accelerationism, I would never ask someone to optimize for the avoidance of mundane utility.
I do share Oliver’s worry. One good reason to avoid mundanely useful things is that by putting mundanely useful options on the table you cause them to get chosen whether they make sense or not, so perhaps better to avoid them unless you are certain you want that.
Oliver tries to pin down his skepticism in another thread, along with exactly what is missing in the announcement.
Yep, the ‘not modeling the hard parts of the problem’ is the hard problem. As with the actual hard problem, that’s fine if solving the easy problem helps you solve (or notice you have to solve) the hard problem, and very bad if it lets you fool yourself.
Appreciation of the problem difficulty level is indeed the best marginal thing to want here, with the caveat of the phenomenon of ‘we did this not because it was easy but because we thought it would be easy.’ Having the startup effort think the problem is easier than it is can be good, as per Ilya’s relentless optimism, provided the proper adjustments are made and the problem is taken properly seriously.
Danielle Fong is on board with this plan, she loves this plan.
I admire her ambition. I do worry.
Oh Look, It’s Capabilities Research
That is certainly the maximally OpenAI thing to do, and does seem to be a good description of the plan. There is most definitely the danger that this effectively becomes a project to build an AGI. Which would be a highly foolish thing to be building given our current state of alignment progress. Hence the whole project.
Yes. This is not the plan, it’s also not not the plan, step one kind of is ‘to solve AGI alignment our step one is build an AGI,’ while buying (a weaker version of) the concept of recursive self-improvement.
I am confused exactly how unfair that characterization is. Non-zero, not entirely.
We very much do not want to end up doing this:
Daniel Eth’s offering:
Back to Leahy’s skeptical perspective, which no one involved should take too personally as he is (quite reasonably, I think) maximally skeptical of essentially everything about alignment attempts.
[bunch of back and forth, in which Connor expresses extreme skepticism of the technical plans offered, and says the good statements in context only offer a marginal update that mostly fails to overcome his priors, while confirming that he agrees Jan and Ilya care for real.]
But, reasonable! For me, the post simply doesn’t contain anything that addresses any of my cruxes needed to overcome the hyper negative prior on plans that are of the form “we muddle our way towards AGI1 and ask it to align AGI2” that I have. It’s not that there couldn’t be a plan of this form that is sufficiently advanced and clever that could work, they just DON’T work by default and I see nothing here that makes me feel good about this instance.
Julian Hazell: What would be something OAI could feasibly do that would be a positive update for you? Something with moderate-to-significant magnitude
Connor Leahy: This is a good question I want to think about for a bit longer instead of giving a low quality off the cuff answer, thanks for asking!
My answer is that I did have a positive update already, but if you want to get an additional one on top of that, I would have like to see them produce their less detailed and technical List of Lethalities, all the reasons their approach will inevitably fail, as well as an admission that the plan looks suspiciously like step 1 is build an AGI, and a clear explanation of how to solve such problems and how they plan to avoid being fooled if they haven’t solved them. A general statement of security mindset and that this is not about ‘throw more cycles at the issue’ or something that naturally falls out. Also would love to see discussion about exactly what the final result has to look like and why we should expect this is possible, and so on. It would also help to see discussion about the dangers of this turning into capabilities and the plan for not doing so, or a clear statement that if the project doesn’t provably work then capabilities work will need to be constrained soon thereafter. I’d also likely update positively if they explained more of their reasoning, and especially about things on which they’ve changed their minds.
Will It Work?
Jeffrey Ladish offers his previous thoughts on this research path, noting that he has updated slightly favorably since writing it due to strong statements from Sam Altman and the project announcement, which reflect taking the problem seriously and better appreciating the difficulties involved.
The full post points out that there is little difference between ‘AI AI-alignment researcher’ and ‘AI AI-capabilities researcher’ and also little gap at most between that and actual AGI, so time left will be very short and the plan inherently accelerates capabilities.
I agree these are many of the core problems. OpenAI is a place we need to especially worry about reallocation to or use of capabilities advances, should they occur, and there is still insufficient appreciation of how to plan for what happens if the plan fails in time to avoid catastrophe. These are problems that can be fixed.
The problem that might not be fixable is whether the general approach can work at all. Among other concerns: Does it assume a solution, thus skipping over the actually hard problem?
Connor Leahy says not only will the plan obviously not work, he claims it is common knowledge among ordinary people that such a plan will not work.
Connor has since clarified that he meant something more like common sense rather than common knowledge. What is bizarre is the idea that most of the ML community ‘doesn’t think misalignment will be a problem.’ Wait, what? This survey seems to disagree, with only 4% saying ‘not a real problem’ and 14% saying ‘not an important problem’ with 58% saying very important or among the most important problems. Saying this isn’t a problem seems flat out nonsensical, and on a much deeper level than denying the problem of extinction risk. There’s ‘Zeus will definitely not be a threat to Cronus’ and then there’s ‘Zeus will never display any behavioral problems of any kind, the issue is his lightning skills.’
Connor offered his critique in Reuters in more precise form.
Yep. Human level is a rather narrow target to hit, also human-level alignment good enough to use for this purpose is not exactly a solved problem either. Of all the tasks you might need a human to do, this is one where you need an unusually high degree of precise alignment, because both it will be highly difficult to know if something does go wrong, and also in order to align the target one needs to understand what that means.
I do not think this holds. I do agree that if you look for anything at all you’re going to stumble on more capabilities relative to the average effort, because some efforts deliberately look for alignment, but this effort is deliberately looking for alignment.
I also don’t think we have strong evidence alignment (which I agree is very hard) is bottlenecked by intelligence. We have only hundreds of people working on it, there are many potential human alignment researchers available before we need turn to AI ones. Most low hanging efforts have not seriously been tried (also true with capabilities, mind) and no one has ever devoted industrial compute levels to alignment at all.
I do think that if OpenAI plans to release a ‘human-level alignment researcher’ more generally then that by default also releases a human-level capabilities researcher along with a human-level everything else, and that’s very much not a good idea, but I am assuming that if only for commercial reasons OpenAI would not do that if they realized they were doing it, and I think they should be able to realize they’d be doing that.
Manifold Markets says that, by the verdict of the team itself, the initiative has a remarkably good chance (this is after me putting M25 on NO, moving it down by 1%).
This is very much not a knock on the team or the attempt. A 15% chance of outright real success would be amazingly great, especially if the other 85% involves graceful failure and learning many of the ways not to align a superintelligence. Hell, I’d happily take an 0.1% chance of success if it came with a 99.9% chance of common knowledge that the project had failed while successfully finding lots of ways not to align an AGI. That is actual progress.
The concern is how often the project will fail, but the team or OpenAI will be fooled into thinking it succeeded. The best way to get everyone killed is to think you have solved ‘superalignment’ so you go ahead and build the system, when you have not solved superalignment. As always, you are the easiest one to fool. Plans like OpenAI’s current one seem especially prone to superficially looking like they are going to work right up until the moment it is too late.
Thus, my betting NO here is actually in large part a bet on the superalignment team. In particular, it is a bet on their ability to not fool themselves, as well as a bet that the problem is super difficult and unlikely to be solved within four years.
It’s also a commentary on the difficulties of such markets. One must always ask about implied value of the currency involved, and in which worlds you can usefully spend it.
Contrast this with Khoja’s market, where I bought some yes after looking at the details offered. An incremental but significant breakthrough seems likely.
The Trial Never Ends, Until it Does
As in, humans can lose control at any time. What they cannot do is ensure that they will permanently keep it, or that there is a full solution to how to keep it. Yann LeCun1 makes the good points that ‘solve the alignment problem’ is impossibly difficult, because it is not the type of thing that has a well-defined single solution, that four years is a hyperaggressive timeline, and that almost all previous similar solved problems were only solved by continuous and iterated refinement.
The problem with this coming from Yann is that in this case is with Yann’s preferred plan of ‘wait until you have the dangerous system to iterate on, then do continuous and iterated refinement then, that’s safe and AI poses no extinction risk to humanity’ combined with ‘just build the safe AIs and do not build the unsafe ones, no one would be so stupid as to build the unsafe ones.’ Because, wait, what?
I have been emphasizing this more lately. ‘Solving alignment’ is often poorly specified, and while for sufficiently capable systems it is necessary, it is not sufficient to ensure good outcomes. Sydney can be misaligned and have it be fine. An AGI can’t.
I also think we overemphasize the ‘bad actor’ aspect of this. One does need not a per-se ‘bad’ actor to cause bad outcomes in a dynamic system, the same way one does not need a ‘good’ butcher, baker or candlestick maker for the market to work.
Thread includes more. Strongly endorse Ajeya’s closing sentiment:
Whatever disagreements we have, it is amazingly great that OpenAI is making a serious effort here that aims to target parts of the hard problem. That doesn’t mean it will succeed, extensive critique is necessary, but it is important to emphasize that this very much is actual progress.
1
I do not typically cover Yann LeCun, but I’m happy to cover anyone making good points.