Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.
Maybe I'm getting it wrong, but it seems the model you have for why everyone is not on the ball is something like "people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works" & "this is a type of problem where you can just throw money at it and attract better ML talent".
I don't think these two insights are promising.
Also, again, maybe I'm getting it wrong, but I'm confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not. But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)
Also, with what you find promising I'm confused what do you think the 'real science' to aim for is - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn't true imo).
On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn't true imo).
Can you elaborate on why you think this is false? I'm curious.
On a related note, this part might be misleading:
I’m just really, really skeptical that a bunch of abstract work on decision theory and similar [from MIRI and similar independent researchers] will get us there. My expectation is that alignment is an ML problem, and you can’t solve alignment utterly disconnected from actual ML systems.
I think earlier forms of this research focused on developing new, alignable algorithms, rather than aligning existing deep learning algorithms. However, a reader of the first quote might think "wow, those people actually thought galaxy-brained decision theory stuff was going to work on deep learning systems!"
For more details, see Paul Christiano's 2019 talk on "Current work in AI alignment":
So for example, I might have a view like: we could either build AI by having systems which perform inference and models that we understand that have like interpretable beliefs about the world and then act on those beliefs, or I could build systems by having opaque black boxes and doing optimization over those black boxes. I might believe that the first kind of AI is easier to align, so one way that I could make the alignment tax smaller is just by advancing that kind of AI, which I expect to be easier to align.
This is not a super uncommon view amongst academics. It also may be familiar here because I would say it describes MIRI's view; they sort of take the outlook that some kinds of AI just look hard to align. We want to build an understanding such that we can build the kind of AI that is easier to align.
Going to share a seemingly-unpopular opinion and in a tone that usually gets downvoted on LW but I think needs to be said anyway:
This stat is why I still have hope: 100,000 capabilities researchers vs 300 alignment researchers.
Humanity has not tried to solve alignment yet.
There's no cavalry coming - we are the cavalry.
I am sympathetic to fears of a new alignment researchers being net negative, and I think plausibly the entire field has, so far, been net negative, but guys, there are 100,000 capabilities researchers now! One more is a drop in the bucket.
If you're still on the sidelines, go post that idea that's been gathering dust in your Google Docs for the last six months. Go fill out that fundraising application.
We've had enough fire alarms. It's time to act.
OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.
This may well be true - but it's hard to be a researcher focusing on this problem directly unless you have access to the ability to train near-cutting edge models. Otherwise you're going to have to work on toy models, theory, or a totally different angle.
I've personally applied for the DeepMind scalable alignment team - they had a fixed, small available headcount which they filled with other people who I'm sure were better choices - but becoming a better fit for those roles is tricky, unless by just doing mostly unrelated research.
Do you have a list of ideas for research that you think is promising and possible without already being inside an org with big models?
I don’t think the response to Covid should give us reason to be optimistic about our effectiveness at dealing with the threat from AI. Quite the opposite. Much of the measures taken were known to be useless from the start, like masks, while others were ineffective or harmful like shutting down schools or giving vaccine to young people who were not at risk of dying from Covid.
Everything can be explained by the incentives our politicians have to do anything. They want to be seen to take important questions seriously while not upsetting their doners in the pharma industry.
I can easily imagine something similar happening if the voters becomes concerned about AI. Some ineffective legislation dictated by Big tech.
I think most comments regarding the covid analogy miss the point made in the post. Leopold makes the case that there will be a societal moment of realization and not that specific measures regarding covid were good and this should give us hope.
Right now talking about AI risk is like yelling about covid in Feb 2020.
I agree with this & there likely being a wake-up moment. This seems important to realize!
I think unless one has both an extremely fast takeoff model and doesn’t expect many more misaligned AI models with increases in capabilities to be released before takeoff, one should expect at least one, plausibly several wakeup moments as we had it with covid. In one way or another in this scenario, AIs are going to do something that their designers or operators didn’t want while it does not yet lead to an existential catastrophe. I’m not clear on what will happen & how society will respond to it but it seems likely that a lot of people working on AI safety should prepare for this moment, especially if you have a platform. This is when people will start to listen.
As for the arguments that specific responses won’t be helpful: I’m skeptical of both positive and negative takes made with any confidence since the analogies to specific measures in response to covid or climate change don’t seem well grounded to me.
This. IIRC by ~April 2020 there were some researchers and experts asking why none of the prevention measures were focused on improving ventilation in public spaces. By ~June the same was happening for the pretty clear evidence that covid was airborne and not transmitted by surfaces (parks near me were closing off their outdoor picnic tables as a covid measure!).
And of course, we can talk about "warp speed" vaccine development all we like, but if we had had better public policy over the last 30 years, Moderna would likely have been already focusing on infectious disease instead of cancer, and had multiple well-respected, well-tested, well-trusted mRNA vaccines on the market, so that the needed regulatory and physical infrastructures could have already been truly ready to go in January when they designed their covid vaccine. We haven't learned these lessons even after the fact. We haven't improved our institutions for next time. We haven't educated the public or our leaders. We seem to have decided to pretend covid was close to a worst case scenario for a pandemic, instead of realizing that there can be much more deadly diseases, and much more rapidly spread diseases.
AI seems...about the same to me in how the public is reacting so far? Lots of concerns about job losses or naughty words, so that's what companies and legislators are incentivized to (be seen as trying to) fix, and most people either treat very bad outcomes as too outlandish to discuss, or treat not-so-very-small probabilities as too unlikely to worry about regardless of how bad they'd be if they happened.
I heavily endorse the tone and message of this post!
I also have a sense of optimism coming from society's endogenous response to threats like these, especially with respect to how public response to COVID went from Feb -> Mar 2020 and how the last 6 months have gone for public response to AI and AI safety (or even just the 4-5 months since ChatGPT was released in November). We could also look at the shift in response to climate change over the past 5-10 years.
Humanity does seem to have a knack for figuring things out just in the nick of time. Can't say it's something I'm glad to be relying on for optimism in this moment, but it has worked out in the past...
COVID and climate change are actually easy problems that only became serious or highly costly because of humanity's irrationality and lack of coordination.
For me, the generalization from these two examples is that humanity is liable to incur at least 1 or 2 orders of magnitude more cost/damage than necessary from big risks, so if you think an optimal response to AI risk means incurring 1% loss of expected value (from truly unpredictable accidents that happen even when one has taken all reasonable precautions), then the actual response would perhaps incur 10-100%.
COVID and climate change are actually easy problems
I'm not sure I'd agree with that at all. Also, how are you calculating what cost is "necessary" for problems like COVID/climate change vs. incurred because of a "less-than-perfect" response? How are we even determining what the "perfect" response would be? We have no way of measuring the counterfactual damage from some other response to COVID, we can only (approximately) measure the damage that has happened due to our actual response.
For those reasons alone I don't make the same generalization you do about predicting the approximate range of damage from these types of problems.
To me the generalization to be made is simply that: as an exogenous threat looms larger on the public consciousness, the larger the societal response to that threat becomes. And the larger the societal response to exogenous threats, the more likely we are to find some solution to overcoming them: either by hard work, miracle, chance, or whatever.
And I think there's a steady case to be made that the exogenous xrisk from AI is starting to loom larger and larger on the public consciousness: The Overton Window widens: Examples of AI risk in the media.
I agree with the sentiment but want to disagree on a minor point. I think we need more galaxy brained proofs and not less.
But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.
Epistemically, there's a solid chance the following argument is my brain working backwards from what I already believe.
To me it appears that the relation between capabilities research and alignment is fundamentally asymmetric. Deep learning research success might be finding just one way to improve the performance of a system. Successful alignment requires you to find all the ways things might go wrong.
I agree with your characterisation of deep learning. The field is famously akin to "alchemy" with advances running far ahead of the understanding. This works for capabilities research because:
1. you can afford to try a variety of experiments and get immediate feedback on if something works or not.
2. You can afford to break things and start again.
3. Finally, you are also in a position where performing well on a set of metrics is success in itself, even if those metrics might not be perfect.
Contrast this with alignment research. If you're trying to solve alignment in 2023:
1. Experiments cannot tell you if your alignment techniques works for models built in the future under new paradigms.
2. When it comes to the point in which there are experiments on AGIs that pose an existential threat, there will be no room to break things and start again.
4. While there are metrics for how well a language model might appear to be aligned there are (currently) no known metrics that can be met that guarantee an AGI is aligned.
What is the best way (assuming one exists), as an independent researcher, with a PhD in AI but not in ML, to get funding to do alignment work? (I recently left my big tech job).
As someone who left their mainstream tech job and a year and a half ago, and tried to get funding for alignment work and applied to alignment orgs... It's not easy, even if you have a legible track record as a competent tech worker. I'm hoping that a shift in public concern over AGI risk will open up more funding for willing helpers.
You mention the number of people at OpenAI doing alignment work. I think it would be helpful to compile a list of the different labs and the number of people that can be reasonably said to be doing alignment work. Then we could put together a chart of sorts, highlighting this gap.
Highlighting gaps like this is a proven and effective strategy to drive change when dealing with various organizational-level inequities.
If people reading this comment have insight into the number of people at the various labs doing alignment work and/or the total number of people at said labs: please comment here!
Thank you for this. This is very close to what I was hoping to find!
It looks like Benjamin Hilton makes a rough guess of the proportion of workers dedicated to AI x-risk for each organization. This seems appropriate for assessing a rough % across all organizations, but if we want to nudge organizations to employ more people toward alignment then I think we want to highlight exact figures.
E.g. we want to ask the organizations how many people they have working on alignment and then post what they say - a sort of accountability feedback loop.
I generally agree with Stephen Fowler, specifically that "there is no evidence that alignment is a solvable problem."
But even if a solution can be found which provably works for up to N level AGI, what about N+1 level? A sustainable alignment is just not possible. Our only hope is that there may be some limits on N, for example N=10 requires more resources than the Universe can provide. But it is likely that our ability to prove the alignment will stop well before a significant limit.
I agree that I don't think we're going to get any proofs of alignment or guarantees of a N level alignment system working for N + 1. I think we also have reason to believe that N extends pretty far, further than we can hope to align with so little time to research. Thus, I believe our hope lies in using our aligned model to prevent anyone from building an N + 1 model ( including the aligned N level model). If our model is both aligned and powerful, this should be possible.
Right now talking about AI risk is like yelling about covid in Feb 2020. I and many others spent the end of that February in distress over impending doom, and despairing that absolutely nobody seemed to care—but literally within a couple weeks, America went from dismissing covid to everyone locking down.
I don't think comparing misaligned AI to covid is fair. With covid, real life people were dying, and it was easy to understand the concept of "da virus will spread," and almost every government on Earth was still MASSIVELY too late in taking action. Even when the pandemic was in full swing they were STILL making huge mistakes. And now post-pandemic, have any lessons been learned in prep for the next one? No.
Far too slow to act, stupid decisions when acting, learned nothing even after the fact.
With AI it's much worse, because the day before the world ends everything will look perfectly normal.
Even in a hypothetical scenario where everyone gets a free life like in a video game so that when the world ends we all get to wake up the next morning regardless, people would still build the AGI again anyway.
I disagree. I think that "everything will look fine until the moment we are all doomed" is quite unlikely. I think we are going to get clear warning shots, and should be prepared to capitalize on those in order to bring political force to bear on the problem. It's gonna get messy. Dumb, unhelpful legislation seems nearly unavoidable. I'm hopeful that having governments flailing around with a mix of bad and good legislation and enforcement will overall be better than them doing nothing.
I generally agree with your commentary about the dire lack of research in this area now, and I want to be hopeful about solvability of alignment.
I want to propose that AI alignment is not only a problem for ML professionals. It is a problem for the whole society and we need to get as many people involved here as possible, soon. From lawyers and law-makers, to teachers and cooks. It is so for many reasons:
I want to show what we are doing at my company: https://conjointly.com/blog/ai-alignment-research-grant/ . The aim is to make social science PhDs aware of the alignment problem and get them involved in the way they can. Is it the right way to do it? I do not know.
I, for one, am not an LLM specialist. So I intend to be making noise everywhere I can with the resources I have. This weekend I will be writing to every member of the Australian parliament. Next weekend, I will be writing to every university in the country.
It looks like you haven't yet replied to the comments on your post. The thing you are proposing is not obviously good, and in fact might be quite bad. I think you probably should not be doing this outreach just yet, with your current plan and current level of understanding. I dislike telling people what to do, but I don't want you to make things worse. Maybe start by engaging with the comments on your post.
minimally-aligned AGIs to help us do alignment research in crunchtime
Christ this fills me with fear. And it's the best we've got? 'Aligned enough' sounds like the last words that will be spoken before the end of the world.
Yes, I think we're in a rough spot. I'm hopeful that we'll pull through. A large group of smart, highly motivated people, all trying to save their own lives and the lives of everyone they love... That is a potent force!
Well, I think LW is a place designed for people to speak their minds on important topics and have polite respectful debates that result in improved understanding for everyone involved. I think we're managing to do that pretty well, honestly.
If there needs to be an AGI Risk Management Outreach Center with a clear cohesive message broadcast to the world... Then I think that needs to be something quite different from LessWrong. I don't think "forum for lots of people to post their thoughts about rationality and AI alignment" would be the correct structure for a political outreach organization.
an AGI Risk Management Outreach Center with a clear cohesive message broadcast to the world
Something like this sounds like it could be a good idea. A way to make the most of those of us who are aware of the dangers and can buy the world time
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Most people still have the Bostromiam “paperclipping” analogy for AI risk in their head. In this story, we give the AI some utility function, and the problem is that the AI will naively optimize the utility function (in the Bostromiam example, a company wanting to make more paperclips results in an AI turning the entire world into a paperclip factory).
That is how Bostrom brought up the paperclipping example in Superintelligence but my impression was that the paperclipping example originally conceived by Eliezer prior to the Superintelligence book was NOT about giving an AI a utility function that it then naively optimises. Text from Arbital's page on paperclip:
The popular press has sometimes distorted the notion of a paperclip maximizer into a story about an AI running a paperclip factory that takes over the universe. (Needless to say, the kind of AI used in a paperclip-manufacturing facility is unlikely to be a frontier research AI.) The concept of a 'paperclip' is not that it's an explicit goal somebody foolishly gave an AI, or even a goal comprehensible in human terms at all. To imagine a central example of a supposed paperclip maximizer, imagine a research-level AI that did not stably preserve what its makers thought was supposed to be its utility function, or an AI with a poorly specified value learning rule, etcetera; such that the configuration of matter that actually happened to max out the AI's utility function looks like a tiny string of atoms in the shape of a paperclip.
That makes your section talking about "Bostrom/Eliezer analogies" seem a bit odd, since Eliezer, in particular, had been concerned about the problem of "the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do" very early on.
Downvote for including unsubstantiated claims as part of your headlines and not even trying to back them up. "Alignment is a solvable problem"...? (Maybe...? Probably...? Hopefully...? But who knows, except in some irrelevant academic sense.) I like the general tone, but things like this discourage me from reading any further.
I know you deleted this, but I personally do believe it is worth noting that there is no evidence that alignment is a solvable problem.
I am hopeful despite suspecting that there is no solution to the hard technical core of alignment that holds through arbitrary levels of intelligence increase. I think if we can get a hacky good-enough alignment at just a bit beyond human-level, we can use that tool, along with government enforcement, to prevent anyone from making a stronger rogue AI.
I think that's fair Amalthea. However I think it's worth encouraging people with the approximate right orientation towards the problem, even if their technical grasp is of it is not yet refined. I'm not sure this forum is the best place for a flood of recently-become-aware people to speak out in favor of trying hard to keep us from being doomed. But on the other hand, I don't have an alternate location in mind for them to start on the journey of learning... So....
I don't have an issue with the general purpose of the post. I do think it's not great to simply state things as true (and in a way that could easily be misinterpreted as spoken from expertise), which simply are not known, and for which the OP doesn't have any strong evidence. To be fair, I have similar issues with some of Eliezer's remarks, but at least he has done the work of going through every possible counter argument he can think of.
Far fewer people are working on it than you might think, and even the alignment research that is happening is very much not on track. (But it’s a solvable problem, if we get our act together.)
Observing from afar, it's easy to think there's an abundance of people working on AGI safety. Everyone on your timeline is fretting about AI risk, and it seems like there is a well-funded EA-industrial-complex that has elevated this to their main issue. Maybe you've even developed a slight distaste for it all—it reminds you a bit too much of the woke and FDA bureaucrats, and Eliezer seems pretty crazy to you.
That’s what I used to think too, a couple of years ago. Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!
There’s no secret elite SEAL team coming to save the day. This is it. We’re not on track.
If timelines are short and we don’t get our act together, we’re in a lot of trouble. Scalable alignment—aligning superhuman AGI systems—is a real, unsolved problem. It’s quite simple: current alignment techniques rely on human supervision, but as models become superhuman, humans won’t be able to reliably supervise them.
But my pessimism on the current state of alignment research very much doesn’t mean I’m an Eliezer-style doomer. Quite the opposite, I’m optimistic. I think scalable alignment is a solvable problem—and it’s an ML problem, one we can do real science on as our models get more advanced. But we gotta stop fucking around. We need an effort that matches the gravity of the challenge.[1]
Alignment is not on track
A recent post estimated that there were 300 full-time technical AI safety researchers (sounds plausible to me, if we’re counting generously). By contrast, there were 30,000 attendees at ICML in 2021, a single ML conference. It seems plausible that there are ≥100,000 researchers working on ML/AI in total. That’s a ratio of ~300:1, capabilities researchers:AGI safety researchers.
That ratio is a little better at the AGI labs: ~7 researchers on the scalable alignment team at OpenAI, vs. ~400 people at the company in total (and fewer researchers).[2] But 7 alignment researchers is still, well, not that much, and those 7 also aren’t, like, OpenAI’s most legendary ML researchers. (Importantly, from my understanding, this isn’t OpenAI being evil or anything like that—OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.)
But rather than the numbers, what made this really visceral to me is… actually looking at the research. There’s very little research where I feel like “great, this is getting at the core difficulties of the problem, and they have a plan for how we might actually solve it in <5 years.”
Let’s take a quick, stylized, incomplete tour of the research landscape.
Paul Christiano / Alignment Research Center (ARC).
Paul is the single most respected alignment researcher in most circles. He used to lead the OpenAI alignment team, and he has made useful conceptual contributions (e.g., Eliciting Latent Knowledge, iterated amplification).
But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.[4]
(This is separate from ARC’s work on evals, which I am very excited about, but I would put more in the “AGI governance” category—it helps us buy time, but it’s not trying to directly solve the technical problem.)
Mechanistic interpretability.
Probably the most broadly respected direction in the field, trying to reverse engineer blackbox neural nets so we can understand them better. The most widely respected researcher here is Chris Olah, and he and his team have made some interesting findings.
That said, to me, this often feels like “trying to engineer nuclear reactor security by doing fundamental physics research with particle colliders (and we’re about to press the red button to start the reactor in 2 hours).” Maybe they find some useful fundamental insights, but man am I skeptical that we’ll be able to sufficiently reverse engineer GPT-7 or whatever. I’m glad this work is happening, especially as a longer timelines play, but I don’t think this is on track to tackle the technical problem if AGI is soon.
RLHF (Reinforcement learning from human feedback).
This and variants of this[5] are what all the labs are doing to align current models, e.g. ChatGPT. Basically, train your model based on human raters’ thumbs-up vs. thumbs-down. This works pretty well for current models![6]
The core issue here (widely acknowledged by everyone working on it) is that this probably predictably won’t scale to superhuman models. RLHF relies on human supervision; but humans won’t be able to reliably supervise superhuman models. (More discussion later in this post.[7])
RLHF++ / “scalable oversight” / trying to iteratively make it work.
Something in this broad bucket seems like the labs’ current best guess plan for scalable alignment. (I’m most directly addressing the OpenAI plan; the Anthropic plan has some broadly similar ideas; see also Holden’s nearcasting series for a more fleshed out version of “trying to iteratively make it work,” and Buck’s talk discussing that.)
Roughly, it goes something like this: “yeah, RLHF won’t scale indefinitely. But we’ll try to go as far as we can with things like it. Then we’ll use smarter AI systems to amplify our supervision, and more generally try to use minimally-aligned AGIs to help us do alignment research in crunchtime.”
This has some key benefits:
But I think it’s embarrassing that this is the best we’ve got:
MIRI and similar independent researchers.
I’m just really, really skeptical that a bunch of abstract work on decision theory and similar will get us there. My expectation is that alignment is an ML problem, and you can’t solve alignment utterly disconnected from actual ML systems.
This is incomplete, but I claim that in broad strokes that covers a good majority of the work that’s happening. To be clear, I’m really glad all this work is happening! I’m not trying to criticize any particular research (this is the best we have so far!). I’m just trying to puncture the complacency I feel like many people I encounter have.
We’re really not on track to actually solve this problem!
(Scalable) alignment is a real problem
Imagine you have GPT-7, and it’s starting to become superhuman at many tasks. It’s hooked up to a bunch of tools and the internet. You want to use it to help run your business, and it proposes a very complicated series of action and computer code. You want to know—will this plan violate any laws?
Current alignment techniques rely on human supervision. The problem is that as these models become superhuman, humans won’t be able to reliably supervise their outputs. (In this example, the series of actions is too complicated for humans to be able to fully understand the consequences.). And if you can’t reliably detect bad behavior, you can’t reliably prevent bad behavior.[11]
You don’t even need to believe in crazy xrisk scenarios to take this seriously; in this example, you can’t even ensure that GPT-7 won’t violate the law!
Solving this problem for superhuman AGI systems is called “scalable alignment”; this is a very different, and much more challenging, problem than much of the near-term alignment work (prevent ChatGPT from saying bad words) being done right now.
A particular case that I care about: imagine GPT-7 as above, and GPT-7 is starting to be superhuman at AI research. GPT-7 proposes an incredibly complex plan for a new, alien, even more advanced AI system (100,000s of lines of code, ideas way beyond current state of the art). It has also claimed to engineer an alignment solution for this alien, advanced system (again way too complex for humans to evaluate). How do you know that GPT-7’s safety solution will actually work? You could ask it—but how do you know GPT-7 is answering honestly? We don’t have a way to do that right now.[12]
Most people still have the Bostromiam “paperclipping” analogy for AI risk in their head. In this story, we give the AI some utility function, and the problem is that the AI will naively optimize the utility function (in the Bostromiam example, a company wanting to make more paperclips results in an AI turning the entire world into a paperclip factory).
I don’t think old Bostrom/Eliezer analogies are particularly helpful at this point (and I think the overall situation is even gnarlier than Bostrom’s analogy implies, but I’ll leave that for a footnote[13]). The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do.[14]
And for getting AIs to do what we tell them to do, the core technical challenge is about scalability to superhuman systems: what happens if you have superhuman systems, which humans can’t reliably supervise? Current alignment techniques relying on human supervision won’t cut it.
Alignment is a solvable problem
You might think that given my pessimism on the state of the field, I’m one of those doomers who has like 99% p(doom). Quite the contrary! I’m really quite optimistic on AI risk.[15]
Part of that is that I think there will be considerable endogenous societal response (see also my companion post). Right now talking about AI risk is like yelling about covid in Feb 2020. I and many others spent the end of that February in distress over impending doom, and despairing that absolutely nobody seemed to care—but literally within a couple weeks, America went from dismissing covid to everyone locking down. It was delayed and imperfect etc., but the sheer intensity of the societal response was crazy and none of us had sufficiently priced that in.
Most critically, I think AI alignment is a solvable problem. I think the failure so far to make that much progress is ~zero evidence that alignment isn’t tractable. The level and quality of effort that has gone into AI alignment so far wouldn’t have been sufficient to build GPT-4, let alone build AGI, so it’s not much evidence that it’s not been sufficient to align AGI.
Fundamentally, I think AI alignment is an ML problem. As AI systems are becoming more advanced, alignment is increasingly becoming a “real science,” where we can do ML experiments, rather than just thought experiments. I think this is really different compared to 5 years ago.
For example, I’m really excited about work like this recent paper (paper, blog post on broader vision), which prototypes a method to detect “whether a model is being honest” via unsupervised methods. More than just this specific result, I’m excited about the style:
I think there’s a lot more to do in this vein—carefully thinking about empirical setups that are analogous to the core difficulties of scalable alignment, and then empirically testing and iterating on relevant ML methods.[16]
And as noted earlier, the ML community is huuuuuge compared to the alignment community. As the world continues to wake up to AGI and AI risk, I’m optimistic that we can harness that research talent for the alignment problem. If we can bring in excellent ML researchers, we can dramatically multiply the level and quality of effort going into solving alignment.
Better things are possible
This optimism isn’t cause for complacency. Quite the opposite. Without effort, I think we’re in a scary situation. This optimism is like saying, in Feb 2020, “if we launch an Operation Warp Speed, if we get the best scientists together in a hardcore, intense, accelerated effort, with all the necessary resources and roadblocks removed, we could have a covid vaccine in 6 months.” Right now, we are very, very far away from that. What we’re doing right now is sorta like giving a few grants to random research labs doing basic science on vaccines, at best.
We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.[17] Right now, there’s too much fretting, too much idle talk, and way too little “let’s roll up our sleeves and actually solve this problem.”
The state of alignment research is not good; much better things are possible. We can and should have research that is directly tackling the core difficulties of the technical problem (not just doing vaguely relevant work that might help, not just skirting around the edges); that has a plausible path to directly solving the problem in a few years (not just deferring to future improvisation, not just hoping for long timelines, not reliant on crossing our fingers); and that thinks conceptually about scalability while also working with real empirical testbeds and actual ML systems.
But right now, folks, nobody is on this ball. We may well be on the precipice of a world-historical moment—but the number of live players is surprisingly small.
Thanks to Collin Burns for years of discussion on these ideas and for help writing this post; opinions are my own and do not express his views. Thanks to Holden Karnofsky and Dwarkesh Patel for comments on a draft.
Note that I believe all this despite having much more uncertainty on AGI / AGI timelines than most people out here. I might write more about this at some point, but in short, my prior is against AI progress reaching 100% automation, rather something that looks more like 90% automation. And 90% automation is what we’ve seen time and time again as technological progress has advanced; it’s only 100% automation (of e.g. all of science and tech R&D) that would lead to transformative and unparalleled consequences.
And even if we do get 100% automation AGI, I’m fairly optimistic on it going well.
I might put AI xrisk in the next 20 years at ~5%. But 5% chance of extinction or similarly bad outcome is, well, still incredibly high!
I don’t have great numbers for DeepMind, my sense is that it’s 10-20 people on scalable alignment vs. 1000+ at the organization overall? Google Brain doesn’t have ~any alignment people. Anthropic is doing the best of them all, maybe roughly 20-30 people on alignment and interpretability vs. somewhat over 100 people overall?
E.g., skip connections (rather than f(x), do f(x)+x, so the gradients flow better); batchnorm (hacky normalization); ReLU instead of sigmoid; these and similar were some of the handful of biggest deep learning breakthroughs!
I do think that alignment will require more conceptual work than capabilities. However, as I discuss later in the post, I think the right role for conceptual work is to think carefully about empirical setups (that are analogous to the ultimate problem) and methods (that could scale to superhuman systems)—but then testing and iterating on these empirically. Paul does pure theory instead.
Anthropic’s Claude uses Constitutional AI. This still relies on RLHF for “helpfulness,” though it uses AI assistance for “harmlessness.” I still think this has the same scalability issues as RLHF (model assistance on harmlessness is fundamentally based on human supervision of pre-training and helpfulness RL stage); though I’d be happy to also group this under “RLHF++” in the next section
Sydney (Bing chat) had some bizarre failure modes, but I think it’s likely Sydney wasn’t RLHF’d, only finetuned. Compare that to ChatGPT/GPT-4 or Claude, which do really quite well! People will still complain about misalignments of current models, but if I thought this were just scalable to superhuman systems, I’d think we’re totally fine.
To be clear, I’m not that into alignment being applied for, essentially, censorship for current models, and think this is fairly distinct from the core long-term problem. See also Paul on “AI alignment is distinct from its near-term applications”
See also this Ajeya Cotra post for a more detailed take on how RLHF might fail; this is worth reading, even if I don’t necessarily endorse all of it.
If AI can automate AI research, I think <1 year takeoff scenarios are pretty plausible (modulo coordination/regulation), meaning <1 year from human-level AGIs to crazy superhuman AGIs. See this analysis by Tom Davidson; see also previous footnote on how a ton of deep learning progress has come from just dumb hacky tweaks (AIs automating AI research could find lots more of these); and this paper on the role of algorithmic progress vs. scaling up compute in recent AI progress.
You could argue that this <1 year could be many years of effective subjective research time (because we have the AIs doing AI research), and to some extent this makes me more optimistic. That said, the iterative amplification proposals typically rely on “humans augmented by AIs,” so we might still be bottlenecked by humans for AIs doing alignment research. (By contrast to capabilities, which might not be human bottlenecked anymore during this time—just make the benchmark/RL objective/etc. go up.)
More generally, this plan rests on labs’ ability to execute this really competently in a crazy crunchtime situation—again, this might well work out, but it doesn’t make sleep soundly at night. (It also has a bit of a funny “last minute pivot” quality to it—we’ll press ahead, not making much progress on alignment, but then in crunchtime we’ll pivot the whole org to really competently do iterative work on aligning these models.)
“Our iterative efforts have been going ok, but man things have been moving fast and there have been some weird failure modes. I *think* we’ve managed to hammer out those failure modes in our last model, but every time we’ve hammered out failure modes like this in the past, the next model has had some other crazier failure mode. What guarantee will we have that our superhuman AGI won’t fail catastrophically, or that our models aren’t learning to deceive us?”
See more discussion in my companion post—if I were a lab, I’d want to work really hard towards a clear solution to alignment so I won’t end up being blocked by society from deploying my AGI.
H/t Collin Burns for helping put it crisply like this.
The reason I particularly care about this example is that I don’t really expect most of the xrisk to come from “GPT-7”/the first AGI systems. Rather, I expect most of the really scary risk to come from the crazy alien even more advanced systems that “GPT-7”/AIs doing AI research build thereafter.
Bostrom says we give the AI a utility function, like maximizing paperclips. If only it were so easy! We can’t even give the AI a utility function. Reward is not the optimization target—all we’re doing is specifying an evolutionary process. What we get out if that process is some creature that happens to do well on the selected metric—but we have no idea what’s going on internally in that creature (cf the Shoggoth meme)
I think the analogy with human evolution is instructive here. Humans were evolutionarily selected to maximize reproduction. But that doesn’t mean that individual humans have a utility function of maximizing reproduction—rather, we learn drives like wanting to have sex or eating sugar that “in training” helped us do well in the evolutionary selection process. Go out of distribution a little bit, and those drives mean we “go haywire”—look at us, eating too much sugar making us fat, or using contraception to have lots of sex while having fewer and fewer children.
More generally, rather than Bostrom/Eliezer’s early contributions (which I respect, but think are outdated), I think by far the best current writing on AI risk is Holden Karnofsky’s, and would highly recommend you read Holden’s pieces if you haven’t already.
If somebody wants to use AIs to maximize paperclip production, fine—the core alignment problem, as I see it, is ensuring that the AI actually does maximize paperclip production if that’s what the user intends to do.
Misuse is a real problem, and I’m especially worried about the specter of global authoritarianism—but I think issues like this (how do we deal with e.g. companies that have goals that aren’t fully aligned with the rest of society) are more continuous with problems we already face. And in a world where everyone has powerful AIs, I think we’ll be able to deal with them in a continuous manner.
For example, we’ll have police AIs that ensure other AI systems follow the law. (Again, the core challenge here is the police AIs do what we tell them to do, rather than, e.g. trying to launch a coup themselves—see Holden here and here).
(That said, things will be moving much quicker, aggravating existing challenges and making me worry about things like power imbalances.)
As mentioned in an earlier footnote, I’d put the chance of AI xrisk in the next 20 years at ~5%. But a 5% chance of extinction of similarly bad outcomes is, well, a lot!
I will have more to say, another time, about my own ideas and alignment plans I’m most excited about.
Or maybe billion-dollar prizes.