I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.
This isn't an objection to the research direction, just a response to how you're framing it:
If you think GPT-3 is "narrowly superhuman" at medical advice, what topic don't you think it's narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)
A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool to get GPT-3 to give good advice.
(I am not denying that give good medical advice is a better initial goal/framing.)
This seems to imply that GPT-3 is broadly superhuman, IE, GPT-3 knows more than the average human about a very broad range of things (although GPT-3 might not know more than the best human in any domain). Going further: the implication is that GPT is a kind of mild superintelligence, currently misaligned in a benign way (it just wants to mimic humans) which hides an unknown portion of its intelligence (making it seem subhuman).
I'm not saying this is exactly true. Maybe GPT-3 really is only narrowly superhuman, in the s...
In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.
In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the basic reason I said in the post (it's not obvious how to do it, and it's analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn't strongly tied to a particular theoretical framing):
...I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice t
This was a very solid post and I've curated it. Here are some of the reasons:
First and foremost, great post! "How do we get GPT to give the best health advice it can give?" is exactly the sort of thing I think about as a prototypical (outer) alignment problem. I also like the general focus on empirical directions and research-feedback mechanisms, as well as the fact that the approach could produce real economic value.
Now on to the more interesting part: how does this general strategy fail horribly?
If we set aside inner alignment and focus exclusively on outer alignment issues, then in-general the failure mode which I think is far and away most likely is roughly "you get what you can measure" or "you get something designed to look good to human supervisors without actually being good". In other words, the inability of humans to reliably/robustly evaluate outcomes is the big problem. (The Fusion Power Generator Scenario is a one good example of the type of failure I'm talking about here - the human doesn't understand what-they-want at a detailed enough level to even ask the right questions, let alone actually evaluate a design.)
So: I expect any version of "align narrowly superhuman models" which evaluates the success of the project entirely by human feedback ...
Thanks for the comment! Just want to explicitly pull out and endorse this part:
the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process
I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the "sandwich" problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).
I also broadly agree with you that "things looking good to humans without actually being good" is a major problem to watch out for. But I don't think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)
Ah... I think we have an enormous amount of evidence on very-similar problems.
For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn't know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.
In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the "sandwich problem" would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don't think we have a good solution in practice; I'd expect the expert business-owner to usually come up with a much better contract.
This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn't understand what the designer wants), versus a p...
HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.
(This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)
I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:
Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?
If an AGI is hung up on these sorts of questions, then we've already mostly-won. That's already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It's already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It's already an AI which is unlikely to flood a room in order to fill a cauldron with water.
The vast majority of human values are not things we typically think of as "moral questions"; they're things which are so obvious that we usually don't even think of them until they're pointed out....
Impression before reading LW post comments & MIRI comments: this strikes me as a valuable "fourth area" of core research that we could start growing now. I'm uncertain about the technical fruits of the research itself (I expect it to be somewhere between 'slightly positive' and 'moderate-high positive'), but it seems like we could indeed scale such research into its own healthy (& prestigious!) subfield in ML. This could diversify the alignment research portfolio in a way that scales sublinearly with long-termist research input: in the long run, we wouldn't need everyone involved to be 'core' alignment researchers.
I have a few notes of unease that I haven't yet sat down to figure out yet, so I may reply to this comment with more thoughts.
Super clear and actionable -- my new favorite post on AF.
I also agree with it, and it's similar to what we're doing at OpenAI (largely thanks to Paul's influence).
Planned summary for the Alignment Newsletter:
...One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full post as your question might be answered there.
The author specifically suggests that we work on **aligning narrowly superhuman models** to make them more useful. _Aligning_ a model roughly means harnessing the full capabilities of the model and orienting these full capabilities towards helping humans. For example, GPT-3 presumably “knows” a lot about medicine and health. How can we get GPT-3 to apply this knowledge as best as possible to be maximally useful in answering user questions about health?
_Narrowly superhuman_ means that the model has more knowledge or “latent capability” than either its overseers or its users. In the exam
This is exactly what Ought is doing as we build Elicit into a research assistant using language models / GPT-3. We're studying researchers' workflows and identifying ways to productize or automate parts of them. In that process, we have to figure out how to turn GPT-3, a generalist by default, into a specialist that is a useful thought partner for domains like AI policy. We have to learn how to take feedback from the researcher and convert it into better results within session, per person, per research task, across the entire product. Another spin on it: w...
Someone on Reddit managed to successfully get GPT-3 to guess the solution to his mystery story, which none of the human readers had figured out yet.
The amount of effort going into AI as a whole ($10s of billions per year) is currently ~2 orders of magnitude larger than the amount of effort going into the kind of empirical alignment I’m proposing here, and at least in the short-term (given excitement about scaling), I expect it to grow faster than investment into the alignment work.
There's a reasonable argument (shoutout to Justin Shovelain) that the risk is that work such as this done by AI alignment people will be closer to AGI than the work done by standard commercial or academic research, and th...
I haven't read this in detail (hope to in the future); I only skimmed based on section headers.
I think the stuff about "what kinds of projects count" and "advantages over other genres" seem to miss an important alternative, which is to build and study toy models of the phenomena we care about. This is a bit like the gridworlds stuff, but I thought the description of that work missed its potential, and didn't provide much of an argument for why working at scale would be more valuable.
This approach (building and studying toy models) is popular in ML re...
Thanks for the very in-depth case you're making! I especially liked the parts about the objections, and your take on some AI Alignment researcher's opinions of this proposal.
Personally, I'm enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you're pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be ...
This was an important and worthy post.
I'm more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don't really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems - sandwiching requires advances in how we interpret human feedback, and how we tr...
Nice post. The one thing I'm confused about is:
Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).
It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objecti...
We're simply not sure where "proactively pushing to make more of this type of research happen" should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).
already seen as a standard way to make progress on the full alignment problem
It might be a standard way to make progress, but I don't feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It's possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn't seem that profitable yet.)
Also, if we use a stricter definition of "narrowly superhuman" (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I'd argue that there hasn't been any work published on that so far.
Suppose we want to train GPT-n in to do any of many different goals (give good medical advice, correctly critique an argument, write formal and polite text, etc). We could find training data that demonstrate a possible goal and insert natural language control codes around that data.
E.g., suppose XY is a section of training text. X contains a description of a medical problem. Y gives good medical advice. We would then modify XY to be something like:
[give correct medical advice]X[start]Y[end]
We would then repeat this for as many different goals and for as mu...
I (conceptual person) broadly do agree that this is valuable.
It's possible that we won't need this work - that alignment research can develop AI that doesn't benefit from the same sort of work you'd do to get GPT-3 to do tricks on command. But it's also possible that this really would be practice for "the same sort of thing we want to eventually do."
My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly ...
One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.
A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.
Both of these approaches feel like they aren...
This post matches and specifies some intuitions I've had for a while about empirical research and I'm very happy it has been expanded.
Google seems to have solved some problem like the above for a multi-language-model (MUM):
"Say there’s really helpful information about Mt. Fuji written in Japanese; today, you probably won’t find it if you don’t search in Japanese. But MUM could transfer knowledge from sources across languages, and use those insights to find the most relevant results in your preferred language."
How useful would it be to work on a problem where the LM "knows" can not be superhuman but it still knows how to do well and needs to be incentivized to do so? A currently prominent example problem is that LMs produce "toxic" content:
https://lilianweng.github.io/lil-log/2021/03/21/reducing-toxicity-in-language-models.html
Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.
On fuzzy tasks: I think the appropriate frame of comparison is neither an average subset (Mechanical Turk) or the ideal human (Go), but instead the median resource that someone would be reasonably likely to seek out. To use healthcare as an example, you'd want your AI to beat the average family doctor that most people would reach out to, as opposed to either a layman's opinion or the preeminent doctor in the field.
I wrote this post to get people’s takes on a type of work that seems exciting to me personally; I’m not speaking for Open Phil as a whole. Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured). We are not seeking grant applications on this topic right now.
Thanks to Daniel Dewey, Eliezer Yudkowsky, Evan Hubinger, Holden Karnofsky, Jared Kaplan, Mike Levine, Nick Beckstead, Owen Cotton-Barratt, Paul Christiano, Rob Bensinger, and Rohin Shah for comments on earlier drafts.
A genre of technical AI risk reduction work that seems exciting to me is trying to align existing models that already are, or have the potential to be, “superhuman”[1] at some particular task (which I’ll call narrowly superhuman models).[2] I don’t just mean “train these models to be more robust, reliable, interpretable, etc” (though that seems good too); I mean “figure out how to harness their full abilities so they can be as useful as possible to humans” (focusing on “fuzzy” domains where it’s intuitively non-obvious how to make that happen).
Here’s an example of what I’m thinking of: intuitively speaking, it feels like GPT-3 is “smart enough to” (say) give advice about what to do if I’m sick that’s better than advice I’d get from asking humans on Reddit or Facebook, because it’s digested a vast store of knowledge about illness symptoms and remedies. Moreover, certain ways of prompting it provide suggestive evidence that it could use this knowledge to give helpful advice. With respect to the Reddit or Facebook users I might otherwise ask, it seems like GPT-3 has the potential to be narrowly superhuman in the domain of health advice.
But GPT-3 doesn’t seem to “want” to give me the best possible health advice -- instead it “wants” to play a strange improv game riffing off the prompt I give it, pretending it’s a random internet user. So if I want to use GPT-3 to get advice about my health, there is a gap between what it’s capable of (which could even exceed humans) and what I can get it to actually provide me. I’m interested in the challenge of:
I think there are other similar challenges we could define for existing models, especially large language models.
I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.
I’ll call this type of project aligning narrowly superhuman models. In the rest of this post, I:
There aren’t a large number of roles where someone could do this right now, but if aligning narrowly superhuman models is a good idea, and we can build a community consensus around it being a good idea, I think we have a good shot at creating a number of roles in this space over the coming years (allowing a larger number of people to productively contribute to AI x-risk reduction than would be possible otherwise). To discover whether that’s possible, I’d appreciate it if people could react with pushback and/or endorsement, depending on where you’re at.
What aligning narrowly superhuman models could look like
I’m a lot less confident about a particular agenda or set of project ideas than I am about the high-level intuition that it seems like we could somehow exploit the fact that today’s models are superhuman in some domains to create (and then analyze and solve) scaled-down versions of the “aligning superintelligent models” problem. I think even the basic framing of the problem has a lot of room to evolve and improve; I’m trying to point people toward something that seems interestingly analogous to the long-run alignment problem rather than nail down a crisp problem statement. With that said, in this section I’ll lay out one vision of what work in this area could look like to provide something concrete to react to.
First of all, it’s important to note that not all narrowly superhuman models are going to be equally interesting as alignment case studies. AlphaGoZero (AGZ) is narrowly superhuman in an extremely strong sense: it not only makes Go moves better than the moves made by top human players, but also probably makes moves that top players couldn’t even reliably recognize as good. But there isn’t really an outer alignment problem for Go: a precise, algorithmically-generated training signal (the win/loss signal) is capable of eliciting the “full Go-playing potential” of AGZ given enough training (although at a certain scale inner alignment issues may crop up). I think we should be focusing on cases where both inner and outer alignment are live issues.
The case studies which seem interesting are models which have the potential to be superhuman at a task (like “giving health advice”) for which we have no simple algorithmic-generated or hard-coded training signal that’s adequate (which I’ll call “fuzzy tasks”). The natural thing to do is to try to train the model on a fuzzy task using human demonstrations or human feedback -- but if (like AGZ) the model actually has the capacity to improve on what humans can demonstrate or even reliably recognize, it’s not immediately obvious how to elicit its “full potential.”
Here’s an attempt at one potential “project-generation formula”, where I try to spell out connections to what I see as the main traditional sub-problems within academic AI alignment research:
This is just one type of project you could do in this space. The larger motivating question here is something like, “It looks like at least some existing models, in at least some domains, ‘have the ability’ to exceed at least some humans in a fuzzy domain, but it’s not obvious how to ‘draw it out’ and how to tell if they are ‘doing the best they can to help.’ What do we do about that?”
I don’t think the project-generation formula I laid out above will turn out to be the best/most productive formulation of the work in the end; I’m just trying to get the ball rolling with something that seems concrete and tractable right now. As one example, the project-generation formula above is putting reward learning / “outer alignment” front and center, and I could imagine other fruitful types of projects that put “inner alignment” issues front and center.
Existing work in this area
This kind of work only became possible to do extremely recently, and mostly only in industry AI labs; I’m not aware of a paper that follows all three steps above completely. But “Learning to summarize from human feedback” (Stiennon et al., 2020) accomplishes the easier version of 1 and a bit of 2 and 3. The authors chose the fuzzy task of summarizing Reddit posts; there was an existing corpus of human demonstrations (summaries of posts written by the posters themselves, beginning with “TL;DR”):
What kinds of projects do and don’t “count”
In the high-level description of this research area, I’ve aimed to be as broad as possible while picking out the thing that seems interestingly different from other research in alignment right now (i.e. the focus on narrowly superhuman models). But given such a broad description, it can be confusing what does and doesn’t count as satisfying it. Would self-driving cars count? Would MuseNet count? Would just training GPT-4 count?
Firstly, I don’t think whether a project “counts” is binary -- in some sense, all I’m saying is “Find a model today such that it seems as non-obvious as possible how to align it, then try to align it.” The more obvious the training signal is, the less a project “counts.” But here are some heuristics to help pick out the work that currently feels most central and helpful to me:
I think some projects that don’t fit all these criteria will also constitute useful progress on aligning narrowly superhuman models, but they don’t feel like central examples of what I’m trying to point at.
Potential near-future projects: “sandwiching”
I think a basic formula that could take this work a step beyond Stiennon et al, 2020 is a) “sandwich” the model in between one set of humans which is less capable than it and another set of humans which is more capable than it at the fuzzy task in question, and b) figure out how to help the less-capable set of humans reproduce the judgments of the more-capable set of humans. For example,
In all of these cases, my guess is that the way to get the less-capable group of humans to provide training signals of a similar quality to the more-capable group will involve some combination of:
It may not yet be possible to do these more ambitious projects (for example, because models may not be powerful enough yet to train them to meaningfully help human evaluators, engage in debates, meaningfully exceed what humans can recognize / verify, etc). In that case, I think it would still be fairly valuable to keep doing human feedback projects like Steinnon et al., 2020 and stay on the lookout for opportunities to push models past human evaluations; state-of-the-art models are rapidly increasing in size and it may become possible within a couple of years even if it’s not quite possible now.
Importantly, I think people could make meaningful progress on aligning narrowly superhuman models using existing models without scaling them up any further, even if they are only superhuman with respect to human demonstrations for now -- there’s a lot we don’t know even just about how to do RL from human feedback optimally. And in the near future I expect it will be possible to use the larger models which will likely be trained to do even more interesting projects, which have the potential to exceed human evaluations in some domains.
(For more speculative thoughts on how we might go beyond “sandwiching”, see the appendix.)
How this work could reduce long-term AI x-risk
On the outside view, I think we should be quite excited about opportunities to get experience with the sort of thing we want to eventually be good at (aligning models that are smarter than humans). In general, it seems to me like building and iterating on prototypes is a huge part of how R&D progress is made in engineering fields, and it would be exciting if AI alignment could move in that direction.
If there are a large number of well-motivated researchers pushing forward on making narrowly superhuman models as helpful as possible, we improve the odds that we first encounter serious problems like the treacherous turn in a context where a) models are not smart enough to cause actually catastrophic harm yet, and b) researchers have the time and inclination to really study them and figure out how to solve them well rather than being in a mode of scrambling to put out fires and watching their backs for competitors. Holistically, this seems like a much safer situation to be in than one where the world has essentially procrastinated on figuring out how to align systems to fuzzy goals, doing only the minimum necessary to produce commercial products.
This basic outside view consideration is a big part of why I’m excited about the research area, but I also have some more specific thoughts about how it could help. Here are three somewhat more specific paths for working on aligning narrowly superhuman models today to meaningfully reduce long-term x-risk from advanced AI:
I think both the broad outside view and these specific object-level benefits make a pretty compelling case that this research would be valuable on the object level. Additionally, from a “meta-EA” / “community building” perspective, I think pioneering this work could boost the careers and influence of people concerned with x-risk because it has the potential to produce conventionally-impressive results and demos. My main focus is the case that this work is valuable on the merits and I wouldn’t support it purely as a career-boosting tool for aligned people, but I think this is a real and significant consideration that can tip the scales.
Advantages over other genres of alignment research
First, I’ll lay out what seem like the three common genres of alignment research:
I’m broadly supportive of all three of these other lines of work, but I’m excited about the potential for the new approach described in this post to “practice the thing we eventually want to be good at.” I think on the outside view we should expect that doing whatever we can find that comes closest to practicing what we eventually want to do will be good in a number of ways (e.g. feeling and looking more “real”, encouraging good habits of thought and imposing helpful discipline, etc).
More specifically, here are some advantages that it feels like “aligning narrowly superhuman models” line of work has over each of the other three genres:
Finally and maybe most importantly, I think aligning narrowly superhuman models has high long-run field growth potential compared to these other genres of work. Just focusing on GPT-3, there are already a lot of different fuzzy goals we could try to align it to, and the number of opportunities will only grow as the ML industry grows and the number and size of the largest models grow. This work seems like it could absorb a constant fraction (e.g. 1% or 5%) of all the ML activity -- the more models are trained and the mode capable they are, the more opportunity there is to align narrowly superhuman models to ever more tasks.
I think we have a shot at eventually supplying a lot of people to work on it too. In the long run, I think more EAs could be in a position to contribute to this type of work than to either conceptual research or mainstream ML safety.[11] Conceptual research is often foggy and extremely difficult to make progress on without a particular kind of inspiration and/or hard-to-define “taste”; mainstream ML safety is often quite technical and mathematically dense (and ensuring the work stays relevant to long-run x-risk may be difficult).
A lot of work involved in aligning narrowly superhuman models, on the other hand, seems like it’s probably some combination of: a) software engineering and ML engineering, b) dealing with human contractors, and c) common sense problem-solving. Lead researchers may need to bring taste and research judgment to ensure that the work is well-targeted, but a number of people could work under one lead researcher doing tractable day-to-day work with reasonably good feedback loops. If there were institutional homes available to onboard people onto this work, I think a strong generalist EA with a software engineering background could plausibly retrain in ML engineering over 6-12 months and start contributing to projects in the space.
Right now there are only a few organizations that offer roles doing this work and that seems like a big bottleneck, but it could make sense to prioritize creating more institutional homes and/or rapidly expanding the ones that exist.
Objections and responses
In this section I’ve tried to anticipate some potential objections, and give my responses; I’d suggest skipping around and reading only the ones that interest you. I don’t think that I have knock-down answers to all of these objections, but I do remain holistically excited about this idea after reflecting on them some.
How would this address treachery by a superintelligence?
Elaboration of objection: It seems like there is a “hard core” of the alignment problem that only crops up when models are very smart in a very general way, not just e.g. better than MTurkers at giving medical advice. The specific scariest problem seems to be the “treacherous turn”: the possibility that the model will appear to be helpful during training time even though it’s actually power-seeking because it’s aware that it’s being trained and has to act helpful to survive, and later cause catastrophic harm once it knows it’s out of the training setup. It doesn’t seem like the “aligning narrowly superhuman models” style of work will figure out a way to address the treacherous turn until it’s likely too late.
I'm very uncertain how relevant the near-term work will turn out to be for more exotic problems like the treacherous turn, and I want to think more about ways to nudge it to be more relevant.[12] I would be very excited to find empirical research projects on large models that specifically shed light on the treacherous turn possibility, and I agree it’s a weakness of my set of potential projects that they aren’t specifically optimized for unearthing and correcting treachery.
With that said, I don’t think there are currently genres of work that feel similarly tractable and scalable that do tackle the treacherous turn head on -- of the main genres of alignment work, I’d argue that only a subset of the conceptual work is aiming to directly generate a long-term solution to treachery, and I think the jury is very much out on whether it will be fruitful; gridworlds and games and mainstream ML safety largely don’t seem to try for a long-term treacherous turn solution. So I think the relative hit that my proposal takes due to this consideration is fairly limited.[13]
Even if they don’t start off tackling the treacherous turn, I’d guess that researchers would have a decent shot at learning useful things about treachery down the line if they were pursuing this work. Basically, I think it’s pretty likely that full-blown treachery will be preceded by mini-treachery, and with better understanding of how neural networks tend to learn and generalize, researchers may be able to specifically seek out domains where mini-treachery is especially likely to occur to better study it. Even if techniques used by empirical researchers don’t work out of the box for the treacherous turn, empirical work eliciting and studying mini-treachery could still inform what kind of theoretical or conceptual work needs to be done to address it, in a way that seems more promising to me than eliciting micro-treachery in gridworlds and games.
Moreover, even though the treacherous turn seems like the scariest single source of risk, I don’t think it totally dominates the overall expected AI risk -- a significant fraction of the risk still seems to come from more “mundane” outer alignment failures and various unforced errors, which this empirical work seems better-placed to address. Of the three broad ways I listed that this work could reduce x-risk, the critique that it doesn’t seem to address the treacherous turn very well applies most to the “Chance of discovering or verifying long-term solution(s)” category; even if it fails to address the treacherous turn, it still seems that “Practical know-how and infrastructure” and “Better AI situation in the run-up to superintelligence” matter.
Doesn’t this feel suspiciously close to just profit-maximizing?
Elaboration of objection: It sort of sounds like you’re just telling EAs to make AI really useful to humans (and indeed push models to be superhuman if they can be); it feels like this would also be what someone who is into pure profit-maximization would be excited about, and that makes me suspicious about the reasoning here and nervous about calling it an alignment activity. Even if you’re right that it helps with alignment, we might see a lot of people flock to it for the wrong reasons.
I agree that there is overlap with commercial incentives, but I think there are three high-level ways that this type of work would be different from what you’d do if you were profit-maximizing:
More broadly, I think successful versions of this type of alignment work should get someone who deeply understands ML and its limitations to say something like, "Wow, it's cool that you got the model to do that." My sense is that most commercial projects wouldn’t really elicit this reaction, and would look more like applying a lot of hard work to realize an outcome that wasn’t very much in doubt.
Given these differences, I think there’s a good shot at distinguishing this type of work from pure profit-seeking and cultivating a community where a) most people doing this work are doing it for altruistic reasons, and b) this is reasonably legible to onlookers, funders, potential junior researchers, etc.
Isn’t this not neglected because lots of people want useful AI?
Elaboration of objection: Even if this is useful for alignment, and even adjusting for the fact that companies aren’t focusing on the version that’s specifically alignment-optimized, won’t a ton of this work get done in AI labs and startups? Doesn’t that mean that the EA community is less likely to make an impact on the margin than in other, less-commercially-incentivized types of alignment work?
I do think there’s probably some work happening broadly along these lines from a commercial motivation, and there will probably be significantly more in the future. But I pretty strongly suspect that there are very few, if any, projects like the ones I proposed above currently being done in a commercial setting, and what work is being done is less well-targeted at reducing long-run x-risk than it could be.
The vast majority of commercial work going into AI by dollars is a) hyper application-specific and hard-coding intensive such as self-driving cars, or b) focused on scaling big generic models. I don’t actually think the resources going into any sort of project focused on human demonstrations and feedback is very large right now; I’d guess it’s within an order of magnitude of the resources going into other alignment work (e.g. $100s of millions per year at the high-end, where other alignment research absorbs $10s of millions per year). And for the reasons outlined above, not a lot of this will be focused on exceeding humans using scalable, domain-general techniques.
As an example to illustrate the relative neglectedness of this work, it was Paul Christiano (motivated by long-term alignment risk concerns) who led the the Stiennon et al., 2020 work, and I think it’s reasonably likely that if he hadn’t done so there wouldn’t have been a human feedback paper of similar scale and quality for another year or so. I’d guess the EA community collectively has the opportunity to substantially increase how much of this work is done before transformative AI with a strong push, especially because the “going beyond human feedback” step seems less commercially incentivized than the Stiennon et al. work.
Some additional thoughts on neglectedness:
Will this cause harm by increasing investment in scaling AI?
Elaboration of objection: Even if the people doing this research don’t personally scale up models and focus on generalizable and scalable solutions to making models helpful, they will be demonstrating that the models have powerful and useful capabilities that people might not have appreciated before, and could inspire people to pour more investment into simply scaling up AI or making AI useful in much less principled ways, which could cause harm that exceeds the benefits of the research.
This is a very contentious question and people have a wide range of intuitions on it. I tend to be less bothered by this type of concern than a lot of other people in the community across the board. At a high-level, my take is that:
With that said, I do think that exciting demos are a lot more likely to spur investment than written arguments, and this kind of research could generate exciting demos. Overall, the case for caution feels stronger to me than the case for caution about discussing arguments about timelines and takeoff speeds, and this consideration probably net claws back some enthusiasm I have for the proposal (largely out of deference to others).
Why not just stick with getting models not to do bad things?
Elaboration of objection: Even if this is useful for alignment, worth doing on the margin, and not net-harmful, it seems like it would be dominated by doing practical/near-term work that’s more clearly and legibly connected to safety and harm-reduction, like “getting models to never lie” or “getting models to never use racist slurs” or “getting models to never confidently misclassify something.” That work seems more neglected and more relevant.
Some people might feel like “avoiding bad behaviors” is clearly the subset of near-term empirical alignment work which is most relevant to long-run alignment and neglected by profit-seeking actors -- after all, in the long run we’re trying to avoid a big catastrophe from misaligned AI, so in the short run we should try to avoid smaller catastrophes.
I disagree with this: I think both “getting models to be helpful and surpass human trainers” and “getting models to never do certain bad things” are valuable lines of empirical alignment work, and I’d like to see more of both. But I don’t think reliability and robustness has a special place in terms of relevance to long-run x-risk reduction, and if anything it seems somewhat less exciting on the margin. This is because:
Why not focus on testing a candidate long-term solution?
Elaboration of objection: This proposal seems like it would lead to a lot of wasted work that isn’t sufficiently optimized for verifying or falsifying a long-term solution to alignment. It would be better if the potential projects were more specifically tied in to testing an existing candidate long-term solution, e.g. Paul Christiano’s agenda.
I’ll focus on Paul’s agenda in my response, because the specific people I’ve talked to who have this objection mostly focus on it, but I think my basic response will apply to all the conceptual alignment agendas.
Some of the projects under the umbrella of “aligning narrowly superhuman models” seem like they could instead be reframed around specific goals related to Paul’s agenda, like “prototyping and testing capability amplification”, “prototyping and testing imitative generalization”, “figuring out how ascription universality works”, and so on. I do think one of the value propositions of this work is shedding light on these sorts of concepts, but I think it’s probably not helpful to frame the whole endeavor around that:
There could be some simple organizing goal or “tagline” for empirical alignment research that is neither “test [concept from a Paul blog post]” nor “align narrowly superhuman models” which would inspire better-targeted research from the perspective of someone who’s bullish on Paul’s work, but the ones I’ve thought about haven’t been convincing,[17] and I’d guess it’ll be hard to find a good organizing tagline until the theory work gets to a more stable state.
Current state of opinion on this work
One of my goals in writing this blog post is to help build some community consensus around the “aligning narrowly superhuman models” proposal if it’s in fact a good idea. To that end, I’ll lay out my current understanding of where various AI alignment researchers stand on this work:
I also think a number of AI alignment researchers (and EAs working in AI risk more broadly) simply haven’t thought a lot about this kind of work because it hasn’t really been possible until the last couple of years. Until 2019 or so, there weren’t really any models accessible to researchers which could exceed human performance in fuzzy domains, and research agendas in AI alignment were largely formed before this was an option.
Takeaways and possible next steps
I’ve laid out the hypothesis that aligning narrowly superhuman models would concretely reduce x-risk and has high long-run field growth potential (i.e., lots of people who don’t have particularly esoteric skills could eventually help with it). I think if the EA and AI alignment community is in broad agreement about this, there’s potential to make a lot happen.
In terms of immediate actionable takeaways:
Looking forward to hearing people’s thoughts!
Appendix: beyond sandwiching?
Right now, models like GPT-3 are not “superhuman” at fuzzy tasks in the sense that AlphaGoZero is “superhuman” at playing Go. AGZ plays Go better than any human, while GPT-3 is only capable of giving better advice or writing better stories than some humans, which is what makes the “sandwiching” tactic an option. What happens when language models and other models get narrowly superhuman in a strong sense -- better than all humans in some fuzzy domain, e.g. stock-picking? How would we verify that we got the model to be “doing the best it can do to help” when there’s no reference model trained on a ground truth signal to compare its performance to?
I’m definitely very unsure what this would look like, but an important starting assumption I have is that whatever techniques worked well to get less-capable humans to reproduce the judgments of more-capable humans in a “sandwich” setting stand a good chance of just continuing to work. If we were careful not to actually use the expertise of the more-capable set of humans in whatever systems/tools we used to assist/augment the less-capable set, and a similar set of systems/tools seemed to work across multiple domains and for humans at multiple different capability levels, there’s no particular reason to believe they would not continue working once models go from slightly less capable than the best humans to slightly more capable than them at some task. So I think it’s possible we could do most of the R&D in the regime where sandwiching works.
With that said, here are some thoughts about how we could try to probe whether our alignment techniques were actually successful at eliciting a model’s full potential in a regime the model is more capable than the best humans:
At least better than some salient large group of humans in a particular context, like “Mechanical Turk workers”, “stackoverflow users”, etc. Right now, models are only superhuman with respect to all humans in particular crisp domains like games. E.g. AlphaGoZero is better at Go than any human; GPT-3 probably has the potential to give better advice than some humans. ↩︎
This idea isn’t original to me -- a number of others (especially some people working on long-term AI alignment at OpenAI and DeepMind) have thought along similar lines. My own thinking about this has been informed a lot by discussions with Paul Christiano and Holden Karnofsky. ↩︎
e.g., Mechanical Turk workers who are hired to give feedback to the model ↩︎
Though if we could pull off a path where we build an AI system that is superhuman in certain engineering capabilities but not yet human-level in modeling and manipulating people, and use that system to cut down on x-risk from other AI projects without having to figure out how to supervise arbitrary superhuman models, that could be really good. ↩︎
Note that I don’t think this is the only way to study interpretability and robustness, or even necessarily the best way. In this project-generation formula, the domain and task were optimized to make reward learning an especially interesting and important challenge, rather than to make interpretability or robustness especially challenging, interesting, or important. I think it’s good to be complete and to try to ensure interpretability and robustness in these domains, but we should probably also do other lines of research which choose domains / tasks that are specifically optimized for interpretability or robustness, rather than reward learning, to be especially challenging and important. ↩︎
Pragmatically speaking, fine-tuning a large model rather than training from scratch is also orders of magnitude cheaper, and so a lot more accessible to most researchers. ↩︎
Another way of seeing why it wouldn’t count is that “predict the next token” is an extremely non-fuzzy training signal. ↩︎
Human contractors make these labels, but they are not providing feedback. ↩︎
More speculatively, if we’re realizing models’ full potential as we go along, there’s less chance of ending up with what I’ll call an “unforced sudden takeoff”: a situation where on some important set of fuzzy tasks models jump suddenly from being not-that-useful to extraordinarily useful, but this was due to not bothering to figure out how to make models useful for fuzzy tasks rather than any inherent underlying fact about models. I’m not sure how plausible an unforced sudden takeoff is though, and I’m inclined (because of efficient market intuitions) to think the strong version of it is not that likely. H/t Owen Cotton-Barratt for this thought. ↩︎
E.g., that whenever there are two or more generalizations equally consistent with the training data so far, models will never generalize in the way that seems more natural or right to humans. ↩︎
I think eventually gridworlds and games will probably fade away as it becomes more practical to work with larger models instead, and dynamics like the treacherous turn start to show up in messier real-world settings. ↩︎
One idea a couple of others have suggested here and which I’m generally interested in is “transparency in (narrowly superhuman) language models”: finding ways to understand “what models are thinking and why,” especially when they know more about something than humans do. I like this idea but am very unsure about what execution could look like. E.g., would it look like Chris Olah’s work, which essentially “does neuroscience” on neural networks? Would it look like training models to answer our questions about what they’re thinking? Something else? ↩︎
Though you could think that in an absolute sense it and all the other approaches that aren’t tackling treachery head-on are doomed. ↩︎
I would also prefer other things being equal that EAs focused on long-run x-risk get the recognition for this work rather than others, but as I said above I consider this secondary and think that this agenda is good on the merits, not just as career capital for EAs. ↩︎
There are some innovators for whom the value of being in an area is strictly decreasing in its crowdedness, because their main value-add is to “start something from nothing.” But I don’t think that applies to most contributors, even those who have an extremely large impact eventually (which might even be larger than the innovators’ impact in some cases). ↩︎
Some people have argued that the “verifying long-run solutions” path is dominant because the other stuff is likely to happen anyway, but I’m not convinced. I think all three paths to impact that I laid out are likely to happen one way or another, and there’s room to speed up or improve all of them. I do think there could be some boost to the “verifying long-run solutions” path, but all in all I feel like it’ll be ⅓ to ¾ of the value, not >90% of the value. ↩︎
The most plausible competing pitch in my mind is “get language models to answer questions honestly”, which seems like it could get at the “ascription universality” / “knowing everything the model knows” concept (h/t Evan H, Owen C-B, Owain E). That would narrow the focus to language models and question-answering, and rule out projects like “get non-coders to train a coding model.” I think the “get language models to answer questions honestly” frame is reasonable and I want to see work done under that banner too, but I’m not convinced it’s superior. It considerably narrows the scope of what’s “in”, cutting down on long-run field growth potential, and I think a lot of the projects that are “out” (like the coding project) could be helpful and informative. I also worry that the tagline of “honesty” will encourage people to focus on “avoiding harmful lies that are nonetheless pretty easy for humans to detect”, rather than focusing on regimes where models exceed human performance (see this objection for more discussion of that). ↩︎
It’s possible other places, like Google Brain or some other FAANG lab, would also have roles available doing this type of work -- I am just more unsure because there is less of a long-termist alignment researcher presence in those places. ↩︎
Eventually, when models are more strongly superhuman, I think it will get too hard to even tell whether outcomes were acceptable, because AI systems could e.g. compromise the cameras and sensors we use to measure outcomes. So relying on outcomes earlier on feels like “kicking the can down the road” rather than “practicing what we eventually want to be good at.” “Don’t kick the can down the road, instead practice what we eventually want to be good at” is the overall ethos/attitude I’m going for with this proposal. ↩︎