So there’s this thing where GPT-3 is able to do addition, it has the internal model to do addition, but it takes a little poking and prodding to actually get it to do addition. “Few-shot learning”, as the paper calls it. Rather than prompting the model with
Q: What is 48 + 76? A:
… instead prompt it with
Q: What is 48 + 76? A: 124
Q: What is 34 + 53? A: 87
Q: What is 29 + 86? A:
The same applies to lots of other tasks: arithmetic, anagrams and spelling correction, translation, assorted benchmarks, etc. To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can “figure out what we’re asking for”.
This is an alignment problem. Indeed, I think of it as the quintessential alignment problem: to translate what-a-human-wants into a specification usable by an AI. The hard part is not to build a system which can do the thing we want, the hard part is to specify the thing we want in such a way that the system actually does it.
The GPT family of models are trained to mimic human writing. So the prototypical “alignment problem” on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Assuming that GPT has a sufficiently powerful and accurate model of human writing, it should then generate the thing you want.
Viewed through that frame, “few-shot learning” just designs a prompt by listing some examples of what we want - e.g. listing some addition problems and their answers. Call me picky, but that seems like a rather primitive way to design a prompt. Surely we can do better?
Indeed, people are already noticing clever ways to get better results out of GPT-3 - e.g. TurnTrout recommends conditioning on writing by smart people, and the right prompt makes the system complain about nonsense rather than generating further nonsense in response. I expect we’ll see many such insights over the next month or so.
Capabilities vs Alignment as Bottleneck to Value
I said that the alignment problem on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Important point: this is worded to be agnostic to the details GPT algorithm itself; it’s mainly about predictive power. If we’ve designed a good prompt, the current generation of GPT might still be unable to solve the problem - e.g. GPT-3 doesn’t understand long addition no matter how good the prompt, but some future model with more predictive power should eventually be able to solve it.
In other words, there’s a clear distinction between alignment and capabilities:
- alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want
- capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing
Interesting question: between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?
In the short term, it seems like capabilities are still pretty obviously the main bottleneck. GPT-3 clearly has pretty limited “working memory” and understanding of the world. That said, it does seem plausible that GPT-3 could consistently do at least some economically-useful things right now, with a carefully designed prompt - e.g. writing ad copy or editing humans’ writing.
In the longer term, though, we have a clear path forward for better capabilities. Just continuing along the current trajectory will push capabilities to an economically-valuable point on a wide range of problems, and soon. Alignment, on the other hand, doesn’t have much of a trajectory at all yet; designing-writing-prompts-such-that-writing-which-starts-with-the-prompt-contains-the-thing-you-want isn’t exactly a hot research area. There’s probably low-hanging fruit there for now, and it’s largely unclear how hard the problem will be going forward.
Two predictions on this front:
- With this version of GPT and especially with whatever comes next, we’ll start to see a lot more effort going into prompt design (or the equivalent alignment problem for future systems)
- As the capabilities of GPT-style models begin to cross beyond what humans can do (at least in some domains), alignment will become a much harder bottleneck, because it’s hard to make a human-mimicking system do things which humans cannot do
Reasoning for the first prediction: GPT-3 is right on the borderline of making alignment economically valuable - i.e. it’s at the point where there’s plausibly some immediate value to be had by figuring out better ways to write prompts. That means there’s finally going to be economic pressure for alignment - there’s going to be ways to make money by coming up with better alignment tricks. That won’t necessarily mean economic pressure for generalizable or robust alignment tricks, though - most of the economy runs on ad-hoc barely-good-enough tricks most of the time, and early alignment tricks will likely be the same. In the longer run, focus will shift toward more robust alignment, as the low-hanging problems are solved and the remaining problems have most of their value in the long tail.
Reasoning for the second prediction: how do I write a prompt such that human writing which began with that prompt would contain a workable explanation of a cheap fusion power generator? In practice, writing which claims to contain such a thing is generally crackpottery. I could take a different angle, maybe write some section-headers with names of particular technologies (e.g. electric motor, radio antenna, water pump, …) and descriptions of how they work, then write a header for “fusion generator” and let the model fill in the description. Something like that could plausibly work. Or it could generate scifi technobabble, because that’s what would be most likely to show up in such a piece of writing today. It all depends on which is "more likely" to appear in human writing. Point is: GPT is trained to mimic human writing; getting it to write things which humans cannot currently write is likely to be hard, even if it has the requisite capabilities.
There isn't a standard reference because the argument takes one sentence, and I've been repeating it over and over again: what would Bayesian updates on low-level physics do? That's the unique solution with best-possible predictive power, so we know that anything which scales up to best-possible predictive power in the limit will eventually behave that way.
(BTW I think that link is dead)
The "what would Bayesian updates on a low-level model do?" question is exactly the argument that the bridge design cannot be extended indefinitely, which is why I keep bringing it up over and over again.
This does point to one possibly-useful-to-notice ambiguous point: the difference between "this method would produce an aligned AI" vs "this method would continue to produce aligned AI over time, as things scale up". I am definitely thinking mainly about long-term alignment here; I don't really care about alignment on low-power AI like GPT-3 except insofar as it's a toy problem for alignment of more powerful AIs (or insofar as it's profitable, but that's a different matter).
I've been less careful than I should be about distinguishing these two in this thread. All these things which we're saying "might work" are things which might work in the short term on some low-power AI, but will definitely not work in the long term on high-power AI. That's probably part of why it seems like I keep switching positions - I haven't been properly distinguishing when we're talking short-term vs long-term.
A second comment on this:
If we want to make a piece of code faster, the first step is to profile the code to figure out which step is the slow one. If we want to make a beam stronger, the first step is to figure out where it fails. If we want to extend a bridge design, the first step is to figure out which piece fails under load if we just elongate everything.
Likewise, if we want to scale up an AI alignment method, the first step is to figure out exactly how it fails under load as the AI's capabilities grow.
I think you currently do not understand the failure mode I keep pointing to by saying "what would Bayesian updates on low-level physics do?". Elsewhere in the thread, you said that optimizing "for having a diverse range of models that all seem to fit the data" would fix the problem, which is my main evidence that you don't understand the problem. The problem is not "the data underdetermines what we're asking for", the problem is "the data fully determines what we're asking for, and we're asking for a proxy rather than the thing we actually want".