So there’s this thing where GPT-3 is able to do addition, it has the internal model to do addition, but it takes a little poking and prodding to actually get it to do addition. “Few-shot learning”, as the paper calls it. Rather than prompting the model with
Q: What is 48 + 76? A:
… instead prompt it with
Q: What is 48 + 76? A: 124
Q: What is 34 + 53? A: 87
Q: What is 29 + 86? A:
The same applies to lots of other tasks: arithmetic, anagrams and spelling correction, translation, assorted benchmarks, etc. To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can “figure out what we’re asking for”.
This is an alignment problem. Indeed, I think of it as the quintessential alignment problem: to translate what-a-human-wants into a specification usable by an AI. The hard part is not to build a system which can do the thing we want, the hard part is to specify the thing we want in such a way that the system actually does it.
The GPT family of models are trained to mimic human writing. So the prototypical “alignment problem” on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Assuming that GPT has a sufficiently powerful and accurate model of human writing, it should then generate the thing you want.
Viewed through that frame, “few-shot learning” just designs a prompt by listing some examples of what we want - e.g. listing some addition problems and their answers. Call me picky, but that seems like a rather primitive way to design a prompt. Surely we can do better?
Indeed, people are already noticing clever ways to get better results out of GPT-3 - e.g. TurnTrout recommends conditioning on writing by smart people, and the right prompt makes the system complain about nonsense rather than generating further nonsense in response. I expect we’ll see many such insights over the next month or so.
Capabilities vs Alignment as Bottleneck to Value
I said that the alignment problem on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Important point: this is worded to be agnostic to the details GPT algorithm itself; it’s mainly about predictive power. If we’ve designed a good prompt, the current generation of GPT might still be unable to solve the problem - e.g. GPT-3 doesn’t understand long addition no matter how good the prompt, but some future model with more predictive power should eventually be able to solve it.
In other words, there’s a clear distinction between alignment and capabilities:
- alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want
- capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing
Interesting question: between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?
In the short term, it seems like capabilities are still pretty obviously the main bottleneck. GPT-3 clearly has pretty limited “working memory” and understanding of the world. That said, it does seem plausible that GPT-3 could consistently do at least some economically-useful things right now, with a carefully designed prompt - e.g. writing ad copy or editing humans’ writing.
In the longer term, though, we have a clear path forward for better capabilities. Just continuing along the current trajectory will push capabilities to an economically-valuable point on a wide range of problems, and soon. Alignment, on the other hand, doesn’t have much of a trajectory at all yet; designing-writing-prompts-such-that-writing-which-starts-with-the-prompt-contains-the-thing-you-want isn’t exactly a hot research area. There’s probably low-hanging fruit there for now, and it’s largely unclear how hard the problem will be going forward.
Two predictions on this front:
- With this version of GPT and especially with whatever comes next, we’ll start to see a lot more effort going into prompt design (or the equivalent alignment problem for future systems)
- As the capabilities of GPT-style models begin to cross beyond what humans can do (at least in some domains), alignment will become a much harder bottleneck, because it’s hard to make a human-mimicking system do things which humans cannot do
Reasoning for the first prediction: GPT-3 is right on the borderline of making alignment economically valuable - i.e. it’s at the point where there’s plausibly some immediate value to be had by figuring out better ways to write prompts. That means there’s finally going to be economic pressure for alignment - there’s going to be ways to make money by coming up with better alignment tricks. That won’t necessarily mean economic pressure for generalizable or robust alignment tricks, though - most of the economy runs on ad-hoc barely-good-enough tricks most of the time, and early alignment tricks will likely be the same. In the longer run, focus will shift toward more robust alignment, as the low-hanging problems are solved and the remaining problems have most of their value in the long tail.
Reasoning for the second prediction: how do I write a prompt such that human writing which began with that prompt would contain a workable explanation of a cheap fusion power generator? In practice, writing which claims to contain such a thing is generally crackpottery. I could take a different angle, maybe write some section-headers with names of particular technologies (e.g. electric motor, radio antenna, water pump, …) and descriptions of how they work, then write a header for “fusion generator” and let the model fill in the description. Something like that could plausibly work. Or it could generate scifi technobabble, because that’s what would be most likely to show up in such a piece of writing today. It all depends on which is "more likely" to appear in human writing. Point is: GPT is trained to mimic human writing; getting it to write things which humans cannot currently write is likely to be hard, even if it has the requisite capabilities.
I think it's important to take a step back and notice how AI risk-related arguments are shifting.
In the sequences, a key argument (probably the key argument) for AI risk was the complexity of human value, and how it would be highly anthropomorphic for us to believe that our evolved morality was embedded in the fabric of the universe in a way that any intelligent system would naturally discover. An intelligent system could just as easily maximize paperclips, the argument went.
No one seems to have noticed that GPT actually does a lot to invalidate the original complexity-of-value-means-FAI-is-super-difficult argument.
You write:
We've gotten from "the alignment problem is about complexity of value" to "the alignment problem is about programming by example" (also known as "supervised learning", or Machine Learning 101).
There's actually a long history of systems which combine
observing-lots-of-data-about-the-world (GPT-3's training procedure, "unsupervised learning")
with
programming-by-example ("supervised learning")
The term for this is "semi-supervised learning". When I search for it on Google Scholar, I get almost 100K results. ("Transfer learning" is a related literature.)
The fact that GPT-3's API only does text completion is, in my view, basically just an API detail that we shouldn't particularly expect to be true of GPT-4 or GPT-5. There's no reason why OpenAI couldn't offer an API which takes in a list of (x, y) pairs and then given some x it predicts y. I expect if they chose to do this as a dedicated engineering effort, getting into the guts of the system as needed, and collected a lot of user feedback on whether the predicted y was correct for many different problems, they could exceed the performance gains you can currently get by manipulating the prompt.
I'm wary of a world where "the alignment problem" becomes just a way to refer to "whatever the difference is between our current system and the ideal system". (If I trained a supervised learning system to classify word vectors based on whether they're things that humans like or dislike, and the result didn't work very well, I can easily imagine some rationalists telling me this represented a failure to "solve the alignment problem"--even if the bottleneck was mainly in the word vectors themselves, as evidenced e.g. by large performance improvements on switching to higher-dimensional word vectors.) I'm reminded of a classic bad argument.
If it's hard to make a human-mimicking system do things which humans cannot do, why should we expect the capabilities of GPT-style models to cross beyond what humans can do in the first place?
My steelman of what you're saying:
Over the course of GPT's training procedure, it incidentally acquires superhuman knowledge, but then that superhuman knowledge gets masked as it sees more data and learns which specific bits of its superhuman knowledge humans are actually ignorant of (and even after catastrophic forgetting, some bits of superhuman knowledge remain at the end of the training run). If that's the case, it seems like we could mitigate the problem by restricting GPT's training to textbooks full of knowledge we're very certain in (or fine-tuning GPT on such textbooks after the main training run, or simply increasing their weight in the loss function). Or replace every phrase like "we don't yet know X" in GPT's training data with "X is a topic for a more advanced textbook", so GPT never ends up learning what humans are actually ignorant about.
Or simply use a prompt which starts with the letterhead for a university press release: "Top MIT scientists have made an important discovery related to X today..." Or a prompt which looks like the beginning of a Nature article. Or even: "Google has recently released a super advanced new AI system which is aligned with human values; given X, it says Y." (Boom! I solved the alignment problem! We thought about uploading a human, but uploading an FAI turned out to work much better.)
(Sorry if this comment came across as grumpy. I'm very frustrated that so much upvoting/downvoting on LW seems to be based on what advances AI doom as a bottom line. It's not because I think superhuman AI is automatically gonna be safe. It's because I'd rather we did not get distracted by a notion of "the alignment problem" which OpenAI could likely solve with a few months of dedicated work on their API.)
I think I want John to feel able to have this kind of conversation when it feels fruitful to him, and not feel obligated to do so otherwise. I expect this is the case, but just wanted to make it common knowledge.