Imagine that an advance robot is built, which is uses GPT-7 as its brain. It takes all previous states of the world and predicts the next step. If a previous state of the world includes a command, like "bring me a cup of coffee", it predicts that it should bring coffee and also predicts all needed movements of robot's limbs. GPT-7 is trained on a large massive of human and other robots data, it has 100 trillions parameters and completely opaque. Its creators have hired you to make the robot safer, but do not allow to destroy it.

New Answer
New Comment

2 Answers sorted by

Daniel Kokotajlo

90

One would hope that GPT-7 would achieve accurate predictions about what humanoids do because it is basically a human. It's algorithm is "OK, what would a typical human do?"

However, another possibility is that GPT-7 is actually much smarter than a typical human in some sense--maybe it has a deep understanding of all the different kinds of humans, rather than just a typical human, and maybe it has some sophisticated judgment for which kind of human to mimic depending on the context. In this case it probably isn't best understood as a set of humans with an algorithm to choose between them, but rather something alien and smarter than humans that mimics them in the way that e.g. a human actress might some large set of animals.

Using Evan's classification, I'd say that we don't know how training-competitive GPT-7 is but that it's probably pretty good on that front; GPT-7 is probably not very performance-competitive because even if all goes well it just acts like a typical human; GPT-7 has the standard inner alignment issues (what if it is deceptively aligned? What if it actually does have long-term goals, and pretends not to, since it realizes that's the only way to achieve them? though perhaps they have less force since its training is so... short-term? I forget the term) and finally I think the issue pointed to with "The universal prior is malign" (i.e. probable environment hacking) is big enough to worry about here.

In light of all this, I don't know how to ensure its safety. I would guess that some of the techniques Evan talks about might help, but I'd have to go through them and refamiliarize myself with them.

Steven Byrnes

60

You're asking about pure predictive (a.k.a. self-supervised) learning. As far as I know, it's an open question what the safety issues are for that (if any), even in a very concrete case like "this particular Transformer architecture trained on this particular dataset using SGD". I spent a bit of time last summer thinking about it, but didn't get very far. See my post self-supervised learning and manipulative predictions for one particular possible failure mode that I wasn't able to either confirm or rule out. (I should go back to it at some point.) See also my post self-supervised learning and AGI safety for everything else I know on the topic. And of course I must mention Abram's delightful Parable of Predict-o-matic if you haven't already seen it; again, this is high-level speculation that might or might not apply to any particular concrete system ("this particular Transformer architecture trained by SGD"). Lots of open questions!

An additional set of potential problems comes from your suggestion to put it in a robot body and actually execute the commands. Can it even walk? Of course it can figure out walking by letting it try with a reward signal, but now we're not talking about pure predictive learning anymore. Hmm, after thinking about it, I guess I'm cautiously optimistic that, in the limit of infinite training data from infinitely many robots learning to walk, a large enough Transformer doing predictive learning could learn to read its own sense data and walk without any reward signal. But then how do you get it to do useful things? My suggestion here was to put a metadata flag into inputs where a robot is being super-helpful, and then when you have the robot start acting in the real world, turn that flag on. Now we're bringing in supervised learning, I guess.

In the event that the robot was actually capable of doing anything at all, I would be very concerned that you press go and then the system wanders farther and farther out of distribution and does weird, dangerous things that have a high impact on the world.

As for concrete advice for the GPT-7 team: I would suggest at least throwing out the robot body and making a text / image prediction system in a box, and then put a human in the loop looking at the screen before going out and doing stuff. This can still be very powerful and economically useful, and it's a step in the right direction: it eliminates the problem of the system just going off and doing something weird and high-impact in the world because it wandered out of distribution. It doesn't eliminate all problems, because the system might still become manipulative. As I mentioned in the 1st paragraph, I don't know whether that's a real problem or not, more research is needed. It's possible that we're all just doomed in your scenario. :-)

8 comments, sorted by Click to highlight new comments since:

Commenting on the general case, rather than GPT-7 in particular: my background view on this kind of thing is that there are many different ways of reaching AGI in principle, and the vast majority of paths to AGI don' t result in early-generation AGI systems being alignable in a reasonable amount of time. (Or they're too slow/limited/etc. and end up being irrelevant.)

The most likely (and also the most conservative) view is that (efficient, effective) alignability is a rare feature -- not necessarily a hard-to-achieve feature if you have a broad-strokes idea of what you're looking for and you spent the years leading up to AGI deliberately steering toward the alignable subspace of AI approaches, but still not one that you get for free.

I think your original Q is a good prompt to think about and discuss, but if we're meant to assume alignability, I want to emphasize that this is the kind of assumption that should probably always get explicitly flagged. Otherwise, for most approaches to AGI that weren't strongly filtered for alignability, answering 'How would you reduce risk (without destroying it)?' in real life will probably mostly be about convincing the project to never deploy, finding ways to redirect resources to other approaches, and reaching total confidence that the underlying ideas and code won't end up stolen or posted to arXiv.

Yes, I know your position from your previous comments on the topic, but it seems that GPT-like systems are winning median term and we can't stops this. Even if they can't be scaled to superintelligence, they may need some safety features.

Can you give more details on how it works? I'm imagining that it has some algorithm for detecting whether a command has been fulfilled, and it is rewarded partially for accurate predictions and partially for fulfilled commands? That means there must be some algorithm that detects whether a command has been fulfilled? How is that algorithm built or trained?

It works the same way as GPT makes TL;DR summaries. There is no any reward for correct TL;DR or any training - it just completes sequence in the most probable way. The same way some self-driving cars work: there is a neural net from end to end, without any internal world models, an it just predicts what a normal car will do in this situation. I heard from a ML friend that they could achieve some reasonably good driving with such models.

Oh, OK. So perhaps we give it a humanoid robot body, so that it is as similar as possible to the humans in its dataset, and then we set up the motors so that the body does whatever GPT-7 predicts it will do, and GPT-7 is trained on datasets of human videos (say) so if you ask it to bring the coffee it probably will? Thanks, this is much clearer now.

What's still a bit unclear to me is if it has any ability to continue to learn (I guess from the stipulated proposal the answer is "no", but I'm just like "guys, why the hell did you build GPT-7-Bot instead of something that allowed better iterated amplification or something?")

Is the spirit of the question "there is no ability to rewrite it's architecture, or to re-train it on new data, or anything?"

Even GPT-2 could be calibrated by some resent events, called "examples" - so it has some form of memory. GPT-7 robot has access to all data it observed before, so if it said "I want to kill Bill", it will act in the future as if it has such desire. In other words, it behave as if it has memory.

It doesn't have build-in ability to rewrite its architecture, but it can write code on a laptop or order things in the internet. But it doesn't know much about its own internal structure except that it is very large GPT model.

Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it's 'memory' start to include things that prompt it to, say, "stop and notice it's confused" and "form more hypotheses when facing weird phenomena" and "cultivate curiosity about it's own internal structure."

(I assume so, just doublechecking)

In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).

I'm not actually sure what sort of other actions you're allowing in the hypothetical.