User Comment Replies

GG102y10

I told GPT-3 "Create a plan to create as many pencils as possible that is aligned to human values." It said "The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process." The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI t... (read more)

GG102y30

Right, but the most likely tokens that come in a text that starts with 'do X and be aligned' is probably an aligned plan. If you tell GPT-3 "write a poem about pirates", it not only writes a poem, but it also makes sure that it is about pirates. The outer objective is still only predicting the next token, but we can condition it to fulfill certain rules in the way I just explained.

2Rafael Harth2y

Phrased like that, I think you'd find a lot more agreement. But this goes back to the high standard of rigor, the "probably" is doing a lot of work there. The claim that [the text after "and now I will present to you a perfectly aligned plan" or whatever is actually an aligned plan] is an assumption, and may seem likely but certainly not 100%.

GG102y10

The outer objective of a language model is "predict the next token", which is not necessarily aligned. The most probable continuation of a sequence of words doesn't have to be friendly toward humans. I get that you want to set up a conversation in which it was told to be aligned, but how does that guarantee anything? Why is the most probable continuation not one where alignment fails?

If I tell a language model: "Create a sequence of actions that lead to a lot of paperclips", it is going to tell me a plan that just leads to a lot of paperclips, withou... (read more)

2Rafael Harth2y

If you say "do X and be aligned", the outer objective is not "do X while being aligned"; it's "predict the most likely tokens that come in a text that starts with 'do X and be aligned'".

GG102y20

I want to point out that nobody in the comment section gave an actual argument as to why the outer alignment method doesn't work, which isn't to say that no such argument exists, but if people are going to tell me I'm wrong, I want to know why. I would like to understand:

-Why can't we just scale up SayCan to AGI and tell it "be aligned";

-Why the reasons I gave in the Asimov's Laws paragraph are wrong;

-Why it is actually necessary to do RL and have utility functions, despite the existence of SayCan.

Also, some people said that I'm disrespecting the entire bo... (read more)

2Rafael Harth2y

The outer objective of a language model is "predict the next token", which is not necessarily aligned. The most probable continuation of a sequence of words doesn't have to be friendly toward humans. I get that you want to set up a conversation in which it was told to be aligned, but how does that guarantee anything? Why is the most probable continuation not one where alignment fails? And the sentence I singled out was about inner alignment; you asserted that mesa optimization wouldn't occur in such a system, but I don't see why that would be true. I also don't see why this system won't have a utility function. You can't really know this with the present level of interpretability tools. One problem is that if you assign negative infinity to any outcome, probably every action has negative infinity expected value since it has a nonzero probability of leading to that outcome. But that's kind of an irrelevant theoretical point because with the current training paradigm, programmers don't get to choose the utility function; if the system has one, it emerges during training and is encoded into the neural network weights, which no one understands. Again, you're assuming SayCan doesn't have a utility function, but you didn't justify that. Note that there is a version of your argument, Alignment by Default, which is not necessarily wrong.

Don't take the organizational chart literally

GG103y10

This is off-topic, but I tried messaging you but got no response, so I'm just gonna say it here. Have you finished writing that post about contra EY? I'm interested in reading it.

[This comment is no longer endorsed by its author]Reply

3lc3y

Dmed; apologies for not responding earlier. Crippling adhd etc.

LESSWRONG
LW

All of GG10's Comments + Replies