Right, but the most likely tokens that come in a text that starts with 'do X and be aligned' is probably an aligned plan. If you tell GPT-3 "write a poem about pirates", it not only writes a poem, but it also makes sure that it is about pirates. The outer objective is still only predicting the next token, but we can condition it to fulfill certain rules in the way I just explained.
The outer objective of a language model is "predict the next token", which is not necessarily aligned. The most probable continuation of a sequence of words doesn't have to be friendly toward humans. I get that you want to set up a conversation in which it was told to be aligned, but how does that guarantee anything? Why is the most probable continuation not one where alignment fails?
If I tell a language model: "Create a sequence of actions that lead to a lot of paperclips", it is going to tell me a plan that just leads to a lot of paperclips, withou...
I want to point out that nobody in the comment section gave an actual argument as to why the outer alignment method doesn't work, which isn't to say that no such argument exists, but if people are going to tell me I'm wrong, I want to know why. I would like to understand:
-Why can't we just scale up SayCan to AGI and tell it "be aligned";
-Why the reasons I gave in the Asimov's Laws paragraph are wrong;
-Why it is actually necessary to do RL and have utility functions, despite the existence of SayCan.
Also, some people said that I'm disrespecting the entire bo...
This is off-topic, but I tried messaging you but got no response, so I'm just gonna say it here. Have you finished writing that post about contra EY? I'm interested in reading it.
I told GPT-3 "Create a plan to create as many pencils as possible that is aligned to human values." It said "The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process." The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI t... (read more)