I refer to these posts:

https://optimists.ai/2023/11/28/ai-is-easy-to-control/

https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn

https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer

My (poor, maybe mis-) understanding is that the argument is that as SGD optimizes for "predicting the next token" and we select for systems with very low loss by modifying every single parameter in the neural network (which basically defines the network itself), it seems quite unlikely that we'll have a "sharp left turn" in the near term, which happened because evolution was too weak an outer optimizer to fully "control" humans' thinking in the direction that most improved inclusive genetic fitness, as it is too weak to directly tinker every neuron connection in our brain.

Given SGD's vastly stronger ability at outer optimisation of every parameter, isn't it possible, if not likely, that any sharp left turn occurs only at a vastly superhuman level, if the inner optimizer becomes vastly stronger than SGD?

The above arguments have persuaded me that we might be able to thread the needle for survival if humanity is able to use the not-yet-actively-deceptive outputs of moderately-superhuman models (because they are still just predicting the next token to the best of their capability), to help us solve the potential sharp left turn and if humanity doesn't do anything else stupid with other training methods/misuse and manages to solve the other problems. Of course, in an ideal world we wouldn't be in this situation.

I have read some rebuttals by others on LessWrong but did not find anything that convincingly debunked this idea (maybe I missed something).

Did Eliezer, or anyone else, ever tell us why this is wrong (if it is)? I have been searching for the past week but have only found this: https://x.com/ESYudkowsky/status/1726329895121514565 which seemed to be switching to more of a post-training discussion.

New Answer
New Comment

4 Answers sorted by

Jeremy Gillen

8931

Some general advice:

  • Taboo "sharp left turn". If possible, replace it with a specific example from e.g. here.
  • Don't worry about what Eliezer in particular has to say, just look for good arguments.
  • There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.

Object level response:

"sharp left turn" in the near term, which happened because evolution was too weak an outer optimizer to fully "control" humans' thinking in the direction that most improved inclusive genetic fitness, as it is too weak to directly tinker every neuron connection in our brain.

I think this is a misunderstanding. Evolution failed to align humans in the sense that, when humans are in an extremely OOD environment, they don't act as if they had evolved in that OOD environment. In other words, alignment is about generalization, it's not about how much fine-grained control the outer optimizer had. 

Eliezer does touch on this in a podcast, search for "information bottleneck". In the podcast, Eliezer seems to be saying that SGD might have less simplicity bias than evolution, which may imply even worse goal generalization. But I don't think I buy that, because there are plenty of other sources of simplicity bias in neural network training, so a priori I don't see a strong reason to believe one would generalize better than the other.

There are theoretical questions about how generalization works, how neural networks in particular generalize, how we can apply these theories to reason about the "goals" of a trained AI, and how those goals relate to the training distribution. These are much more relevant to alignment than the information bottleneck of SGD. My impression of Quintin and Nora is that they don't model advanced AIs as pursuing "goals" in the same way that I do, and I think that's one of the main sources of disagreement. They have a much more context-dependent idea of what a goal is, and they seem to think this model of goals is sufficient for very intelligent behavior.

 

humanity is able to use the not-yet-actively-deceptive outputs of moderately-superhuman models (because they are still just predicting the next token to the best of their capability), to help us solve the potential sharp left turn

LLMs are already moderately-superhuman at the task of predicting next tokens. This isn't sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering. And having that level of capability implies (arguably, see here for some arguments):

  • the capacity to learn and develop new skills (basic self-modification), 
  • the capacity to creatively solve way-out-of-training-distribution problems using way-out-of-training-distribution solutions,
  • maybe something analogous to the capability of moral philosophy, i.e. analyzing its own motivations and trying to work out what it "should" want.

These capabilities should (imo) produce sufficiently large context changes (distribution shifts) that they will probably be sufficient to reveal goal-misgeneralization. (some specific example mechanisms for goal misgeneralization are listed here).

So my answer, in short, is that "help us solve" implies "high capability" implies "OOD goal misgeneralization" (which is what I think you basically meant by "sharp left turn"). So it seems unlikely to me that we can use future AI to solve alignment in the way you describe, because misalignment should already be creating large problems by the time the AI is capable of helping.

There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.

The most obvious way of addressing this, "just feel more comfortable adjusting arguments... (read more)

LLMs are already moderately-superhuman at the task of predicting next tokens. This isn't sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering. 

We also need the assumption - which is definitely not obvious - that significant intelligence increases are relatively close to achievable. Superhumanly strong math skills presumably don't let AI solve NP problems in P time, and it's similarly plausible - though far from certain - that really go... (read more)

[-]dxu2-2

There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.

My sense is that this is because different people have different intuitive priors, an... (read more)

1sunwillrise
I'm not sure I understand this distinction as-written. How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
4dxu
They're not! But humans aren't ideal Bayesians, and it's entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one's intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to "de-update", because the evidence that went into the update isn't kept around in a form that subjects it to (potential) refutation. (IIRC, E.T. Jaynes talks about this distinction in Chapter 18 of Probability Theory: The Logic of Science, and he models it by introducing something he calls an A_p distribution. His exposition of this idea is uncharacteristically unclear, and his A_p distribution looks basically like a beta distribution with specific values for α and β, but it does seem to capture the distinction I see between "intuitive" updating versus "conscious" updating.)

Donald Hobson

70

I haven't seen an answer by Eliezer. But I can go through the first post, and highlight what I think is wrong. (And would be unsurprised if Eliezer agreed with much of it)

AIs are white boxes

We can see literally every neuron, but have little clue what they are doing.

 

Black box methods are sufficient for human alignment

Humans are aligned to human values because humans have human genes. Also individual humans can't replicate themselves, which makes taking over the world much harder. 

 

most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social.

Humans have specific genes for absorbing cultural values, at least within a range of human cultures. There are various alien values that humans won't absorb. 

Gradient descent is very powerful because, unlike a black box method, it’s almost impossible to trick

Hmm. I don't think the case for that is convincing.

If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.

Current AI techniques involve giving the AI loads of neurons, so having a few neurons that aren't being used isn't a problem. 

Also, it's possible that the same neurons that sometimes plot to kill you are also sometimes used to predict plots in murder mystery books. 

In general, gradient descent has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving performance on the tasks humans will actually optimize AIs to perform.

If you give the AI lots of tasks, it's possible that the simplest solution is some kind of internal general optimizer. 

Either you have an AI that is smart and general and can try new things that are substantially different from anything it's done before. (In which case the new things can include murder plots) Or you have an AI that's dumb and is only repeating small variations on it's training data. 

 

We can run large numbers of experiments to find the most effective interventions

Current techniques are based on experiments/gradient descent. This works so long as the AI's can't break out of the sandbox or realize they are being experimented on and plot to trick the experimenters. You can't keep an ASI in a little sandbox and run gradient descent on it. 

Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc.

Sure. And we use contraception. Which kind of shows that evolution failed somewhere a bit. 
 

Also, evolution got a long time testing and refining with humans that didn't have the tools to mess with evolution or even understand it. 

 

Even in the pessimistic scenario where AIs stop obeying our every command, they will still protect us and improve our welfare, because they will have learned an ethical code very early in training.

No one is claiming the ASI won't understand human values, they are saying it won't care. 

 

The moral judgements of current LLMs already align with common sense to a high degree,

Is that evidence that LLM's actually care about morality. Not really. It's evidence that they are good at predicting humans. Get them predicting an ethics professor and they will answer morality questions. Get them predicting Hitler and they will say less moral things. 

 

And of course, there is a big difference between an AI that says "be nice to people" and an AI that is nice to people. The former can be trivially achieved by hard coding a list of platitudes for the AI to parrot back. The second requires the AI to make decisions like "are unborn babies people?". 

Imagine some robot running around. You have an LLM that says nice-sounding things when posed ethical dilemmas. You need some system that turns the raw camera input into a text description, and the nice sounding output into actual actions. 

lukehmiles

40

It would be interesting if someone discovered something like "junk DNA that just copies itself" within the weights during the backprop+SGD process. Would be some evidence that backprop's thumb is not so heavy a worm can't wiggle out. Right now I would bet against that happening within a normal neural net training on a dataset.

Note that RL exists and gives the neural net much more uh "creative room" to uh "decide how to exist". Because you just have to get enough score over time to survive, but any strategy is accepted. In other words, it is much less convergent.

Also in RL, glitching/hacking of the physics sim / game engine is what you expect to happen! Then you have to patch your sim and retrain.

Also, most of the ML systems we use every day involve multiple neural nets with different goals (eg the image generator and the NSFW detector), so something odd might happen in that interaction.

All this to say: The question "if I train one NN on a fixed dataset with backprop+SGD, could something unexpected pop out?" is quite interesting and still open in my opinion. But even if that always goes exactly as expected, it is certainly clear that RL, active learning, multi-NN ML systems, hyperparameter optimization (which is often an evolutionary algorithm), etc produces weird things with weird goals and strategies very often.

I think debate surrounds the 1-NN-1-dataset question because it is an interesting and natural and important question, the type of question a good scientist would ask. Probably only a small part of the bigger challenge to control the whole trained machine.

Noah Birnbaum

30

I think Eliezer briefly responds to this in his podcast with Dwarkesh Patel — satisfactorily is pretty subjective. https://youtu.be/41SUp-TRVlg?si=hE3gcWxjDtl1-j14

At about 24:40.

4 comments, sorted by Click to highlight new comments since:

IMO, a very good response, which Eliezer doesn't seem to be interested in making as far as I can tell, is that we should not be making the analogy natural selection <--> gradient descent, but rather, human brain learning algorithm <--> gradient descent ; natural selection <--> us trying to build AI.

So here, the striking thing is that evolution failed to solve the alignment problem for humans. I.e. we have a prior example of strongish general intelligence being created, but no prior examples of strongish general intelligence being aligned. Evolution was strong enough to do one but not the other. It's not hopeless, because we should generally consider ourselves smarter than evolution, but on the other hand, evolution has had a very long time to work and it does frequently manage things that we humans have not been able to replicate. And also, it provides a small amount of evidence against "the problem will be solved with minor tweaks to existing algorithms" since generally minor tweaks are easier for evolution to find than ideas that require many changes at once.

Yeah you can kind of stop at "we are already doing natural selection." The devs give us random variation. The conferences and the market give us selection. The population is large, the mutation rate is high, the competition is fierce, and replicating costs $0.25 + 10 minutes.

I don't see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.

Focusing on this argument seems like missing the forest for the trees.

I think the tooling/scale is at a point where we can begin the search for "life" (eg viruses, boundaries that repair, etc) in weights during training. We should certainly expect to see such things if the NN is found via evolutionary algorithm. So we can look for similar structures in similar places with backprop+SGD. I expect this to go much like the search for life on mars. A negative result would still be good information IMO.