Or, why we probably don't need to worry about AI.

So this post is partially a response to Amalthea's comment on how I simply claimed that my side is right, and I responded by stating that I was going for a short comment rather than having to make another very long comment on the issue.

https://www.lesswrong.com/posts/aW288uWABwTruBmgF/?commentId=r7s9JwqP5gt4sg4HZ#r7s9JwqP5gt4sg4HZ

This is the post where I won't try to claim that my side is right, and instead give evidence so I can properly collect my thoughts here. This will be a link-heavy post, and I'll reference a lot of concepts and conversations, so it will help if you have some light background on these ideas, but I will try to make everything intelligible to the lay/non-technical person.

This will be a long post, so get a drink and a snack.

The Sharp Left Turn probably won't happen, because AI training is very different from evolution

Nate Soares suggests that a critical problem in AI safety is the sharp left turn, and the sharp left turn essentially is that capabilities generalize much more than the goals, ie it is basically goal misgeneralization plus fast takeoff:

My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.

So essentially the analogy is akin to AI is aligned in the training data, but in the test set, due to the limitations of the method of alignment, fail to generalize to the test set.

Here's the problem: We actually know why the sharp left turn happened, and the circumstances that led to the sharp left turn in humans won't reappear in AI training and AI progress.

Basically, the sharp left turn happened because the outer optimizer of evolution was billions of times less powerful than the inner search process like human lifetime learning, and the inner learners like us humans die after basically a single step, or at best 2-3 steps of the outer optimizer. Evolution mostly can't transmit as ,many bits from one generation to the next generation via it's tools, compared to cultural evolution, and the difference between their ability to transmit bits over certain time-scales is massive.

Once we had the ability to transmit some information via culture, that meant that given our ability to optimize billions of times more efficiently, we could essentially undergo a sharp left turn where capabilities spiked. But the only reason this happened was to quote Quintin Pope:

Once the inner learning processes become capable enough to pass their knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.

This does not exist for AIs trained with SGD, and there is a much smaller gap between the outer optimizer SGD and the inner optimizer, with the difference being ~0-40x.

Here's the source for it below, and I'll explicitly quote it:

https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn#Don_t_misgeneralize_from_evolution_to_AI

See also: Model Agnostic Meta Learning proposed a bi-level optimization process that used between 10 and 40 times more compute in the inner loop, only for Rapid Learning or Feature Reuse? to show they could get about the same performance while removing almost all the compute from the inner loop, or even by getting rid of the inner loop entirely.

Also, we can set the ratio of outer to inner optimization steps to basically whatever we want, which means that we can control the inner learner's rates of learning far better than evolution, meaning we can prevent a sharp left turn from happening.

A crux I have with Jan Kulevit is that to the extent that animals do have culture, it is much more limited than human culture, and that evolution largely has little ability to pass on traits non-culturally, and very critically this is a one-time inefficiency, there is no reason to assume a second source of massive inefficiency leading to a sharp left turn:

X4vier and particular illustrates this, and I'll show it below:

https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=qYFkt2JRv3WzAXsHL

https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=vETS4TqDPMqZD2LAN

I don't believe that Nate's example actually shows the misgeneralization were concerned about

This is because the alleged misgeneralization was not a situation where 1 AI was trained in an environment and maximized the correlates IGF, then in the new environment it encountered inputs that changed the goals such that it now misgeneralizes the goal to not pursue IGF anymore.

What happened is that evolution trained humans in one environment to optimize the correlates of IGF, then basically trained new humans in another environment, and they diverged.

Very critically, there were thousands of different systems/humans being trained on in drastically different environments, not 1 AI being trained on different environments like in modern AI training, so it's not a valid example of misgeneralization.

Some posts and quotes from Quintin Pope will help:

(Part 2, how this matters for analogies from evolution) Many of the most fundamental questions of alignment are about how AIs will generalize from their training data. E.g., "If we train the AI to act nicely in situations where we can provide oversight, will it continue to act nicely in situations where we can't provide oversight?"

When people try to use human evolutionary history to make predictions about AI generalizations, they often make arguments like "In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead." Then they try to infer something about AI generalizations by pointing to how X and Y differ.

However, such arguments make a critical misstep: evolution optimizes over the human genome, which is the top level of the human learning process. Evolution applies very little direct optimization power to the middle level. E.g., evolution does not transfer the skills, knowledge, values, or behaviors learned by one generation to their descendants. The descendants must re-learn those things from information present in the environment (which may include demonstrations and instructions from the previous generation).

This distinction matters because the entire point of a learning system being trained on environmental data is to insert useful information and behavioral patterns into the middle level stuff. But this (mostly) doesn't happen with evolution, so the transition from ancestral environment to modern environment is not an example of a learning system generalizing from its training data. It's not an example of:

We trained the system in environment A. Then, the trained system processed a different distribution of inputs from environment B, and now the system behaves differently.

It's an example of:

We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two different systems behave differently.

These are completely different kinds of transitions, and trying to reason from an instance of the second kind of transition (humans in ancestral versus modern environments), to an instance of the first kind of transition (future AIs in training versus deployment), will very easily lead you astray.

Two different learning systems, trained on data from two different distributions, will usually have greater divergence between their behaviors, as compared to a single system which is being evaluated on the data from the two different distributions. Treating our evolutionary history like humanity's "training" will thus lead to overly pessimistic expectations regarding the stability and predictability of an AI's generalizations from its training data.

Drawing correct lessons about AI from human evolutionary history requires tracking how evolution influenced the different levels of the human learning process. I generally find that such corrected evolutionary analogies carry implications that are far less interesting or concerning than their uncorrected counterparts. E.g., here are two ways of thinking about how humans came to like ice cream:

If we assume that humans were "trained" in the ancestral environment to pursue gazelle meat and such, and then "deployed" into the modern environment where we pursued ice cream instead, then that's an example where behavior in training completely fails to predict behavior in deployment.

If there are actually two different sets of training "runs", one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.

In particular, this outcome doesn't tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they'll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.

A comment by Quintin on why humans didn't actually misgeneralize to liking icecream:

https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/?commentId=sYA9PLztwiTWY939B

AIs are white boxes, and we are the innate reward system

Edit from comments due to Steven Byrnes: The white-box definition I'm using in this post does not correspond to the intuitive definition of a white box, and instead refers to the computer analysis/security sense of the term.

These links will be the definitions of white box AI going forward for this post:

https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#Alignment_optimism__AIs_are_white_boxes

https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=CLi5eBchYfXKZvXuD

https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=qisPbHyDHMKxgNGeh#qisPbHyDHMKxgNGeh

The above arguments on why the Sharp Left Turn probably won't reappear in modern AI development, and why the claim that humans didn't misgeneralize is enough to land us out of the most doomy voices like Eliezer Yudkowsky, and in particular the removal of reasons to assume extreme misgeneralization lands us out of MIRI-sphere views, as well as arguably outside of 50% p(doom). But I wanted to argue that the chance of doom is way lower than that, so low that we mostly shouldn't be concerned about AI, and thus I have to provide a positive story of why AIs very likely are aligned, and I argue that AIs are white boxes and we are the innate reward system, in this context.

The key advantage we have over evolution is that unlike studying brains, we have full read-write access to their internals, and they're essentially a special type of computer program, and we already have ways to manipulate computer programs at essentially no cost to us. Indeed, this is why SGD and backpropagation works at all to optimize SGD. If the AI was a black box, SGD and backpropagation wouldn't work.

The innate reward system aligns us via whitebox methods, and the values that the reward system imprints on us is ridiculously reliable, where almost every human has empathy for friends and acquaintances, parental instincts, revenge etc.

This is shown in the link below:

https://forum.effectivealtruism.org/s/vw6tX5SyvTwMeSxJk/p/JYEAL8g7ArqGoTaX6#White_box_alignment_in_nature

(Here, we must take a detour and say that our reward system is ridiculously good at aligning us to survive, and the flaws like obesity in the modern world are usually surprisingly mild failures, in that the human isn't as capable of things as we thought, and this arguably implies that alignment failures in practice will look much more like capabilities failures, and passing the analogy back to the AI case, I basically don't expect X-risk, GCRs, or really anything more severe than say the AI messing up a kitchen, for example.)

Steven Byrnes raised the concern that if you don't know how to do the manipulation, then it does cost you to gain the knowledge.

Steven Byrnes's comment is linked here: https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=3xxsumjgHWoJqSzqw

Nora Belrose responded on what white boxing meant, as well as how people use SGD to automate the search so that the cost of manipulation in an overall sense is as low as possible:

https://twitter.com/norabelrose/status/1709603325078102394

I mean it in the computer security sense, where it refers to the observability of the source code of a program (Nora Belrose)

https://twitter.com/norabelrose/status/1709606248314998835

We can do better than IDA Pro & Ghidra by exploiting the differentability of neural nets, using SGD to locate the manipulations of NN weights that improve alignment the most

I’d be much more worried if we didn’t have SGD and were just evolving AGI in a sim or smth (Nora Belrose)

https://twitter.com/norabelrose/status/1709601025286635762

I’m pointing out that it’s a white box in the very literal sense that you can observe and manipulate everything that’s going on inside, and this is a far from trivial fact because you can’t do this with other systems we routinely align like humans or animals. (Nora Belrose)

https://twitter.com/norabelrose/status/1709603731413901382

No, I don’t agree this is a weakening. In a literal sense it is zero cost to analyze and manipulate the NNs. It may be greater than zero cost to come up with manual manipulations that achieve some goal. But that’s why we automate the search for manipulations using SGD (Nora Belrose)

Steven Byrnes argues that this could be due to differing definitions:

https://twitter.com/steve47285/status/1709655473941631430

I think that’s a black box with a button on the front panel that says “SGD”. We can talk all day about all the cool things we can do by pressing the SGD button. But it's still a button outside the box, metaphorically.

To me, “white box” would mean: If an LLM outputs A rather than B, and you ask me why, then I can always give you a reasonable answer. I claim that this is closer to how that term is normally used in practice.

(Yes I know, it’s not literally a button, it’s an input-output interface that also changes the black box internals.) (Steven Byrnes)

This is the response chain so that I could see why Nora Belrose and Steven Byrnes were disagreeing.

I ultimately think a potential difference is that for alignment purposes, the humans vs AI abstraction is not a very useful abstraction, and SGD vs the inner optimizer is the better abstraction here, and thus it doesn't matter whether AI progresses generally, it's the specific progress by humans + SGD vs the inner optimizer that's important, and thus the cost of manipulating AI values is quite low.

This leads to...

I believe the security mindset is inappropriate for AI

In general, a common disagreement with a lot of LWers is that there is very limited transfer of knowledge from the computer security field to AI, because AI is very different in ways that make the analogies inappropriate.

For one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.

All of this and more is explained by Quintin below, but there are several big disanalogies between the AI field and the computer security field, so much so that I think that ML/AI is a lot like quantum mechanics, where we shouldn't port intuitions from other fields and expect them to work because of the weirdness of the domain:

https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/#Yudkowsky_mentions_the_security_mindset__

Similarly, I think that machine learning is not really like computer security, or rocket science (another analogy that Yudkowsky often uses). Some examples of things that happen in ML that don't really happen in other fields:

Models are internally modular by default. Swapping the positions of nearby transformer layers causes little performance degradation.

Swapping a computer's hard drive for its CPU, or swapping a rocket's fuel tank for one of its stabilization fins, would lead to instant failure at best. Similarly, swapping around different steps of a cryptographic protocol will, usually make it output nonsense. At worst, it will introduce a crippling security flaw. For example, password salts are added before hashing the passwords. If you switch to adding them after, this makes salting near useless.

We can arithmetically edit models. We can finetune one model for many tasks individually and track how the weights change with each finetuning to get a "task vector" for each task. We can then add task vectors together to make a model that's good at multiple of the tasks at once, or we can subtract out task vectors to make the model worse at the associated tasks.

Randomly adding / subtracting extra pieces to either rockets or cryptosystems is playing with the worst kind of fire, and will eventually get you hacked or exploded, respectively.

We can stitch different models together, without any retraining.

The rough equivalent for computer security would be to have two encryption algorithms A and B, and a plaintext X. Then, midway through applying A to X, switch over to using B instead. For rocketry, it would be like building two different rockets, then trying to weld the top half of one rocket onto the bottom half of the other.

Things often get easier as they get bigger. Scaling models makes them learn faster, and makes them more robust.

This is usually not the case in security or rocket science.

You can just randomly change around what you're doing in ML training, and it often works fine. E.g., you can just double the size of your model, or of your training data, or change around hyperparameters of your training process, while making literally zero other adjustments, and things usually won't explode.

Rockets will literally explode if you try to randomly double the size of their fuel tanks.

I don't think this sort of weirdness fits into the framework / "narrative" of any preexisting field. I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we're dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.

I also believe that the epistemic differences between computer security and alignment is in computer security, there's an easy to check ground truth for whether a crypto-system is broken, whereas in AI alignment, we don't have the ability to get feedback from proposed breakages of alignment schemes.

For more, see Quintin's post section on the difference between AI safety and computer security in regards to epistemics, and a worked example of an attempted security break, where there is suggestive evidence that inner misaligned models/optimization daemons go away as we increase the amount of dimensions.

https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#True_experts_learn__and_prove_themselves__by_breaking_things

(Where Quintin Pope talks about the fact that alignment doesn't have good feedback loops on ground truth on "What is an attempted break?", and the example of a claimed break actually went away as the dimensions was scaled up, and note that the disconfirmatory evidence was more realistic than the attempted break.)

This is why I disagreed with Jeffrey Ladish about the security mindset on Twitter: I believe it's a trap for those not possessing technical knowledge, like a lot of LWers, and there are massive differences between AI and computer security that means most attempted connections fail.

https://twitter.com/JeffLadish/status/1712262020438131062

uh I guess I hope he reads enough to internalize the security mindset?? (Jeffrey Ladish)

https://twitter.com/SharmakeFarah14/status/1712264530829492518

I generally tend to think the security mindset is a trap, because ML/AI alignment is very different from rocket engineering or cybersecurity.

For a primer on why, read @QuintinPope5's post section on it:

https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_mentions_the_security_mindset__ (Myself)

So now that I've tried to show why porting over the security mindset is flawed, I want to talk about a class of adversaries like gradient hackers or inner-misaligned mesa-optimization, and why I believe this is actually very difficult to do against SGD, and even the non-platonic ideal version of SGD, we can detect most mesa-optimizers quite easily.

Inner Misalignment, or at least Gradient Hacking is very difficult for AIs trained on SGD

I'll be taking the inner misalignment definition from Evan Hubinger's post The Inner Alignment Problem:

https://www.lesswrong.com/posts/pL56xPoniLvtMDQ4J/the-inner-alignment-problem

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem.

The basic reason why it's hard for a misaligned mesa-optimizer to stick around for long is because Gradient Descent is in fact, much more powerful and white-boxy than people realize, and in particular it has 5 defenses that any mesa-optimizer would need to overcome in order to misalign it:

https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult

1.) Everything is optimized simultaneously and independently. The exact local sensitivity to of each and every parameter to the loss is computed in the backward pass and every parameter is updated based on this. This means it is impossible for a mesaoptimizer or gradient hacker (as long as it is differentiably connected to the network) to shield any part of itself from gradient updates. This includes the gradient hacker's internal circuitry, its connections with other parts of the network, its connections to the network's input data, and the connections between the rest of the network and the loss.

2.) Gradient descent is maximal. Each parameter is updated in the direction of steepest descent. This means that, unlike biological life, gradient hackers cannot 'burn' something to increase entropy to be traded off against decreasing entropy elsewhere. Only the local maximally increasing entropy direction is allowed.

3.) It is stateless. Each parameter is updated according to its local sensitivity to the loss for each datapoint. Credit is not propagated across time. It is impossible to 'bargain' with gradient descent by increasing loss now but decreasing loss more later, or having decreased loss in the past.

4.) It is first-order only. It only computes partial derivatives of the parameter to the loss. It does not take into account potential second order interactions. While bad in some cases, this makes gradient descent immune to 'blackmail' or 'tripwires' setup by mesaoptimizers where optimizing some part of their circuit would potentially cause a huge loss as a second order effect. GD will blithely trigger the tripwire and then optimize away the consequences. This also disallows 'trade' where a gradient hacker could cause low loss on one causal branch in exchange for high loss on some other branch. Gradient descent will instead sum the steepest descent direction on both branches.

5.) It is infinitesimal. GD only computes the infinitesimal local sensitivities of each parameter regardless of the actual learning rate or step-size in practice. This means gradient descent does not 'see' if you are standing next to a giant cliff, so cannot be blackmailed in this way either.

Basically it will optimize the entire causal graph, and leave no slack, and as a bonus are extremely resistant to blackmail by mesa-optimizers. In general, a big part of my optimism around inner alignment is that SGD is extraordinarily good at credit assignment, and it has quite strong correction features in the case that a mesa-optimizer does attempt to misalign it.

We also can detect most mesa-optimizers in the AI without the need for mechanistic interpretability, like so:

While it might be possible to construct gradient hackers around these features, the advantages to us is that all of these approaches (except perhaps 2) should be visible to us even without detailed mechanistic interpretability knowledge of the network. We know a-priori where the stop-gradients or non-differentiable parts of the architecture are. We can straightforwardly observe what gradients are being clipped -- and if the gradient updates to the same set of parameters is being constantly clipped that could be a red flag. We can also find regions of high curvature in the loss landscape by just inspecting the Hessians or approximations of it. Only a gradient hacker which utilizes correlations across minibatches to cancel out its interventions in model behaviour seems difficult to detect.

One caveat here is that the prevention of mesa-optimizers applies fully only to SSL learning on IID data, which is an unfortunate limitation, albeit I do expect SGD to still be ridiculously good at credit assignment even in the RL context.

In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.

Also of importance is that this post only applies to the classic supervised/self-supervised learning on i.i.d data which most current foundation models are trained in. The important things about this paradigm is that the outputs of the model have no causal link to the data distribution -- i.e. the model cannot influence what future data it will be trained on (except potentially highly indirectly influencing successor models [1]). In an RL context, or supervised learning with adaptive data sampling, where there is a link between the model's outputs and the future data distribution, then this provides another route for gradient hackers to operate -- by creating outputs which will steer the data distribution in a desired direction which will strengthen the gradient hacker.

But there's also weak evidence that optimization daemons/demons, often called inner misaligned models, go away when you increase the dimension count:

https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#True_experts_learn__and_prove_themselves__by_breaking_things

Another poster (ironically using the handle "DaemonicSigil") then found a scenario in which gradient descent does form an optimization demon. However, the scenario in question is extremely unnatural, and not at all like those found in normal deep learning practice. So no one knew whether this represented a valid "proof of concept" that realistic deep learning systems would develop optimization demons.

Roughly two and a half years later, Ulisse Mini would make DaemonicSigil's scenario a bit more like those found in deep learning by increasing the number of dimensions from 16 to 1000 (still vastly smaller than any realistic deep learning system), which produced very different results, and weakly suggested that more dimensions do reduce demon formation.

https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect

https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect?commentId=hwzu5ak8REMZuBDBk

This was actually a crux in a discussion between me and David Xu about inner alignment, where I argued that the sharp left turn conditions don't exist in AI development, and he argued that misalignment happens when there are gaps that go uncorrected, which is likely referring to the gap between the base goal like SGD and the internal optimizer's goal that leads to inner misalignment, and I argued that inner misalignment is likely to be extremely difficult to do, due to SGD being able to correct the gap between the inner and outer mesa-optimizer in most cases, and I now showed the argument in this post:

Twitter conversation below:

https://twitter.com/davidxu90/status/1712567663401238742

Speaking as someone who's read that post (alongside most of Quintin's others) and who still finds his basic argument unconvincing, I can say that my issue is that I don't buy his characterization of the doom argument—e.g. I disagree that there needs to be a "vast gap". (David Xu)

https://twitter.com/davidxu90/status/1712568155959362014

SGD is not the kind of thing where you need "vast gaps" between the inner and outer optimizer to get misalignment; on my model, misalignment happens whenever gaps appear that go uncorrected, since uncorrected gaps will tend to grow alongside capabilities/coherence. (David Xu)

https://twitter.com/SharmakeFarah14/status/1712573782773108737

since uncorrected gaps will tend to grow alongside capabilities/coherence.

This is definitely what I don't expect, and part of that is because I expect that uncorrected inner misalignment will be squashed out by SGD unless extreme things happen:

https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult (Myself)

https://twitter.com/davidxu90/status/1712575172124033352

Yes, that definitely sounds cruxy—you expect SGD to contain corrective mechanisms by default, whereas I don't. This seems like a stronger claim than "SGD is different from evolution", however, and I don't think I've seen good arguments made for it. (David Xu)

This reminds me, I should address that other conversation I had with David Xu on how strong priors do we need to encode to ensure alignment, vs how much can we let it learn and it leading to a good outcome, or alternatively how much do we need to specify upfront. And that leads to...

I expect reasonably weak priors to work well to align AI with human values, and that a lot of the complexity can be offloaded to the learning process

Equivalently speaking, I expect the cost of specification of values to be relatively low, and that a lot of the complexity is offloadable to the learning process.

This was another crux between David Xu and me, specifically on the question of whether you can largely get away with weak priors, or do you actually need to encode a lot stronger prior to prevent misalignment? It ultimately boiled down to the crux that I expected reasonably weak priors to be enough, guided by the innate reward system.

A big part of my reasoning here has to do with the fact that a lot of values and biases are inaccessible by the genome, and that means that you can't directly specify them. You can shape them via setting up training algorithms and data, but it turns out that it's very difficult to directly specify things like values, for instance in the genome. This is primarily because the genome does not have direct access to the world model or the brain, which would be required to hardcode the prior. To the extent that it can, it has to be over relatively simple properties, which means that you need to get alignment with relatively weak priors encoded, and the innate reward system generally does this fantastically, with examples of misalignment being rare and mild.

The fact that humans can reliably get values like "having empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc", without requiring the genome to hardcode a lot of prior information, and getting away with reasonably weak priors is a rather underappreciated thing, since it means that we don't need to specify our values very much, and thus we can reliably offload most of the value learning work to AI.

Here are some posts and comments below:

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#iGdPrrETAHsbFYvQe

(I want to point out that it's not just that with weak prior information that the genome can reliably bind humans to real-enough things such that for example, they don't die from thirst from drinking fake water, but that it can create the innate reward system which uses simple update rules to reliably get nearly every person on earth to have empathy for their family and ingroup, revenge when others harmed us, etc, and the rare exceptions to the pattern are rather rare and usually mild alignment failures at best. That's a source of a lot of my optimism on AI safety and alignment.)

https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#8Fry62GiBnRYPnpNn

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#dRXCwRBkGxKTuq2Cc

Here is the compressed conversation between David Xu and me:

https://twitter.com/davidxu90/status/1713102210354294936

(And the reason I'd be more optimistic there is basically because I expect the human has meta-priors I'd endorse, causing them to extrapolate in a "good" way, and reach a policy similar to one I myself would reach under similar augmentation.) (David Xu)

https://twitter.com/davidxu90/status/1713230086730862731

(In reality, of course, I disagree with the framing in both cases: "two different systems" isn't correct, because the genetic information that evolution was working with in fact does encode fairly strong priors, as I mentioned upthread.) (David Xu)

https://twitter.com/SharmakeFarah14/status/1713232260827095119

My disagreement is that I expect the genetic priors to be quite weak, and that a lot of values are learned, not encoded in priors, because values are inaccessible to the genome:

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome

Maybe we will eventually be able to hardcode it, but we don't need that. (Myself)

https://twitter.com/davidxu90/status/1713232760637358547

Values aren't "learned", "inferred", or any other words that suggests they're directly imbibed from the training data, because values aren't constrained by training data alone; if this were false, it would imply the orthogonality thesis is false. (David Xu)

I'm going to reply in this post and say that the orthogonality thesis is a lot like the no free lunch theorem: An extraordinarily powerful result that is too general to apply, because it only applies to the space of all logically possible AIs, and it only works if you have 0 prior that's applied, which in this case would require you to specify everything, including the values of the system, or at best use stuff like brute force search or memorization algorithms.

I have a very similar attitude to "Most goals in the space of goal space are bad." I'd probably agree in the most general sense, but that even weak priors can prevent most goals from being bad, and thus I suspect that a 0 prior condition is likely necessary. But I'm not arguing that with 0 prior, models are aligned with people without specifying everything. I'm arguing that we can get away with reasonably weak priors, and let within life-time learning do the rest.

Once you introduce even weak priors to the situation, then the issue is basically resolved, and I stated that weak priors work to induce learning of values, and it's consistent with the orthogonality thesis to have arbitrarily tiny prior information be necessary to learn alignment.

I could make an analogous argument for capabilities, and I'd be demonstrably wrong, since the conclusion doesn't hold.

This is why I hate the orthogonality thesis, despite rationalists being right on it: It allows for too many outcomes, and any inference like values aren't learned can't be supported based on the orthogonality thesis.

https://twitter.com/SharmakeFarah14/status/1713234214391255277

The problem with the orthogonality thesis is that it allows for too many outcomes, and notice I said the genetic prior is weak, not non-existent, which would be compatible with the orthogonality thesis. (Myself)

https://twitter.com/davidxu90/status/1713234707272626653

The orthogonality thesis, as originally deployed, isn't meant as a tool to predict outcomes, but to counter arguments (pretty much) like the ones being made here: encountering "good" training data doesn't constrain motivations. Beyond that the thesis doesn't say much. (David Xu)

https://twitter.com/SharmakeFarah14/status/1713236849873891699

I suspect it's true when looking at the multiverse of AIs as a whole, then it's true, if we impose 0 prior, but even weak priors start to constrain your motivations a lot. I have more faith in weak priors + whiteboxness working out than you do. (Myself)

https://twitter.com/davidxu90/status/1713237355501584857

I have more faith in weak priors + whiteboxness working out than you do.

I agree that something in the vicinity of this is likely [a] crux. (David Xu)

https://twitter.com/davidxu90/status/1713238995893912060

TBC, I do think it's logically possible for the NN landscape to be s.t. everything I've said is untrue, and that good minds abound given good data. I don't think this is likely a priori, and I don't think Quintin's arguments shift me very much, but I admit it's possible. (David Xu)

##My own algorithm for how to do AI alignment

This is a subpoint, but for those that want to have a ready-to-go alignment plan, here it is:

  1. Implement a weak prior over goal space.

  2. Use DPO, RLHF, or something else to create a preference model.

  3. Create a custom loss function for the preference model.

  4. Use the backpropagation algorithm to optimize it and achieve a low loss.

  5. Repeat the backpropagation algorithm until you achieve an acceptable solution.

Now that I'm basically finished with laying out the arguments and the conversations, lets move on to the conclusion:

Conclusion

My optimism on AI safety stems from a variety of sources. The reasons are, in order of the post, not ordered by importance are:

  1. I don't believe the sharp left turn is anywhere near as general as Nate Soares puts it, because the conditions that caused a sharp left turn in humans was basically cultural learning in humans being able to optimize over much faster time-scales than evolution could respond, evolution not course-correcting us, and being able to transmit OOMs more information via culture through the generations than evolution could. None of these conditions hold for modern AI development.

  2. I don't believe that Nate's example of misgeneralizing the goal of IGF actually works as an actual example of misgeneralization that matters for our purposes, because they were not that 1 AI is trained for a goal in environment A, and then in environment B, it does not pursue the goal, but instead pursues a different goal competently.

Instead, what's happening is that 1 human generation, or 1 human is trained in Environment A, and then a fresh generation of humans is trained on a different distribution, which predictably will have more divergence than the first case.

In particular, there's no reason to be concerned about the alignment of AI misgeneralizing, since we have no reason to assume that the central example of Lesswrong is actually misgeneralization. From Quintin:

If we assume that humans were "trained" in the ancestral environment to pursue gazelle meat and such, and then "deployed" into the modern environment where we pursued ice cream instead, then that's an example where behavior in training completely fails to predict behavior in deployment.

If there are actually two different sets of training "runs", one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.

In particular, this outcome doesn't tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they'll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.

  1. AIs are mostly white boxes, at the very least, and the control over AI that we have means that a better analogy is through our innate reward systems, which align us to quite a lot of goals spectacularly well, so well that the total evidence of alignment could easily put X-risk or even say, killing a human 5-15+ OOMs or less, which would make the alignment problem a non-problem for our purposes. It would pretty much single-handedly make AI misuse the biggest problem, but that issue has different solutions, and governments are likely to regulate AI misuse anyway, so existential risk gets cut 10-99%+ or more.

  2. I believe the security mindset is inappropriate for AI due to the fact that aligning AI mostly doesn't involve dealing with adversarial intelligences or inputs, and the reason turns out to be that the most natural class, inner misaligned mesa-optimizers/optimization daemons mostly doesn't exist, because of my next reason. Also alignment is in a different epistemic state to computer security, and there are other disanalogies that make porting intuitions from other fields into ML/AI research very difficult to do correctly.

  3. It is actually really difficult to inner misalign the AI, since SGD is really good at credit assignment, and optimizes the entire causal graph leading to the loss, leaving no slack. It's not like evolution where you have to do this from Gwern's post here:

https://gwern.net/backstop#rl

Imagine trying to run a business in which the only feedback given is whether you go bankrupt or not. In running that business, you make millions or billions of decisions, to adopt a particular model, rent a particular store, advertise this or that, hire one person out of scores of applicants, assign them this or that task to make many decisions of their own (which may in turn require decisions to be made by still others), and so on, extended over many years. At the end, you turn a healthy profit, or go bankrupt. So you get 1 bit of feedback, which must be split over billions of decisions. When a company goes bankrupt, what killed it? Hiring the wrong accountant? The CEO not investing enough in R&D? Random geopolitical events? New government regulations? Putting its HQ in the wrong city? Just a generalized inefficiency? How would you know which decisions were good and which were bad? How do you solve the “credit assignment problem”?

The way SGD solves this problem is by running backprop, which is a white-box algorithm, and Nora Belrose explains it more here:

https://forum.effectivealtruism.org/s/vw6tX5SyvTwMeSxJk/p/JYEAL8g7ArqGoTaX6#Status_quo_AI_alignment_methods_are_white_box

And that's the base optimizer, not the mesa-optimizer, which is why SGD has a chance to correct the inner-misaligned agent far more effectively than cultural/biological evolution, the free market, etc. It is white-box, like the inner optimizers it runs, and solves credit assignment in a much better way than the previous optimizers like cultural/biological evolution, the free market, etc could hope to do.

  1. I believe that due to information inaccessibility plus the fact that the brain acts quite a lot like a Universal Learning Machine/Neural Turing Machine, this means that alignment in the human case for say surviving, having empathy for friends etc, can't depend on complicated genetic priors, and thus to the extent that genetic priors are encoded in, they need to be fairly weak and universalish-priors, plus help from the innate reward system, which is built upon those priors to use simple updating rules to reinforce certain behaviors and penalize others, and this works ridiculously well to align humans to surviving and having things like empathy/sympathy for the ingroup, revenge etc.

So now that we have listed the reasons why I expect optimism on AI safety, I'll add 1 new mini-section to show that the shutdown problem from AI is almost solved.

Addendum 1: The shutdown problem for AI is almost solved

It turns out that we can keep the most useful aspects of Expected Utility Maximization while making an AI shutdownable.

Sami Petersen showed that we can integrate incomplete preferences to AIs while weakening transitivity just enough to get a non-trivial theory of Expected Utility Maximization that's quite a lot safer. Elliott Thornley proposed that incomplete preferences would be used to solve the shut-down problem, and the very nice thing about subagent models of Expected Utility Maximization is that they require a unanimous committee in order for a decision to be accepted as a sure gain.

This is both useful, but can lead to problems. On the one hand, we only need one expected utility maximizer that wants to be able to shut down the AI in order for us to shut it down as a whole, but we would need to be sort of careful on where their execution conditions/domain is, as unanimous committees can terrible because only one agent needs to do something to grind the entire system to a halt, which is why in the real world, it's usually not a preferred way to govern something.

Nevertheless, for AI safety purposes, this is still very, very useful, and if it grows up to have broader conditions than the ones outlined in the posts below, this might be the single biggest MIRI success of the last 15 years, which is ridiculously good.

https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1

http://pf-user-files-01.s3.amazonaws.com/u-242443/uploads/2023-05-02/m343uwh/The Shutdown Problem- Two Theorems%2C Incomplete Preferences as a Solution.pdf

Edit 3: I've removed addendum 2 as I think it's mostly irrelevant, and Daniel Kokotajlo showed me that Ajeya actually expects things to slow down in the next few years, so the section really didn't make that much sense.

Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.)
New Comment
131 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

why we mostly don't need to worry about AI

This topic is poorly understood, very high confidence is obviously wrong for any claim that's not exceptionally clear. Absence of doom is not such a claim, so the need to worry isn't going anywhere.

-1Noosphere89
This is why the post is so long: It has to integrate a lot of different sources of evidence, actually give lots of evidence for major claims, and I had to make sure that I actually have positive arguments such that it's very, very likely we will align AI, an arguably make it safe by default. That's why I made the argument on AIs as white boxes, and the fact that I think that the genome uses very weak priors to align us to having empathy for the ingroup, for example ridiculously well, because these were intended to be reasons to expect safe AI by default in a very strong sense. Also, there is a lot of untapped evidence on humans, and that's what I was using to make this post. Quintin Pope and TurnTrout's post is below on the massive evidence we have about humans for alignment. https://www.lesswrong.com/posts/CjFZeDD6iCnNubDoS/humans-provide-an-untapped-wealth-of-evidence-about

Without sufficient clarity, which humanity doesn't possess on this topic, no amount of somewhat confused arguments is sufficient for the kind of certainty that makes the remaining risk of extinction not worth worrying about. It's important to understand and develop what arguments we have, but in their present state they are not suitable for arguing this particular case outside their own assumption-laden frames.

When reunited with unknown unknowns outside their natural frames, such arguments might plausibly make it reasonable to believe the risk of extinction is as low as 10%, or as high as 90%, but nothing more extreme than that. Nowhere across this whole range of epistemic possibilities is a situation that we "mostly don't need to worry about".

[-]Roko3017

I believe the security mindset is inappropriate for AI

I think that's because AI today feels like a software project akin to building a website. If it works, that's nice, but if it doesn't work it's no big deal.

Weak systems have safe failures because they are weak, not because they are safe. If you piss off a kitten, it will not kill you. If you piss off an adult tiger...

The optimistic assumptions laid out in this post don't have to fail in every possible case for us to be in mortal danger. They only have to fail in one set of circumstances that someone actualizes. And as long as things keep looking like they are OK, people will continue to push the envelope of risk to get more capabilities.

We have already seen AI developers throw caution to the wind in many ways (releasing weights as open source, connecting AI to the internet, giving it access to a command prompt) and things seem OK for now so I imagine this will continue. We have already seen some psycho behavior from Sydney too. But all these systems are weak reasoners and they don't have a particularly solid grasp on cause and effect in the real world.

We are certainly in a better position with respect to winning than when I started posting on this website. To me the big wins are (1) that safety is a mainstream topic and (2) that the AIs learned English before they learned physics. But I don't regard those as sufficient for human survival.

3Noosphere89
I disagree, and I think there are deeper reasons for why most computer security analogies do not work for ML/AI alignment. I think the biggest reasons for this are the following: 1. The thing that LW people call security mindset is non-standard, and under the computer security definition, you only start handing out points for discovering potential failures when they can actually demonstrate it, and virtually no proposed failures that I am aware of have been demonstrated successfully, except for goal misgeneralization and specification gaming, and even here they are toy AIs. In contrast, the notion that inner misaligned models/optimization daemons would appear in modern AI systems has been tested twice before, and in 1 case, DaemonicSigil was able to get a gradient hacker/optimization daemon to appear, but it was extremely toy, then when it was shown in a more realistic case, the optimization daemon phenomenon went away, or was going away. See Iceman's comment for more details on why LW Security Mindset!=Computer Security Mindset: https://www.lesswrong.com/posts/99tD8L8Hk5wkKNY8Q/?commentId=xF5XXJBNgd6qtEM3q That leads to 2: ML people can do things that would not work well under a security mindset or rocket engineering, like randomly doubling model size or data, or swapping one model for another, which would be big no-nos under computer security and rocket engineering, because rockets would literally explode if you doubled their fuel randomly in-flight, and switching the order in securing a password would make it output nonsense at best or destroy the security at worst. There are enough results like this that I'm now skeptical of applying the security mindset frame to AI safety, beyond inner alignment being very likely by default to SGD's corrective properties.
[-]Roko199

Do you just like not believe that AI systems will ever become superhumanly strong? That once you really crank up the power (via hardware and/or software progress), you'll end up with something that could kill you?

Read what I wrote above: current systems are safe because they're weak, not safe because they're inherently safe.

Security mindset isn't necessary for weak systems because weak systems are not dangerous.

8Noosphere89
This is exactly what I am arguing against. I do not believe that the security mindset doesn't work because AI is weak, I believe that the security mindset fails for deeper reasons than that, and an increase in capabilities doesn't mean that the security mindset looks better (indeed, it may actually look worse, see the attempted optimization daemon break of AI to see how making capabilities go up by increasing the dimensions of the AI, where it started going away, or all of SGD's corrections.) Edit: I also have issues with the way LW applies the security mindset, and I'll quote my comment from there on why a lot of LW implementations of security mindset fail:
5Roko
Maybe you're right, we may need to deploy an AI system that demonstrates the potential to kill tens of millions of people before anyone really takes AI risk seriously. The AI equivalent of Trinity. https://en.wikipedia.org/wiki/Trinity_(nuclear_test)
[-]lc187

It's not just about "being taken seriously", although that's a nice bonus - it's also about getting shared understanding about what makes programs secure vs. insecure. You need a method of touching grass so that researchers have some idea of whether or not they're making progress on the real issues.

1Roko
We already can't make MNIST digit recognizers secure against adversarial attacks. We don't know how to prevent prompt injection. Convnets are vulnerable to adversarial attacks. RL agents that play Go at superhuman levels are vulnerable to simple strategies that exploit gaps in their cognition. No, there's plenty of evidence that we can't make ML systems robust. What is lacking is "concrete" evidence that that will result in blood and dead bodies.
4lc
None of those things are examples of misalignment except arguably prompt injection, which seems like it's being solved by OpenAI with ordinary engineering.
3O O
To me the security mindset seems inapplicable because in computer science, programs are rigid systems with narrow targets. AI is not very rigid and the target, I.e. an aligned mind, is not necessarily narrow.
4Richard_Kennaway
That rigidity is what makes computer security so easy. ... Relative to AGI security.
6O O
No the rigidity is what makes a system error prone i.e. brittle. If you don’t specify the solution exactly, the machine won’t solve the problem. Classic computer programs can’t generalize. The OP makes a point how you can double a model size and it will work well but if you double a computer programs binary size with unused lines of code you can get all sorts of weird errors. Even if none of that extra size is ever used. An analogy is trying to write a symbolic logic program to emulate an LLM. (Ie with only if statements and for loops) or trying to make a self driving car with Boolean logic. If I flip one single bit in a computer program, it will probably catastrophically fail and crash the whole computer. However removing random weights won’t do much to an LLM. a little tangent on the flipping a bit: Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash. Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
4Noosphere89
Yep, these are the important parts, and Neural Networks are much more robust than that, and it has extreme robustness compared to a lot of other fields, which is why I'm skeptical of applying the security mindset, since it would predict false things.
2Richard_Kennaway
The non-rigidity of ChatGPT and its ilk does not make them less error-prone. Indeed, ChatGPT text is usually full of errors. But the errors are just as non-rigid. So are the means, if they can be found, of fixing them. ChatGPT output has to be read with attention to see its emptiness. None of this has anything to do with security mindset, as I understand the term.
0Noosphere89
The point is that if it was like computer security or even computer engineering, those errors would completely destroy ChatGPT's intelligence, and make it as useless as a random computer. This is just one example of an observation like this that makes me skeptical of applying the security mindset, as ML/AI and it's subfield, ML/AI alignment is a strange enough field that I wouldn't port over any intuitions from other fields. ML/AI alignment is like quantum mechanics, in which you need to leave your intuitions at the door, and unfortunately this makes public outreach likely net-negative.

At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of "security mindset": "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions". From this and his further words about the concept, he seems to mean something like "programming mindset", i.e. good practice in software engineering. Only if I read both you and him as using "security mindset" to mean that can I make sense of the way you both use the term.

But that is simply not what "security mindset" means. Recall that Schneier's article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:

the security mindset involves thinking about how things can be made to fail. It involves thinki

... (read more)
0Noosphere89
My issue with the security mindset is that there's a selection effect/bias that causes people to notice the failures of security, and not it's successes, even if the true evidence for success is massively larger than it's failure. Here's a quote from lc's post POC or GTFO as a counter to alignment wordcelism, on why the security industry has massive issues with people claiming security failures when they don't or can't happen: And this is why in general I dislike the security mindset, because of the incentives to report failure or bad events even when they aren't very much of a concern. Also, the stuff that computer security people do largely doesn't need to be done in ML/AI, which is another reason I'm skeptical of the security mindset.
0Richard_Kennaway
These are parochial matters within the computer security community, and do not bear on the hazards of AGI.
0Noosphere89
They do matter, since it implies a sort of selection effect where people will share the evidence for doom, and not notice the evidence for not-doom, and this matters because the real chance of doom may be much lower, in principle arbitrarily low, while LWers and AI safety/governance organizations have higher probabilities of doom. Combined with more standard biases on negative news being selected for, it is one piece in why I think AI doom is very unlikely. This is just one piece of it, not my entire argument And I think this already happened, cf the entire inner misalignment/optimization daemon situation, where it was tested twice, once showing a confirmed break, and the other one by Ulisse Mini, where in a more realistic situation, the optimization daemon/inner misalignment went away, and very little shared on this result, compared to the original which almost certainly got more views.

Downvote for being absurdly overconfident, and thereby harming the whole direction of more optimism on alignment. I'd downvote Eliezer for the same reason on his 99.99% doom arguments in public; they are visibly silly, making the whole direction seem silly by association.

In both cased, there are too many unknown unknowns to have confidences remotely that high. And you've added way more silly zeros than EY, despite having looser arguments.

This is a really important topic; we need serious discussion of how to really think about alignment difficulty. This is a serious attempt, but it's just not realistically humble. It also seems to be ignoring the cultural norm and explicit stated goal of writing to inform, not to persuade, on LW.

So, I look forward to your next iteration, improved by the feedback on this post!

4Noosphere89
I'll probably put this back into drafts by tomorrow.
3Seth Herd
It looks like you already took out the 99.9...% claims, which are the primary thing I was reacting to. That's great IMO. I think the new phrasing of "not claiming this is right, just getting the logic out there" is way better- both more honest and ultimately more convincing if the logic holds.  jBut that's a major edit without noting the edit, so I think this should be a draft right now, not a post that's evolving so that the comments are now addressing an earlier version. Publishing a second version that includes much of the first is a great idea. I'd choose a different term than white box, as per Steve Byrnes' conclusion that he just won't use those terms since they're confusing. My biggest substantive comment is that you seem to be assuming that because we could get alignment right, we will get alignment right. Even Yudkowsky agrees that we could get it right. You're arguing that it's a lot easier than assumed, and I think that's probably right. But that's not enough to be confident that we will get it right. It will depend on how seriously the first person to make self-improving AGI takes alignment, even if there are easy techniques available. Will they use them, or will they race and take risks?
2Noosphere89
I honestly agree with this. I feel that the post has been edited so much that I now think it's time to delete this post and reupload a new version of it so that I can actually deal with the edits, without having this weird patchwork post. Yeah, I'll probably edit it to emphasize something else. I am definitely assuming that, but I do think it's a weak assumption, assuming that at least some part of my post holds true. In essence, I'm hoping that OpenAI doesn't do the worst thing even if it isn't favored by profit incentives. The good news is that assuming value learning is easy, then we have an easier time, since we can do AI regulations a lot more normally, and in particular, we don't need to be that strict with licensing. Don't get me wrong, AI governance is necessary in this world, but the type of governance would be drastically different. No pauses, for one example.
8Seth Herd
Agreed on all points. This is closely related to my thinking on how we survive, which is why I care about seeing it presented in a way people can hear and understand. I'll send you a draft of the closely related post I'm working on, and if you haven't seen it, I focus on that last point, values learning being relatively easy, in this post: The (partial) fallacy of dumb superintelligence. I think it's worth explicitly discussing the assumption that people won't do "the dumbest possible thing". It's a reasonable assumption, but it's probably a little more complicated than that. If alignment taxes are non-zero, there will be some pull between different motivations.
2Noosphere89
Yeah, it kinda depends on how small the alignment tax is. If it's not 0, like I unfortunately suspect, but instead small, then there is a small chance of extinction risk. I definitely plan to discuss that when I reupload the post after deleting it first. Thanks for talking with me today!
2Vladimir_Nesov
Discussion is written by others, unpublishing affects both.
5habryka
I also think it would be better if you changed the title to saying you don't endorse it anymore. It's sad for the discussion to disappear/become unfindable.
4Noosphere89
Okay, I endorse parts of this post, but in hindsight, I clearly was overconfident. I still want to reupload this post, partially because I want to not have to deal with the editing process, but I will probably edit the title to say I don't endorse this version anymore, and make a new post based on this one.

I’m pretty confused about almost everything you said about “innate reward system”.

My view is: the relevant part of the human innate reward system (the part related to compassion, norm-following, etc.) consists of maybe hundreds of lines of code, and nobody knows what they are, and I would feel better if we did. (And that happens to be my own main research interest.)

Whereas your view seems to be: umm, I’m not sure, I’m gonna say things and you can correct me. Maybe you think that (1) the innate reward system is simple, (2) when we do RLHF, we are providing tens of thousands of samples of what the innate reward system would do in different circumstances, (3) and therefore ML will implicitly interpolate how the innate reward system works from that data, (4) …and this will continue to extrapolate to norm-following behavior etc. even in out-of-distribution situations like inventing new society-changing technology. Is that right? (I’m stating this possible argument without endorsing or responding to it, I’m still at the trying-to-understand-you phase.)

3Noosphere89
My general model of the way that the innate reward system works is that the following happens: 1. I agree with the claim that the innate reward system is simple. 2. The innate reward system uses the fact that it can edit the weights and code of the brain, albeit it's limited by biology's quirks like it's completely uninterpretable neurons to use the backpropagation algorithm, or a weaker variant thereof to update the gradients using RLHF or DPO or whatever specific variant it is to train a reward model for preference alignment. It continuously trains online on a lot of examples. 3. Yes, the ML/AI algorithm learns to interpolate from the data, and via weak priors plus the examples learned, it eventually starts to learn how the innate reward system works from the data, and what the reward function is. 4. I think one key reason why we can navigate out-of distribution situations is because the innate reward system is fully online, and thus whenever it faces out of distribution situations, it's able to react on the timescale of the rest of the brain and take action. At the very least, this is a possible sketch of how we could make a reward system that lets us align the AI. Regarding the idea that there is a short code for how the innate reward system works: I agree with the view that there probably is a short, powerful code of the innate reward system in humans, for the same reason as my argument that priors from genetics are probably very weak. My claim here is that even the weaker reward model where we use local update rules is already enough to make alignment very likely, for the same reasons that the innate reward system is able to input a lot of preferences reliably like empathy for the ingroup, revenge when we are harmed, etc. Your algorithm seems like a very good thing, if we could get at it, but even the weaker stuff enabled by SGD probably is enough to ensure alignment with very high probability.

On the topic of security mindset, the thing that the LW community calls "security mindset" isn't even an accurate rendition of what computer security people would call security mindset. As noted by lc, actual computer security mindset is POC || GTFO, or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you're maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.

In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:

1) Person A says to Person B, "I think your software has X vulnerability in it." Person B says, "This is a highly specific scenario, and I suspect you don't have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me."

2) Person B says to Person A, "Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I'm so confident, I give it a 99.99999%+ chance." Person A says, "I can't specify the exact vulnerability your software might have without it in front of me, but I'm fairly sure this confidence is unwarranted. In general it's easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn't because I think X will actually be the vulnerability, I'm just trying to be illustrative."

Story 1 seems to be the case where "POC or GTFO" is justified. Story 2 seems to be the case where "security mindset" is justified.

It's very different to suppose a particula... (read more)

7bigjeff5
The reason Person A in scenario 2 has the intuition that Person B is very wrong is because there are dozens, if not hundreds of examples where people claimed no vulnerabilities and were proven wrong. Usually spectacularly so, and often nearly immediately. Consider the fact that the most robust software developed by the most wealthy and highly motivated companies in the world, who employ vast teams of talented software engineers, have monthly patch schedules to fix their constant stream vulnerabilities, and I think it's pretty easy to immediately discount anybody's claim of software perfection without requiring any further evidence.  All the evidence Person A needs is the complete and utter lack of anybody having achieved such a thing in the history of software to discount Person B's claims. I've never heard of an equivalent example for AI. It just seems to me like Scenario 2 doesn't apply, or at least it cannot apply at this point in time. Maybe in 50 years we'll have the vast swath of utter failures to point to, and thus a valid intuition against someone's 9-9's confidence of success, but we don't have that now. Otherwise people would be pointing out examples in these arguments instead of vague unease regarding problem spaces.
4Daniel Kokotajlo
Well, no one has built an AGI yet, and if your plan is to wait until we have years of experience with unaligned AGIs before it's OK to start worrying about the problem, that's a bad plan. Also, there are things which are not AGI but which are similar in various ways (software, deep neural nets, rocket navigation mechanisms, prisons, childrearing strategies, tiger-training-strategies) which provide ample examples of unseen errors. Also, like I said, there ARE plenty of POCs for AGI risk. 

At the very least I think it would be more accurate to say “one aspect of actual computer security mindset is POC || GTFO”. Right? Are you really arguing that there’s nothing more to it than that?? That seems insane to me.

Even leaving that aside, here’s a random bug thread:

Mozilla developers identified and fixed several stability bugs in the browser engine used in Firefox and other Mozilla-based products. Some of these crashes showed evidence of memory corruption under certain circumstances and we presume that with enough effort at least some of these could be exploited to run arbitrary code. [emphasis added]

IIUC they treated these crashes as a security vulnerability, not a mere usability problem, and thus did things like not publicly disclosing the details until they had a fix ready to go, categorizing the fix as a high-priority security update, etc.

If your belief is that “actual computer security mindset is POC||GTFO”, then I think you’d have to say that these Mozilla developers do not have computer security mindset, and instead were being silly and overly paranoid. Is that what you think?

[-]lc175

You're right that this is definitely not "security mindset". Iceman is distorting the point of the original post. But also, the reason Mozilla's developers can do that and get public credit for it is partially because the infosec community has developed tens of thousands of catastrophic RCE's from very similar exploit primitives, and so there is loads of historical evidence that those particular kinds of crashes lead to exploitable bugs. Alignment researchers lack the same shared understanding - they're mostly philosopher-mathematicians with no consensus even among themselves about what the real issues are, and so if one tries to claim credit for averting catastrophe in a similar situation it's impossible to tell if they're right.

3bigjeff5
This is exactly right. To put it more succinctly: Memory corruption is a known vector for exploitation, therefore any bug that potentially leads to memory corruption also has the potential to be a security vulnerability. Thus memory corruption should be treated with similar care as a security vulnerability.
[-]lc142

POC || GTFO is not "security mindset", it's a norm. It's like science in that it's a social technology for making legible intellectual progress on engineering issues, and allows the field to parse who is claiming to notice security issues to signal how smart they are vs. who is identifying actual bugs. But a lack of "POC || GTFO" culture doesn't tell you that nothing is wrong, and demanding POCs for everything obviously doesn't mean you understand what is and isn't secure. Or to translate that into lesswrongese, reversed stupidity is not intelligence.

6iceman
But POC||GTFO is really important to constraining your expectations. We do not really worry about Rowhammer since the few POCs are hard, slow and impractical. We worry about Meltdown and other speculative execution attacks because Meltdown shipped with a POC that read passwords from a password manager in a different process, was exploitable from within Chrome's sandbox, and my understanding is that POCs like that were the only reason Intel was made to take it seriously. Meanwhile, Rowhammer is maybe a real issue but is so hard to pull off consistently and stealthily that nobody worries about it. My recollection was when it was first discovered, people didn't panic that much because there wasn't warrant to panic. OK, so there was a problem with the DRAM. OK, what are the constraints on exploitation? Oh, the POCs are super tricky to pull off and will often make the machine hard to use during exploitation? A POC provides warrant to believe in something.
2niplav
I'm confused about how POC||GTFO fits together with cryptographers starting to worry about post-quantum cryptography already in 2006, when the proof of concept was we have factored 15 into 3×5 using Shor's algorithm? (They were running a whole conference on it!)

Citation needed? The one computer security person I know who read Yudkowsky's post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it's the core of the concept.

Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we've provided so far don't meet your standards for "example of the thing you're maybe worried about" with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.

If instead POC||GTFO applied to AGI risk means "we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc." then we are already doing that and have been.

On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:

Rohin Shah has a comment on why most POCs aren't that great here:

https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment#P3phaBxvzX7KTyhf5

2Daniel Kokotajlo
Nevertheless, if you think that this isn't good enough and that people worried about AGI risk should GTFO until they have something better, you are the one who is wrong.
3Noosphere89
I don't think people worried about AGI risk should GTFO. I do think we should stop giving them as much credit as we do, because of the fact that you are likely to privilege the hypothesis, and it does mean that we shouldn't count the POCs as vindicating the people worried about AI safety, since their evidence doesn't really work to support the claim of goal misgeneralization.
4Daniel Kokotajlo
I think that's a vague enough claim that it's basically a setup for motte-and-bailey. "Stop giving them as much credit as we do." Well I think that if 'we' = society in general, then we should start giving them way more credit, in fact. If 'we' = various LWers who don't think for themselves and just repeat what Yudkowsky says, then yes I agree. If 'we' = me, then no thank you I believe I am allocating credit appropriately, I take the point about privileging the hypothesis but I was well aware of it already.
2Noosphere89
What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment): 1. Someone proposes a break of AI that threatens alignment like optimization daemons. 2. We test the claim on toy AIs, and either it doesn't work or it does work on them, then we move to the next step. 3. We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away. 4. Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure. More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they'll forget or not reference the fact that they're wrong, but if they're right about a danger, even if it's much milder and some of their other predictions were wrong, people will treat you as an oracle. A quote from lc's post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
4Vladimir_Nesov
Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn't much use for sorting out correctness of arguments.
0Noosphere89
I do have other issues with the security mindset, but that is an important issue I had. Turning to this part though, I think I might see where I disagree: It's not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people. It's why I'm a priori a bit skeptical of AI doom, and it's also why it's consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don't pay attention to the not doom or the things that went right, only the things that went wrong.
6Vladimir_Nesov
The researchers are not the arguments. You are discussing correctness of researchers.
0Noosphere89
Yes, that's true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom. The researchers aren't the arguments, but the properties of the researchers looking into the arguments, especially the way they're biased, does provide some evidence for certain proposition.

For white box vs black box, after further discussion I wound up feeling like people just use the term “black box” differently in different fields, and in practice maybe I’ll just taboo “black box” and “white box” going forward. Hopefully we can all agree on:

If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer.

And likewise we can surely all agree that future AI programmers will be able to see the weights and perform SGD.

2Nora Belrose
I don’t think any complete description of the LLM is going to be intuitive to a human, because it’s just too complex to fit in your head all at once. The best we can do is to come up with interpretations for selected components of the network. Just like a book or a poem, there’s not going to be a unique correct interpretation: different interpretations are useful for different purposes. Theres also no guarantee that any of these mechanistic interpretations will be the most useful tool for what you’re trying to do (e.g. make sure the model doesn’t kill you, or whatever). The track record of mech interp for alignment is quite poor, especially compared to gradient based methods like RLHF. We should accept the Bitter Lesson: SGD is better than you at alignment.
3Noosphere89
I would definitely like to see that argument made, as I suspect that a lot of LWers might disagree with this statement.
3Amalthea
  I think this is essentially what people mean when they say "LLMs are a black box" and since you seem to be agreeing, I find myself very confused that you've been pushing a "white box" talking point. 
6Steven Byrnes
It seems that all parties including Nora agree with “If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer”. The disagreements are (1) whether we should care—i.e., whether this fact is important and worrisome in the context of safe & beneficial AGI, (2) what the terms “black box” and “white box” mean. I think Nora’s comment here was taking an opportunity to argue her side of (1). In Nora’s recent post, to her credit, she defined exactly what she meant by “white box” the first time she used the term, and her discussion was valid given that definition. I think her recent post (and ditto the OP here) would have been clearer if she had (A) noted that people in the AGI safety space sometimes use “black box” to say something like the “decades of work” claim above, (B) explicitly said that the “decades of work” claim is obviously true and totally uncontroversial, (C) clarified that this popular definition of “black box / white box” is not the definition she’s using in this post. (A similar suggestion also applies to the other side of the debate including me, i.e. in the unlikely event that I use the term “black box” to mean the “decades of work” thing, in my future writing, I plan to immediately define it and also explicitly say that I’m not using the term to discuss whether or not you can see the weights and perform SGD.)
5Amalthea
Hmm, I guess the point of using the term "white box" is then to illustrate that it is not a literal black box, while the point of the term "black box" is that while it's a literal transparent system, we still don't understand it in the ways that matter. There's something that feels really off about the dynamic of term use here, but I can't quite articulate it.
8Steven Byrnes
The terms “white box” and “black box”, like pretty much all terms, are more than just their literal definitions, they are also trojan horses full of connotations and vibes. So of course it’s natural (albeit unfortunate and annoying) for people on both sides of a debate to try to get those connotations and vibes to work in service of their side. :-P
2Noosphere89
I'll edit the post soon to focus on the fact that the white-box definition is not a standard definition of the term, and instead refers to the computer analysis/security sense of the term.
2[comment deleted]
2Noosphere89
I definitely agree that I think tabooing white box vs black box is good. One point though is that the innate reward system does targeted updates to neural circuits using simple learning rules, that means that we can probably use SGD to make ourselves an innate reward system combined with a weak prior to get good results. Admittedly, I do thnk that the pathway isn't as complete as I like, but I do actually think that the notion of seeing the weights, checking the Hessians, etc to be extremely powerful alignment tools, more powerful than appreciated.

This whole post seems to be about accident risk, under the assumption that competent programmers are trying in good faith to align AI to “human values”. It’s fine for you to write a blog post on that—it’s an important and controversial topic! But it’s a much narrower topic than “AI safety”, right? AI safety includes lots of other things too—like bad actors, or competitive pressures to make AIs that are increasingly autonomous and increasingly ruthless, or somebody making ChaosGPT just for the lols, etc. etc.

6Roko
Indeed. No mention of misuse, multipolar traps, etc!
4jacob_cannell
Given how scaling laws work the power of AGI systems is/will be proportional to net training compute, so 'lols' doesn't seem like much of a concern. These systems are increasingly enormous industrial scale rapidly escalating towards manhattan scale projects.

One can argue that algorithmic & hardware improvements will never ever be enough to put human-genius-level human-speed AGI in the hands of tons of ordinary people e.g. university students with access to a cluster.

Or, one can argue that tons of ordinary people will get such access sooner or later, but meanwhile large institutional actors will have super-duper-AGIs, and they will use them to make the world resilient against merely human-genius-level-chaosGPTs, somehow or other.

Or, one can argue that ordinary people will never be able to do stupid things with human-genius-level AGIs because the government (or an AI singleton) will go around confiscating all the GPUs in the world or monitoring how they’re used with a keylogger and instant remote kill-switch or whatever.

As it happens, I’m pretty pessimistic about all of those things, and therefore I do think lols are a legit concern.

(Also, “just for the lols” is not the only way to get ChaosGPT; another path is “We should do this to better understand and study possible future threats”, but then fail to contain it. Large institutions could plausibly do that. If you disagree—if you’re thinking “nobody would be so stupid as to do that”—note the existence of gain-of-function research, lab leaks, etc. in biology.)

4jacob_cannell
If ordinary people have access to human-genius-level AGIs, then there will be many AGIs at that level (along with some far more powerful above them) and thus these weaker agents almost certainly won't be dangerous unless a significant fraction are not just specifically misaligned in the most likely failure mode (selfish empowerment), but co-aligned specifically against humanity in their true utility functions (ie terminal rather than just instrumental values). These numerous weak AGI are not much more dangerous to humanity than psycopaths (unaligned to humanity yes, but also crucially unaligned with each other). EY/MIRI? has a weird argument about AIs naturally coordinating because they can "read each others source code", but that wouldn't actually cause true alignment of utility functions, just enable greater cooperation, and regardless is not really compatible with how DL AGI works. There are strong economic/power incentives against sharing source code (open source models lag), it's also only really useful for deterministic systems and ANNs are increasingly non-deterministic and moving towards BNNs in that regard, and too difficult to verify against various spoofing mechanisms regardless (even if an agent's source code is completely avail and you have a full deterministic hash chain, difficult to have any surety the actual agent isn't in some virtual prison with other agent(s) actually in control unless it's chain amounts to enormous compute ).
3Noosphere89
I'll note that a potential disagreement I have with your post on out-of-control AGIs ruining the world is that I actually expect the defense-offense balance to be much less biased towards the attack than you show here, and in particular, I think that to the extent AI improves things, my prior is that it's symmetric in the improvement, so the offense-defense balance ultimately doesn't change.
2Noosphere89
I definitely agree with this, and I'll probably change the title to focus on AI alignment. My general view on the other problems of AI safety is that removing accident risk would make the following strategies much less positive EV: 1. General slowdowns of AI, because misuse is handlable in other, less negative EV ways. 2. Trying to break the Overton Window, as Eliezer Yudkowsky did, since governments and companies have incentives to restrict misuse. And in particular, I think that removing the accidental risk probably ought to change a lot of people's p(doom), especially if the main way they claim that people will die is due to accident risk, which is IMO my sense of a lot of people's models on LW, and is arguably the main reason people are scared about AI. Also, I think that the type of governance would change assuming no accident risk.

I've upvoted this post because it's a good collection of object-level, knowledgeable, serious arguments, even though I disagree with most of them and strongly disagree with the bottom line conclusion.

There is a good analogy between genetic brain evolution and technological AGI evolution. In both cases there is a clear bi-level optimization, with the inner optimizer using a very similar UL/RL intra-lifetime SGD (or SGD-like) algorithm.

The outer optimizer of genetic evolution is reasonably similar to the outer optimizer of technological evolution. The recipe which produces an organic brain is a highly compressed encoding or low frequency prior on the brain architecture along with a learning algorithm to update the detailed wiring during lifetime training. The genes which encode the brain architectural prior and learning algorithms are very close analogically to the 'memes' which are propagated/exchanged in ML papers and encode AI architectural prior and learning algorithms (ie the initial pytorch code etc).

The key differences are mainly just that memetic evolution is much faster - like an amplified artificial selection and genetic engineering process. For tech evolution a large number of successful algorithm memes from many different past experiments can be flexibly recombined in a single new experiment, and the process guiding this recombination and selection is itself runni... (read more)

2torekp
I view your final point as crucial. I would put an additional twist on it, though. During the approach to AGI, if takeoff is even a little bit slow, the effective goals of the system can change. For example, most corporations arguably don't pursue profit exclusively even though they may be officially bound to. They favor executives, board members, and key employees in ways both subtle and obvious. But explicitly programming those goals into an SGD algorithm is probably too blatant to get away with.

I definitely think that LW might not realize that AI is on an S-curve right now.

AI is obviously on an S-curve, since eventually you run out of energy to feed into the system. But the top of that S-curve is so far beyond human intelligence, that this fact is basically irrelevant when considering AI safety.

The arguments about fundamental limits of computation (halting problem,etc) also are irrelevant for similar reasons. Humans can’t even solve BB(6).

2Noosphere89
I definitely agree that the limit could end up being far beyond superhuman, but in that addendum, I was talking about limitations that would slow down AI right as it has equal the compute and memory that humans have. It's possible that Addendum 2 does fail though, so I agree with you that this isn't conclusive. It was more to check the inevitability of fast takeoff/AI explosion, not that it can't happen.

I just saw this post and cannot parse it at all. You first say that you have removed the 9s of confidence. Then the next paragraph talks about a 99.9… figure. Then there are edit and quote paragraphs and I do not know whether these are your views or other or whether you endorse them.

1Noosphere89
I'll probably need to edit that more completely, but for the moment a lot of the weirdness has to do with my original confidence was 99.9999%+, but I somehow didn't make it clear enough for people that this was an original version, not the new version.

I believe getting Friendly AI is really really likely, closer to 99.99999%+ of the time

I think it'd make sense to clarify what you mean here, since the following are very different:

  1. I am >99.99999% confident that friendly AI will happen.
  2. I am e.g. 70% confident that in >99.99999% of cases we get friendly AI.

I assume you mean something more like the latter.
In that case it'd probably be useful to give a sense of your actual confidence in the 99.99999%+ claim.

"Mostly don't need to worry" would imply extremely high confidence.
Or do you mean something like "In most worlds it'll be clear in retrospect that we needn't have worried"?

4Noosphere89
I definitely mean the first one, and I'll try to give some reasons why I'm so confident on AI alignment: 1. I believe the evidence for the human case is actually really strong, and a lot of that comes from the fact that for arguably the past 10,000+ years, our reward system reliably imprints in us a set of values like for example empathy for the ingroup, getting revenge when people have harmed us, etc, and over 95%+ of humans share the values that the reward system has implemented, which is ridiculously reliable. We also have the ability to implement much more complicated reward functions than evolution can, and that lets us drive the probability really large, really fast, due to this phenomenon: 2. Strong evidence is common, and the ability to add in more bits of evidence very quickly makes the evidence get ridiculously strong, and I view the evidence from humans about alignment, as well as the ability to implement complicated reward functions meaning that you can get very strong evidence for things with a scarily weak prior, because the number of bits usually cuts probability in half. Some comments and Marx Xu's post Strong Evidence is Common below: https://www.lesswrong.com/posts/JD7fwtRQ27yc8NoqS/strong-evidence-is-common https://www.lesswrong.com/posts/JD7fwtRQ27yc8NoqS/strong-evidence-is-common#itdkXwhitCcsyXC4q 1. One theory I subscribe to called Prospect theory is that people drastically overestimate the probability of extraordinarily large positive or negative impact, and the application here is that we are likely biased to overweight the probability of events that have large negative impact like us going extinct, which is why I decided to avoid anchoring on LW estimates.

Ok, well thanks for clarifying.
I'd assumed you meant the second.
Some reasons I think that this confidence level is just plain silly (not an exhaustive list!):

  • First, you're misapplying strong-evidence-is-common - see Donald's comment (or indeed mine).
    Strong evidence getting from [hugely unlikely] to [plausible] is common; from [plausible] to [hugely likely] is rare.
    • A lot of strong evidence comes from [locating a hypothesis h and having a strong reason to think that p(h is true | h was located) is high]. If you selected the hypothesis at random, locating it gives you almost no evidence, since you don't have the second part. Similarly if wishful thinking is an easy way to locate a hypothesis.
    • All Mark's examples have the form [trusted source tells me h is true].
  • Second, you should have nothing close to 99.99999% credence that an AI aligned as well as a human is aligned is safe. We have observations that humans usually behave well when they're on distribution and in a game-theoretic context when it's to their advantage to behave well. Take a human far off distribution, and we have no guarantees - not simply no guarantee that they'll act the same: no guarantee that they'll feel or think t
... (read more)
2Noosphere89
I might want to reduce my confidence for now, and I have edited the post to remove the 9s for now, but a potential reason comes from Nora Belrose in the AI optimism discord: "OTOH, if I put doom in the reference class of "things I used to believe, kinda" then perhaps I should feel comfortable putting e.g. 10^-5 credence in doom, since I put << 10^-5 credence in Christianity being true, and < 10^-5 credence in Marxism (although the truth conditions for Marxism are murkier." I sort of agree with this, but with a huge caveat, in that if an anthropologist 100,000 years ago somehow managed to understand the innate reward system, they would likely predict that the values of humans would be essentially fairly universal things like empathy for the ingroup, parental instinct, and revenge, and they would have an impressive track record of such predictions.
4Joe Collman
Some object-level stuff first: I think my main disagreement comes down to: * Being [well-behaved as far as we can tell] in training is always very weak evidence that behaviour will generalize as we'd wish it to. * I don't say "aligned in the training data", since alignment is about robust generalization of good behaviour. Evidence of alignment is evidence of desirable generalization. Eliezer isn't claiming we won't get approximately perfect behaviour (as far as we can tell) on the training data; he's claiming that this gets us almost nowhere in terms of alignment. * Caveat - this is contingent on what counts as 'behaviour' and on our tools; if behaviour includes activations, and our tools have hugely improved, this may be progress. * Arguments against particular failure modes often come down to [from what we can tell, inductive bias will tend to push against this particular type of failure]. * Of course here I'd point at "from what we can tell" and "tend to". * However, the more fundamental point is that we have no reason to think that inductive bias pushes towards success either. Does the simplest solution compatible with [good behaviour as far as we can tell] on the training data generalize exactly as we'd wish it to? Why would this be the case? Does the fastest? Again, why would we expect this? Does the [insert our chosen metric]est? Why? * I do expect that there exists some metric with a rich set of inputs (including weights, activations etc.) that would give robustly desirable generalization. * I expect that finding such a metric will require deep understanding. * Expecting a simple metric found based on little understanding to be sufficient is equivalent to assuming that there's something special about the kind of generalization we would like (other than that we like it). * This is baseless - it's why I don't like the term "misgeneralization", since it can suggest that there's some natural 'correct' general
2Noosphere89
This will be a long comment, so get a drink and a snack. I agree with this, assuming 0 prior, but I expect to disagree on the strength of the prior necessary in order to generalize correctly. My claim is essentially the opposite of this, that the reason humans generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder and didn't just trick their reward system isn't that special, and that it's basically a consequence of weak prior information from the genome plus the innate reward system using backpropagation or a weaker variant of it to update the neural circuitry to reinforce certain behaviors and penalizing others. This was meant to be an example of the values that the innate reward system could align us to, not what things resulted from holding this set of values. When I use an example, it's essentially a wildcard, such that it can stand for almost arbitrary values. This turns out to be a crux, in that I think that the understanding required is probably minimal, compared to the majority of LWers like you.

that the reason humans generalized correctly to having human values and didn't just trick their reward system isn't that special

This is a tautology, not an example of successful alignment:
Humans trick their reward systems as much as humans trick their reward systems.

Imagine a case where we did "trick our reward system". In such a case the human values we'd infer would be those that we'd infer from all the actions we were taking - including the actions that were "tricking our reward system".

We would then observe that we'd generalized entirely correctly with respect to the values we inferred. From this we learn that things tend to agree with themselves. This tells us precisely nothing about alignment.

I note for clarity that it occurs to me to say:
Indeed we do observe some humans doing what most of us would think of as tricking their reward systems (e.g. self-destructive drug addictions).
You may respond "Ah, but that's a small proportion of people - most people don't do that!" - at which point we're back to tautology: what most people do will determine what is meant by "human values". Most people are normal, since that's how 'normal' is defined.

The only possible evidence I could provi... (read more)

2Noosphere89
I'll rewrite that to "generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder", because I accidentally made a tautology here.

I don't think it's accidental - it seems to me that the tautology accurately indicates where you're confused.

"generalised correctly" makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.

2Noosphere89
Treacherous turn failure modes, which examples will be posted below: Humans seeming to have empathy only for say 25 years in order to play nice with their parents, and then making a treacherous turn to say kill other people that are part of their ingroup. More generally, humans mostly avoid what's called the treacherous turn type failure mode, where it appears to have values consistent with human morals, but then reveals that it didn't have those values all along, and hurt other people. More generally, the extreme stability of values gives evidence that it's very difficult to have a human that executes a treacherous turn. That's the type of thing which I call generalizing correctly, since it basically excludes deceptive alignment out of the gate, contra Evan Hubinger's fear of AIs having deceptive alignment. In general, one of the miracles is that the innate reward system plus very weak genetic priors can rule out so many dangerous types of generalizations, which is a big source of my optimism here.
4Joe Collman
For this kind of thing to be evidence, you'd need the human treacherous turn to be a convergent instrumental strategy to achieve many goals. The AI case for treacherous turns is: * AI ends up with weird-by-our-lights goal. (e.g. a rough proxy for the goal we intended) * The AI cooperates with us until it can seize power. * The AI does a load of treacherous-by-our-lights stuff in order to seize power. * The AI uses the power to effectively pursue its goal. We don't observe this in almost any human, since almost no human has the option to gain enormous power through treachery. When humans do have the option to gain enormous power through treachery, they do sometimes do this. Of course, even for the potentially-powerful it's generally more effective not to screw people over (all else being equal), or at least not to be noticed screwing people over. Preserving options for cooperation is useful for psychopaths too. The treacherous turn argument is centrally about instrumentally useful treachery. Randomly killing other people is very rarely useful. No-one is claiming that AI treachery will be based on deciding to be randomly nasty. If we gave everyone a take-over-the-world button that only works if they first pretend that they're lovely for 25 years, certainly some people would do this - though by no means all. And here we're back to the tautology issue: Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn't want to do it? Because for a long time we've lived in a world where actions similar to this did not lead to cultures that win (noting here that this level of morality is cultural more than genetic - so we're selecting for cultures-that-win). If actions similar to this did lead to winning cultures, after a long time we'd expect to see [press button after pretending for 25 years] to be both something that most people would do, and something that most people would consider righ
2Noosphere89
My view on this is unfortunately unlikely to be resolved in a comment thread, but 2 things I'll say about human values and evidence bases can be clarified here: 1. This: "If it were just too hard to get correct generalization, where "correct" here means [sufficient for humans to persist over many generations], then we wouldn't observe incorrect generalization: we wouldn't be here. "If anything, we'd find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We'd see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called "correct generalization". Is probably not correct, and we can in fact update normally from the fact that human behavior is surprisingly good, in that this is probably a case of anthropic shadow, which has reasonable arguments against it existing. For more on this, I'd read SSA Rejects Anthropic Shadow by Jessica Taylor and Anthropically Blind: The Anthropic Shadow is Reflectively Inconsistent by Christopher King. Links are below: https://www.lesswrong.com/posts/LGHuaLiq3F5NHQXXF/anthropically-blind-the-anthropic-shadow-is-reflectively https://www.lesswrong.com/posts/EScmxJAHeJY5cjzAj/ssa-rejects-anthropic-shadow-too 1. I have a different causal story from yours about why this happens: "Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn't want to do it?" At least for my own causal story on why people don't usually want to take over the world and kill people, it goes something like this: 1. There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works. 2. The brain's innate reward system uses DPO, RLHF or whatever else is used to
2Joe Collman
I'm not reasoning anthropically in any non-trivial sense - only claiming that we don't expect to observe situations that can't occur with more than infinitesimal probability. This isn't a [we wouldn't be there] thing, but a [that situation just doesn't happen] thing. My point then is that human behaviour isn't surprisingly good. It's not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable - it's tautological. Some things we could reasonably observe occurring differently are e.g.: 1. More or less variation in behaviour among humans. 2. More or less variation in behaviour in atypical situations. 3. More or less external requirements to keep behaviour generally 'good'. 4. More or less deviation between stated preferences and revealed preferences. However, I don't think this bears on alignment, and I don't think you're interpreting the evidence reasonably. As a simple model, consider four possibilities for traits: 1. x is common and good. 2. y is uncommon and bad. 3. z is uncommon and good. 4. w is common and bad. x is common and good (e.g. empathy): evidence for correct generalisation! y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization! z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we're only really aiming for what's commonly part of human values, not outlier ideals. w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between: * [w isn't actually bad, all things considered... correct generalization!] * [w is common and only mildly bad, so it's best to consider it part of standard human values - correct generalization!] It seems to me that the only evidence you'd accept of misgeneralization would be [terrible and common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the
2Noosphere89
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment. My non-tautological claim is that the reason isn't behavioral, but instead internal, and in particular the innate reward system plays a big role here. In essence, my story on how humans are aligned with the values of the innate reward system wasn't relying on a behavioral property. I'll reproduce it, so that you can focus on the fact that it didn't rely on behavioral analysis: 1. There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works. 2. The brain's innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example. 3. It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it's performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values. 4. The loop of weak prior + innate reward system + algorithm to implement it like backprop or it's weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there. Critically, it makes very little reference to society or behavioral analysis, so
4Joe Collman
This still seems like the same error: what evidence do we have that tells us the "values the innate reward system put in place"? We have behaviour. We don't know that [system aimed for x and got x]. We know only [there's a system that tends to produce x]. We don't know the "values of the innate reward system". The reason I'm (thus far) uninterested in a story about the mechanism, is that there's nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that's achieved - but this doesn't give you evidence for the assumption you used to motivate your story in the first place. In particular, I find it implausible that there's a system that does aim for x and get x (unless the 'system' is the entire environment): If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be 'used' - since you get them for free. There's no selection pressure to encode or preserve those elements of x. If you want to sail quickly, you take advantage of the currents. So I don't think there's any reasonable sense in which there's a target being hit. If a magician has me select a card, looks at it, then tells me that's exactly the card they were aiming for me to pick, I'm not going to spend energy working out how the 'trick' worked.
2Noosphere89
It sounds like we've got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process. This could be disentangled into 2 cruxes: 1. Where are human values generated. 2. How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.) And I'd expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you'd probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question. For some of my reasoning on this, I'd probably read some posts like these: https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for (Basically argues that the critic in the brain generates the values) https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome (The genomic prior can't be strong, because it has massive limitations in what it can encode).
2Joe Collman
The central crux really isn't where values are generated. That's a more or less trivial aside. (though my claim was simply that it's implausible the values aimed for would be entirely determined by genome + learning process; that's a very weak claim; 98% determined is [not entirely determined]) The crux is the tautology issue: I'm saying there's nothing to explain, since the source of information we have on [what values are being "aimed for"] is human behaviour, and the source of information we have on what values are achieved, is human behaviour. These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that's enough to conclude that there's nothing to learn here. An argument of the form [f(x) == f(x), therefore y] is invalid. f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
4Noosphere89
That's our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there's more evidence I have than that. In particular I am using results from both ML/AI and human brain studies to inform my conclusion. Basically, my claim is that [f(x) == f(y), therefore z].
1Thoth Hermes
But humans are capable of thinking about what their values "actually should be" including whether or not they should be the values evolution selected for (either alone or in addition to other things). We're also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit. We don't simply commit to tricking our reward systems forever and only doing that, for example. So that overall suggests a level of coherency and consistency in the "coherent extrapolated volition" sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
7Joe Collman
A few points here: * We don't have the option to "trick our reward systems forever" - e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all). * The divergence between stated preferences and revealed preferences is exactly what we'd expect to see in worlds where we're constantly "tricking our reward system" in small ways: our revealed preferences are not what we think they "actually should be". * We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It's not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour. * Again, I'd ask for an example of a world plausibly reachable through an evolutionary process where we don't have the kind of coherence and consistency you're talking about. Being completely orthogonal to evolution clearly isn't plausible, since we wouldn't be here (I note that when I don't care about x, I sacrifice x to get what I do care about - I don't take actions that are neutral with respect to x). Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.

Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you're no longer looking at a "broad, reasonable" distribution of space, but now a "very, specific" scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.

2Noosphere89
The problem here is that it shares a similar issue to optimization daemons/goal misgeneralization, etc, and a comment from Iceman sums it up perfectly: "or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you're maybe worried about being a real problem because you are almost certain to be privileging the hypothesis." https://www.lesswrong.com/posts/99tD8L8Hk5wkKNY8Q/?commentId=xF5XXJBNgd6qtEM3q Or equivalently from lc: "you only start handing out status points after someone has successfully demonstrated the security failure, ideally in a deployed product or at the very least a toy program." This is to a large extent the issue I have with attempted breaks on alignment, in that pretty much no alignment break has been demonstrated, and the cases where they had, we have very mixed results to slight positive results at best.
4Chris_Leong
The POC || GTFO article was very interesting. I do worry though that it is mixing together pragmatics and epistemics (even though it does try to distinguish the two). Like there's a distinction between when it's reasonable to believe something and when it's reasonable to act upon something. For example, when I was working as a web developer, there's lots of potential bugs where it would have made sense to believe that there was a decent chance we were vulnerable, but pragmatically we couldn't spare the time to fix every potential security issue. It doesn't mean that I should walk around saying: "Therefore they aren't there" though. I'll admit, if someone randomly messaged you some of the AI risk arguments and no one else was worried about them, it'd probably be reasonable to conclude that there's a flaw there and put them aside. On the other hand, when even two deep learning Turing prize winners are starting to get concerned, and the stakes are so high, I think we should be a bit more cautious regarding dismissing the arguments out of hand.
2Noosphere89
I agree, which is why I have an entire section or 2 about why I think ML/AI isn't like computer security.

fe, but I feel like a lot of Lesswrongers are probably wrong in their assumption that AI progress will continue as it had after 2030,


Who thinks that? I don't think that. Ajeya doesn't think that.

4Noosphere89
I'm going to defend that addendum weakly, but I think it's implicit in a lot of models that assume that intelligence will grow to superhumanity by say, the 2040s, like Scott Alexander's or Kurzweil after 2030 or your model after 2029, and I suspect that Ajeya does in fact think that AI progress will continue to be like the past, and she thinks it will be even faster. If she believes that AI progress will slow down in a decade, then I'll probably edit or remove that statement.
6Daniel Kokotajlo
I literally heard her saying a few weeks ago something to the effect of "it'll be such a relief when we get through these next few OOMs of progress. Everything is happening so fast now because we are scaling up through so many OOMs so quickly in various metrics. But after a few more years the pace will slow down and we'll get back to a much slower rate of progress in AI capabilities." Her bio anchors model also incorporates some of these effects IIRC. My model after 2029--what are you referring to? I currently think that probably we'll have superintelligence by 2029. I definitely agree that if I'm wrong about that and AGI is a lot harder to build than I think, progress in AI will be slowing down significantly around 2030 relative to today's pace.
4Matthew Barnett
Is that realistic? When I plug some estimates that I find reasonable into the Epoch interactive model, I find that scaling shouldn't slow down significantly until about 2030. And at that point we might be getting into a regime where the economy should be growing quickly enough to support further rapid scaling, if TAI is attainable at lower FLOP levels. So, actually, our current regime of rapid scaling might not slow down until we approach the limits of the solar system, which is likely over 10 OOMs above our current level. The reason for this relatively dramatic prediction seems to be that we have a lot of slack left. The current largest training run is GPT-4, which apparently only cost OpenAI about $50 million. That's roughly 4-5 OOMs away from the maximum amount I'd expect our current world economy would be willing to spend on a single training run before running into fundamental constraints. Moreover hardware progress and specialization might add another 1 OOM to that in the next 6 years.
2Daniel Kokotajlo
Oh I agree, the scaling will not slow down. But that's because I think TAI/AGI/etc. isn't that far off in terms of OOMs of various inputs. If I thought it was farther off, say 1e36 OOMs, I'd think that before AI R&D or the economy began to accelerate, we'd run out of steam and scaling would slow significantly and we'd hit another AI winter.
2Noosphere89
Ultimately, that's why I decided to cut the section: It was probably false, and it didn't even matter for my thesis statement on AI safety/alignment.
2Noosphere89
I'll grant that Ajeya was misrepresented in this post, and I'll probably either edit or remove the section. This isn't a crux on why I believe AI to be safe, but I think my potential disagreement is that once you manage to reach the human compute and memory regime, I do expect it to be more difficult to scale upwards. I definitely assign some credence to you being right, so I'll probably edit or remove that section.

In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.

I'm going to read this as "...1 new potential gradient hacking pathway" because I think that's what the section is mainly about. (It appears to me that throughout the section you're conflating mesa-optimization with gradient hacking, but that's not the main thing I want to talk about.)

The following quote indicates at least two potential avenues of gradient hacking: "In an RL context", "supervised learning with ada... (read more)

4Noosphere89
Basically, it's a combo of not being incentivized to do it, combined with the fact that SGD is actually really powerful in ways that undermines the traditional story for gradient hacking. One of the most important things to keep in mind is that gradient descent optimizes independently and simultaneously, which means that for a gradient hacker, unless it contains non-differentiable components, there's no way for the inner misaligned agent to escape being optimized away by SGD, and since it optimizes the entire causal graph leading to the loss, there is very little avenue for a gradient hacker to escape being optimized away. In general, this is a big problem with a lot of stories of danger that rely on goal divergences between the base and the mesa optimizer: How do you prevent the mesa-optimizer from being optimized away by SGD? For a lot of stories, the likely answer is you can't, and the stories that people propose usually fall victim to the issue that SGD is too good at credit assignment, compared to genetic algorithms or evolutionary methods.

Thanks a lot for writing that post.

One question I have regarding fast takeoff is: don't you expect learning algorithms much more efficient than SGD to show up and accelerate a lot the rate of development of capabilities?

One "overhang' I can see it the fact that humans have written a lot of what they know how to do all kinds of task on the internet and so a pretty data efficient algo could just leverage this and fairly suddenly learn a ton of tasks quite rapidly. For instance, in context learning is way more data efficient than SGD in pre-training. Right no... (read more)

4jacob_cannell
Brains use somewhat less lifetime training compute (perhaps 0 to a few OOM less) than GPT4, and 2 or 3 OOM less data, which provides existence proof of somewhat better scaling curves, along with some evidence that scaling curves much better than those brains are on are probably hard. AI systems already train on the entire internet so I don't see how that is an overhang. There are diminishing returns to context for in-context learning; it is extremely RAM intensive and GPUs are RAM starved compared to the brain, and finally brains already use it with much longer context, so its more like one of the hard challenges to achieve brain parity at all rather than a big overhang.
2Noosphere89
I am definitely semi-agnostic to whether SGD will ultimately be the base optimizer of choice, and whether the inner algorithm does better than SGD and causes a fast takeoff. But I'll assume that you are right about fast takeoff happening, and my response to that is that this would leave the alignment schemes proposed intact, for the following reasons: 1. Even if fast takeoff happens, the sharp left turn in the form of misgeneralization is still less likely to happen, because unlike evolution, we are unlikely to run fresh versions of an AI, and retain the same AI throughout the training run. 2. It mostly doesn't affect how easy it is to learn values, and the trick of using our control of SGD to be the innate reward system still works, because of the fact that weak genetic priors that are easy to trick plus the innate reward system's local update rule still suffices to make people reliably have a set of values like empathy for the ingroup. 3. SGD still has really strong corrective properties against inner misaligned agents, unlike evolution. I do agree that fast takeoff complicates the analysis, but I don't think it breaks the alignment methods shown in the post. If it required very strong priors to align (But with SGD we can align them to reward functions that are much more complicated than genetic priors can do), or we can't control the innate reward system, this would be a much bigger issue.
2Logan Zoellner
I think there are plausible stories in which a hard left turn could happen (but as you’ve pointed out, it is extremely unlikely under the current deep learning paradigm). For example, suppose it turns out that a class of algorithms I will simply call heuristic AIXI are much more powerful than the current deep learning paradigm. The idea behind this class of algorithm is you basically do evolution but instead of using blind hillclimbing, you periodically ask what is the best learning algorithm I have, and then apply that to your entire process. Because this means you are constantly changing the learning algorithm, you could get the same sort of 1Mx overhang that caused the sharp left turn in human evolution. The obvious counter is that if we think heuristic, AIXI is not safe, then we should just not use it. But the obvious counter to that is when have humans ever not done some thing because someone else told them it wasn’t safe.
2Noosphere89
I definitely agree with the claim that evolutionary strategies being effective would weaken my entire case. I do think that evolutionary methods like GAs are too hobbled by their inability to exploit white-box optimization, unlike SGD, but we shall see.
2Logan Zoellner
I genuinely don't know if heuristic AIXI is a real thing or not, but if it is it combines the ability to search the whole space of possible algorithms (which evolution has but SGD doesn't) with the ability to take advantage of higher order statistics (like SGD does but evolution doesn't). My best guess is that just as there was a "Deep learning" regime that only got unlocked once we had tons of compute from GPUs, there's also a heuristic AIXI regime that unlocks at some level of compute.

r one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.


The analogy was about the alignment problem, not the capabilities problem.

A rocket won't get to the moon if you randomly double one of the variables used to navigate, like the amount of thrust applied in maneuvers or the angle of attack. (well, not unless you've built in good error-correction and redundancy etc.)

2Noosphere89
The point here is that there are enough results in ML like this that I'm more skeptical of the security mindset being accurate, and ML/AI alignment is a strange enough domain such that we shouldn't port over intuitions from other fields, like you shouldn't port over intuitions from the large scale to quantum mechanics. For a specific example relevant to alignment, I talked about SGD's corrective properties in a section of the post. Another good example has to do with with the fact that AIs are generally modular and you can switch out parts without breaking the AI, which couldn't be done under a security mindset as it would predict that either the AI spits out nonsense or breaks it's security, none of which have happened.

Good to see your point of view. The old arguments about AI doom are not convincing to me anymore, however getting alignment 100% right, whatever that means in no way guarantees a positive Singularity.

Should we be talking about concrete plans about that now? For example I believe with a slow takeoff if we don't get Neuralink or mind uploading, then our P(doom) -> 1 as the Super AI gets ever more ahead of us. The kind scenarios I can see 

  1. "dogs in a war zone" great powers make ever more powerful AI and use them as weapons. We don't understand our envi
... (read more)
[-][anonymous]10

Have you uploaded a new version of this article? It have just been reading elsewhere about goal misgeneralisation and shutdown problem, so I'd be really interested to read the new version of this article.

4Noosphere89
This post is the spiritual successor to the old post, shown below: https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities
1[anonymous]
That is an incredible post. I strongly upvoted. Deals with a lot of arguments for AI doom. Very clearly written as well. However, I do notice that there was nothing there about goal misgeneralisation or shutdown problem. Is it because 1) you've written about it elsewhere, 2) you believe that these problems have already been solved somewhere else, 3) you still endorse what you have written about them here or 4) you plan on writing about them in the future?

Thanks for writing this! I strongly appreciate a well-thought out post in this direction.

My own level of worry is pretty dependent on a belief that we know and understand shaping NN behaviors much better than we do (values/goals/motivations/desires) (although I don't think eg chatGPT has any of the latter in the first place). Do you have thoughts on the distinction between behaviors and goals? In particular, do you feel like you have any evidence we know how to shape/create/guide goals and values, rather than just behaviors?

Arguments about inner misalignment work as arguments for optimism only inside "outer/inner alignment" framework, in deep learning version of it. If we have good outer loss function, such as closer to the minimum means better, then yes, our worries should be about weird inner misalignment issues. But we don't have good outer loss function so we kinda should hope for inner misalignment. 

2Noosphere89
That's definitely a claim that I contest, and my disagreement comes down to my optimism on weak priors sufficing for alignment at humans, and the fact that we can do better than that, combined with my view that deceptive alignment is so terrible that we're generally better off having more inner-alignment than not, because deception is one of the few ways to break this analysis on alignment, meaning that I generally find inner alignment more useful than not.
1quetzal_rainbow
Okay, let's break down this. 1.  Inner misalignment is when we have "objective function" (reward, loss function, etc.) and select systems that produce better results according this function (using evolutionary search, SGD, etc) and resulting system doesn't produce actions which optimize this objective function. The most obvious example of inner misalignment is RL-trained agents that doesn't maximize reward. 2. Your argument against possibility of inner misalignment is, basically, "SGD is so powerful optimizer that no matter what it will drag the system towards minimum of loss function". Let's suppose this is true. 3. We don't have "good" outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer. 4. If you have some alignment benchmark, you can't see the difference between superhumanly capable aligned and deceptively aligned systems. They both give you correct answers, because they both are superhumanly capable. 5. Because they give you the same correct answers, loss function assignes minimal values to their outputs. They are both either inside local minimum or on flat basin of loss function landscape. 6. Therefore, you don't need inner misalignment to get deceptive alignment.
2Noosphere89
While I dislike using the framing of loss functions here, I do think that this is probably false, especially with even weak prior information about the shape of the alignment solutions. This might turn out to be a crux, but I do think that rewarding AIs for bad actions will likely be rare, at least in the regime where we can supervise things, and in particular, I think a hypothetical alignment scheme via an outer function would look like this: 1. Place a weak prior over goal space, such that there already is a bias towards say being helpful. 2. Use the fact that we are the innate reward system to use backpropagation to compute the optimal direction towards being helpful, or really any criterion we can specify. 3. Repeat reinforcing preferred values and not rewarding/disrewarding dispreferred values with backpropagation until it gets to minimum loss or near minimal loss. 4. After millions of iterations of that loop by SGD, you can get a very aligned agent. This is roughly how I believe that the innate reward system manages to align us with values like empathy for the ingroup, but really we could replace the backprop algorithm with bio-realistic algorithms, and we could replace the values with mostly arbitrary values and get the same results.
[-]Ilio1-2

Evolution mostly can't transmit any bits from one generation to the next generation via genetic knowledge, or really any other way

http://allmanlab.caltech.edu/biCNS217_2008/PDFs/Meaney2001.pdf

2Noosphere89
My first impression skimming through, is that what it's arguing is that abuse by parents can negatively affect a child, and that stress can have both positive and negative effects, and that individual responses to stress determine the balance of positive to negative effects. 2 things I want to point out: 1. I think that the conclusions from this study are almost certainly extremely limited, and I wouldn't trust these results to generalize to other species like us. 2. I expect the results, in so far as they are real and generalizable, to be essentially that the genome can influence things later in life via indirect methods, but mostly can't directly specify it via hardcoding it or baking it directly in as prior information, and the transfer seems very limited, and critically the timescale is likely on evolutionary timescales, which is far, far slower than human within-lifetime learning timescales, and certainly not as much as the many bits cultural evolution can give in a much shorter timeframe. I will edit the post to modify the any to more as many bits as cultural evolution, and edit it more to say what I really meant here.
-7Ilio