why we mostly don't need to worry about AI
This topic is poorly understood, very high confidence is obviously wrong for any claim that's not exceptionally clear. Absence of doom is not such a claim, so the need to worry isn't going anywhere.
Without sufficient clarity, which humanity doesn't possess on this topic, no amount of somewhat confused arguments is sufficient for the kind of certainty that makes the remaining risk of extinction not worth worrying about. It's important to understand and develop what arguments we have, but in their present state they are not suitable for arguing this particular case outside their own assumption-laden frames.
When reunited with unknown unknowns outside their natural frames, such arguments might plausibly make it reasonable to believe the risk of extinction is as low as 10%, or as high as 90%, but nothing more extreme than that. Nowhere across this whole range of epistemic possibilities is a situation that we "mostly don't need to worry about".
I believe the security mindset is inappropriate for AI
I think that's because AI today feels like a software project akin to building a website. If it works, that's nice, but if it doesn't work it's no big deal.
Weak systems have safe failures because they are weak, not because they are safe. If you piss off a kitten, it will not kill you. If you piss off an adult tiger...
The optimistic assumptions laid out in this post don't have to fail in every possible case for us to be in mortal danger. They only have to fail in one set of circumstances that someone actualizes. And as long as things keep looking like they are OK, people will continue to push the envelope of risk to get more capabilities.
We have already seen AI developers throw caution to the wind in many ways (releasing weights as open source, connecting AI to the internet, giving it access to a command prompt) and things seem OK for now so I imagine this will continue. We have already seen some psycho behavior from Sydney too. But all these systems are weak reasoners and they don't have a particularly solid grasp on cause and effect in the real world.
We are certainly in a better position with respect to winning than when I started posting on this website. To me the big wins are (1) that safety is a mainstream topic and (2) that the AIs learned English before they learned physics. But I don't regard those as sufficient for human survival.
Do you just like not believe that AI systems will ever become superhumanly strong? That once you really crank up the power (via hardware and/or software progress), you'll end up with something that could kill you?
Read what I wrote above: current systems are safe because they're weak, not safe because they're inherently safe.
Security mindset isn't necessary for weak systems because weak systems are not dangerous.
It's not just about "being taken seriously", although that's a nice bonus - it's also about getting shared understanding about what makes programs secure vs. insecure. You need a method of touching grass so that researchers have some idea of whether or not they're making progress on the real issues.
At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of "security mindset": "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions". From this and his further words about the concept, he seems to mean something like "programming mindset", i.e. good practice in software engineering. Only if I read both you and him as using "security mindset" to mean that can I make sense of the way you both use the term.
But that is simply not what "security mindset" means. Recall that Schneier's article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:
...the security mindset involves thinking about how things can be made to fail. It involves thinki
Downvote for being absurdly overconfident, and thereby harming the whole direction of more optimism on alignment. I'd downvote Eliezer for the same reason on his 99.99% doom arguments in public; they are visibly silly, making the whole direction seem silly by association.
In both cased, there are too many unknown unknowns to have confidences remotely that high. And you've added way more silly zeros than EY, despite having looser arguments.
This is a really important topic; we need serious discussion of how to really think about alignment difficulty. This is a serious attempt, but it's just not realistically humble. It also seems to be ignoring the cultural norm and explicit stated goal of writing to inform, not to persuade, on LW.
So, I look forward to your next iteration, improved by the feedback on this post!
I’m pretty confused about almost everything you said about “innate reward system”.
My view is: the relevant part of the human innate reward system (the part related to compassion, norm-following, etc.) consists of maybe hundreds of lines of code, and nobody knows what they are, and I would feel better if we did. (And that happens to be my own main research interest.)
Whereas your view seems to be: umm, I’m not sure, I’m gonna say things and you can correct me. Maybe you think that (1) the innate reward system is simple, (2) when we do RLHF, we are providing tens of thousands of samples of what the innate reward system would do in different circumstances, (3) and therefore ML will implicitly interpolate how the innate reward system works from that data, (4) …and this will continue to extrapolate to norm-following behavior etc. even in out-of-distribution situations like inventing new society-changing technology. Is that right? (I’m stating this possible argument without endorsing or responding to it, I’m still at the trying-to-understand-you phase.)
On the topic of security mindset, the thing that the LW community calls "security mindset" isn't even an accurate rendition of what computer security people would call security mindset. As noted by lc, actual computer security mindset is POC || GTFO, or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you're maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.
In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
1) Person A says to Person B, "I think your software has X vulnerability in it." Person B says, "This is a highly specific scenario, and I suspect you don't have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me."
2) Person B says to Person A, "Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I'm so confident, I give it a 99.99999%+ chance." Person A says, "I can't specify the exact vulnerability your software might have without it in front of me, but I'm fairly sure this confidence is unwarranted. In general it's easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn't because I think X will actually be the vulnerability, I'm just trying to be illustrative."
Story 1 seems to be the case where "POC or GTFO" is justified. Story 2 seems to be the case where "security mindset" is justified.
It's very different to suppose a particula...
At the very least I think it would be more accurate to say “one aspect of actual computer security mindset is POC || GTFO”. Right? Are you really arguing that there’s nothing more to it than that?? That seems insane to me.
Even leaving that aside, here’s a random bug thread:
Mozilla developers identified and fixed several stability bugs in the browser engine used in Firefox and other Mozilla-based products. Some of these crashes showed evidence of memory corruption under certain circumstances and we presume that with enough effort at least some of these could be exploited to run arbitrary code. [emphasis added]
IIUC they treated these crashes as a security vulnerability, not a mere usability problem, and thus did things like not publicly disclosing the details until they had a fix ready to go, categorizing the fix as a high-priority security update, etc.
If your belief is that “actual computer security mindset is POC||GTFO”, then I think you’d have to say that these Mozilla developers do not have computer security mindset, and instead were being silly and overly paranoid. Is that what you think?
You're right that this is definitely not "security mindset". Iceman is distorting the point of the original post. But also, the reason Mozilla's developers can do that and get public credit for it is partially because the infosec community has developed tens of thousands of catastrophic RCE's from very similar exploit primitives, and so there is loads of historical evidence that those particular kinds of crashes lead to exploitable bugs. Alignment researchers lack the same shared understanding - they're mostly philosopher-mathematicians with no consensus even among themselves about what the real issues are, and so if one tries to claim credit for averting catastrophe in a similar situation it's impossible to tell if they're right.
POC || GTFO is not "security mindset", it's a norm. It's like science in that it's a social technology for making legible intellectual progress on engineering issues, and allows the field to parse who is claiming to notice security issues to signal how smart they are vs. who is identifying actual bugs. But a lack of "POC || GTFO" culture doesn't tell you that nothing is wrong, and demanding POCs for everything obviously doesn't mean you understand what is and isn't secure. Or to translate that into lesswrongese, reversed stupidity is not intelligence.
Citation needed? The one computer security person I know who read Yudkowsky's post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it's the core of the concept.
Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we've provided so far don't meet your standards for "example of the thing you're maybe worried about" with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.
If instead POC||GTFO applied to AGI risk means "we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc." then we are already doing that and have been.
On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:
Rohin Shah has a comment on why most POCs aren't that great here:
For white box vs black box, after further discussion I wound up feeling like people just use the term “black box” differently in different fields, and in practice maybe I’ll just taboo “black box” and “white box” going forward. Hopefully we can all agree on:
If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer.
And likewise we can surely all agree that future AI programmers will be able to see the weights and perform SGD.
This whole post seems to be about accident risk, under the assumption that competent programmers are trying in good faith to align AI to “human values”. It’s fine for you to write a blog post on that—it’s an important and controversial topic! But it’s a much narrower topic than “AI safety”, right? AI safety includes lots of other things too—like bad actors, or competitive pressures to make AIs that are increasingly autonomous and increasingly ruthless, or somebody making ChaosGPT just for the lols, etc. etc.
One can argue that algorithmic & hardware improvements will never ever be enough to put human-genius-level human-speed AGI in the hands of tons of ordinary people e.g. university students with access to a cluster.
Or, one can argue that tons of ordinary people will get such access sooner or later, but meanwhile large institutional actors will have super-duper-AGIs, and they will use them to make the world resilient against merely human-genius-level-chaosGPTs, somehow or other.
Or, one can argue that ordinary people will never be able to do stupid things with human-genius-level AGIs because the government (or an AI singleton) will go around confiscating all the GPUs in the world or monitoring how they’re used with a keylogger and instant remote kill-switch or whatever.
As it happens, I’m pretty pessimistic about all of those things, and therefore I do think lols are a legit concern.
(Also, “just for the lols” is not the only way to get ChaosGPT; another path is “We should do this to better understand and study possible future threats”, but then fail to contain it. Large institutions could plausibly do that. If you disagree—if you’re thinking “nobody would be so stupid as to do that”—note the existence of gain-of-function research, lab leaks, etc. in biology.)
I've upvoted this post because it's a good collection of object-level, knowledgeable, serious arguments, even though I disagree with most of them and strongly disagree with the bottom line conclusion.
There is a good analogy between genetic brain evolution and technological AGI evolution. In both cases there is a clear bi-level optimization, with the inner optimizer using a very similar UL/RL intra-lifetime SGD (or SGD-like) algorithm.
The outer optimizer of genetic evolution is reasonably similar to the outer optimizer of technological evolution. The recipe which produces an organic brain is a highly compressed encoding or low frequency prior on the brain architecture along with a learning algorithm to update the detailed wiring during lifetime training. The genes which encode the brain architectural prior and learning algorithms are very close analogically to the 'memes' which are propagated/exchanged in ML papers and encode AI architectural prior and learning algorithms (ie the initial pytorch code etc).
The key differences are mainly just that memetic evolution is much faster - like an amplified artificial selection and genetic engineering process. For tech evolution a large number of successful algorithm memes from many different past experiments can be flexibly recombined in a single new experiment, and the process guiding this recombination and selection is itself runni...
I definitely think that LW might not realize that AI is on an S-curve right now.
AI is obviously on an S-curve, since eventually you run out of energy to feed into the system. But the top of that S-curve is so far beyond human intelligence, that this fact is basically irrelevant when considering AI safety.
The arguments about fundamental limits of computation (halting problem,etc) also are irrelevant for similar reasons. Humans can’t even solve BB(6).
I just saw this post and cannot parse it at all. You first say that you have removed the 9s of confidence. Then the next paragraph talks about a 99.9… figure. Then there are edit and quote paragraphs and I do not know whether these are your views or other or whether you endorse them.
I believe getting Friendly AI is really really likely, closer to 99.99999%+ of the time
I think it'd make sense to clarify what you mean here, since the following are very different:
I assume you mean something more like the latter.
In that case it'd probably be useful to give a sense of your actual confidence in the 99.99999%+ claim.
"Mostly don't need to worry" would imply extremely high confidence.
Or do you mean something like "In most worlds it'll be clear in retrospect that we needn't have worried"?
Ok, well thanks for clarifying.
I'd assumed you meant the second.
Some reasons I think that this confidence level is just plain silly (not an exhaustive list!):
that the reason humans generalized correctly to having human values and didn't just trick their reward system isn't that special
This is a tautology, not an example of successful alignment:
Humans trick their reward systems as much as humans trick their reward systems.
Imagine a case where we did "trick our reward system". In such a case the human values we'd infer would be those that we'd infer from all the actions we were taking - including the actions that were "tricking our reward system".
We would then observe that we'd generalized entirely correctly with respect to the values we inferred. From this we learn that things tend to agree with themselves. This tells us precisely nothing about alignment.
I note for clarity that it occurs to me to say:
Indeed we do observe some humans doing what most of us would think of as tricking their reward systems (e.g. self-destructive drug addictions).
You may respond "Ah, but that's a small proportion of people - most people don't do that!" - at which point we're back to tautology: what most people do will determine what is meant by "human values". Most people are normal, since that's how 'normal' is defined.
The only possible evidence I could provi...
I don't think it's accidental - it seems to me that the tautology accurately indicates where you're confused.
"generalised correctly" makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.
Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you're no longer looking at a "broad, reasonable" distribution of space, but now a "very, specific" scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.
fe, but I feel like a lot of Lesswrongers are probably wrong in their assumption that AI progress will continue as it had after 2030,
Who thinks that? I don't think that. Ajeya doesn't think that.
In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.
I'm going to read this as "...1 new potential gradient hacking pathway" because I think that's what the section is mainly about. (It appears to me that throughout the section you're conflating mesa-optimization with gradient hacking, but that's not the main thing I want to talk about.)
The following quote indicates at least two potential avenues of gradient hacking: "In an RL context", "supervised learning with ada...
Thanks a lot for writing that post.
One question I have regarding fast takeoff is: don't you expect learning algorithms much more efficient than SGD to show up and accelerate a lot the rate of development of capabilities?
One "overhang' I can see it the fact that humans have written a lot of what they know how to do all kinds of task on the internet and so a pretty data efficient algo could just leverage this and fairly suddenly learn a ton of tasks quite rapidly. For instance, in context learning is way more data efficient than SGD in pre-training. Right no...
r one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.
The analogy was about the alignment problem, not the capabilities problem.
A rocket won't get to the moon if you randomly double one of the variables used to navigate, like the amount of thrust applied in maneuvers or the angle of attack. (well, not unless you've built in good error-correction and redundancy etc.)
Good to see your point of view. The old arguments about AI doom are not convincing to me anymore, however getting alignment 100% right, whatever that means in no way guarantees a positive Singularity.
Should we be talking about concrete plans about that now? For example I believe with a slow takeoff if we don't get Neuralink or mind uploading, then our P(doom) -> 1 as the Super AI gets ever more ahead of us. The kind scenarios I can see
Have you uploaded a new version of this article? It have just been reading elsewhere about goal misgeneralisation and shutdown problem, so I'd be really interested to read the new version of this article.
Thanks for writing this! I strongly appreciate a well-thought out post in this direction.
My own level of worry is pretty dependent on a belief that we know and understand shaping NN behaviors much better than we do (values/goals/motivations/desires) (although I don't think eg chatGPT has any of the latter in the first place). Do you have thoughts on the distinction between behaviors and goals? In particular, do you feel like you have any evidence we know how to shape/create/guide goals and values, rather than just behaviors?
Arguments about inner misalignment work as arguments for optimism only inside "outer/inner alignment" framework, in deep learning version of it. If we have good outer loss function, such as closer to the minimum means better, then yes, our worries should be about weird inner misalignment issues. But we don't have good outer loss function so we kinda should hope for inner misalignment.
Evolution mostly can't transmit any bits from one generation to the next generation via genetic knowledge, or really any other way
http://allmanlab.caltech.edu/biCNS217_2008/PDFs/Meaney2001.pdf
Or, why we probably don't need to worry about AI.
So this post is partially a response to Amalthea's comment on how I simply claimed that my side is right, and I responded by stating that I was going for a short comment rather than having to make another very long comment on the issue.
https://www.lesswrong.com/posts/aW288uWABwTruBmgF/?commentId=r7s9JwqP5gt4sg4HZ#r7s9JwqP5gt4sg4HZ
This is the post where I won't try to claim that my side is right, and instead give evidence so I can properly collect my thoughts here. This will be a link-heavy post, and I'll reference a lot of concepts and conversations, so it will help if you have some light background on these ideas, but I will try to make everything intelligible to the lay/non-technical person.
This will be a long post, so get a drink and a snack.
The Sharp Left Turn probably won't happen, because AI training is very different from evolution
Nate Soares suggests that a critical problem in AI safety is the sharp left turn, and the sharp left turn essentially is that capabilities generalize much more than the goals, ie it is basically goal misgeneralization plus fast takeoff:
So essentially the analogy is akin to AI is aligned in the training data, but in the test set, due to the limitations of the method of alignment, fail to generalize to the test set.
Here's the problem: We actually know why the sharp left turn happened, and the circumstances that led to the sharp left turn in humans won't reappear in AI training and AI progress.
Basically, the sharp left turn happened because the outer optimizer of evolution was billions of times less powerful than the inner search process like human lifetime learning, and the inner learners like us humans die after basically a single step, or at best 2-3 steps of the outer optimizer. Evolution mostly can't transmit as ,many bits from one generation to the next generation via it's tools, compared to cultural evolution, and the difference between their ability to transmit bits over certain time-scales is massive.
Once we had the ability to transmit some information via culture, that meant that given our ability to optimize billions of times more efficiently, we could essentially undergo a sharp left turn where capabilities spiked. But the only reason this happened was to quote Quintin Pope:
This does not exist for AIs trained with SGD, and there is a much smaller gap between the outer optimizer SGD and the inner optimizer, with the difference being ~0-40x.
Here's the source for it below, and I'll explicitly quote it:
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn#Don_t_misgeneralize_from_evolution_to_AI
Also, we can set the ratio of outer to inner optimization steps to basically whatever we want, which means that we can control the inner learner's rates of learning far better than evolution, meaning we can prevent a sharp left turn from happening.
A crux I have with Jan Kulevit is that to the extent that animals do have culture, it is much more limited than human culture, and that evolution largely has little ability to pass on traits non-culturally, and very critically this is a one-time inefficiency, there is no reason to assume a second source of massive inefficiency leading to a sharp left turn:
X4vier and particular illustrates this, and I'll show it below:
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=qYFkt2JRv3WzAXsHL
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=vETS4TqDPMqZD2LAN
I don't believe that Nate's example actually shows the misgeneralization were concerned about
This is because the alleged misgeneralization was not a situation where 1 AI was trained in an environment and maximized the correlates IGF, then in the new environment it encountered inputs that changed the goals such that it now misgeneralizes the goal to not pursue IGF anymore.
What happened is that evolution trained humans in one environment to optimize the correlates of IGF, then basically trained new humans in another environment, and they diverged.
Very critically, there were thousands of different systems/humans being trained on in drastically different environments, not 1 AI being trained on different environments like in modern AI training, so it's not a valid example of misgeneralization.
Some posts and quotes from Quintin Pope will help:
A comment by Quintin on why humans didn't actually misgeneralize to liking icecream:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/?commentId=sYA9PLztwiTWY939B
AIs are white boxes, and we are the innate reward system
Edit from comments due to Steven Byrnes: The white-box definition I'm using in this post does not correspond to the intuitive definition of a white box, and instead refers to the computer analysis/security sense of the term.
These links will be the definitions of white box AI going forward for this post:
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#Alignment_optimism__AIs_are_white_boxes
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=CLi5eBchYfXKZvXuD
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=qisPbHyDHMKxgNGeh#qisPbHyDHMKxgNGeh
The above arguments on why the Sharp Left Turn probably won't reappear in modern AI development, and why the claim that humans didn't misgeneralize is enough to land us out of the most doomy voices like Eliezer Yudkowsky, and in particular the removal of reasons to assume extreme misgeneralization lands us out of MIRI-sphere views, as well as arguably outside of 50% p(doom). But I wanted to argue that the chance of doom is way lower than that, so low that we mostly shouldn't be concerned about AI, and thus I have to provide a positive story of why AIs very likely are aligned, and I argue that AIs are white boxes and we are the innate reward system, in this context.
The key advantage we have over evolution is that unlike studying brains, we have full read-write access to their internals, and they're essentially a special type of computer program, and we already have ways to manipulate computer programs at essentially no cost to us. Indeed, this is why SGD and backpropagation works at all to optimize SGD. If the AI was a black box, SGD and backpropagation wouldn't work.
The innate reward system aligns us via whitebox methods, and the values that the reward system imprints on us is ridiculously reliable, where almost every human has empathy for friends and acquaintances, parental instincts, revenge etc.
This is shown in the link below:
https://forum.effectivealtruism.org/s/vw6tX5SyvTwMeSxJk/p/JYEAL8g7ArqGoTaX6#White_box_alignment_in_nature
(Here, we must take a detour and say that our reward system is ridiculously good at aligning us to survive, and the flaws like obesity in the modern world are usually surprisingly mild failures, in that the human isn't as capable of things as we thought, and this arguably implies that alignment failures in practice will look much more like capabilities failures, and passing the analogy back to the AI case, I basically don't expect X-risk, GCRs, or really anything more severe than say the AI messing up a kitchen, for example.)
Steven Byrnes raised the concern that if you don't know how to do the manipulation, then it does cost you to gain the knowledge.
Steven Byrnes's comment is linked here: https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=3xxsumjgHWoJqSzqw
Nora Belrose responded on what white boxing meant, as well as how people use SGD to automate the search so that the cost of manipulation in an overall sense is as low as possible:
https://twitter.com/norabelrose/status/1709603325078102394
https://twitter.com/norabelrose/status/1709606248314998835
https://twitter.com/norabelrose/status/1709601025286635762
https://twitter.com/norabelrose/status/1709603731413901382
Steven Byrnes argues that this could be due to differing definitions:
https://twitter.com/steve47285/status/1709655473941631430
This is the response chain so that I could see why Nora Belrose and Steven Byrnes were disagreeing.
I ultimately think a potential difference is that for alignment purposes, the humans vs AI abstraction is not a very useful abstraction, and SGD vs the inner optimizer is the better abstraction here, and thus it doesn't matter whether AI progresses generally, it's the specific progress by humans + SGD vs the inner optimizer that's important, and thus the cost of manipulating AI values is quite low.
This leads to...
I believe the security mindset is inappropriate for AI
In general, a common disagreement with a lot of LWers is that there is very limited transfer of knowledge from the computer security field to AI, because AI is very different in ways that make the analogies inappropriate.
For one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.
All of this and more is explained by Quintin below, but there are several big disanalogies between the AI field and the computer security field, so much so that I think that ML/AI is a lot like quantum mechanics, where we shouldn't port intuitions from other fields and expect them to work because of the weirdness of the domain:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/#Yudkowsky_mentions_the_security_mindset__
I also believe that the epistemic differences between computer security and alignment is in computer security, there's an easy to check ground truth for whether a crypto-system is broken, whereas in AI alignment, we don't have the ability to get feedback from proposed breakages of alignment schemes.
For more, see Quintin's post section on the difference between AI safety and computer security in regards to epistemics, and a worked example of an attempted security break, where there is suggestive evidence that inner misaligned models/optimization daemons go away as we increase the amount of dimensions.
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#True_experts_learn__and_prove_themselves__by_breaking_things
(Where Quintin Pope talks about the fact that alignment doesn't have good feedback loops on ground truth on "What is an attempted break?", and the example of a claimed break actually went away as the dimensions was scaled up, and note that the disconfirmatory evidence was more realistic than the attempted break.)
This is why I disagreed with Jeffrey Ladish about the security mindset on Twitter: I believe it's a trap for those not possessing technical knowledge, like a lot of LWers, and there are massive differences between AI and computer security that means most attempted connections fail.
https://twitter.com/JeffLadish/status/1712262020438131062
https://twitter.com/SharmakeFarah14/status/1712264530829492518
So now that I've tried to show why porting over the security mindset is flawed, I want to talk about a class of adversaries like gradient hackers or inner-misaligned mesa-optimization, and why I believe this is actually very difficult to do against SGD, and even the non-platonic ideal version of SGD, we can detect most mesa-optimizers quite easily.
Inner Misalignment, or at least Gradient Hacking is very difficult for AIs trained on SGD
I'll be taking the inner misalignment definition from Evan Hubinger's post The Inner Alignment Problem:
https://www.lesswrong.com/posts/pL56xPoniLvtMDQ4J/the-inner-alignment-problem
The basic reason why it's hard for a misaligned mesa-optimizer to stick around for long is because Gradient Descent is in fact, much more powerful and white-boxy than people realize, and in particular it has 5 defenses that any mesa-optimizer would need to overcome in order to misalign it:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult
Basically it will optimize the entire causal graph, and leave no slack, and as a bonus are extremely resistant to blackmail by mesa-optimizers. In general, a big part of my optimism around inner alignment is that SGD is extraordinarily good at credit assignment, and it has quite strong correction features in the case that a mesa-optimizer does attempt to misalign it.
We also can detect most mesa-optimizers in the AI without the need for mechanistic interpretability, like so:
One caveat here is that the prevention of mesa-optimizers applies fully only to SSL learning on IID data, which is an unfortunate limitation, albeit I do expect SGD to still be ridiculously good at credit assignment even in the RL context.
In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.
But there's also weak evidence that optimization daemons/demons, often called inner misaligned models, go away when you increase the dimension count:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#True_experts_learn__and_prove_themselves__by_breaking_things
https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect
https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect?commentId=hwzu5ak8REMZuBDBk
This was actually a crux in a discussion between me and David Xu about inner alignment, where I argued that the sharp left turn conditions don't exist in AI development, and he argued that misalignment happens when there are gaps that go uncorrected, which is likely referring to the gap between the base goal like SGD and the internal optimizer's goal that leads to inner misalignment, and I argued that inner misalignment is likely to be extremely difficult to do, due to SGD being able to correct the gap between the inner and outer mesa-optimizer in most cases, and I now showed the argument in this post:
Twitter conversation below:
https://twitter.com/davidxu90/status/1712567663401238742
https://twitter.com/davidxu90/status/1712568155959362014
https://twitter.com/SharmakeFarah14/status/1712573782773108737
https://twitter.com/davidxu90/status/1712575172124033352
This reminds me, I should address that other conversation I had with David Xu on how strong priors do we need to encode to ensure alignment, vs how much can we let it learn and it leading to a good outcome, or alternatively how much do we need to specify upfront. And that leads to...
I expect reasonably weak priors to work well to align AI with human values, and that a lot of the complexity can be offloaded to the learning process
Equivalently speaking, I expect the cost of specification of values to be relatively low, and that a lot of the complexity is offloadable to the learning process.
This was another crux between David Xu and me, specifically on the question of whether you can largely get away with weak priors, or do you actually need to encode a lot stronger prior to prevent misalignment? It ultimately boiled down to the crux that I expected reasonably weak priors to be enough, guided by the innate reward system.
A big part of my reasoning here has to do with the fact that a lot of values and biases are inaccessible by the genome, and that means that you can't directly specify them. You can shape them via setting up training algorithms and data, but it turns out that it's very difficult to directly specify things like values, for instance in the genome. This is primarily because the genome does not have direct access to the world model or the brain, which would be required to hardcode the prior. To the extent that it can, it has to be over relatively simple properties, which means that you need to get alignment with relatively weak priors encoded, and the innate reward system generally does this fantastically, with examples of misalignment being rare and mild.
The fact that humans can reliably get values like "having empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc", without requiring the genome to hardcode a lot of prior information, and getting away with reasonably weak priors is a rather underappreciated thing, since it means that we don't need to specify our values very much, and thus we can reliably offload most of the value learning work to AI.
Here are some posts and comments below:
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#iGdPrrETAHsbFYvQe
(I want to point out that it's not just that with weak prior information that the genome can reliably bind humans to real-enough things such that for example, they don't die from thirst from drinking fake water, but that it can create the innate reward system which uses simple update rules to reliably get nearly every person on earth to have empathy for their family and ingroup, revenge when others harmed us, etc, and the rare exceptions to the pattern are rather rare and usually mild alignment failures at best. That's a source of a lot of my optimism on AI safety and alignment.)
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#8Fry62GiBnRYPnpNn
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#dRXCwRBkGxKTuq2Cc
Here is the compressed conversation between David Xu and me:
https://twitter.com/davidxu90/status/1713102210354294936
https://twitter.com/davidxu90/status/1713230086730862731
https://twitter.com/SharmakeFarah14/status/1713232260827095119
https://twitter.com/davidxu90/status/1713232760637358547
I'm going to reply in this post and say that the orthogonality thesis is a lot like the no free lunch theorem: An extraordinarily powerful result that is too general to apply, because it only applies to the space of all logically possible AIs, and it only works if you have 0 prior that's applied, which in this case would require you to specify everything, including the values of the system, or at best use stuff like brute force search or memorization algorithms.
I have a very similar attitude to "Most goals in the space of goal space are bad." I'd probably agree in the most general sense, but that even weak priors can prevent most goals from being bad, and thus I suspect that a 0 prior condition is likely necessary. But I'm not arguing that with 0 prior, models are aligned with people without specifying everything. I'm arguing that we can get away with reasonably weak priors, and let within life-time learning do the rest.
Once you introduce even weak priors to the situation, then the issue is basically resolved, and I stated that weak priors work to induce learning of values, and it's consistent with the orthogonality thesis to have arbitrarily tiny prior information be necessary to learn alignment.
I could make an analogous argument for capabilities, and I'd be demonstrably wrong, since the conclusion doesn't hold.
This is why I hate the orthogonality thesis, despite rationalists being right on it: It allows for too many outcomes, and any inference like values aren't learned can't be supported based on the orthogonality thesis.
https://twitter.com/SharmakeFarah14/status/1713234214391255277
https://twitter.com/davidxu90/status/1713234707272626653
https://twitter.com/SharmakeFarah14/status/1713236849873891699
https://twitter.com/davidxu90/status/1713237355501584857
https://twitter.com/davidxu90/status/1713238995893912060
##My own algorithm for how to do AI alignment
This is a subpoint, but for those that want to have a ready-to-go alignment plan, here it is:
Implement a weak prior over goal space.
Use DPO, RLHF, or something else to create a preference model.
Create a custom loss function for the preference model.
Use the backpropagation algorithm to optimize it and achieve a low loss.
Repeat the backpropagation algorithm until you achieve an acceptable solution.
Now that I'm basically finished with laying out the arguments and the conversations, lets move on to the conclusion:
Conclusion
My optimism on AI safety stems from a variety of sources. The reasons are, in order of the post, not ordered by importance are:
I don't believe the sharp left turn is anywhere near as general as Nate Soares puts it, because the conditions that caused a sharp left turn in humans was basically cultural learning in humans being able to optimize over much faster time-scales than evolution could respond, evolution not course-correcting us, and being able to transmit OOMs more information via culture through the generations than evolution could. None of these conditions hold for modern AI development.
I don't believe that Nate's example of misgeneralizing the goal of IGF actually works as an actual example of misgeneralization that matters for our purposes, because they were not that 1 AI is trained for a goal in environment A, and then in environment B, it does not pursue the goal, but instead pursues a different goal competently.
Instead, what's happening is that 1 human generation, or 1 human is trained in Environment A, and then a fresh generation of humans is trained on a different distribution, which predictably will have more divergence than the first case.
In particular, there's no reason to be concerned about the alignment of AI misgeneralizing, since we have no reason to assume that the central example of Lesswrong is actually misgeneralization. From Quintin:
AIs are mostly white boxes, at the very least, and the control over AI that we have means that a better analogy is through our innate reward systems, which align us to quite a lot of goals spectacularly well, so well that the total evidence of alignment could easily put X-risk or even say, killing a human 5-15+ OOMs or less, which would make the alignment problem a non-problem for our purposes. It would pretty much single-handedly make AI misuse the biggest problem, but that issue has different solutions, and governments are likely to regulate AI misuse anyway, so existential risk gets cut 10-99%+ or more.
I believe the security mindset is inappropriate for AI due to the fact that aligning AI mostly doesn't involve dealing with adversarial intelligences or inputs, and the reason turns out to be that the most natural class, inner misaligned mesa-optimizers/optimization daemons mostly doesn't exist, because of my next reason. Also alignment is in a different epistemic state to computer security, and there are other disanalogies that make porting intuitions from other fields into ML/AI research very difficult to do correctly.
It is actually really difficult to inner misalign the AI, since SGD is really good at credit assignment, and optimizes the entire causal graph leading to the loss, leaving no slack. It's not like evolution where you have to do this from Gwern's post here:
https://gwern.net/backstop#rl
The way SGD solves this problem is by running backprop, which is a white-box algorithm, and Nora Belrose explains it more here:
https://forum.effectivealtruism.org/s/vw6tX5SyvTwMeSxJk/p/JYEAL8g7ArqGoTaX6#Status_quo_AI_alignment_methods_are_white_box
And that's the base optimizer, not the mesa-optimizer, which is why SGD has a chance to correct the inner-misaligned agent far more effectively than cultural/biological evolution, the free market, etc. It is white-box, like the inner optimizers it runs, and solves credit assignment in a much better way than the previous optimizers like cultural/biological evolution, the free market, etc could hope to do.
So now that we have listed the reasons why I expect optimism on AI safety, I'll add 1 new mini-section to show that the shutdown problem from AI is almost solved.
Addendum 1: The shutdown problem for AI is almost solved
It turns out that we can keep the most useful aspects of Expected Utility Maximization while making an AI shutdownable.
Sami Petersen showed that we can integrate incomplete preferences to AIs while weakening transitivity just enough to get a non-trivial theory of Expected Utility Maximization that's quite a lot safer. Elliott Thornley proposed that incomplete preferences would be used to solve the shut-down problem, and the very nice thing about subagent models of Expected Utility Maximization is that they require a unanimous committee in order for a decision to be accepted as a sure gain.
This is both useful, but can lead to problems. On the one hand, we only need one expected utility maximizer that wants to be able to shut down the AI in order for us to shut it down as a whole, but we would need to be sort of careful on where their execution conditions/domain is, as unanimous committees can terrible because only one agent needs to do something to grind the entire system to a halt, which is why in the real world, it's usually not a preferred way to govern something.
Nevertheless, for AI safety purposes, this is still very, very useful, and if it grows up to have broader conditions than the ones outlined in the posts below, this might be the single biggest MIRI success of the last 15 years, which is ridiculously good.
https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
http://pf-user-files-01.s3.amazonaws.com/u-242443/uploads/2023-05-02/m343uwh/The Shutdown Problem- Two Theorems%2C Incomplete Preferences as a Solution.pdf
Edit 3: I've removed addendum 2 as I think it's mostly irrelevant, and Daniel Kokotajlo showed me that Ajeya actually expects things to slow down in the next few years, so the section really didn't make that much sense.