You seem to be assuming that the ability of the system to find out if security assumptions are false affects whether the falsity of the assumptions have a bad effect. Which is clearly the case for some assumptions - "This AI box I am using is inescapable" - but it doesn't seem immediately obvious to me that this is generally the case.
Generally speaking, a system can have bad effects if made under bad assumptions (think a nuclear reactor or aircraft control system) even if it doesn't understand what it's doing. Perhaps that's less likely for AI, of course.
And on the other hand, an intelligent system could be aware that an assumption would break down in circumstances that haven't arrived yet, and not do anything about it (or even tell humans about it).
My point is that key assumptions are about learning system and its properties.
I.e., "if we train system to predict human text it will learn human cognition". Or "if we train system on myopic task of predicting text it won't learn long-term consequentialist planning".
TL;DR it is possible to model the training of ASI[1] as an adversarial process because:
More explanation:
Many alignment thinkers are skeptical about security mindset being necessary for ASI alignment. Their central argument is that ASI alignment doesn't have a source of adversarial optimization, i.e., adversary. I want to show that there is a more fundamental level of adversity than "assume that there is someone who wants to hurt you".
Reminder of the standard argument
Feel free to skip this section if you remember the content of "Security Mindset and Ordinary Paranoia"
Classic writings try to communicate that, actually, you should worry about not adversarial optimization per se, but about optimization overall:
Many people don't find "weird selection forces" particularly compelling. It leaves you in a state of confusion even if you feel like you agree, because "weird forces" is not great gear-level explanation.
Another classic mental model is "planner-grader": planner produces plans, grader evaluates them. If planner is sufficiently powerful and grader is not sufficiently robust, you are going to get lethally bad plan sooner or later. This mental model is unsatisfactory, because it's not how modern DL models work.
I, personally, find standard form of argument in favor of security mindset intuitively reasonable, but I understand why so many are unpersuaded. So I'm going to reframe this argument step by step until it crosses inferential distances.
Optimization doesn't need to have enemy to blow things up
There are multiple examples of this!
Let's take evolution:
Or, taking description from the paper itself:
The interesting part is that evolution is clearly not an adversary relative to you, it doesn't know you even exist. You are just trying to optimize in one direction and evolution is trying to optimize in another and you are getting unexpected results.
My personal example which I used several times:
Again, you don't have any actual adversaries here. You have a system, some parts of the system optimize in one direction, other in another direction and the result is not expected by either part.
Can we find something more specific than "optimization happens here"? If you optimize, say, a rocket engine, it can blow up, but the primary reason is not optimization.
We can try to compare these examples to usual cybersecurity setup. For example, you have a database and you want everyone to get access to it only if you give them authorized password. But if, for example, your syster has easily discoverable SQL injections, at least some users are goint to get unauthorised access.
Trying to systematize these examples I eventually come up with this intuition:
Systems that require security mindset have in common that they are learning systems, regardless of whether they are trying to work against you.
Now that we have this intuition, let's lay out the logic.
Learning is a source of pressure
We can build a kind of isomorphism between all these examples:
If you think about the whole metaphor of "pressure on assumptions" it's easy to find it weird. Assumptions are not doors or locks, you can't break them, unless they are false.
If your security assumptions (including implicit assumptions) are true, you are safe. You need to worry only if they are false. And if you have powerful learning system, for which your assumption "the undesirable policy is unlearnable given training process" is false, you naturally want your system to be less powerful.
This fact puts the whole system "security assumptions - actual learning system" in tension.
Superintelligence implies fragile security assumptions
It's important to remember that actual task of AI alignment is alignment of superintelligences. By superintelligence I mean, for example, a system that can develop prototype general-purpose nanotech within a year given 2024-level tech.
The problem with superintelligence is that you (given current state of knowledge) have no idea how it works mechanistically. You have no expertise in superintelligent domains - you can't create dataset of good (controllable) and bad (uncontrollable gray goo) nanotech designs and assign corresponding rewards, because why would you need superintelligence if you could?[2]
You can look at deception as special case of this. Suppose that you are training superintelligent programmer. You assign reward for performance and also penalize vulnerabilities in code. At some moment, system creates a program that has vulnerability which you can't detect, so you give highest possible reward, reinforcing behavior "write code with vulnerabilities invisible for operator". You are going to get unexpected behavior of system exactly because you didn't know that your training process conveys into system information about how to deceive you.
Given all ignorance about system implied by superintelligence, it's very likely that at least some of your assumptions are going to be false, and thus, will break eventually.
Thus, you need to have reasoning style that makes security assumptions less likely to break. I.e., security mindset.
This is a more elaborate version of this post.
Thanks to Nicholas Cross for valuable feedback. All errors are my own.
ASI - artificial superintelligence. Interchangable with "superintelligence".
If you have example of good nanotech design before training superintelligence, you already have what you need and there is no reason to train superintelligence to get nanotech in the first place