I’m confused about your bit on deception within Tool AIs. I generally think of Tool AIs not as consequentialists, and therefore there is no “long-term utility” to maximize via short-term deception. What’s the mechanism by which you worry about these tools being deceptive to their users?
I'm thinking of the entire human+tool system as a consequentialist, and I'm basically arguing that that system fails in the same ways as "human in the loop oversight" fails
Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top. Thanks to Kyle and Laria for discussions.
In “Searching for Bayes Structure”, Eliezer argues that things which arrive at the truth must be doing something Bayes-like somewhere in there, even if it’s extremely inefficient or accidental or filtered through some process that looks very different from the outside. If it has no Bayes structure, it literally cannot arrive at the truth.
I think it’s plausible that Bayes is to epistemic rationality as consequentialism is to instrumental rationality. Here, by consequentialism I mean the ability to get to some goal from a broader set of initial states, which is how I understood Elliezer’s use of the term in the dialogues. It does not have to be an EUM/VNM-rational. Importantly, to be good at consequentialism comes with all the stuff like instrumental convergence (and consequently self preservation, power seeking, etc).
In other words, there’s some minimal amount of consequentialism you need to achieve a given goal, and there’s no getting around this lower bound. Of course, some agents do a lot more consequentialism than is necessary, but I interpret some people (i.e Eliezer) as arguing that even this lower bound is too much consequentialism for, say, performing a pivotal action, because it’s fundamentally very difficult to do. As an analogy, if your goal is to buy an apple for $1, any plans which fail to collect at least $1 will simply not be able to acquire an apple, and even though there also exist plans which collect $100 (which is more dangerous than collecting $1), if even collecting $1 is too dangerous by default, then even lobotomizing a $100-collecting agent to only collect $1 is scarcely comforting. The main lesson I draw from this is that if we want to do something consequentialist then we have to somehow figure out how to align consequentialists, and that making the agent worse at consequentialism isn’t a solution. I think it also seems productive to figure out where the lower bound of how much consequentialism we need (for, i.e a pivotal action) and the upper bound of how much consequentialism is safe are relative to each other.
As an aside, my intuition for why instrumental convergence is unavoidable for consequentialists: There is some set of plans that achieve your goal with high probability across all universes, and a slightly larger superset that achieves your goal with high probability in across the universes in which you don't try to stop the system. Among those plans, under some complexity prior or something, the ones that have the safe properties we want (they actually stop when you try to stop them) occupy a very small measure, because this set is very unnatural. Think Nate's laser analogy: the plans which accomplish the goal in a way robust to many perturbations, but yet are not robust against the perturbations that correspond to us trying to stop the system, is difficult to specify. So to pull out the ones we want requires a bunch of work, which means that by default most plans that achieve our goal will be power seeking and also not corrigible.
Actually, it gets even worse, because there are also deceptive plans that look corrigible but actually aren't; it's really hard to separate these from the actually corrigible ones. So even getting really good at picking out the corrigibility cluster of things isn't good enough, because there's a big incentive for deceptive plans to look corrigible. (I think this is vaguely corresponding to outer vs inner alignment but I'm not 100% sure.) Also, there might be self-modification considerations here where non-EUMs and other lobotomized AIs might self modify to EUMs, if you lobotomize them too much, the reason being that if you lobotomize the AI too much then most plans conditioned on success start looking like "build a non lobotomized AI somehow and then tell that AI to accomplish the thing".
Tool AI
I think there's also a way in which this applies to tool AI type arguments as well. Ultimately, the reason we want tools is they help us accomplish some goal (for instance, performing a pivotal action). So, we can view the entire human+tool system as an agent that is trying to get to some goal, where there's some human oversight to ensure that the plans pass muster. This system, to get to goals, will have to do consequentialism. While having the human provide the glue between the model and the real world does add some safety, I would argue that this is approximately as safe as just humans vetting plans from an agent, which is itself not really sufficient because many plans which seem non suspicious to us are actually dangerous in some unforeseen way (diving into whether this is most plans or vanishingly few plans seems like a useful avenue - on the one hand one would think that making a plan sound useful while still doing something bad might require a lot of optimization, on the other hand "I can't see anything going wrong with this" is a hugely lower bar than "nothing will go wrong with this" resulting in most plans that satisfy the former not satisfying the latter).
While the human providing the "outer loop" so to speak does allow the human a little bit more control in that they have control over the particular reasoning processes used to arrive at the action, I don't think this is a very large effect. For one, if the tool is deceptive, it can still manipulate the human by giving them carefully crafted answers to their queries - in the deceptive case this gives almost no protection. Even without deception, if we're in a universe where "I can't see anything going wrong with this" is much looser than "nothing goes wrong" (which, for what it's worth, I think is highly likely to be true), you end up with the exact same problem again. Because there's some minimal amount of consequentialism you need to do no matter what to accomplish your goal, you also need to do some minimal amount of optimizing against your ability to see things wrong with plans, and in the universe where most ok-looking plans suck, this optimization quickly goodharts your ability to tell whether a plan is good.
This seems to suggest the human's ability to pick good plans with extremely high precision (i.e definitely not accepting any dangerous plans, though possibly rejecting some safe plans) as a core difficulty. This particular subproblem has also popped up in discussions about other parts of alignment, so I think it seems like something pretty important.
(Also, tools are economically uncompetitive anyways.)