(edit: discussions in the comments section have led me to realize there have been several conversations on LessWrong related to this topic that I did not mention in my original question post.
Since ensuring their visibility is important, I am listing them here: Rohin Shah has explained how consequentialist agents optimizing for universe-histories rather than world-states can display any external behavior whatsoever, Steven Byrnes has explored corrigibility in the framework of consequentialism by arguing poweful agents will optimize for future world-states at least to some extent, Said Achmiz has explained what incomplete preferences look like (1, 2, 3), EJT has formally defined preferential gaps and argued incomplete preferences can be an alignment strategy, John Wentworth has analyzed incomplete preferences through the lens of subagents but has then argued that incomplete preferences imply the existence of dominated strategies, and Sami Petersen has argued Wentworth was wrong by showing how incomplete preferences need not be vulnerable.)
In his first discussion with Richard Ngo during the 2021 MIRI Conversations, Eliezer retrospected and lamented:
In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to - they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they'd spent a lot of time being exposed to over and over and over again in lots of blog posts.
Maybe there's no way to make somebody understand why corrigibility is "unnatural" except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell's attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.
Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!"
And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.
And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent.
Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good.
When Eliezer says "they did not even do as many homework problems as I did," I doubt he is referring to actual undergrad-style homework problems written nicely in LaTeX. Nevertheless, I would like to know whether there is some sort of publicly available repository of problem sets that illustrate the principles he is talking about. Meaning set-ups where you have an agent (of sorts) that is acting in a manner that's either not utility-maximizing or even simply not consequentialist, followed by explanations of how you can exploit this agent. Given the centrality of consequentialism (and the associated money-pump and Dutch book-type arguments) to his thinking about advanced cognition and powerful AI, it would be nice to be able to verify whether working on these "homework problems" indeed results in the general takeaway Eliezer is trying to communicate.
I am particularly interested in this question in light of EJT's thorough and thought-provoking post on how "There are no coherence theorems". The upshot of that post can be summarized as saying that "there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy" and that "nevertheless, many important and influential people in the AI safety community have mistakenly and repeatedly promoted the idea that there are such theorems."
I was not a member of this site at the time EJT made his post, but given the large number of upvotes and comments on his post (123 and 116, respectively, at this time), it appears likely that it was rather popular and people here paid some attention to it. In light of that, I must confess to finding the general community reaction to his post rather baffling. Oliver Habryka wrote in response:
The post does actually seem wrong though.
I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but mostly, I feel like in order to argue that something is wrong with these arguments is that you have to argue more compellingly against completeness and possible alternative ways to establish dutch-book arguments.
However, the "details", as far as I can tell, have never been written up. There was one other post by Valdes on this topic, who noted that "I have searched for a result in the literature that would settle the question and so far I have found none" and explicitly called for the community's participation, but constructive engagement was minimal. John Wentworth, for his part, wrote a nice short explanation of what coherence looks like in a toy setting involving cache corruption and a simple optimization problem; this was interesting but not quite on point to what EJT talked about. But this was it; I could not find any other posts (written after EJT's) that were even tangentially connected to these ideas. Eliezer's own response was dismissive and entirely inadequate, not really contending with any of the arguments in the original post:
Eliezer: The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.
Cool, I'll complete it for you then.
Transitivity: Suppose you prefer A to B, B to C, and C to A. I'll keep having you pay a penny to trade between them in a cycle. You start with C, end with C, and are three pennies poorer. You'd be richer if you didn't do that.
Completeness: Any time you have no comparability between two goods, I'll swap them in whatever direction is most useful for completing money-pump cycles. Since you've got no preference one way or the other, I don't expect you'll be objecting, right?
Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem. The post's thesis, "There are no coherence theorems", is therefore falsified by presentation of a counterexample. Have a nice day!
In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!" What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated?
As EJT explained in detail,
EJT: These arguments don't work. [...] As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences. [...]
This whole situation appears very strange to me, as an outsider; isn't this topic important enough to merit enough of an analysis that gets us beyond saying (in Habryka's words) "it does seem wrong" to "it's actually wrong, here's the math that proves it"? I tried quite hard to find one, and was not able to. Given that coherence arguments are still crucial argumentative building blocks of the case made by users here that AI risk should be taken seriously (and that the general format of these arguments has remained unchanged), it leaves me with the rather uncanny impression that EJT's post was seen by the community, acknowledged as important, yet never truly engaged with, and essentially... forgotten, or maybe ignored? It doesn't seem like it has changed anyone's behavior or arguments despite no refutation of it having appeared. Am I missing something important here?
I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:
In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor. By tuning β we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They're trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.
Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.