(edit: discussions in the comments section have led me to realize there have been several conversations on LessWrong related to this topic that I did not mention in my original question post.
Since ensuring their visibility is important, I am listing them here: Rohin Shah has explained how consequentialist agents optimizing for universe-histories rather than world-states can display any external behavior whatsoever, Steven Byrnes has explored corrigibility in the framework of consequentialism by arguing poweful agents will optimize for future world-states at least to some extent, Said Achmiz has explained what incomplete preferences look like (1, 2, 3), EJT has formally defined preferential gaps and argued incomplete preferences can be an alignment strategy, John Wentworth has analyzed incomplete preferences through the lens of subagents but has then argued that incomplete preferences imply the existence of dominated strategies, and Sami Petersen has argued Wentworth was wrong by showing how incomplete preferences need not be vulnerable.)
In his first discussion with Richard Ngo during the 2021 MIRI Conversations, Eliezer retrospected and lamented:
In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to - they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they'd spent a lot of time being exposed to over and over and over again in lots of blog posts.
Maybe there's no way to make somebody understand why corrigibility is "unnatural" except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell's attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.
Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!"
And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.
And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent.
Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good.
When Eliezer says "they did not even do as many homework problems as I did," I doubt he is referring to actual undergrad-style homework problems written nicely in LaTeX. Nevertheless, I would like to know whether there is some sort of publicly available repository of problem sets that illustrate the principles he is talking about. Meaning set-ups where you have an agent (of sorts) that is acting in a manner that's either not utility-maximizing or even simply not consequentialist, followed by explanations of how you can exploit this agent. Given the centrality of consequentialism (and the associated money-pump and Dutch book-type arguments) to his thinking about advanced cognition and powerful AI, it would be nice to be able to verify whether working on these "homework problems" indeed results in the general takeaway Eliezer is trying to communicate.
I am particularly interested in this question in light of EJT's thorough and thought-provoking post on how "There are no coherence theorems". The upshot of that post can be summarized as saying that "there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy" and that "nevertheless, many important and influential people in the AI safety community have mistakenly and repeatedly promoted the idea that there are such theorems."
I was not a member of this site at the time EJT made his post, but given the large number of upvotes and comments on his post (123 and 116, respectively, at this time), it appears likely that it was rather popular and people here paid some attention to it. In light of that, I must confess to finding the general community reaction to his post rather baffling. Oliver Habryka wrote in response:
The post does actually seem wrong though.
I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but mostly, I feel like in order to argue that something is wrong with these arguments is that you have to argue more compellingly against completeness and possible alternative ways to establish dutch-book arguments.
However, the "details", as far as I can tell, have never been written up. There was one other post by Valdes on this topic, who noted that "I have searched for a result in the literature that would settle the question and so far I have found none" and explicitly called for the community's participation, but constructive engagement was minimal. John Wentworth, for his part, wrote a nice short explanation of what coherence looks like in a toy setting involving cache corruption and a simple optimization problem; this was interesting but not quite on point to what EJT talked about. But this was it; I could not find any other posts (written after EJT's) that were even tangentially connected to these ideas. Eliezer's own response was dismissive and entirely inadequate, not really contending with any of the arguments in the original post:
Eliezer: The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.
Cool, I'll complete it for you then.
Transitivity: Suppose you prefer A to B, B to C, and C to A. I'll keep having you pay a penny to trade between them in a cycle. You start with C, end with C, and are three pennies poorer. You'd be richer if you didn't do that.
Completeness: Any time you have no comparability between two goods, I'll swap them in whatever direction is most useful for completing money-pump cycles. Since you've got no preference one way or the other, I don't expect you'll be objecting, right?
Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem. The post's thesis, "There are no coherence theorems", is therefore falsified by presentation of a counterexample. Have a nice day!
In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!" What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated?
As EJT explained in detail,
EJT: These arguments don't work. [...] As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences. [...]
This whole situation appears very strange to me, as an outsider; isn't this topic important enough to merit enough of an analysis that gets us beyond saying (in Habryka's words) "it does seem wrong" to "it's actually wrong, here's the math that proves it"? I tried quite hard to find one, and was not able to. Given that coherence arguments are still crucial argumentative building blocks of the case made by users here that AI risk should be taken seriously (and that the general format of these arguments has remained unchanged), it leaves me with the rather uncanny impression that EJT's post was seen by the community, acknowledged as important, yet never truly engaged with, and essentially... forgotten, or maybe ignored? It doesn't seem like it has changed anyone's behavior or arguments despite no refutation of it having appeared. Am I missing something important here?
Thank you for the link, Steve. I recall having read your post a while back, but for some reason it slipped my mind while I was pondering the original question here. That being said, while your writing is tangentially related, it is also not quite on point to my inquiry and concerns.
Your post is somewhat of a direct answer to @Rohin Shah's illustration in "Coherence arguments do not entail goal-directed behavior" (yet another post I should have linked to in my original post) of the fact that an agent optimizing for the fulfillment of preferences over universe-histories as opposed to mere future world states can display "any behavior whatsoever." More specifically, you explored corrigibility proposals in light of this fact, concluding that "preferences purely over future states are just fundamentally counter to corrigibility." I haven't thought about this topic enough to come to a definite judgment either way, but in any case, this is much more relevant to Eliezer's tangent about corrigibility (in the quote I selected at the top of my post) than to the different object-level concern about whether coherence arguments imply the degree of certitude Eliezer has about how sufficiently powerful agents will behave in the real world.
Indeed, the section of your post that is by far the most relevant to my interests here is the following (which you wrote as more of an aside):
For completeness and to save others the effort of clicking on that link onto a new tab, the relevant part of your referenced comment says the following:
Unfortunately, this still doesn't present the level of evidence or reasoning that would persuade me (or even move me significantly) in the direction of believing powerful AI will necessarily (or even likely) optimize (at least in large part) for explicit preferences over future world states. It suffers from the same general problems that previous writings on this topic (including Eliezer's) do, namely the fact that they communicate strongly-held intuitions about "certain structures of cognition [...] that are good at stuff and do the work" without a solid grounding in either formal mathematical reasoning or explicit real-world empirical data or facts (although I suppose this is better than claiming the math actually proves the intuitions are right, when in fact it doesn't).
We all know intuitions aren't magic but are nonetheless useful about complex topics when they function as gears in understanding, so I am certainly not claiming reliance on intuitions is bad per se, especially in dialogues about topics like AGI where empirical analyses are inherently tricky. On the contrary, I think I understand the relevant dynamic here quite well: you (just like Eliezer) have spent a ton of time thinking about and working on determining how powerful optimizers reason, and in the process you have gained a lot of (mostly implicit) understandings of this. Analogously to how nobody starts off with great intuition about chess but can develop it tremendously over time after playing games, working through analyses and doing puzzles, you have trained your intuition about consequentialist reasoning by working hard on the alignment problem and should thus be more attuned to the ground-level territory than someone who hasn't done the (in Eliezer-speak) "homework problems". I am nonetheless still left with the (mostly self-imposed) task of figuring out whether those intuitions are correct.
One way of doing that would be to obtain irrefutable mathematical proof that Expected Utility maximizers come about when we optimize hard enough for intelligence of an AI system, or at the very least that such entities would necessarily be exploitable if they don't self-modify into an EU maximizer. Indeed, this is the very reason I made this question post, but it seems like this type of proof isn't actually available given that none of the answers or comments thus far have patched the holes in Eliezer's arguments or explained how EJT might have been wrong. Another way of doing it would be to defer to the conclusions that smart people with experience in this area like you or Eliezer have reached; however, this also doesn't work because suffers from a few major issues:
Of course, the one other way out of this conundrum is for me to follow the standard "think for yourself and reach your own conclusions" advice that's usually given. Unfortunately, that also can't work if "think for yourself" means "hypothesize really hard without any actual experimentation, HPJEV-style". As it turns out, while I am not an alignment researcher, I think of myself as a reasonably smart guy who understands Eliezer's perspective ("In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good") and who has spent a fair bit of time pondering these matters, and even after all that I don't see why Eliezer's perspective is even likely to be true (let alone reaching the level of confidence he apparently has in these conclusions).
Indeed, in order for me (and for other people like me, which I imagine exist out here) to proceed on this matter, I would need something that has more feedback loops that allow me to progress while remaining grounded to reality. In other words, I would need to see for myself the analogues to the aforementioned "games, analyses, and puzzles" that build the chess intuition. This is why an important part of my original post (which nobody has responded to yet) was the following: