Now, people working in these areas don't often disagree with this formal argument; they just think it isn't that important. They feel that getting the right formalism is most of the work, and finding the right U is easier, or just a separate bolt-on that can be added later.
My intuition, formed mainly by my many failure in this area, is that defining the U is absolutely critical, and is much harder than the rest of the problem. Others have different intuitions, and I hope they're right.
I'm curious if you're aiming for justified 99.9999999% confidence in having a friendly AI on the first try (i.e. justified belief that there's no more than a 1 in a billion chance of not-a-friendly-AI-on-the-first-try). I would agree that defining U is necessary to hit that sort of confidence, and that it's much harder than the rest of the problem.
ETA: The reason I ask is that this post seems very similar to the problem I have with impact measures (briefly: either you fail to prevent catastrophes, or you never do anything useful), but I wouldn't apply that argument to corrigibility. I think the difference might be that I'm thinking of "natural" things that agents might want, whereas you're considering the entire space of possible utility functions. I'm trying to figure out why we have this difference.
I'm trying to figure out why we have this difference.
My judgements come mainly from trying to make corrigibility/impact measures etc... work, and having similar problems in all cases.
I guess my point is that your counterexamples/problems all have this very formal no-free-lunch theorem aspect to them, and we solve problems that have no-free-lunch theorems all the time -- in fact a lot of the programming languages community is tackling such problems and getting decent results in most cases.
For this reason you could say "okay, while there is a no-free-lunch theorem here, probably when the AI system carves reality at the joints, it ends up with features / cognition that doesn't consider the utility on something like turning on a yellow light to be a reasonable utility function". You seem to be opposed to any reasoning of this sort, and I don't know why.
The counter-examples are of that type because the examples are often of that type - presented formally, so vulnerable to a formal solution.
If you're saying that " utility on something like turning on a yellow light" is not a reasonable utility function, then I agree with you, and that's the very point of this post - we need to define what a "reasonable" utility function is, at least to some extent ("partial preferences..."), to get anywhere with these ideas.
The counter-examples are of that type because the examples are often of that type - presented formally, so vulnerable to a formal solution.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I'm not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it's still much less formal than you usually work with.)
we need to define what a "reasonable" utility function
My argument is that we don't need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.
people working in these areas don't often disagree with this formal argument; they just think it isn't that important.
I do disagree with this formal argument in that I think it’s incorrectly framed. See the difference between avoiding huge impact to utility and avoiding huge impact to attainable utility, discussed here: https://www.lesswrong.com/posts/c2oM7qytRByv6ZFtz/impact-measure-desiderata#zLnkb5xM4E9ATzCFg.
AI scientists at EvenDeeperMind: "Hey everyone! We have a developed a low-impact AI!"
AI policy people at OpenFutureofExistentialAI: "Fantastic! What does it do?"
AI scientists: "It's limited to answering questions, and it has four possible outputs, o1, o2, o3, and o4."
AI policy: "What exactly do these outputs do, btw?"
AI scientists: "Well, o1 turns a green light on, o2 turns a red light on, o3 starts a nuclear war, and o4 turns a blue light one."
AI policy: "Starts a nuclear war?!?!?"
AI scientists: "That or turns a yellow light on; I can never remember which...".
Against purely formal definitions of impact measure
It's "obvious" that an AI that starts a nuclear war with one of its four actions, cannot be considered a "low-impact" agent.
But what about one that just turned the yellow light on? Well, what about the utility function Uy, which was +100 if there was no yellow lights in that room during that hour, but was −101010 if there was a yellow light. For that utility function, the action "start a nuclear war" is the low impact action, and even entertaining the possibility of turning on the yellow light is an abomination, you monster.
To which you should answer "Uy is a very stupid choice of utility function". Indeed it is. But it is a possible choice of utility function, so if we had an AI that was somehow "low-impact" for all utility functions, it would be low-impact for Uy.
There are less artificial examples than Uy; a friendly utility function with a high discount rate would find any delay intolerable ("by not breaking out and optimising the world immediately, you're murdering thousands of people in intense agony, you monster!").
Abstract low-impact
This is has been my recurrent objection to many attempts to formalise low-impact in abstract terms. We live in a universe in which every action is irreversible (damn you, second law!) and the consequences expand at light-speed across the cosmos. And yet most versions of low impact - including my own attempts - revolve around some measure of "keeping the rest of the universe the same, and only changing this tiny thing".
For this to make sense, we need to classify some descriptors of the world as "important", and others as "unimportant". And we further need to establish what counts as "small" change to an "important" fact. You can see this as assigning utility functions to the values of the important descriptors, and capturing low impact as "only change to these utility functions in U".
But you absolutely need to define U, and this has to be definition that captures something of human values. This paper measures low-impact by preserving vases and penalising "irreversible" changes. But every change is irreversible, and what about preserving the convection currents in the room rather than the pointless vases? ("you monster!").
So defining U that are compatible with human models of "low-impact", is absolutely essential to getting the whole thing to work. Abstractly considering all utility functions (or all utility functions defined in an abstract action-observation sense) is not going to work.
Note that often the definition of U can be hidden in the assumptions of the model, which will result in problems if those assumptions are relaxed or wrong.
The general intuitive disagreement
The objection I made here applies also to concepts like corrigibility, domesticity, value-learning, and similar ideas (including some versions of toolAI and Oracles). All of these need to designate certain AI policies as "safe" (or safer) and other as dangerous, and draw the line between them.
But, in my experience, this definition cannot be done in an abstract way; there is no such thing as a generally low-impact or corrigible agent. Defining some subset of what humans consider corrigible or tool-like, is an essential requirement.
Now, people working in these areas don't often disagree with this formal argument; they just think it isn't that important. They feel that getting the right formalism is most of the work, and finding the right U is easier, or just a separate bolt-on that can be added later.
My intuition, formed mainly by my many failure in this area, is that defining the U is absolutely critical, and is much harder than the rest of the problem. Others have different intuitions, and I hope they're right.
Strictly easier than friendliness
The problem of finding a suitable U is, however, strictly easier than defining a friendly utility function.
This can be seen in the fact that there are huge disagreements about morality and values between humans, but much lower disagreement on what an Oracle, a low-impact, or a corrigible agent should do.
"Don't needlessly smash the vases, but the convection currents are not important" is good advice for a low impact agent, as agreed upon by people from all types of political, moral, and cultural persuasions, including a wide variety of plausible imaginary agents.
Thus defining U is easier than coming up with a friendly utility function, as the same low-impact/corrigibility/domesticity/etc. is compatible with many different potential friendly utility functions for different values.