Thanks for trying to clarify "X and only X", which IMO is a promising concept.
One thing we might want from an only-Xer is that, in some not-yet-formal sense, it's "only trying to X" and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You've discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this generalize? If we've selected something to do X, within some set of environments, what does that imply about how it'll behave outside of that set of environments? It looks like by your definition we could have a program that's a very competent general intelligence with a slot for a goal, plus a pointer to X in that slot; and that program would count as an only-Xer. This program would be very close, in some sense, to programs that optimize competently for not-X, or for a totally unrelated Y. That seems counterintuitive for my intuitive picture of an "X and only X"er, so either there's more to be said, or my picture is incoherent.
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all.
Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in computers or humans to bootstrap this code into existence. This is what I want to avoid.
Your picture might be coherent and formalizable into some different technical definition. But you would have to start talking about difference in codespace, which can differ depending on different programming languages.
The program if True: x() else: y() is very similar in codespace to if False: x() else: y() .
If code space is defined in terms of minimum edit distance, then layers of interpereters, error correction and holomorphic encryption can change it. This might be what you are after, I don't know.
Well, a main reason we'd care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, "how much" learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you'd have to have traversed the both the distances from null to x() and from null to y(), and it's weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that's an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that's an only-Yer, where Y is "very different" from X.
I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.
So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?
To clarify where my responses are coming from: I think what I'm saying is not that directly relevant to your specific point in the post. I'm more (1) interested in discussing the notion of only-X, broadly, and (2) reacting to the feature of your discussion (shared by much other discussion) that you (IIUC) consider only the extensional (input-output) behavior of programs, excluding from analysis the intensional properties. (Which is a reasonable approach, e.g. because the input-output behavior captures much of what we care about, and also because it's maybe easier to analyze and already contains some of our problems / confusions.)
From where I'm sitting, when a program "makes an observation of the world", that's moving around in codespace. There's of course useful stuff to say about the part that didn't change. When we really understand how a cognitive algorithm works, it starts to look like a clear algorithm / data separation; e.g. in Bayesian updating, we have a clear picture of the code that's fixed, and how it operates on the varying data. But before we understand the program in that way, we might be unable to usefully separate it out into a fixed part and a varying part. Then it's natural to say things like "the child invented a strategy for picking up blocks; next time, they just use that strategy", where the first clause is talking about a change in source code. We know for sure that such separations can be done, because for example we can say that the child is always operating in accordance with fixed physical law, and we might suspect there's "fundamental brain algorithms" that are also basically fixed. Likewise, even though Solomonoff induction is always just Solomonoff induction plus data, it can be also useful to understand SI(some data) in terms of understanding those programs that are highly ranked by SI(some data), and it seems reasonable to call that "the algorithm changed to emphasize those programs".
Epistemic status: Speculative. Rambling stream of ideas.
Recomended semiprerequisite https://www.lesswrong.com/posts/DkcdXsP56g9kXyBdq/coherence-arguments-imply-a-force-for-goal-directed-behavior
How do we tell if an arbitrary system is an optimizer or not?
Well "optimiser" is a set of hypothesis. We can take an AIXItl like structure, and use the universal prior over utility functions. The other hypothesis set is the default solomonov prior. All computer programs weighted by complexity.
Note that each hypothesis set contains a shrunken copy of the other. Within the space of all computer programs, some of those programs will be optimisers. Within the space of all possible optimizers, some will have the goal of making their output match the output of an arbitrary computer program. Thus the odds ratio in either direction is bounded. We call this the odds ratio approach.
The intuition is that this odds ratio provides a good measure if a program is optimizing.
(There is another measure of optimization, namely the weighted average utility over all Turing machine environments. We call this the weighted reward approach)
I will discuss a few programs, and how much optimization each formalism thinks they display.
Trivial constant programms
and
Under the odds ratio approach, both these programs are on the not optimizer end of the scale and right next to each other.
Under the weighted reward approach, the answer is much more sensitive on choice of universal Turing machine. An environment that just gives reward for each 1 outputted is very simple. It might even account for say 5% (random number) of weight across all computable environments. If the environment that just rewards 0's isn't given a correspondingly high weight, then according to the weighted reward framework, one of these programs can be much more optimizing than the other.
Conditional AIXI
and
In the odds ratio formulation, both of these are pretty optimizerry, and almost identical. In either formalization they need to take the complexity of describing the long random string. In the optimizer formulation, they need to take the complexity of constantly outputing 0's. In the all programs formulation, they need the complexity of describing AIXI like behaviour.
In the weighted reward formulation, the first is basically the constant 0 program, while the other is nearly as much of an optimizer as AIXI.
Optimising X and only X
The basic idea here is that once you compensate for the X optimization then the system isn't an optimizer.
One straightforward way of doing this is a rejection of low X proposed action. You have an arbitrary Turing machine T. Whenever T proposes an action A, the utility/ expected utility of that action is computed. If A is one of the top m actions when it comes to maximizing X, then A is taken. Otherwise T is told to try again. A system is said to be an X and only X maximizer if it is better modelled as being any old Turing machine behind a low X rejection gate. As opposed to a Y optimizer behind a low X rejection gate.
For example, a system that outputs whichever action in the million X-best actions contains the most 0's is an X and only X best optimizer. Here m=1000000, and the simple Turing machine behind the low X rejection gate is just a TM that cycles through all strings, starting with the ones full of 0's.
A system that outputs whichever action in the million X-best actions maximizes the number of paperclips in existence is not an X and only X maximiser. It is best modelled as a paperclip maximiser behind a low X rejector. And this comes back to an implicit choice of distribution over optimizers at the start of this article. But presumably the paperclip maximizer is prominent in that set.
Note that at the start of this, we used a set of optimisers. One way to formally specify that set would be to take an AIXI like design, and put a solomonov prior over utility functions (as functions of observations)
However, this fails to capture AI's that optimise for something that is not a simple function of observations. (Like some real world object in a world where the AI's senses can be fooled.) However, if we had a fomalization of an agent that did this kind of optimization, we could just plug in all possible utility functions, weighted by simplicity, into our set of optimizers. Likewise for other decision theories, compute limited AI's ect.