XiXiDu comments on The genie knows, but doesn't care - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (515)
Yes. To divide it more finely, it could be a terminal goal, or an instrumental goal; it could be a goal of the AI, or a goal of the human; it could be a goal the human would reflectively endorse, or a goal the human would reflectively reject but is inadvertently promoting anyway.
I agree that, at a given time, the AI must have a determinate goal. (Though the encoding of that goal may be extremely complicated and unintentional. And it may need to be time-indexed.) I'm not dogmatically set on the idea that a self-improving AGI is easy to program; at this point it wouldn't shock me if it took over 100 years to finish making the thing. What you're alluding to are the variety of ways we could fail to construct a self-improving AGI at all. Obviously there are plenty of ways to fail to make an AGI that can improve its own ability to track things about its environment in a domain-general way, without bursting into flames at any point. If there weren't plenty of ways to fail, we'd have already succeeded.
Our main difference in focus is that I'm worried about what happens if we do succeed in building a self-improving AGI that doesn't randomly melt down. Conditioned on our succeeding in the next few centuries in making a machine that actually optimizes for anything at all, and that optimizes for its own ability to generally represent its environment in a way that helps it in whatever else it's optimizing for, we should currently expect humans to go extinct as a result. Even if the odds of our succeeding in the next few centuries were small, it would be worth thinking about how to make that extinction event less likely. (Though they aren't small.)
I gather that you think that making an artificial process behave in any particular way at all (i.e., optimizing for something), while recursively doing surgery on its own source code in the radical way MIRI is interested in, is very tough. My concern is that, no matter how true that is, it doesn't entail that if we succeed at that tough task, we'll therefore have made much progress on other important tough tasks, like Friendliness. It does give us more time to work on Friendliness, but if we convince ourselves that intelligence explosion is a completely pie-in-the-sky possibility, then we won't use that time effectively.
I also gather that you have a hard time imagining our screwing up on a goal architecture without simply breaking the AGI. Perhaps by 'screwing up' you're imagining failing to close a set of parentheses. But I think you should be at least as worried about philosophical, as opposed to technical, errors. A huge worry isn't just that we'll fail to make the AI we intended; it's that our intentions while we're coding the thing will fail to align with the long-term interests of ourselves, much less of the human race.
We agree that it's possible to 'bind' a superintelligence. (By this you don't mean boxing it; you just mean programming it to behave in some ways as opposed to others.) But if the bindings fall short of Friendliness, while enabling superintelligence to arise at all, then a serious risk remains. Is your thought that Friendliness is probably an easier 'binding' to figure out how to code than are, say, resisting Pascal's mugging, or having consistent arithmetical reasoning?
I am trying to understand if the kind of AI, that is underlying the scenario that you have in mind, is a possible and likely outcome of human AI research.
As far as I am aware, as a layman, goals and capabilities are intrinsically tied together. How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?
Coherent and specific goals are necessary to (1) decide which actions are instrumental useful (2) judge the success of self-improvement. If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?
If I understand your position correctly, you would expect a chess playing general AI, one that does not know about checkmate, instead of "winning at chess", to improve against such goals as "modeling states of affairs well" or "make sure nothing intervenes chess playing". You believe that these goals do not have to be programmed by humans, because they are emergent goals, an instrumental consequence of being general intelligent.
These universal instrumental goals, these "AI drives", seem to be a major reason for why you believe it to be important to make the AI care about behaving correctly. You believe that these AI drives are a given, and the only way to prevent an AI from being an existential risk is to channel these drives, is to focus this power on protecting and amplifying human values.
My perception is that these drives that you imagine are not special and will be as difficult to get "right" than any other goal. I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.
As far as I am aware, here is what you believe an AI to want:
What AIs that humans would ever want to create would require all of these drives, and how easy will it be for humans to make an AI exhibit these drives compared to making an AI that can do what humans want without these drives?
Take mathematics. What are the difficulties associated with making an AI better than humans at mathematics, and will an AI need these drives in order to do so?
Humans did not evolve to play chess or do mathematics. Yet it is considerably more difficult to design a chess AI than an AI that is capable of discovering interesting and useful mathematics.
I believe that the difficulty is due to the fact that it is much easier to formalize what it means to play chess than doing mathematics. The difference between chess and mathematics is that chess has a specific terminal goal in the form of a clear definition of what constitutes winning. Although mathematics has unambiguous rules, there is no specific terminal goal and no clear definition of what constitutes winning.
The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.
In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?
All these drives are very vague ideas, not like "winning at chess", but more like "being better at mathematics than Terence Tao".
The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume, and from which you can reason about the behavior of an AI.
It is this context that the "dumb superintelligence" argument tries to highlight. It is likely incredibly hard to make these drives emerge in a seed AI. They implicitly presuppose that humans succeed at encoding intricate ideas about what "winning" means in all those cases required to overpower humans, but not in the case of e.g. winning at chess or doing mathematics. I like to analogize such a scenario to the creation of a generally intelligent autonomous car that works perfectly well at not destroying itself in a crash but which somehow manages to maximize the number of people to run over.
I agree that if you believe that it is much easier to create a seed AI to exhibit the drives that you imagine, than it is to make a seed AI use its initial resources to figure out how to solve a specific problem, then we agree about AI risks.
(Note: I'm also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the "AI Drives" to self-improve, be rational, retaining it's goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it's objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it's own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of 'drives' would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a 'theoryful' task while Discovering (Interesting) Mathematical Proofs would be a 'theoryless' one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it's training data for what is and is not an "interesting" mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.
Humans are capable of winning at chess without the terminal goal of doing so. Nor were humans designed by evolution specifically for chess. Why should we expect a general superintelligence to have intelligence that generalizes less easily than a human's does?
You keep coming back to this 'logically incoherent goals' and 'vague goals' idea. Honestly, I don't have the slightest idea what you mean by those things. A goal that can't motivate one to do anything ain't a goal; it's decor, it's noise. 'Goals' are just the outcomes systems tend to produce, especially systems too complex to be easily modeled as, say, physical or chemical processes. Certainly it's possible for goals to be incredibly complicated, or to vary over time. But there's no such thing as a 'logically incoherent outcome'. So what's relevant to our purposes is whether failing to make a powerful optimization process human-friendly will also consistently stop the process from optimizing for anything whatsoever.
Conditioned on a self-modifying AGI (say, an AGI that can quine its source code, edit it, then run the edited program and repeat the process) achieving domain-general situation-manipulating abilities (i.e., intelligence), analogous to humans' but to a far greater degree, which of the AI drives do you think are likely to be present, and which absent? 'It wants to self-improve' is taken as a given, because that's the hypothetical we're trying to assess. Now, should we expect such a machine to be indifferent to its own survival and to the use of environmental resources?
Sometimes a more complex phenomenon is the implication of a simpler hypothesis. A much narrower set of goals will have intelligence-but-not-resource-acquisition as instrumental than will have both as instrumental, because it's unlikely to hit upon a goal that requires large reasoning abilities but does not call for many material resources.
You haven't given arguments suggesting that here. At most, you've given arguments against expecting a seed AI to be easy to invent. Be careful to note, to yourself and others, when you switch between the claims 'a superintelligence is too hard to make' and 'if we made a superintelligence it would probably be safe'.
Well, I'm not sure what XXD means by them, but...
G1 ("Everything is painted red") seems like a perfectly coherent goal. A system optimizing G1 paints things red, hires people to paint things red, makes money to hire people to paint things red, invents superior paint-distribution technologies to deposit a layer of red paint over things, etc.
G2 ("Everything is painted blue") similarly seems like a coherent goal.
G3 (G1 AND G2) seems like an incoherent goal. A system with that goal... well, I'm not really sure what it does.
A system's goals have to be some event that can be brought about. In our world, '2+2=4' and '2+2=5' are not goals; 'everything is painted red and not-red' may not be a goal for similar reasons. When we're talking about an artificial intelligence's preferences, we're talking about the things it tends to optimize for, not the things it 'has in mind' or the things it believes are its preferences.
This is part of what makes the terminology misleading, and is also why we don't ask 'can a superintelligence be irrational?'. Irrationality is dissonance between my experienced-'goals' (and/or, perhaps, reflective-second-order-'goals') and my what-events-I-produce-'goals'; but we don't care about the superintelligence's phenomenology. We only care about what events it tends to produce.
Tabooing 'goal' and just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce would, I think, undermine a lot of XiXiDu's intuitions about goals being complex explicit objects you have to painstakingly code in. The only thing that makes it more useful to model a superintelligence as having 'goals' than modeling a blue-minimizing robot as having 'goals' is that the superintelligence responds to environmental variation in a vastly more complicated way. (Because, in order to be even a mediocre programmer, its model-of-the-world-that-determines-action has to be more complicated than a simple camcorder feed.)
Oh.
Well, in that case, all right. If there exists some X a system S is in fact optimizing for, and what we mean by "S's goals" is X, regardless of what target S "has in mind", then sure, I agree that systems never have vague or logically incoherent goals.
Well, wait. Where did "models its environment" come from?
If we're talking about the things S optimizes its environment for, not the things S "has in mind", then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.
In fact, given how you've defined "goal" here, I'm not sure why we're talking about intelligence at all. If that is what we mean by "goal" then intelligence has nothing to do with goals, or optimizing for goals. Volcanoes have goals, in that sense. Protons have goals.
I suspect I'm still misunderstanding you.
From Eliezer's Belief in Intelligence:
"Since I am so uncertain of Kasparov's moves, what is the empirical content of my belief that 'Kasparov is a highly intelligent chess player'? What real-world experience does my belief tell me to anticipate? [...]
"The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a 'better player' is reflected in the amount of probability mass I concentrate into the 'Kasparov wins' class of outcomes, versus the 'drawn game' and 'Mr. G wins' class of outcomes."
From Measuring Optimization Power:
"When I think you're a powerful intelligence, and I think I know something about your preferences, then I'll predict that you'll steer reality into regions that are higher in your preference ordering. [...]
"Ah, but how do you know a mind's preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string - how do you know this wasn't because a mind wanted it to produce that string?
"This, in turn, is reminiscent of the Minimum Message Length formulation of Occam's Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance.
"(3) Actually, you need to fit another criterion to take the intentional stance - there can't be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one - a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like moving parts in a machine.
"(4) If you have a coin that always comes up heads, there's no need to say "The coin always wants to come up heads" because you can just say "the coin always comes up heads". Optimization will beat alternative mechanical explanations when our ability to perturb a system defeats our ability to predict its interim steps in detail, but not our ability to predict a narrow final outcome. (Again, note that this is an epistemic criterion.)
"(5) Suppose you believe a mind exists, but you don't know its preferences? Then you use some of your evidence to infer the mind's preference ordering, and then use the inferred preferences to infer the mind's power, then use those two beliefs to testably predict future outcomes. The total gain in predictive accuracy should exceed the complexity-cost of supposing that 'there's a mind of unknown preferences around', the initial hypothesis."
Notice that throughout this discussion, what matters is the mind's effect on its environment, not any internal experience of the mind. Unconscious preferences are just as relevant to this method as are conscious preferences, and both are examples of the intentional stance. Note also that you can't really measure the rationality of a system you're modeling in this way; any evidence you raise for 'irrationality' could just as easily be used as evidence that the system has more complicated preferences than you initially thought, or that they're encoded in a more distributed way than you had previously hypothesized.
My take-away from this is that there are two ways we generally think about minds on LessWrong: Rational Choice Theory, on which all minds are equally rational and strange or irregular behaviors are seen as evidence of strange preferences; and what we might call the Ideal Self Theory, on which minds' revealed preferences can differ from their 'true self' preferences, resulting in irrationality. One way of unpacking my idealized values is that they're the rational-choice-theory preferences I would exhibit if my conscious desires exhibited perfect control over my consciously controllable behavior, and those desires were the desires my ideal self would reflectively prefer, where my ideal self is the best trade-off between preserving my current psychology and enhancing that psychology's understanding of itself and its environment.
We care about ideal selves when we think about humans, because we value our conscious, 'felt' desires (especially when they are stable under reflection) more than our unconscious dispositions. So we want to bring our actual behavior (and thus our rational-choice-theory preferences, the 'preferences' we talk about when we speak of an AI) more in line with our phenomenological longings and their idealized enhancements. But since we don't care about making non-person AIs more self-actualized, but just care about how they tend to guide their environment, we generally just assume that they're rational. Thus if an AI behaves in a crazy way (e.g., alternating between destroying and creating paperclips depending on what day of the week it is), it's not because it's a sane rational ghost trapped by crazy constraints. It's because the AI has crazy core preferences.
Yes, in principle. But in practice, a system that doesn't have internal states that track the world around it in a reliable and useable way won't be able to optimize very well for anything particularly unlikely across a diverse set of environments. In other words, it won't be very intelligent. To clarify, this is an empirical claim I'm making about what it takes to be particularly intelligent in our universe; it's not part of the definition for 'intelligent'.
Yes, that seems plausible.
I would say rather that modeling one's environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing.
But... so what?
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal... but that doesn't stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
So why isn't modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display.
It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can't see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I'm not equipped for..
Maybe it will help to keep in mind that this is one small branch of my conversation with Alexander Kruel. Alexander's two main objections to funding Friendly Artificial Intelligence research are that (1) advanced intelligence is very complicated and difficult to make, and (2) getting a thing to pursue a determinate goal at all is extraordinarily difficult. So a superintelligence will never be invented, or at least not for the foreseeable future; so we shouldn't think about SI-related existential risks. (This is my steel-manning of his view. The way he actually argues seems to instead be predicated on inventing SI being tied to perfecting Friendliness Theory, but I haven't heard a consistent argument for why that should be so.)
Both of these views, I believe, are predicated on a misunderstanding of how simple and disjunctive 'intelligence' and 'goal' are, for present purposes. So I've mainly been working on tabooing and demystifying those concepts. Intelligence is simply a disposition to efficiently convert a wide variety of circumstances into some set of specific complex events. Goals are simply the circumstances that occur more often when a given intelligence is around. These are both very general and disjunctive ideas, in stark contrast to Friendliness; so it will be difficult to argue that a superintelligence simply can't be made, and difficult too to argue that optimizing for intelligence requires one to have a good grasp on Friendliness Theory.
Because I'm trying to taboo the idea of superintelligence, and explain what it is about seed AI that will allow it to start recursively improving its own intelligence, I've been talking a lot about the important role modeling plays in high-level intelligent processes. Recognizing what a simple idea modeling is, and how far it gets one toward superintelligence once one has domain-general modeling proficiency, helps a great deal with greasing the intuition pump 'Explosive AGI is a simple, disjunctive event, a low-hanging fruit, relative to Friendliness.' Demystifying unpacking makes things seem less improbable and convoluted.
I think this is a map/territory confusion. I'm not denying that superintelligences will have a map of their own preferences; at a bare minimum, they need to know what they want in order to prevent themselves from accidentally changing their own preferences. But this map won't be the AI's preferences -- those may be a very complicated causal process bound up with, say, certain environmental factors surrounding the AI, or oscillating with time, or who-knows-what.
There may not be a sharp line between the 'preference' part of the AI and the 'non-preference' part. Since any superintelligence will be exemplary at reasoning with uncertainty and fuzzy categories, I don't think that will be a serious obstacle.
Does that help explain why I'm coming from? If not, maybe I'm missing the thread unifying your comments.
I suppose it helps, if only in that it establishes that much of what you're saying to me is actually being addressed indirectly to somebody else, so it ought not surprise me that I can't quite connect much of it to anything I've said. Thanks for clarifying your intent.
For my own part, I'm certainly not functioning here as Alex's proxy; while I don't consider explosive intelligence growth as much of a foregone conclusion as many folks here do, I also don't consider Alex's passionate rejection of the possibility justified, and have had extended discussions on related subjects with him myself in past years. So most of what you write in response to Alex's positions is largely talking right past me.
(Which is not to say that you ought not be doing it. If this is in effect a private argument between you and Alex that I've stuck my nose into, let me know and I'll apologize and leave y'all to it in peace.)
Anyway, I certainly agree that a system might have a representation of its goals that is distinct from the mechanisms that cause it to pursue those goals. I have one of those, myself. (Indeed, several.) But if a system is capable of affecting its pursuit of its goals (for example, if it is capable of correcting the effects of a state-change that would, uncorrected, have led to value drift), it is not merely interacting with maps. It is also interacting with the territory... that is, it is modifying the mechanisms that cause it to pursue those goals... in order to bring that territory into line with its pre-existing map.
And in order to do that, it must have such a mechanism, and that mechanism must be consistently isomorphic to its representations of its goals.
Yes?
This sounds like a potentially confusing level of simplification; a goal should be regarded as at least a way of comparing possible events.
Its behavior is what makes its goal important. But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better.
If the goal is implemented as a part of the system, other parts of the system can store some information about the goal, certain summaries or inferences based on it. This information can be thought of as beliefs about the goal. And if the goal is not "logically transparent", that is its specification is such that making concrete conclusions about what it states in particular cases is computationally expensive, then the system never knows what its goal says explicitly, it only ever has beliefs about particular aspects of the goal.
Perhaps, but I suspect that for most possible AIs there won't always be a fact of the matter about where its preference is encoded. The blue-minimizing robot is a good example. If we treat it as a perfectly rational agent, then we might say that it has temporally stable preferences that are very complicated and conditional; or we might say that its preferences change at various times, and are partly encoded, for instance, in the properties of the color-inverting lens on its camera. An AGI's response to environmental fluctuation will probably be vastly more complicated than a blue-minimizer's, but the same sorts of problems arise in modeling it.
I think it's more useful to think of rational-choice-theory-style preferences as useful theoretical constructs -- like a system's center of gravity, or its coherently extrapolated volition -- than as real objects in the machine's hardware or software. This sidesteps the problem of haggling over which exact preferences a system has, how those preferences are distributed over the environment, how to decide between causally redundant encodings which is 'really' the preference encoding, etc. See my response to Dave.
"Goal" is a natural idea for describing AIs with limited resources: these AIs won't be able to make optimal decisions, and their decisions can't be easily summarized in terms of some goal, but unlike the blue-minimizing robot they have a fixed preference ordering that doesn't gradually drift away from what it was originally, and eventually they tend to get better at following it.
For example, if a goal is encrypted, and it takes a huge amount of computation to decrypt it, system's behavior prior to that point won't depend on the goal, but it's going to work on decrypting it and eventually will follow it. This encrypted goal is probably more predictive of long-term consequences than anything else in the details of the original design, but it also doesn't predict its behavior during the first stage (and if there is only a small probability that all resources in the universe will allow decrypting the goal, it's probable that system's behavior will never depend on the goal). Similarly, even if there is no explicit goal, as in the case of humans, it might be possible to work with an idealized goal that, like the encrypted goal, can't be easily evaluated, and so won't influence behavior for a long time.
My point is that there are natural examples where goals and the character of behavior don't resemble each other, so that each can't be easily inferred from the other, while both can be observed as aspects of the system. It's useful to distinguish these ideas.
I agree preferences aren't reducible to actual behavior. But I think they are reducible to dispositions to behave, i.e., behavior across counterfactual worlds. If a system prefers a specific event Z, that means that, across counterfactual environments you could have put it in, the future would on average have had more Z the more its specific distinguishing features had a large and direct causal impact on the world.
The examples I used seem to apply to "dispositions" to behave, in the same way (I wasn't making this distinction). There are settings where the goal can't be clearly inferred from behavior, or collection of hypothetical behaviors in response to various environments, at least if we keep environments relatively close to what might naturally occur, even as in those settings the goal can be observed "directly" (defined as an idealization based in AI's design).
An AI with encypted goal (i.e. the AI itself doesn't know the goal in explicit form, but the goal can be abstractly defined as the result of decryption) won't behave in accordance with it in any environment that doesn't magically let it decrypt its goal quickly, there is no tendency to push the events towards what the encrypted goal specifies, until the goal is decrypted (which might be never with high probability).
Here is what I mean:
Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this?
Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator.
The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis.
Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.
The problem is, you don't have to program the bit that says "now make yourself more intelligent." You only have to program the bit that says "here's how to make a new copy of yourself, and here's how to prove it shares your goals without running out of math."
And the bit that says "Try things until something works, then figure out why it worked." AKA modeling.
The AI isn't actually an intelligence optimizer. But it notes that when it takes certain actions, it is better able to model the world, which in turn allows it to make more paperclips (or whatever). So it'll take those actions more often.