In popular perception, this is how China is acting in international relations: it mostly pursues instrumental convergent goals like increasing its influence, accumulating resources, etc., while just holding its strategic goals "in the background", without explicitly "rushing" to achieve them right now.
This reminds me of Sun Tzu's saying, “If you wait by the river long enough, the bodies of your enemies will float by.”
Does such a strategy count as misalignment? If the beliefs held by the agent are compatible[1] with the overseer's beliefs, I don't think so. Agent's understanding of the world could be deeper and therefore its cognitive horizon and the "light cone" of agency and concern (see Levin, 2022; Witkowski et al., 2023) farther or deeper than those of the overseer.
Then, the superintelligent agent could evolve its beliefs to the point that is no longer compatible with the "original" beliefs of the overseer, or even their existence. In the latter case, or if the agent fails to convince the overseers in (a simplified version of) their new beliefs, that would constitute misalignment, of course. But here we go out of scope of what is considered in the post and my comment above.
By "compatible", I mean either coinciding or reducible with minimal problems, like general relativity could be reduced to Newtonian mechanics in certain regimes with negligible numerical divergence.
Thanks, that's a great example!
Does such a strategy count as misalignment?
Yeah, I don't think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that "ultimate intents" or "deep intents" are conceptually murky, and therefore not obvious how to read off an agent--if you can discern them through behavior, what can you discern them through?
Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don't have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don't have any ways of knowing of any "deep beliefs" that AI may have, if it doesn't reveal them in its behavior or "thoughts" (explicit representations during inferences)
Tentative GPT4's summary. This is part of an experiment.
Up/Downvote "Overall" if the summary is useful/harmful.
Up/Downvote "Agreement" if the summary is correct/wrong.
If so, please let me know why you think this is harmful.
(OpenAI doesn't use customers' data anymore for training, and this API account previously opted out of data retention)
TLDR: This article explores the challenges of inferring agent supergoals due to convergent instrumental subgoals and fungibility. It examines goal properties such as canonicity and instrumental convergence and discusses adaptive goal hiding tactics within AI agents.
Arguments:
- Convergent instrumental subgoals often obscure an agent's ultimate ends, making it difficult to infer supergoals.
- Agents may covertly pursue ultimate goals by focusing on generally useful subgoals.
- Goal properties like fungibility, canonicity, and instrumental convergence impact AI alignment.
- The inspection paradox and adaptive goal hiding (e.g., possibilizing vs. actualizing) further complicate the inference of agent supergoals.
Takeaways:
- Inferring agent supergoals is challenging due to convergent subgoals, fungibility, and goal hiding mechanisms.
- A better understanding of goal properties and their interactions with AI alignment is valuable for AI safety research.
Strengths:
- The article provides a detailed analysis of goal-state structures, their intricacies, and their implications on AI alignment.
- It offers concrete examples and illustrations, enhancing understanding of the concepts discussed.
Weaknesses:
- The article's content is dense and may require prior knowledge of AI alignment and related concepts for full comprehension.
- It does not provide explicit suggestions on how these insights on goal-state structures and fungibility could be practically applied for AI safety.
Interactions:
- The content of this article may interact with other AI safety concepts such as value alignment, robustness, transparency, and interpretability in AI systems.
- Insights on goal properties could inform other AI safety research domains.
Factual mistakes:
- The summary does not appear to contain any factual mistakes or hallucinations.
Missing arguments:
- The potential impacts of AI agents pursuing goals not in alignment with human values were not extensively covered.
- The article could have explored in more detail how AI agents might adapt their goals to hide them from oversight without changing their core objectives.
[Metadata: crossposted from https://tsvibt.blogspot.com/2022/12/ultimate-ends-may-be-easily-hidable.html. First completed December 18, 2022.]
Thought and action in pursuit of convergent instrumental subgoals do not automatically reveal why those subgoals are being pursued--towards what supergoals--because many other agents with different supergoals would also pursue those subgoals, maybe with overlapping thought and action. In particular, an agent's ultimate ends don't have to be revealed by its pursuit of convergent subgoals. It might might therefore be easy to covertly pursue some ultimate goal by mostly pursuing generally useful subgoals of other supergoals. By the inspection paradox for the convergence of subgoals, it might be easy to think and act almost comprehensively like a non-threatening agent would think and act, while going most of the way towards achieving some other more ambitious goal.
Note: the summary above is the basic idea. The rest of the essay analyzes the idea in a lot of detail. The final main section might be the most interesting.
What can you tell about an agent's ultimate intent by its behavior?
An agent's ultimate intent is what the agent would do if it had unlimited ability to influence the world. What can we tell about an agent's ultimate intent by watching the external actions it takes, whether low-level (e.g. muscle movements) or higher-level (e.g. going to the store), and by watching its thinking (e.g. which numbers it's multiplying, which questions it's asking, which web searches it's running, which concepts are active)?
Terms
Inferring supergoals through subgoals
Suppose that G is an intermediate goal for some agent. By observing more fully the constellation of action and thought the agent does in pursuit of the goal G, we become more confident that the agent is pursuing G. That is, observing the agent pursuing subgoals that constitute much of a sufficient strategy for G is evidence that the agent has the goal G. If we observe that the farmer turns the earth, broadcasts the seed, removes the weeds, sprays for pests, and fences out wild animals, we become more and more sure that ze is trying to grow crops.
But, we don't necessarily know what ze intends to do with the crops, e.g. eat them or sell them. Information doesn't necessarily flow from subgoals up through the fact that the farmer is trying to grow crops, to indicate supergoals; the possible supergoals may be screened off from the observed subgoals. In that case, we're uncertain which of the supergoals in the motivator set of [grow crops] is the one held by the farmer.
Suppose an agent is behaving in a way that looks to us like it's pursuing some instantiation g of a goal-state G. (We always abstract somewhat from g to G, e.g. we don't say "the farmer has a goal of growing rhubarb with exactly the lengths [14.32 inches, 15.11 inches, 15.03 inches, ...] in Maine in 2022 using a green tractor while singing sea shanties", even if that's what g is.) Some ways to infer the supergoals of the agent by observing its pursuit of G:
These points leave out how to infer goals from observed behavior, except through observed pursuit of subgoals. This essay takes for granted that some goals can be inferred from [behavior other than subgoal pursuit], e.g. by observing the result of the behavior, by predicting the result of the behavior through simulation, by analogy with other similar behavior, or by gemini modeling the goal.
Messiness in relations between goals
Except for the concept of strategy as an arrangement of goals, the above discussion takes goals in isolation, as if the environment and the agent's skills, resources, language of thought, and other plans are irrelevant to the supergoal-subgoal relation. Really agents somehow apply all their mental elements to generate and choose between overall plans for the whole of their behavior, not individual subgoals, or strategies for subgoals, executed without preconditions or overlap with other strategies.
The subgoal-supergoal vocabulary may be too restrictive, by assuming that agents have goals--that is, that agents pursue goal-states. To be more general, we would assume as little as possible about how an agent controls the cosmos, and leave open that there are agents occupying cognitive realms where there aren't really goals with subgoal-supergoal relations.
It is not the case that goal-/cosmos-states have any a priori central hierarchy of implication, time-space containment, or causality; that is, states we naturally call goals don't have to factor neatly in those ways.
It is also not the case that goals (as agent properties, or their concomitant mental elements) neatly form an acyclic supergoal-subgoal hierarchy, whether by the relations of motivation, delegation, other mental causation or control, or inclusion of sets of pursuant behaviors.
For example, the goal [obtain energy] is related to the goal [mine coal] both as a supergoal (mine coal to get energy) and also as a subgoal (expend energy to move rocks blocking the way to the coal).
As another example, it would be awkward to try to carve boundaries around all the actions taken in service of the goal [grow crops] and its subgoals. Such a boundary would have to be quite uncomfortably intertwined with the boundary drawn around the actions taken in service of the goal [drive crops to the market] and its subgoals, because both goals share the subgoal [pump petroleum from the ground]: which steps taken towards the oil field were in service of [grow crops], and which in service of [drive crops]? If we say, every step is 9/10 in service of [grow crops] and 1/10 in service of [drive crops], then have we given up on drawing subgoal-boundaries around behaviors or world-states? Or, if we exclude from the behavior representing the goal [grow crops] all of the behavior merely in service of the subgoals of [grow crops], then wouldn't we go on excluding subgoal behaviors until there are no behaviors left to constitute the pursuit of the goal [grow crops] itself? Or is there a specialized mental action that exactly sets in motion and orchestrates the pursuit of [growing crops], without getting its hands dirty with any subgoals?
Factors obscuring supergoals
Fungibility
Note: the two following subsections, Terms and Examples, aren't really needed to read what comes after; their purpose is to clarify the idea of fungibility. Maybe read the third subsection, "Effects on goal structure", and then backtrack if you want more on fungibility.
Terms
If S is use-fungible, then it is effectively-use-fungible: the strategy using s1 already probably works with s2 instead, by use-fungibility.
If S is state-fungible, then it is effectively-use-fungible: given a working strategy π1 using s1, there's a strategy π2 that, given s2, first easily produces s1 from s2, which is doable by state-fungibility, and then follows π1.
Examples
Effects on goal structure
Canonicity
In "The unreasonable effectiveness of mathematics in the natural sciences", Wigner discusses the empirical phenomenon that theories in physics are expressed using ideas that were formed by playing around with ideas and selected such that they are apt for demonstrating a sense of formal beauty and ingenious skill at manipulating ideas.
Some of this empirical phenomenon could be explained by abstract mathematical concepts being canonical. That is, there's in some sense only one form, or very few forms, that this concept can take. Then, when a concept is discovered once by mathematicians, it is discovered in roughly the same form as will be useful later in another context. Compare Thingness.
Canonicity is not the same thing as simplicity. Quoting Wigner:
However, it might be that canonicity is something like, maximal simplicity within the constraint of the task at hand. Quoting Einstein:
The mathematician seeks contexts where the simplest concepts adequate to the tasks of the context are themselves interesting. Repeatedly seeking in that way builds up a library of canonical concepts, which are applied in new contexts.
Extreme canonicity is far from necessary. It does not always hold, even in abstract mathematics (and might not ever hold). For example, quoting:
The same concept might be redescribed in a number of different ways; there's no single canonical definition.
Canonicity could be viewed as a sort of extreme of fungibility, as if there were only one piece of gold in the cosmos, so that having a piece of gold is trivially fungible with all ways of having a piece of gold. All ways of comprehending a concept are close to fungible, since any definition can be used to reconstruct other ways of understanding or using the concept. (This is far from trivial in individual humans, but I think it holds fairly well among larger communities.)
Compare also Christopher Alexander's notion of a pattern and a pattern language.
Effects on goal structure
Instrumental convergence
A convergent instrumental goal is a goal that would be pursued by many different agents in service of many different supergoals or ultimate ends. See Arbital "Instrumental convergence" and a list of examples of plausible convergent instrumental strategies.
Or, as a redefinition: a convergent instrumental goal is a goal with a large motivator set. ("Large" means, vaguely, "high measure", e.g. many goals held by agents that are common throughout the multiverse according to some measure.) Taking this definition literally, a goal-state G might be "instrumentally convergent" by just being a really large set. For example, if G is [do anything involving the number 3, or involving electricity, or involving the logical AND operation, or involving a digital code, or involving physical matter], then a huge range of goals have some instantiation of G as a subgoal. This is silly. Really what we mean is to include some notion of naturality, so that different instantiations of G are legitimately comparable. Fungibility is some aspect of natural, well-formed goals: any instantiation of G should be about as useful for supergoals as any other instantiation of G.
Canonicity might be well-described as a combination of two properties: well-formedness (fungibility, Thingness), and instrumental convergence.
Fungibility spirals
A goal-state G is enfungible if agents can feasibly make G more fungible.
Suppose G is somewhat convergent and somewhat enfungible, even if it's not otherwise very fungible / natural. Then agents with goals in the motivator set of G will put instrumental value on making G more fungible. That is, such agents would find it useful (if not maximally useful) to enfunge G, to make it more possible to use different instantiations of G towards more supergoals, because for example it might be that g⊂G is best suited to an agent's supergoals, but g′⊂G is easiest to obtain.
If G is enfunged, that increases the pool of agents who might want to further enfunge G. For example, suppose agents Ai have goals gi⊂G, and suppose that the gi aren't currently fungible, and that gi is easier to obtain than gj when i>j. First A0 works to make g1 fungible into g0. Once that happens, now both A0 and A1 want to make g2 fungible into g1. And so on.
In this way, there could be a fungibility spiral, where agents (which might be "the same" agent at different times or branches of possibility) have instrumental motivation to make G very fungible. An example is the history of humans working with energy. By now we have the technology to efficiently store, transport, apply, and convert between many varieties of energy (which actions are each an implementation of fungibility). As another example, consider the work done to do computational tasks using different computing hardware.
Existence
Evidence for the existence of convergent instrumental goals:
Effects on goal structure
The inspection paradox for subgoals
The Friendship paradox: "most people have fewer friends than their friends have, on average". Relatedly, the inspection paradox:
Analogously:
Very instrumentally convergent goals are held by more agents than non-convergent goals, so, just like a passenger who's usually subjected to an overcrowded bus, an agent holding some subgoal is usually holding an overcrowded subgoal. In particular:
Hidden ultimate ends
[I got tired of writing this essay, so this section is less thoroughly thought out / expressed / distilled. It's probably the most important section though. Feel free to ask questions at my gmail address: tsvibtcontact]
Suppose we're observing a mind thinking and acting. What can we tell about its ultimate goals?
Are convergent instrumental goals near-universal for possibilizing?
Possibilizing is making things possible, in contrast to actually doing things. This might not be a principled distinction, since whatever is called possibilizing is surely also actualizing something, but still there's some intuitive spectrum. E.g. speed cubers first look at the Rubik's cube, then put it down, and then pick it up and turn its faces. Before they pick it up the second time, in their head they're possibilizing solving the cube by figuring out the sequence of moves. When they pick it up, they actualize solving the cube. An engineer designing a machine possibilizes the work the machine will do; building the machine is midway between possibilizing and actualizing the work; applying the machine actualizes the work. See possibilizing vs. actualizing.
Question:
In other words: given some random supergoal G, how close can you get to achieving G--how little additional work can you leave undone--if you're only allowed to behave in ways that many agents would behave? I don't know. I suspect you can get pretty close. Some intuitions behind that suspicion:
Non-adaptive goal hiding
Suppose the mind is not adapted in any way to stop us from understanding the mind's ultimate goals. Can we determine what the mind's ultimate goals are? Some obstacles:
Adaptive goal hiding
Suppose an agent A adapts in some way so that it isn't visibly pursuing goals that an overseer wouldn't want it to pursue. The adaptation could be passive, e.g. we apply a training loss to thoughts or actions that score highly on some detector of having certain goals, or it could be active, e.g. A itself is searching out ways to arrange itself so that we don't interfere with its goals.