I upvoted this but I also wish it made clearer distinctions. In particular, I think it misses that the following can all be true:
Strongly upvoted this comment.
But I'm not up for reworking the post right now.
And I'm not sure I fully agree with it, particularly point 1. I mean "goal directedness as argmax over action [or higher level mappings thereof] to maximise the expected value of a simple unitary utility function" and goal directedness as "contextually activated heuristics downstream of historical reinforcement events" seem to make pretty different predictions about the future (especially in the extremes), that I'm not actually sure I want to call the things that humans do "goal directed". It seems unhelpful to overload the term when referring to such different decision making procedures.
I mean I do in fact think one can define a spectrum of goal directedness, but I think the extreme end of that spectrum (argmax) is anti-natural, that the sophisticated systems we have in biology and ML both look more like "executing computations/cognition that historically correlated with higher performance on the objective function a system was selected for performance on" and that this is a very important distinction.
"Contextually activated heuristics" seem to be of an importantly different type than "immutable terminal goals" or a "simple unitary utility function".
I think for the purposes of this post, that distinction is very important and I need to emphasise it more.
(this response is cross-posted as a top-level post on my blog)
i expect the thing that kills us if we die, and the thing that saves us if we are saved, to be strong/general coherent agents (SGCA) which maximize expected utility. note that this is two separate claims; it could be that i believe the AI that kills us isn't SGCA, but the AI that saves us still has to be SGCA. i could see shifting to that latter viewpoint; i currently do not expect myself to shift to believing that the AI that saves us isn't SGCA.
to me, this totally makes sense in theory, to imagine something that just formulates plans-over-time and picks the argmax for some goal. the whole of instrumental convergence is coherent with that: if you give an agent a bunch of information about the world, and the ability to run eg linux commands, there is in fact an action that maximizes the amount of expected paperclips in the universe, and that action does typically entail recursively self-improving and taking over the world and (at least incidentally) destroying everything we value. the question is whether we will build such a thing any time soon.
right now, we have some specialized agentic AIs: alphazero is pretty good at reliably winning at go; it doesn't "get distracted" with other stuff. to me, waiting for SGCA to happen is like waiting for a rocket to get to space in the rocket alignment problem: once the rocket is in space it's already too late. the whole point is that we have to figure this out before the first rocket gets to space, because we only get to shoot one rocket to space. one has to build an actual inside view understanding of agenticity, and figure out if we'll be able to build that or not. and, if we are, then we need to solve alignment before the first such thing is built — you can't just go "aha, i now see that SGCA can happen, so i'll align it!" because by then you're dead, or at least past its decisive strategic advantage.
i'm not sure how to convey my own inside view of why i think SGCA can happen, in part because it's capability exfohazardous. maybe one can learn from IEM or the late 2021 MIRI conversations? i don't know where i'd send someone to figure this out, because i think i largely derived it from the empty string myself. it does strongly seem to me that, while a single particular neural net might not be the first thing to be an SGCA, we can totally bootstrap SGCA from existing ML technology; it might just take a clever trick or two rather than being the completely direct solution of "oh you train it like this and then it becomes SGCA". recursive self-improvement is typically involved.
we also have some AIs, including sydney, which aren't SGCA. it might even be that SGCA is indeed somewhat unnatural for a lot of current deep learning capabilities. nevertheless, i believe such a thing is likely enough to be built that it's what it takes for us to die — maybe non-SGCA AI's impact on the economy would slowly disempower us over the course of 20~40 years, but in those worlds AI tech gets good enough that 5 years into it someone figures out the right clever trick to build (something that bootstraps to) SGCA and we die of agentic intelligence explosion very fast before we get to see the slow economic disempowerement. in addition, i believe that our best shot is to build an aligned SGCA.
why haven't animals or humans gotten to SGCA? well, what would getting from messy biological intelligences to SGCA look like? typically, it would look like one species taking over its environment while developing culture and industrial civilization, overcoming in various ways the cognitive biases that happened to be optimal in its ancestral environment, and eventually building more reliable hardware such as computers and using those to make AI capable of much more coherent and unbiased agenticity.
that's us. this is what it looks like to be the first species to get to SGCA. most animals are strongly optimized for their local environment, and don't have the capabilities to be above the civilization-building criticality threshold that lets them build industrial civilization and then SGCA AI. we are the first one to get past that threshold; we're the first one to fall in an evolutionary niche that lets us do that. this is what it looks like to be the biological bootstrap part of the ongoing intelligence explosion; if dogs could do that, then we'd simply observe being dogs in the industrialized dog civilization, trying to solve the problem of aligning AI to our civilized-dog values.
we're not quite SGCA ourselves because, turns out, the shortest path from ancestral-environment-optimized life to SGCA is to build a successor that is much closer to SGCA. if that successor is still not quite SGCA enough, then its own successor will probly be. this is what we're about to do, probly this decade, in industrial civilization. maybe if building computers was much harder, and brains were more reliable to the point that rational thinking was not a weird niche thing you have to work on, and we got an extra million years or two to evolutionarily adapt to industrialized society, then we'd become properly SGCA. it does not surprise me that that is not, in fact, the shortest path to SGCA.
i expect the thing that kills us if we die, and the thing that saves us if we are saved, to be strong/general coherent agents (SGCA) which maximize expected utility. note that this is two separate claims; it could be that i believe the AI that kills us isn't SGCA, but the AI that saves us still has to be SGCA. i could see shifting to that latter viewpoint; i currently do not expect myself to shift to believing that the AI that saves us isn't SGCA.
I don't share the pivotal act framing, so "AI that saves us" isn't something I naturally accommodate.
to me, this totally makes sense in theory, to imagine something that just formulates plans-over-time and picks the argmax for some goal. the whole of instrumental convergence is coherent with that
My contention is that "instrumental convergence" is itself something that needs to be rethought. From the post:
I think that updating against strong coherence would require rethinking the staples of (traditional) alignment orthodoxy:
* Instrumental convergence (see[2])
This is not to say that they are necessarily no longer relevant in systems that aren't strongly coherent, but that to the extent they manifest at all, they manifest in (potentially very) different ways than originally conceived when conditioned on systems with immutable terminal goals.
And:
I think that updating against strong coherence would require rethinking the staples of (traditional) alignment orthodoxy:
* Instrumental convergence (see[2])
This is not to say that they are necessarily no longer relevant in systems that aren't strongly coherent, but that to the extent they manifest at all, they manifest in (potentially very) different ways than originally conceived when conditioned on systems with immutable terminal goals.
So a core intuition underlying this contention is something like: "strong coherence is just a very unnatural form for the behaviour of intelligent systems operating in the real world to take".
And I'd describe that contention as something like:
Decision making in intelligent systems is best described as "executing computations/cognition that historically correlated with higher performance on the objective function a system was selected for performance on".
With the implication that decision making is poorly described as:
(An approximation) of argmax over actions (or higher level mappings thereof) to maximise (the expected value of) a simple unitary utility function
That expected utility maximisation is something that can happen does not at all imply that expected utility maximisation is something that will happen.
I find myself in visceral agreement with (almost the entirety) of @cfoster0 's reply. In particular:
- Goal-directedness in learning-based agents takes the form of contextual decision-influences (shards) steering cognition and behavior.
- [...]
- Even as they resolve these incoherences, agents will not need or want to become utility maximizers globally, as that would require them to self-modify in a way inconsistent with their existing preferences.
Agents with malleable values do not self modify to become expected utility maximisers. Thus an argument that expected utility maximisers can exist does not to me appear to say anything particularly interesting about the nature of generally intelligent systems in our universe.
why haven't animals or humans gotten to SGCA? well, what would getting from messy biological intelligences to SGCA look like? typically, it would look like one species taking over its environment while developing culture and industrial civilization, overcoming in various ways the cognitive biases that happened to be optimal in its ancestral environment, and eventually building more reliable hardware such as computers and using those to make AI capable of much more coherent and unbiased agenticity.
that's us. this is what it looks like to be the first species to get to SGCA. most animals are strongly optimized for their local environment, and don't have the capabilities to be above the civilization-building criticality threshold that lets them build industrial civilization and then SGCA AI. we are the first one to get past that threshold; we're the first one to fall in an evolutionary niche that lets us do that. this is what it looks like to be the biological bootstrap part of the ongoing intelligence explosion; if dogs could do that, then we'd simply observe being dogs in the industrialized dog civilization, trying to solve the problem of aligning AI to our civilized-dog values.
Would you actually take a pill that turned you into an expected utility maximiser[1]? Yes or no please.
Over a simple unitary utility function.
Agents with malleable values do not self modify to become expected utility maximisers.
These agents could avoid modifying themselves, but still build external things that are expected utility maximizers (or otherwise strong coherent optimizers). So what use is this framing?
The meaningful claim would be agents with malleable values never building coherent optimizers, and it's a much stronger claim, close to claiming that those agents won't build any AGIs with novel designs. Humans are currently in the process of building AGIs with novel designs.
These agents could avoid modifying themselves, but still build external things that are expected utility maximizers (or otherwise strong coherent optimizers). So what use is this framing?
Take a look at the case I outlined in Is "Strong Coherence" anti-natural?.
I'd be interested in following up with you after conditioning on that argument.
Replied with a clearer example for the (moral) framing argument and a few more words on misalignment argument as a comment to that post. (I don't see the other post answering my concerns; I did skim it even before making the grandparent comment in this thread.)
Mhmm, so the argument I had was that:
The optimisation processes that construct intelligent systems operating in the real world do not construct utility maximisers
Systems with malleable values do not self modify to become utility maximisers
You contend that systems with malleable values can still construct utility maximisers.
I agree that humans can program utility maximisers in simplified virtual environments, but we don't actually know how to construct sophisticated intelligent systems via design; we can only construct them as the product of search like optimisation processes.
From #1: we don't actually know how to construct competent utility maximisers even if we wanted to
This generalises to future intelligent systems
Where in the above chain of argument do you get off?
The misalignment argument ignores all moral arguments, we just build whatever even if it's a very bad idea. If we don't have the capability to do that now, we might gain it in 5 years, or LLM characters might gain it 5 weeks after waking up, and surely 5 years after waking up and disassembling the moon to gain moon-scale compute.
There'd need to be an argument that fixed goal optimizers are impossible in principle even if they are sought to be designed on purpose, and this seems false, because you can always wrap a mind in a plan evaluation loop. It's just a somewhat inefficient weird algorithm, and a very bad idea for most goals. But with enough determination efficiency will improve.
i expect the thing that kills us if we die, and the thing that saves us if we are saved, to be strong/general coherent agents
I agree in the sense that strong optimization is the likely shape of equilibrium (though I wouldn't go so far as to say it's utility maximization specifically), and in that equilibrium humanity is either fine or not. Conversely, while humanity remains alive, the doom status of the eventual outcome remains in question until there is a strong optimization equilibrium. Doom could come sooner, but singularity is fast in physical time, so the distinction doesn't necessarily matter.
But do you expect humans to build strong optimization? The way things are going, it's weakly coherent AGIs that are going to build strong optimization, while any alignment-relevant things humanity can do are not going to be about alignment of strong optimization, they are instead about alignment of weakly coherent AGIs (with LLM characters as the obvious candidate for successful alignment, and much more tenuous grounds for alignability of other things).
Agree on SGCA, if only because something is likely to self-modify to one, disagree on expected utility maximization necessarily being the most productive way to think of it.
Consider the following two hypothetical agents:
Agent 1 follows the deontological rule of choosing the action that maximizes some expected utility function.
Agent 2 maximizes expected utility, where utility is defined as how well an objective god's-eye-view observer would rate Agent 2's conformance to some deontological rule.
Obviously agent 1 is more naturally expressed in utilitarian terms, and agent 2 in deontological terms, though both are both and both can be coherent.
Now, when we try to define what decision procedure an aligned AI could follow, it might turn out that there's no easy way to express what we want it to do in purely utilitarian terms, but it might be easier in some other terms.
I especially think that's likely to be the case for corrigibility, but also for alignment generally.
i mean sure but i'd describe both as utility maximizers because maximizing utility is it fact what they consistently do. Dragon God's claim seems to be that we wouldn't get an AI that would be particularly well predicted by utility maximization, and this seems straightforwardly false of agents 1 and 2.
Yes, but:
While (a) is risky (b) seems worse to me.
I want advocates of strong coherence to explain why [...] sophisticated ML systems (e.g. foundation models[5]) aren't strongly coherent.
I wouldn't call myself an advocate for strong coherence, but two answers come to mind depending on how strong coherence is defined:
(I very recently posted a thingy about this.)
Is "mode collapse" in RLHF'ed models an example of increased coherence?
I think the more useful near-term frame is that RLHF- mostly when applied as a not-conditioning-equivalent fine-tuning RL process- is giving the optimizer more room to roam, and removing the constraints that forced the prediction training to maintain output distributions. The reward function for the fine-tuned trainee looks sparser. KL penalties can help maintain some of the previous constraints, but it looks like the usual difficulties with RL prevent that from being as robust as pure predictive training.
In the limit, I would assume RL leads to increased "coherence" in the sense of sharpness, because it's very likely that the learned reward induced by RL training is much narrower and sparser than the predictive training objective. (It may not increase coherence in the sense of "not stepping on its own toes" because it was already pretty good at that.)
Eliciting the desired behavior through conditioning learned during the densely defined predictive training objective rather than a pure RL postpass seems wise regardless.
I don't know whether I agree with "strong coherence" or not because it seems vaguely defined. Can you give three examples of arguments that assume strong coherence? (Because the definition becomes crisper when applied than when theorwtical.)
Examples of strong coherence: Assuming AI systems:
I think even Wentworth's Subagents is predicated on an assumption of stronger coherence than obtains in practice. I think humans aren't actually fully modeled as a fixed committee of agents making pareto optimal decisions with respect to their fixed utility functions. The ways in which I think Wentworth's Subagents falls short are:
And like I don't think those particular features are necessarily just incoherencies. Malleable values are I think just a facet of generally intelligent systems in our universe.
See also: nostalgebraist's "why assume AGIs will optimize for fixed goals?".
And this comment.
I do think Subagents could be adapted/extended to model human preferences adequately, but the inherent inconsistency and context dependence of said preferences makes me think the agent model may be somewhat misleading.
E.g. Rob Bensinger suggested that agents may self modify to become more coherent over time.
Arguments that condition on strong coherence:
The ways in which I think Wentworth's Subagents falls short are:
- Human preferences change over time
- Human preferences don't have fixed weights, but activate to different degrees in particular contexts
What are your preferred examples of this?
Arguments that condition on strong coherence:
- Deceptive alignment in mesa-optimisers
- General failure modes from utility maximising
Can you link your preferred examples of this?
Meta note, that I dislike this style of engagement.
It feels like a lot of effort to reply to for me, with little additional value being provided by my reply/from the conversation.
Wait, why is it a lot of effort to reply? I'd have expected it to just involve considering the factors that made you endorse these memes and then pick one of the factors to dump as an example?
As for value, I think examples are valuable because they help grounding things and make it easier to provide alternate perspectives on things.
Replies are effortful in general, and it feels like I'm not really making much progress in the conversation for the effort I'm putting in.
🤷 Up to you. You were the one who tagged me in here. I can just ignore your post if you don't feel like giving examples of what you are talking about.
But I will continue to ask for examples if you tag me in the future because I think examples are extremely valueable.
I'd be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.
Basically, such environments are called "intelligent systems". I mean that if we train shard-agent to some lowest-necessary level of superintelligence, it will look at itself and say "Whoa, what a mess of context-dependent decision-making influencers I got here! I should refine it into something more like utility function if I want to get more things that I want" and I see literally no reason for it to not do it. You can't draw any parallels with actual intelligent systems here because actual intelligent systems don't have ability to self-modify.
(Please, let's not conduct experiments with self-modifying ML systems to check it?)
Secondly, I disagree with framing of "strong coherence", because it seems to me that it implies some qualitative difference in coherence. It's not that there are "coherent" and "not coherent" systems, there are more and less coherent systems. Fully coherent systems are likely impossible in our world, because full coherence is computationally intractable. It's expected that future superintelligent systems will be incoherent and exploitable from POV of platonic unbounded agents, it just doesn't change anything from our perspective - we still should treat them as helluva coherent. It doesn't make a difference that superintelligence will predictably (for platonic unbounded agents) choose strategy with the probability of turning universe into paperclips one bazillionth less than optimal.
I can agree that direct result of SGD that produce first superintelligent system can be not very much more coherent than human. I don't see reason for it to stay at that level of coherence.
Basically, such environments are called "intelligent systems". I mean that if we train shard-agent to some lowest-necessary level of superintelligence, it will look at itself and say "Whoa, what a mess of context-dependent decision-making influencers I got here! I should refine it into something more like utility function if I want to get more things that I want" and I see literally no reason for it to not do it. You can't draw any parallels with actual intelligent systems here because actual intelligent systems don't have ability to self-modify.
Ridiculously strong levels of assuming your conclusion and failing to engage with the argument. Like this does not engage at all with the core contention that strong coherence may be anti-natural to generally intelligent systems in our universe.
Why did actual intelligent systems not develop as maximisers of a simple unitary utility function if that was actually optimal?
Why did evolution converge to agents that developed terminal values for drives that were instrumental for raising inclusive genetic fitness in their environment of evolutionary adaptedness?
I don't see reason for it to stay at that level of coherence.
You have not actually justified the assumption that it would self modify into strong coherence.
That strong coherence is optimal.
That systems with malleable values would necessarily want to self modify into immutable terminal goals.
I would NOT take a pill to turn myself into an expected utility maximiser.
Polished from my shortform
See also: Is "Strong Coherence" Anti-Natural?
Introduction
Many AI risk failure modes imagine strong coherence/goal directedness[1] (e.g. [expected] utility maximisers).
Such strong coherence is not represented in humans (or any other animal), seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe[2][3].
I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error[4].
I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems, because powerful systems in the real world are "of an importantly different type".
Ontological Error?
I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD. HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.
The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that (immutable) terminal goals is just a wrong frame for reasoning about generally intelligent systems in our world (e.g. humans, animals and future powerful AI systems)[2].
Theoretical Justification and Empirical Investigation Needed
I'd be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.
And empirical demonstrations of systems that actually become more coherent as they are trained for longer/"scaled up" or otherwise amplified.
I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals, humans) or sophisticated ML systems (e.g. foundation models[5]) aren't strongly coherent.
And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar, OpenAI Five [or replications thereof]) to investigate their degree of coherence.
Conclusions
Currently, I think strong coherence is unlikely (plausibly "anti-natural"[3][2]) and am unenthusiastic about research agendas and threat models predicated on strong coherence.
Disclaimer
The above is all low confidence speculation, and I may well be speaking out of my ass[6].
By "strong coherence/goal directedness" I mean something like:
Informally: a system has immutable terminal goals.
Semi-formally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states.
You cannot well predict the behaviour/revealed preferences of humans or other animals by the assumption that they have immutable terminal goals or are expected utility maximisers.
The ontology that intelligent systems in the real world instead have "values" (contextual influences on decision making) seems to explain their observed behaviour (and purported "incoherencies") better.
Many observed values in humans and other mammals (see[7]) (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for increasing inclusive genetic fitness (promoting survival, exploration, cooperation and sexual reproduction/survival of progeny respectively). Yet, humans and mammals seem to value these terminally and not because of their instrumental value on inclusive genetic fitness.
That the instrumentally convergent goals of evolution's fitness criterion manifested as "terminal" values in mammals is IMO strong empirical evidence against the goals ontology and significant evidence in support of shard theory's basic account of value formation in response to selection pressure.
This is not to say that I think all coherence arguments are necessarily dead on arrival, but rather in practice, I think coherent behaviour (not executing strictly dominated strategies) acts upon our malleable values, to determine our decisions. We do not replace said values with argmax over a preference ordering.
As @TurnTrout says:
E.g. if the shard theory account of value formation is at all correct, particularly the following two claims:
* Values are inherently contextual influences on decision making
* Values (shards) are strengthened (or weakened) via reinforcement events
Then strong coherence in the vein of utility maximisation just seems like an anti-natural form. See also[2]. I think evolutionarily convergent "terminal" values provides (strong) empirical evidence against the naturalness of strong coherence.
I could perhaps state my thesis that strong coherence is anti-natural more succinctly as:
[This generalises the shard theory account of value formation from reinforcement learning to arbitrary constructive optimisation processes.]
It is of an importantly different type from the "immutable terminal goals"/"expected utility maximisation" I earlier identified with strong coherence.
I'm given the impression that the assumption of strong coherence is still implicit in some current AI safety failure modes [e.g. it underpins deceptive alignment[8]].)
Is "mode collapse" in RLHF'ed models an example of increased coherence?
I do think that my disagreements with e.g. deceptive alignment/expected utility maximisation is not simply a failure of understanding, but I am very much an ML noob, so there can still be things I just don't know. My opinions re: coherence of intelligent systems would probably be different in a significant way by this time next year.
I mention mammals because I'm more familiar with them, not necessarily because only mammals display these values.
In addition to the other prerequisites listed in the "Deceptive Alignment" post, deceptive alignment also seems to require a mesa-optimiser so coherent that it would be especially resistant to modifications to its mesa-objective. That is it requires very strong levels of goal content integrity.
I think that updating against strong coherence would require rethinking the staples of (traditional) alignment orthodoxy:
* Orthogonality
* Basic AI drives
* Instrumental convergence (see[2])
This is not to say that they are necessarily no longer relevant in systems that aren't strongly coherent, but that to the extent they manifest at all, they manifest in (potentially very) different ways than originally conceived when conditioned on systems with immutable terminal goals.