Review

(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)

I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.

It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.

In saying the above, I do not mean the following:

(1) Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective. Managing to put your own objective into this "goal slot" (as opposed to having the goal slot set by random happenstance) is a central difficult challenge. [Reminder: I am not asserting this] 

Instead, I mean something more like the following:

(2) By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out "goal" that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.

Making the AI even have something vaguely nearing a 'goal slot' that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).

(But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a "goal slot", it's not letting you touch it.)

Furthermore, insofar as the AI is capable of finding actions that force the future into some narrow band, I expect that it will tend to be reasonable to talk about the AI as if it is (more-or-less, most of the time) "pursuing some objective", even in the stage where it's in fact a giant kludgey mess that's sorting itself out over time in ways that are unpredictable to you.

I can see how my attempts to express these other beliefs could confuse people into thinking that I meant something more like (1) above (“Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective…”), when in fact I mean something more like (2) (“By default, the first minds humanity makes will be a terrible spaghetti-code mess…”).


In case it helps those who were previously confused: the "diamond maximizer" problem is one example of an attempt to direct researchers’ attention to the challenge of cleanly factoring cognition around something a bit like a 'goal slot'.

As evidence of a misunderstanding here: people sometimes hear me describe the diamond maximizer problem, and respond to me by proposing training regimes that (for all they know) might make the AI care a little about diamonds in some contexts.

This misunderstanding of what the diamond maximizer problem was originally meant to be pointing at seems plausibly related to the misunderstanding that this post intends to clear up. Perhaps in light of the above it's easier to understand why I see such attempts as shedding little light on the question of how to get cognition that cleanly pursues a particular objective, as opposed to a pile of kludges that careens around at the whims of reflection and happenstance.

New Comment
25 comments, sorted by Click to highlight new comments since:

I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.

So if the debate is “Are Eliezer & Nate right about ≳99% (or whatever) chance of doom?”, then I find myself on the optimistic side (at least, leaving aside the non-technical parts of the problem), whereas if the debate is “Do we have a strong reason to believe that thus-and-such plan will actually solve technical alignment?”, then I find myself on the pessimistic side.

 

Separately, I don’t think it’s true that reflectively-stable hard superintelligence needs to have a particular behavioral goal, for reasons here.

[-]So8resΩ173410

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of pointing is hard" or "yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it's sorted out".

For instance, the latter response obtains if the "pointing" is done by naive training.

(Though I also have some sense that I see the situation as more fragile than you--there's lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)

Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.

For instance, the latter response obtains if the "pointing" is done by naive training.

Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details.

Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.

That’s helpful, thanks!

UPDATE: The “least-doomed plan” I mentioned is now described in a more simple & self-contained post, for readers’ convenience. :)

Given a sufficiently Kludgy pile of heuristics, it won't make another AI, unless it has a heuristic towards making AI. (In which case the kind of AI it makes depend on its AI making heuristics. ) GPT5 won't code an AI to minimize predictive error on text. It will code some random AI that looks like something in the training dataset. And will care more about what the variable names are than what the AI actually does.

Big piles of kludges usually arise from training a kludge finding algorithm (like deep learning). So the only ways agents could get AI building kludges is from making dumb AI's or reading human writings. 

Alternately, maybe the AI has sophisticated self reflection. It is looking at its own kludges and trying to figure out what it values. In which case, does the AI's metaethics contain a simplicity prior? With a strong simplicity prior, an agent with a bunch of kludges that mostly maximized diamond could turn into an actual crystaline diamond maximizer. If it doesn't have that simplicity prior, I would guess it ended up optimizing some complicated utility function. (But probably producing a lot of diamond as it did so, diamond isn't the only component of it's utility, but it is a big one.) 

For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.

E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.

To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.

I can imagine at least two different senses in which an AI might have something like a "goal slot":

A. Instrumental "goal slot" - Something like a box inside the agent that holds in mind the (sub)goal it is currently pursuing. That box serves as an interface through which the different parts of the agent coordinate coherent patterns of thought & action among themselves. (For ex. the goal slot's contents get set temporarily to "left foot forward", allowing coordination between limb effectors so as to not trip over one another.) I think the AIs we build will probably have something (or some things) like this, because it is a natural design pattern to implement flexible distributed control.

B. Terminal "goal slot" - Something like a box inside the agent that holds in mind a fixed goal it is always pursuing. That box sits at the topmost level of the control hierarchy within the agent, and its contents are not and cannot be changed via bottom-up feedback. I don't think the AIs we build will have something like this, at least not in the safety-relevant period of cognitive development (the period wherein the agent's goals are still malleable to us), in part because in reality it is a design that rarely ever works.

Were you thinking of A or B?

It seems perfectly consistent-to-me to have an AI whose cognitive internals are not "spaghetti-code", even one with a cleanly separated instrumental "goal slot" (A) that interfaces with the rest of the AI's cognition, but where there is no single terminal "goal slot" (B) to speak of. In fact, that seems like a not-unlikely development path to me.

I feel like we already can point powerful cognition at certain things pretty well (e.g. chess), and the problem is figuring out how to point AIs to more useful things (e.g. honestly answering hard questions well). So I don’t know if I’m nit-picky, but I think that the problem is not pointing powerful cognition at anything at all, but rather pointing powerful cognition at whatever we want.

It's not clear here, but if you read the linked post it's spelled out (the two are complementary really). The thesis is that it's easy to do a narrow AI that knows only about chess, but very hard to make an AGI that knows the world, can operate in a variety of situations, but only cares about chess in a consistent way.

I think this is correct at least with current AI paradigms, and it has both some reassuring and some depressing implications.

I agree with what you say in this post, but still want to disagree on the Diamond Maximizer idea being something people should be encouraged to try to solve. I think solving "how to make an AI target a distinct goal" before thoroughly solving "how do we specify a goal that will actually be beneficial to humanity for an AI to target" would be very dangerous.

Even if we haven't found a goal that will be (strongly) beneficial to humanity, it seems useful knowing "how to make an AI target a distinct goal" because we can at least make it have limited impact and not take over. There's gradients of success on both problems, and having solved the first does mean we can do slightly positive things even if we can't do very strongly positive things. 

In theory, yes, that could work.

In practice once it's known how to do that people will do it over and over again with successively more ambitious "limited" goals until it doesn't work.

Another possibility is a "pivotal act", but in practice I think a pivotal act has the following possible outcomes:

  1. The intended shutdown of the AI after the pivotal act fails, humanity destroyed
  2. pivotal act doesn't work
  3. pivotal act succeeds temporarily
    1. but the people responsible are vilified, caught and prevented from repeating it, pivotal act undone over time
    2. and the people responsible can repeat what they did, becoming de facto rulers of earth, but due to the lack of legitimacy face a game-theoretical landscape forcing them to act as dictators even if they have good intentions to start with
    3. and the people responsible and sympathizers convince the general public of the legitimacy of the act (potentially leads to good outcome in this case, but beware of corrupted hardware making you overestimate the actually-low probability of this happening)
  4. pivotal act is somehow something that has permanent effect, but the potential of humanity permanently reduced

I don't dispute that at some point in time we want to solve alignment (to come out of the precipice period), but I disputed it's more dangerous to know how to point AI before having solved what perfect goal to give it. 
In fact, I think it's less dangerous because we at minimum gain more time, to work and solve alignment, and at best can use existing near human-level AGI to help us solve alignment too. The main reason to believe this is to reason that near human-level AGI is a particular zone where we can detect deception, where it can't easily unbox itself and takeover, yet is still useful. The longer we stay in this zone, the more relatively safe progress we can make (including on alignment)

In fact, I think it's less dangerous because we at minimum gain more time

As I argued above we have less time, not more, if we know how to point AI. 

An AI aimed at something in particular would be much more dangerous for its level than one not aimed at any particular real-world goal, and so "near-human level AGI" would be much safer (and we can keep in the near-human level zone you mention longer) if we can't point it.

My intuition is viscerally disgusted at the idea of pointing cognition and I don't think that this disgust is quite correct but I think there is something to be said about the fact that pointing cognition of a human in this way would be incredible injury to the human. is there a mutual version of this pointing that is easier to solve because of not needing to use concepts that would be incredible harm if targeted on a human?

unfortunately I can only offer an intuition off the top of my head.

I don't particularly understand and you're getting upvoted so I'd appreciate clarification, here are some prompts if you want : 
- Is you deciding that you will concentrate on doing a hard task (solving a math problem) pointing your cognition, and is it viscerally disgusting ? 
- Is you asking a friend a favor to do your math homework pointing their cognition and is it viscerally disgusting ? 
- Is you convincing by true arguments someone else into doing a hard task that benefits them pointing their cognition, and is it viscerally disgusting ? 
- Is a CEO giving directions to his employees so they spend their days working on specific task pointing their cognition, and is it viscerally disgusting ? 
- Is you having a child and training them to be a chess grandmaster (eg. Judith Polgar) pointing their cognition, and is it viscerally disgusting ?
- Is you brainwashing someone into a particular cult where they will dedicate their life to one particularly repetitive task pointing their cognition, and is it viscerally disgusting ? 
- Is you running a sorting algorithm on a list pointing the computer's cognition, and is it viscerally disgusting ? 


I'm hoping to get info on what problem you're seeing (or esthetics?), why it's a problem, and how it could be solved. I gave many examples where you could interpret my questions as being rhetoric - that's not the point. It's about identifying at which point you start differing. 

- Is you deciding that you will concentrate on doing a hard task (solving a math problem) pointing your cognition, and is it viscerally disgusting ? 

It is me pointing my own cognition in consensus with myself; nobody else pointed it for me.

- Is you asking a friend a favor to do your math homework pointing their cognition and is it viscerally disgusting ? 

Not in the sense discussed here; they have the ability to refuse. It's not a violation of consent on their part - they point their own cognition, I just give them some tokens inviting them to point it.

- Is you convincing by true arguments someone else into doing a hard task that benefits them pointing their cognition, and is it viscerally disgusting ? 
 

A little, but still not a violation of consent.

- Is a CEO giving directions to his employees so they spend their days working on specific task pointing their cognition, and is it viscerally disgusting ? 
 

Hmm, that depends on the bargaining situation. Usually it's a mild violation, I'd say, because the market never becomes highly favorable for workers, and so workers are always at a severe disadvantage compared to managers and owners, and often are overpressured by the incentive gradient of money. However, in industries where workers have sufficient flexibility or in jobs where people without flexibility to leave are still able to refuse orders from their bosses, it seems less of a violation. The method of pointing is usually more like sparse game theory related to continued employment than like dense incentive gradients, though. It varies how cultlike companies are.

- Is you having a child and training them to be a chess grandmaster (eg. Judith Polgar) pointing their cognition, and is it viscerally disgusting ?
 

Yeah, though not quite as intensely as...

- Is you brainwashing someone into a particular cult where they will dedicate their life to one particularly repetitive task pointing their cognition, and is it viscerally disgusting ? 
 

Unequivocally yes.

- Is you running a sorting algorithm on a list pointing the computer's cognition, and is it viscerally disgusting ? 
 

I want to say it's not pointing cognition at all, but I'm not sure about that. I have the sense that it might be to a very small degree. Perhaps it depends on what you're sorting? I have multiple strongly conflicting models I could use to determine how to answer this, and they give very different results, some insisting this is a nonsense question in the first place. Eg, perhaps cognition is defined by the semantic type that you pass to sort; eg, if the numbers are derived from people, it might count more as cognition due to the causal history that generated the numbers having semantic meaning about the universe. Or, maybe the key thing that defines cognition is compressive bit erasure, in which case operating a cpu is either always horrifying due to the high thermal cost per bit of low-error-rate computation, or is always fine due to the low lossiness of low-error-rate computation.

Thanks for the answers. It seems they mostly point to you valuing stuff like freedom/autonomy/self-realization, and that violations of that are distasteful. I think your answers are pretty reasonable and though I might not have exact same level of sensitivity I agree with the ballpark and ranking (brainwashing is worse than explaining, teaching chess exclusively feels a little too heavy handed..)

So where our intuitions differ is probably that you're applying these heuristics about valuing freedom/autonomy/self-realization to AI systems we train ? Do you see them as people, or more abstractly as moral patients (because of them probably being conscious or something)? 

I won't get into moral weeds too fast, I'd point out that though I do currently mostly believe consciousness and moral patienthood is quite achievable "in silico", that doesn't mean that all intelligent system is conscious or a moral patient, and we might create AGI that isn't of that kind. If you suppose AGI is conscious and a moral patient, then yeah I guess you can argue against it being pointed somewhere, but I'd mostly counter argue from moral relativism that "letting it point anywhere" is not fundamentally more good than "pointed somewhere", so because we exist and have morals, let's point it to our morals anyway. 

I would say that personhood arises from ability to solve complex incentive gradients or something...

In general I would say something like, though I'm not sure that I've actually got these right according to my own opinions and I might need to revisit this,

Consciousness: the amount of justified-by-causal-path mutual information imparted to a system  by the information it integrates from the shape of energy propagations has with an external system

Intelligence: when a system seeks structures that create consciousness that directs useful action reliably

Cognition: strong overlap with "intelligence", the network structure of distilling partial computation results into useful update steps that coherently transform components of consciousness into useful intelligent results

Moral patienthood: a system that seeks to preserve its shape (note that this is more expansive than cognition!)

I believe that by the time an AI has fully completed the transition to hard superintelligence

Nate, what is meant by "hard" superintelligence, and what would precede it? A "giant kludgey mess" that is nonetheless superintelligent? If you've previously written about this transition, I'd like to read more.

Maybe Nate has something in mind like Bostrom's "strong superintelligence", defined in Superintelligence as "a level of intelligence vastly greater than contemporary humanity's combined intellectual wherewithal"?

(Whereas Bostrom defines "superintelligence" as "any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest", where "exceeding the performance of humans" means outperforming the best individual human, not necessarily outperforming all of humanity working together.)

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Oh. I was thinking the diamond maximizer problem is about making the AI care about the specific goal of maximizing diamond, not about making the AI have some consistent goal slot instead of lots of cognitive spaghetti code. I think it’s simple to make a highly agentic AI with some specific goals it’s really good at maximizing (if you have infinite compute and don’t care about what these goals actually are; I have no idea how to point that at diamond). Should I write a description somewhere?

Is it simple if you don't have infinite compute ? 
I would be interested in a description which doesn't rely on infinite compute, or more strictly still, that is is computationally tractable. This constraint is important to me because I assume that the first AGI we get is using something that's more efficient that other known methods (eg. using DL because it works, even though it's hard to control), so I care about aligning the stuff which we'll actually be using. 

it seems to me that a key problem is how to avoid distilling goals too early - in particular, when in a society of beings who do not themselves have clear goal slots, an AI that is seeking to distill itself should help those near it participate in distilling themselves incrementally as well, because we are as much at risk from erroneous self-modification as augmented AIs. the reason to prevent (excess of?) conceptual steganography is to allow coming into sync about intentions. we want to become one consequentialist together eventually, but reaching dense percolation on that requires existing powerful consequentialists to establish trust that all other consequentialists have diffused their goals towards something that can be agreed on? or something? (I think my thinking could use more human train of thought. self unvoted for that reason.)