Disposable Identity

A central AI alignment problem: capabilities generalization, and the sharp left turn

That aside, I'm not sure what argument you're making here.

I do not often comment on Less Wrong. (Although I am starting to, this is one of my first comment!)
Hopefully, my thoughts will become clearer as I write more, and get myself more acquainted with the local assumptions and cultural codes.

In the meanwhile, let me expand:

Two possible interpretations that come to mind (probably both of these are wrong):
You're arguing that all humans in the world will refuse to build dangerous AI, therefore AI won't be dangerous.
You're arguing that natural selection doesn't tell us how hard it is to pull off a pivotal act, since natural selection wasn't trying to do a pivotal act.
2 seems broadly correct to me, but I don't see the relevance. Nate and I indeed think that pivotal acts are possible. Nate is using natural selection here to argue against 'AI progress will be continuous', not to argue against 'it's possible to use sufficiently advanced AI systems to end the acute existential risk period'.

2 is the correct one.

But even though I read the post again with your interpretation in mind, I am still confused about why 2 is irrelevant. Consider:

The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.
Why does alignment fail while capabilities generalize, at least by default and in predictable practice?

On one hand, in the analogy with Natural Selection, "by default" means "When you don't even try to do alignment, when you 100% optimize for a given goal.". Ie: When NS optimized for IGF, capabilities generalized, but not alignment.
On the other hand, when speaking of alignment directly, "by default" means "Even if you optimize for alignment, but not having in mind some specific considerations". Ie: Some specific alignment proposals will fail.

My point was that the former is not evidence for the latter.

A central AI alignment problem: capabilities generalization, and the sharp left turn

Disposable Identity4y10

But 'alignment is tractable when you actually work on it' doesn't imply 'the only reason capabilities outgeneralized alignment in our evolutionary history was that evolution was myopic and therefore not able to do long-term planning aimed at alignment desiderata'.

I am not claiming evolution is 'not able to do long-term planning aimed at alignment desiderata'.
I am claiming it did not even try.

If you're myopically optimizing for two things ('make the agent want to pursue the intended goal' and 'make the agent capable at pursuing the intended goal') and one generalizes vastly better than the other, this points toward a difference between the two myopically-optimized targets.

This looks like a strong steelman of the post, which I gladly accept.

But it seemed to me that the post was arguing:
1. That alignment was hard (it mentions that technical alignment contains the hard bits, multiple specific problems in alignment), etc.
2. That current approaches do not work

That you do not get alignment by default looks like a much weaker thesis than 1&2, one that I agree with.

This would obviously be an incredibly positive development, and would increase our success odds a ton! Nate isn't arguing 'when you actually try to do alignment, you can never make any headway'.

This unfortunately didn't answer my question. We all agree that it would be a positive development, my question was how much. But from my point of view, it could even be enough.

The question that I was trying to ask was: "What is the difficulty ratio that you see between alignment and capabilities?"
I understood the post as making a claim (among others) that "Alignment is very more difficult than capabilities, as evidenced by Natural Selection".

A central AI alignment problem: capabilities generalization, and the sharp left turn

Disposable Identity4y*1-2

Many comparisons are made with Natural Selection (NS) optimizing for IGF, on the grounds that this is our only example of an optimization process yielding intelligence.

I would suggest considering one very relevant fact: NS has not optimized for alignment, but only for a myopic version of IGF. I would also suggest considering that humans have not optimized for alignment either.

Let's look at some quotes, with those considerations in mind:

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF.

NS has not optimized for alignment, which is why it's bad at alignment compared to what it has optimized for.

Some people I say this to respond with arguments like: "Surely, before a smaller team could get an AGI that can master subjects like biotech and engineering well enough to kill all humans, some other, larger entity such as a state actor will have a somewhat worse AI that can handle biotech and engineering somewhat less well, but in a way that prevents any one AGI from running away with the whole future?"

I respond with arguments like, "In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”)."

NS has not optimized for one intelligence not conquering the rest of the world. As such, it doesn't say anything about how hard it is to optimize to produce one intelligence not conquering the rest of the world.

Their response in turn is generally some variant of "well, natural selection wasn't optimizing very intelligently" or "maybe humans weren't all that sharply above evolutionary trends" or "maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us" or suchlike.

The response is not that NS is not intelligent, but that NS has not even optimized for any of the things that you have pointed to.

Why does alignment fail while capabilities generalize, at least by default and in predictable practice? In large part, because good capabilities form something like an attractor well.

My answer would be the same for NS and humans: alignment is simply not optimized for! People spend countless more resources on capabilities than alignment.
If the resources invested in capability vs alignment ratio was reversed, would you still expect alignment to fare so much worse than capabilities?
Let's say you'd still expect that: how much better do you expect the situation to be as a result of the ratio being reversed? How much doom would you still expect in that world compared to now?

good capabilities form something like an attractor well

Sure, in so far as people will optimize for short-term power (ie, capabilities) because they are myopic and power is the name we give to what is useful in most scenarios.

---

I also expect a discontinuity in intelligence. But I think this post does not make a good case for it: a much simpler theory already explains its observations.

In an upcoming post, I’ll say more about how it looks to me like ~nobody is working on this particular hard problem, by briefly reviewing a variety of current alignment research proposals. In short, I think that the field’s current range of approaches nearly all assume this problem away, or direct their attention elsewhere.

I'm very eager to read this.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments