Eliezer Yudkowsky

Sequences

Metaethics
Quantum Physics
Fun Theory
Ethical Injunctions
The Bayesian Conspiracy
Three Worlds Collide
Highly Advanced Epistemology 101 for Beginners
Inadequate Equilibria
The Craft and the Community
Load More (9/40)

Wikitag Contributions

Comments

Sorted by

Death requires only that we do not infer one key truth; not that we could not observe it.  Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.

All of that, yes, alongside things like, "The AI is smarter than any individual human", "The AIs are smarter than humanity", "the frontier models are written by the previous generation of frontier models", "the AI can get a bunch of stuff that wasn't an option accessible to it during the previous training regime", etc etc etc.

Do you expect "The AIs are capable of taking over" to happen a long time after "The AIs are smarter than humanity", which is a long time after "The AIs are smarter than any individual human", which is a long time after "AIs recursively self-improve", and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn't-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?

What we "could" have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.

Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code.  Where's the worldwide halt on further capabilities development that we're supposed to get, until new techniques are found and apparently start working again?  What's the total number of new failures we'd need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?

And then of course that whole scenario where everybody keenly went looking for all possible problems early, found all the ones they could envision, and humanity did not proceed further until reasonable-sounding solutions had been found and thoroughly tested, is itself taking place inside an impossible pollyanna society that is just obviously not the society we are currently finding ourselves inside.

But it is impossible to convince pollyannists of this, I have found.  And also if alignment pollyannists could produce a great solution given a couple more years to test their brilliant solutions with coverage for all the problems they have with wisdom foreseen and manifested early, that societal scenario could maybe be purchased at a lower price than the price of worldwide shutdown of ASI.  That is: for the pollyannist technical view to be true, but not their social view, might imply a different optimal policy.

But I think the world we live in is one where it's moot whether Anthropic will get two extra years to test out all their ideas about superintelligence in the greatly different failure-is-observable regime, before their ideas have to save us in the failure-kills-the-observer regime.  I think they could not do it either way.  I doubt even 2/3rds of their brilliant solutions derived from the failure-is-observable regime would generalize correctly under the first critical load in the failure-kills-the-observer regime; but 2/3rds would not be enough.  It's not the sort of thing human beings succeed in doing in real life.

When I've tried to talk to alignment pollyannists about the "leap of death" / "failure under load" / "first critical try", their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.

You are now arguing that we will be able to cross this leap of generalization successfully.  Well, great!  If you are at least allowing me to introduce the concept of that difficulty and reply by claiming you will successfully address it, that is further than I usually get.  It has so many different attempted names because of how every name I try to give it gets strawmanned and denied as a reasonable topic of discussion.

As for why your attempt at generalization fails, even assuming gradualism and distribution:  Let's say that two dozen things change between the regimes for observable-failure vs failure-kills-observer.  Half of those changes (12) have natural earlier echoes that your keen eyes naturally observed.  Half of what's left (6) is something that your keen wit managed to imagine in advance and that you forcibly materialized on purpose by going looking for it.  Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn't see coming, 1/3rd fail.  Now you're dead.  The end.  If there was only one change ahead, and only one problem you were gonna face, maybe your one solution to that one problem would generalize, but this is not how real life works.

I deny that gradualism obviates the "first critical try / failure under critical load" problem.  This is something you believe, not something I believe.  Let's say you're raising 1 dragon in your city, and 1 dragon is powerful enough to eat your whole city if it wants.  Then no matter how much experience you think you have with a little baby dragon, once the dragon is powerful enough to actually defeat your military and burn your city, you need the experience with the little baby passively-safe weak dragon, to generalize oneshot correctly to the dragon powerful enough to burn your city.  What if the dragon matures in a decade instead of a day?  You are still faced with the problem of correct oneshot generalization.  What if there are 100 dragons instead of 1 dragon, all with different people who think they own dragons and that the dragons are 'theirs' and will serve their interests, and they mature at slightly different rates?  You still need to have correctly generalized the safely-obtainable evidence from 'dragon groups not powerful enough to eat you while you don't yet know how to control them' to the different non-training distribution 'dragon groups that will eat you if you have already made a mistake'.  The leap of death is not something that goes away if you spread it over time or slice it up into pieces.  This ought to be common sense; there isn't some magical way of controlling 100 dragons which at no point involves the risk that the clever plan for controlling 100 dragons turns out not to work.  There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn't work as you hoped.  Because they are different regimes.  The dragon or collective of dragons is still big and powerful and it kills you if you made a mistake and you need to learn in regimes where mistakes don't kill you and those are not the same regimes as the regimes where a mistake kills you.  If you think I am trying to say something clever and complicated that could have a clever complicated rejoinder then you are not understanding the idea I am trying to convey.  Between the world of 100 dragons that can kill you, and a smaller group of dragons that aren't old enough to kill you, there is a gap that you are trying to cross with cleverness and generalization between two regimes that are different regimes.  This does not end well for you if you have made a mistake about how to generalize.  This problem is not about some particular kind of mistake that applies exactly to 3-year-old dragons which are growing at a rate of exactly 1 foot per day, where if the dragon grows slower than that, the problem goes away yay yay.  It is a fundamental problem not a surface one.

The reporter asked "What do you think of Anthropic's recent work?" rather than any particular paper.  My wondering how much Anthropic is uncovering roleplay vs. deep entities is a theme that runs through a lot of that recent work and the main thing where I expect I have something on the margins to contribute to what the reporter hears.

Yep, general vibes about whether Anthropic & partner's research is quite telling us things about the underlying strange creature, or a sort of mask that it wears with a lot of roleplaying qualities.  I think this generalizes across a swathe of their research, but the Fake Alignment paper did stand out to me as one of the clearer cases.

Load More