After having chosen a utility function to maximize, how would it maximize? I'm thinking that the search/planning process for finding good policies naturally introduce mesa-optimizers, regardless of everything that came before in the PreDCA (detecting precursors and extrapolating their utility function).
It seems like the AI risk mitigation solutions you've listed aren't mutually exclusive, but we'll likely have to use a combination of them to succeed. While I agree that it would be ideal for us to end up with a FAS, the pathway towards the outcome would likely involve "sponge coordination" and "pivotal acts" as mechanisms by which our civilization can buy some time before FAS.
A possible scenario in a world where FAS takes some time (chronological):
Of course we wouldn't need all of this if FAS happens to be the first capable AGI to be developed, (kinda unlikely in my model). I would like to know what scenarios you think are most likely to happen (or should aim towards), or if I overlooked any other pathways. (also relevant)
It seems like the exact model which the AI will adopt is kinda confounding my picture when I'm trying to imagine how "existentially secure" a world looks like. I'm current thinking there are two possible existentially secure worlds:
The obvious one is where all human dependence is removed from setting/modifying the AI's value system (like CEV, fully value-aligned)—this would look much more unipolar.
The alternate is for the well-intentioned-and-coordianted group to use a corrigible AI that is aligned with its human instructor. To me, whether this scenario looks existentially secure probably depends on "whether small differences in capability can magnify to great power differences"—if false, it would be much easier for capable groups to defect and make their own corrigible AI push agendas that may not be in favor of humanity's interest (hence not so existentially secure). If true, then the world would again be more unipolar—and its existential secureness would depend on how value-aligned the humans that are operating the corrigible AI are (I'm guessing this is your offense-defense balance example?)
So it seems to me that the ideal end game is for humanity to end up with a value-aligned AI, either by starting with it or somehow going through the "dangerous period" of multipolar corrigible AIs and transition to a value-aligned one. Possible pathways (non-exhaustive).
I'm not sure whether this is a good framing at all (probably isn't), but simply counting the number of dependencies (without taking into consider how plausible each dependencies are) it just seems to me that humanity's chances would be better off with a unipolar takeover scenario—either using a value-aligned AI from the start or transitioning into one after a pivotal act.
Is it even possible for a non-pivotal act to ever achieve existential security? Even if we max-ed up AI lab communication and had awesome interpretability, that doesn't help in the long-run given that the amount of minimum resources required to build a misaligned AGI will probably be keep dropping.
Thanks, I found your post very helpful and I think this community would benefit from posts similar as such.
I agree that we would need a clear categorization. Ideally, they would provide us a way to explicitly quantify/make-legible the claims of various proposals e.g. "my proposal, under these assumptions about the world, may give us X years of time, changes the world in these ways, and interacts with proposal A, B, C in these ways.
The lack of such is perhaps one of the reasons as to why I feel the pivotal act framing is still necessary. It seems to me that, while proposals closer to the "gradual steering" end of the spectrum (e.g. regulation, culture change, AI lab communication) usually are aimed at giving humanity a couple more months/years of extra time, they fail to make legible claims as above and yet (I might be wrong) proceed to implicitly claim "therefore, if we do a lot of these, we're safe—even without any pivotal acts!"
(of course pivotal acts aren't guilt-free and many of their details are hand-wavy, but their claims of impact & world-assumptions seem pretty straightforward. Are there non pivotal act type proposals like that?)
well crap, that was fast. does anyone know what karma threshold the button was pressed at?
In my model this isn't a capabilities failure, because there are demons in imperfect search; what you would get out of a heuristic-search-to-approximate-the-best-policy wouldn't only be something close to the global optimum, but something that has also been optimized by whatever demons (don't even have to be "optimizers" necessarily) that emerged through the selection pressures.
Maybe I'm still misunderstanding PreDCA and it somehow rules out this possibility, but afaik it only seems to do so in the limit of perfect search.