My website is https://www.cs.toronto.edu/~duvenaud/
I think people will be told about and sometime notice AIs' biases, but they'll still be the most trustworthy source of information for almost everybody. I think Wikipedia is a good example here - it's obviously biased on many politicized topics, but it's still usually the best source for anyone who doesn't personally know experts or which obscure forums to trust.
Great question. I think treacherous turn risk is still under-funded in absolute terms. And gradual disempowerment is much less shovel-ready as a discipline.
I think there are two reasons why maybe this question isn't so important to answer:
1) The kinds of skills required might be somewhat disjoint.
2) Gradual disempowerment is perhaps a subset or extension of the alignment problem. As Ryan Greenblatt and others point out: at some point, agents aligned to one person or organization will also naturally start working on this problem at the object level for their principals.
Yes, Ryan is correct. Our claim is that even fully-aligned personal AI representatives won't necessarily be able to solve important collective action problems in our favor. However, I'm not certain about this. The empirical crux for me is: Do collective action problems get easier to solve as everyone gets smarter together, or harder?
As a concrete example, consider a bunch of local polities in a literal arms race. If each had their own AGI diplomats, would they be able to stop the arms race? Or would the more sophisticated diplomats end up participating in precommitment races or other exotic strategies that might still prevent a negotiated settlement? Perhaps the less sophisticated diplomats would fear that a complicated power-sharing agreement would lead to their disempowerment eventually anyways, and refuse to compromise?
As a less concrete example, our future situation might be analogous to a population of monkeys who unevenly have access to human representatives which earnestly advocate on their behalf. There is a giant, valuable forest that the monkeys live in next to a city where all important economic activity and decision-making happens between humans. Some of the human population (or some organizations, or governments) end up not being monkey-aligned, instead focusing on their own growth and security. The humans advocating on behalf of monkeys can see this is happening, but because they can't always participate directly in wealth generation as well as independent humans, they eventually become a small and relatively powerless constituency. The government and various private companies regularly bid or tax enormous amounts of money for forest land, and even the monkeys with index funds eventually are forced to sell, and then go broke from rent.
I admit that there are many moving parts of this scenario, but it's the closest simple analogy to what I'm worried about that I've found so far. I'm happy for people to point out ways this analogy won't match reality.
I disagree - I think Ryan raised an obvious objection that we didn't directly address in the paper. I'd like to encourage medium-effort engagement from people as paged-in as Ryan. The discussion spawned was valuable to me.
Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!
But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.
The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.
If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.
Good point. The reason AI risk is distinct is simply that it removes the need of those bureaucracies and corporations to keep some humans happy and healthy enough to actually run them. This doesn't exactly put limits on how much they can disempower humans, but it does tend to provide at least some bargaining power for the humans involved.
Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.
As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?
As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.
ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.
Good point about our summary of Christiano, thanks, will fix. I agree with your summary.
We could broaden our moral circle to recognize that AIs—particularly agentic and sophisticated ones—should be seen as people too. ... From this perspective, gradually sharing control over the future with AIs might not be as undesirable as it initially seems.
Is what you're proposing just complete, advance capitulation to whoever takes over? If so, can I have all your stuff? If you change your values to prioritize me in your moral circle, it might not be as undesirable as it initially seems.
I agree that if we change ourselves to value the welfare of whoever controls the future, then their takeover will be desirable by definition. It's certainly a recipe for happiness - but then why not just modify your values to be happy with anything at all?
"I expect my view to get more popular over time."
I agree, except I think it mostly won't be humans holding this view when it's popular. Usually whoever takes over is glad they did, and includes themselves in their own moral circle. The question from my point of view is: will they include us in their moral circle? It's not obvious to me that they will, especially if we ourselves don't seem to care.
This reminds me of Stewie from Succession: "I'm spiritually and emotionally and ethically and morally behind whoever wins."
As someone who writes these kinds of papers, I try to make an effort to cite the original inspirations when possible. And although I agree with Robin's theory broadly, there are also some mechanical reasons why Yudkowsky in particular is hard to cite.
The most valuable things about the academic paper style as a reader are:
1) Having a clear, short summary (the abstract)
2) Stating the claimed contributions explicitly
3) Using standard jargon, or if not, noting so explicitly
4) A related work section that contrasts one's own position against others'
5) Being explicit about what evidence you're marshalling and where it comes from.
6) Listing main claims explicitly.
7) The best papers include a "limitations" or "why I might be wrong" section.
Yudkowsky mostly doesn't do these things. That doesn't mean he doesn't deserve credit for making a clear and accessible case for many foundational aspects of AI safety. It's just that in any particular context, it's hard to say what, exactly, his claims or contributions were.
In this setting, maybe the most appropriate citation would be something like "as illustrated in many thought experiments by yudkowsky [cite particular sections of the sequences and hpmor], it's dangerous to rely on any protocol for detecting scheming by agents more intelligent than oneself". But that's a pretty broad claim. Maybe I'm being unfair - but it's not clear to me what exactly yudkowsky's work says about the workability of these schemes other than "there be dragons here".