Posts

Sorted by New

162Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

4mo

95Sabotage Evaluations for Frontier Models

7mo

133Simple probes can catch sleeper agents

305Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Wikitag Contributions

Comments

Sorted by

Newest

G.D. as Capitalist Evolution, and the claim for humanity's (temporary) upper hand

David Duvenaud11d30

Thanks for the summary. I really like your phrasing "We will not wake up to a Skynet banner; we will simply notice, one product launch at a time, that fewer meaningful knobs are within human reach."

But as for "by what title/right do we insist on staying in charge?" I find it odd to act as if there is some external moral frame that we need to satisfy to maintain power. By what right does a bear catch a fish? Or a mother feed her child? I hope that a moral frame comprehensive enough to include humans is sufficiently compelling to future AIs to make them treat us well, but I don't think that that happens by default.

I think we should frame the problem as "how do we make sure we control the moral framework of future powerful beings", not as "how do we justify our existence to whatever takes over". I think it's entirely possible for us to end up building something that takes over that doesn't care about our interests, and I simply care about (my) human interests, full stop, with no larger justification.

I might have an expansive view of my interests that includes all sorts of charity to other beings in a way that is easy for other beings to get on board with. But there are just so, so many possible beings that could exist that won't care about my interests or moral code. Many already exist with us on this planet, such as wild animals and totalitarian governments. So my plea is: don't think you can argue your way into being treated well! Instead, make sure that any being or institution you create has a permanent interest in treating you well.

The Most Forbidden Technique

David Duvenaud2mo184

As someone who writes these kinds of papers, I try to make an effort to cite the original inspirations when possible. And although I agree with Robin's theory broadly, there are also some mechanical reasons why Yudkowsky in particular is hard to cite.

The most valuable things about the academic paper style as a reader are:
1) Having a clear, short summary (the abstract)
2) Stating the claimed contributions explicitly
3) Using standard jargon, or if not, noting so explicitly
4) A related work section that contrasts one's own position against others'
5) Being explicit about what evidence you're marshalling and where it comes from.
6) Listing main claims explicitly.
7) The best papers include a "limitations" or "why I might be wrong" section.

Yudkowsky mostly doesn't do these things. That doesn't mean he doesn't deserve credit for making a clear and accessible case for many foundational aspects of AI safety. It's just that in any particular context, it's hard to say what, exactly, his claims or contributions were.

In this setting, maybe the most appropriate citation would be something like "as illustrated in many thought experiments by yudkowsky [cite particular sections of the sequences and hpmor], it's dangerous to rely on any protocol for detecting scheming by agents more intelligent than oneself". But that's a pretty broad claim. Maybe I'm being unfair - but it's not clear to me what exactly yudkowsky's work says about the workability of these schemes other than "there be dragons here".

A History of the Future, 2025-2040

David Duvenaud3mo40

I think people will be told about and sometime notice AIs' biases, but they'll still be the most trustworthy source of information for almost everybody. I think Wikipedia is a good example here - it's obviously biased on many politicized topics, but it's still usually the best source for anyone who doesn't personally know experts or which obscure forums to trust.

The Risk of Gradual Disempowerment from AI

David Duvenaud3mo30

Great question. I think treacherous turn risk is still under-funded in absolute terms. And gradual disempowerment is much less shovel-ready as a discipline.

I think there are two reasons why maybe this question isn't so important to answer:
1) The kinds of skills required might be somewhat disjoint.
2) Gradual disempowerment is perhaps a subset or extension of the alignment problem. As Ryan Greenblatt and others point out: at some point, agents aligned to one person or organization will also naturally start working on this problem at the object level for their principals.

The Risk of Gradual Disempowerment from AI

David Duvenaud3mo52

Yes, Ryan is correct. Our claim is that even fully-aligned personal AI representatives won't necessarily be able to solve important collective action problems in our favor. However, I'm not certain about this. The empirical crux for me is: Do collective action problems get easier to solve as everyone gets smarter together, or harder?

As a concrete example, consider a bunch of local polities in a literal arms race. If each had their own AGI diplomats, would they be able to stop the arms race? Or would the more sophisticated diplomats end up participating in precommitment races or other exotic strategies that might still prevent a negotiated settlement? Perhaps the less sophisticated diplomats would fear that a complicated power-sharing agreement would lead to their disempowerment eventually anyways, and refuse to compromise?

As a less concrete example, our future situation might be analogous to a population of monkeys who unevenly have access to human representatives which earnestly advocate on their behalf. There is a giant, valuable forest that the monkeys live in next to a city where all important economic activity and decision-making happens between humans. Some of the human population (or some organizations, or governments) end up not being monkey-aligned, instead focusing on their own growth and security. The humans advocating on behalf of monkeys can see this is happening, but because they can't always participate directly in wealth generation as well as independent humans, they eventually become a small and relatively powerless constituency. The government and various private companies regularly bid or tax enormous amounts of money for forest land, and even the monkeys with index funds eventually are forced to sell, and then go broke from rent.

I admit that there are many moving parts of this scenario, but it's the closest simple analogy to what I'm worried about that I've found so far. I'm happy for people to point out ways this analogy won't match reality.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Duvenaud4mo317

I disagree - I think Ryan raised an obvious objection that we didn't directly address in the paper. I'd like to encourage medium-effort engagement from people as paged-in as Ryan. The discussion spawned was valuable to me.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Duvenaud4moΩ450

Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!

But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.

The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Duvenaud4mo67

Good point. The reason AI risk is distinct is simply that it removes the need of those bureaucracies and corporations to keep some humans happy and healthy enough to actually run them. This doesn't exactly put limits on how much they can disempower humans, but it does tend to provide at least some bargaining power for the humans involved.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Duvenaud4moΩ9164

Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.

As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?

As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.

ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Duvenaud4mo20

Good point about our summary of Christiano, thanks, will fix. I agree with your summary.