ryan_greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by

Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)

I was mostly just trying to point to prior arguments against similar arguments while expressing my view.

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:

  • At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
  • I expect organizations will be explicitly controlled by people and (some of) those people will have AI representatives to represent their interests as I discuss here. If you think getting good AI representation is unlikely, that would be a crux, but this would be my proposed solution at least.
    • The explicit mission of for-profit companies is to empower the shareholders. It clearly doesn't serve the interests of the shareholders to end up dead or disempowered.
    • Democratic governments have similar properties.
    • At a more basic level, I think people running organizations won't decide "oh, we should put the AI in charge of running this organization aligned to some mission from the preferences of the people (like me) who currently have de facto or de jure power over this organization". This is a crazily disempowering move that I expect people will by default be too savvy to make in almost all cases. (Both for people with substantial de facto and with de jure power.)
  • Even independent of the advice consideration, people will probably want AIs running organizations to be honest to at least the people controlling the organization. Given that I expect explicit control by people in almost all cases, if things are going in an existential direction, people can vote to change them in almost all cases.
  • I don't buy that there will be some sort of existential multi-polar trap even without coordination (though I also expect coordintion) due to things like the strategy stealing assumption as I also discuss in that comment.
  • If a subset of organizations diverge from a reasonable interpretation of what they were supposed to do (but are still basically obeying the law and some interpretation of what they were intentionally aligned to) and this is clear to the rest of the world (as I expect would be the case given some type of AI advisors), then the rest of the world can avoid problems from this subset via the court system or other mechanisms. Even if this subset of organizations run by effectively rogue AIs just runs away with resources successfully, this is probably only a subset of resources.

I think your response to a lot of this will be something like:

  • People won't have or won't listen to AI advisors.
  • Institutions will intentionally delude relevant people to acquire more power.

But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the question of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!

I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.

Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.

I've updated over time to thinking that trusting AI systems and mostly handing things off to AI systems is an important dynamic to be tracking. As in, the notion of human researcher obsolescence discussed here. In particular, it seems important to understand what is required for this and when we should make this transition.

I previously thought that this handoff should be mostly beyond our planning horizon (for more prosaic researchers) because by the time of this handoff we will have already spent many years being radically (e.g., >10x) accelerated by AI systems, but I've updated against this due to updating toward shorter timelines, faster takeoff, and lower effort on safety (indicating less pausing). I also am just more into end-to-end planning with more of the later details figured out.

As I previously discussed here, I still think that you can do this with Paul-corrigible AGIs, including AGIs that you wouldn't be happy trusting with building a utopia (at least building a utopia without letting you reflect and then consulting you about this). However, these AIs will be nearly totally uncontrolled in that you'll be deferring to them nearly entirely. But, the aim is that they'll ultimately give you the power they acquire back etc. I also think pretty straightforward training strategies could very plausibly work (at least if you develop countermeasures to relatively specific threat models), like studying and depending on generalization.

(I've also updated some toward various of the arguments in this post about vulnerability and first mover advantage, but I still disagree with the bottom line overall..)

Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.

Build concrete evidence of risk, to increase political will towards reducing misalignment risk

Working on demonstrating evidence of risk at an very irresponsible developer might be worse than you'd hope because they prevent you publishing or try to prevent this type of research from happening in the first place (given that it might be bad for their interests).

However, it's also possible they'd be skeptical of research done elsewhere.

I'm not sure how to reconcile these considerations.

I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.

Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:

  • Humans own capital/influence.
  • They use this influence to serve their own interests and have an (aligned) AI system which faithfully represents their interests.
  • Given that AIs can make high quality representation very cheap, the AI representation is very good and granular. Thus, something like the strategy-stealing assumption can hold and we might expect that humans end up with the same expected fraction of captial/influence they started with (at least to the extent they are interested in saving rather than consumption).

Even without any coordination, this can potentially work OK. There are objections to the strategy-stealing assumption, but none of these seem existential if we get to a point where everyone has wildly superintelligent and aligned AI representatives and we've ensured humans are physically robust to offense dominant technologies like bioweapons.

(I'm optimistic about being robust to bioweapons within a year or two of having wildly superhuman AIs, though we might run into huge issues during this transitional period... Regardless, bioweapons deployed by terrorists or as part of a power grab in a brief transitional period doesn't seem like the threat model you're describing.)

I expect some issues with races-to-the-bottom / negative sum dynamics / negative externalities like:

  • By default, increased industry on earth shortly after the creation of very powerful AI will result in boiling the oceans (via fusion power). If you don't participate in this industry, you might be substantially outcompeted by others[1]. However, I don't think it will be that expensive to protect humans through this period, especially if you're willing to use strategies like converting people into emulated minds. Thus, this doesn't seem at all likely to be literally existential. (I'm also optimistic about coordination here.)
  • There might be one time shifts in power between humans via mechanisms like states becoming more powerful. But, ultimately these states will be controlled by humans or appointed successors of humans if alignment isn't an issue. Mechanisms like competing over the quantity of bribery are zero sum as they just change the distribution of power and this can be priced in as a one time shift even without coordination to race to the bottom on bribes.

But, this still doesn't seem to cause issues with humans retaining control via their AI representatives? Perhaps the distribution of power between humans is problematic and may be extremely unequal and the biosphere will physically be mostly destroyed (though humans will survive), but I thought you were making stronger claims.

Edit in response to your edit: If we align the AI to some arbitrary target which is seriously misaligned with humanity as a whole (due to infighting or other issues), I agree this can cause existential problems.

(I think I should read the paper in more detail before engaging more than this!)


  1. It's unclear if boiling the oceans would result in substantial acceleration. This depends on how quickly you can develop industry in space and dyson sphere style structures. I'd guess the speed up is much less than a year. ↩︎

The paper says:

Christiano (2019) makes the case that sudden disempowerment is unlikely,

This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!

The post does say:

The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.

I think this is probably not what failure will look like,

But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the level of capabilities take its creators and others by suprise and fast capabilities progress allows for AIs to suddenly become poweful enough to takeover) rather than threat models involving sudden disempowerment from a point where AIs are already well known to be extremely powerful.

I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).

This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.

I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However, I don't think this matches the story you describe and the mitigations seem substantially different than what you seem to be imagining.

I'm also somewhat skeptical of the threat model you describe in the case where alignment isn't solved. I think the difference between the story you tell and something more like We get what we measure is important.

I worry I'm misunderstanding something because I haven't read the paper in detail.

Load More