LESSWRONG
LW

13

[ Question ]

Should you publish solutions to corrigibility?

30th Jan 2025

1 min read

13

This question is partly motivated by observing recent discussions about corrigibility and wondering to what extent the people involved have thought about how their results might be used.

If there existed practically implementable ways to make AGIs corrigible to arbitrary principals, that would enable a wide range of actors to eventually control powerful AGIs. Whether that would be net good or bad on expectation would depend on the values/morality of the principals of such AGIs.

Currently it seems highly unclear what kinds of people we should expect to end up in control of corrigible ASIs, if corrigibility were practically feasible.

What (crucial) considerations should one take into account, when deciding whether to publish---or with whom to privately share---various kinds of corrigibility-related results?

CorrigibilityAI

13

Should you publish solutions to corrigibility?

5Martin Randall

2Charlie Steiner

3Charlie Steiner

New Answer

New Comment

6 Answers sorted by
top scoring

Feb 01, 2025*

51

Possible responses to discovering a possible infohazard:

Tell everybody
Tell nobody
Follow a responsible disclosure process.

If you have discovered an apparent solution to corrigibility then my prior is:

90%: It's not actually a solution.
9%: Someone else will discover the solution before AGI is created.
0.9%: Someone else has already discovered the same solution.
0.1%: This is known to you alone and you can keep it secret until AGI.

Given those priors, I recommend responsible disclosure to a group of your choosing. I suggest a group which:

if applicable, is the research group you already belong to (if you don't trust them with research results, you shouldn't be researching with them)
can accurately determine if it is a real solution (helps in the 90% case)
you would like to give more influence over the future (helps in all other cases)
will reward you for the disclosure (only fair)

Then if it's not assessed to be a real solution, you publish it. If it is a real solution then coordinate next steps with the group, but by default publish it after some reasonable delay.

Inspired by @MadHatter's Mental Model of Infohazards:

Two people can keep a secret if one of them is dead.

Jan 30, 2025

42

It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity. Both seem unlikely at this point, to me. It's hard to tell when your alignment plan is good enough, and humans are foolishly optimistic about new projects, so they'll probably build AGI with or without a solid alignment plan.

So I'd say any and all solutions to corrigibility/control should be published.

Also, almost any solution to alignment in general could probably be used for control as well. And probably would be. See

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

And

https://www.lesswrong.com/posts/587AsXewhzcFBDesH/intent-alignment-as-a-stepping-stone-to-value-alignment

1

It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity.

I'm assuming neither. I agree with you that both seem (very) unlikely. ^[1]

It seems like you're assuming that any humans succeeding in controlling AGI is (on expectation) preferable to extinction? If so, that seems like a crux: if I agreed with that, then I'd also agree with "publish all corrigibility results".

I expect that unaligned ASI would lead to extinction, and

... (read more)

2Seth Herd2mo

I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies. So I'd take my chances with a human overlord far before accepting extinction. Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.

1rvnnt2mo

That's a good thing to consider! However, taking Earth's situation as a prior for other "cradles of intelligence", I think that consideration returns to the question of "should we expect Earth's lightcone to be better or worse than zero-value (conditional on corrigibility)?"

1rvnnt2mo

IIUC, your model would (at least tentatively) predict that * if person P has a lot of power over person Q, * and P is not sadistic, * and P is sufficiently secure/well-resourced that P doesn't "need" to exploit Q, * then P will not intentionally do anything that would be horrible for Q? If so, how do you reconcile that with e.g. non-sadistic serial killers, rapists, or child abusers? Or non-sadistic narcissists in whose ideal world everyone else would be their worshipful subject/slave? That last point also raises the question: Would you prefer the existence of lots of (either happily or grudgingly) submissive slaves over oblivion? To me it seems that terrible outcomes do not require sadism. Seems sufficient that P be low in empathy, and want from Q something Q does not want to provide (like admiration, submission, sex, violent sport, or even just attention).[1] I'm confused as to how/why you disagree. ---------------------------------------- 1. Also, AFAICT, about 0.5% to 8% of humans are sadistic, and about 8% to 16% have very little or zero empathy. How did you arrive at "99% of humanity [...] are not so sadistic"? Did you account for the fact that most people with sadistic inclinations probably try to hide those inclinations? (Like, if only 0.5% of people appear sadistic, then I'd expect the actual prevalence of sadism to be more like ~4%.) ↩︎

Jan 30, 2025

20

The answer depends on your values, and thus there isn't really a single answer to be said here.

Charlie Steiner

Jan 30, 2025

2-1

Yes. Current AI policy is like people in a crowded room fighting over who gets to hold a bomb. It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.

That said, we're currently not near any satisfactory solutions to corrigibility. And I do think it would be better for the world if were easier (by some combination of technical factors and societal factors) to build AI that works for the good of all humanity than to build equally-smart AI that follows the orders of a single person. So yes, we should focus research and policy effort toward making that happen, if we can.

And if we were in that world already, then I agree releasing all the technical details of an AI that follows the orders of a single person would be bad.

It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.

I think there is a key disanalogy to the situation with AGI: The analogy would be stronger if the bomb was likely to kill everyone, but also had a some (perhaps very small) probability of conferring godlike power to whomever holds it. I.e., there is a tradeoff: decrease the probability of dying, at the expense of increasing the probability of S-risks from corrupt(ible) humans gaining godlike power.

If you agree that there exists that kind of tradeoff, I'm cur... (read more)

3Charlie Steiner2mo

I give the probability that some authority figure would use an order-following AI to get torturous revenge on me (probably for being part of a group they dislike) is quite slim. Maybe one in a few thousand, with more extreme suffering being less likely by a few more orders of magnitude? The probablility that they have me killed for instrumental reasons, or otherwise waste the value of the future by my lights, is mich higher - ten percent-ish, depends on my distribution over who's giving the orders. But this isn't any worse to me than being killed by an AI that wants to replace me with molecular smiley faces.

1rvnnt2mo

To me, those odds each seem optimistic by a factor of about 1000, but ~reasonable relative to each other. (I don't see any low-cost way to find out why we disagree so strongly, though. Moving on, I guess.) Makes sense (given your low odds for bad outcomes). Do you also care about minds that are not you, though? Do you expect most future minds/persons that are brought into existence to have nice lives, if (say) Donald "Grab Them By The Pussy" Trump became god-emperor (and was the one deciding what persons/minds get to exist)?

Jan 30, 2025

11

Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.

If a company fails at coming up with solutions for AGI corrigibility, they wouldn't stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.

Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.

Jan 30, 2025

10

Taking a stab at answering my own question; an almost-certainly non-exhaustive list:

Would the results be applicable to deep-learning-based AGIs?^[1] If I think not, how can I be confident they couldn't be made applicable?
Do the corrigibility results provide (indirect) insights into other aspects of engineering (rather than SGD'ing) AGIs?
How much weight one gives to avoiding x-risks vs s-risks.^[2]
Who actually needs to know of the results? Would sharing the results with the whole Internet lead to better outcomes than (e.g.) sharing the results with a smaller number of safety-conscious researchers? (What does the cost-benefit analysis look like? Did I even do one?)
How optimistic (or pessimistic) one is about the common-good commitment (or corruptibility) of the people who one thinks might end up wielding corrigible AGIs.

Something like the True Name of corrigibility might at first glance seem applicable only to AIs of whose internals we have some meaningful understanding or control. ↩︎
If corrigibility were easily feasible, then at first glance, that would seem to reduce the probability of extinction (via unaligned AI), but increase the probability of astronomical suffering (under god-emperor Altman/Ratcliffe/Xi/Putin/...). ↩︎

More from rvnnt

Curated and popular this week