Wiki Contributions

Comments

Sorted by

Thanks for the answer. It's nice to get data about how other people think about this subject.

the concern that the more sociopathic people wind up in positions of power is the big concern.

Agreed!

Do I understand correctly: You'd guess that

  • 99% of humans have a "positive empathy-sadism balance",
  • and of those, 90-99% could be trusted to control the world (via controlling ASI),
  • i.e., ~89-98% of humanity could be trusted to control the world with ASI-grade power?

If so, then I'm curious -- and somewhat bewildered! -- as to how you arrived at those guesses/numbers.

I'm under the impression that narcissism and sadism have prevalences of very roughly 6% and 4%, respectively. See e.g. this post, or the studies cited therein. Additionally, probably something like 1% to 10% of people are psychopaths, depending on what criteria are used to define "psychopathy". Even assuming there's a lot of overlap, I think a reasonable guess would be that ~8% of humans have at least one of those traits. (Or 10%, if we include psychopathy.)

I'm guessing you disagree with those statistics? If yes, what other evidence leads you to your different (much lower) estimates?

Do you believe that someone with (sub-)clinical narcissism, if given the keys to the universe, would bring about good outcomes for all (with probability >90%)? Why/how? What about psychopaths?

Do you completely disagree with the aphorism that "power corrupts, and absolute power corrupts absolutely"?

Do you think that having good intentions (and +0 to +3 SD intelligence) is probably enough for someone to produce good outcomes, if they're given ASI-grade power?

FWIW, my guesstimates are that

  • over 50% of genpop would become corrupted by ASI-grade power, or are sadistic/narcissistic/psychopathic/spiteful to begin with,
  • of the remainder, >50% would fuck things up astronomically, despite their good intentions[1],
  • genetic traits like psychopathy and narcissism (not sure about sadism), and acquired traits like cynicism, are much more prevalent (~5x odds?) in people who will end up in charge of AGI projects, relative to genpop. OTOH, competence at not-going-insane is likely higher among them too.

it would be so easy to benefit humanity, just by telling your slave AGI to go make it happen. A lot of people would enjoy being hailed as a benevolent hero

I note that if someone is using an AGI as a slave, and is motivated by wanting prestige status, then I do not expect that to end well for anyone else. (Someone with moderate power, e.g. a medieval king, with the drive to be hailed a benevolent hero, might indeed do great things for other people. But someone with more extreme power -- like ASI-grade power -- could just... rewire everyone's brains; or create worlds full of suffering wretches, for him to save and be hailed/adored by; or... you get the idea.)


  1. Even relatively trivial things like social media or drugs mess lots of humans up; and things like "ability to make arbitrary modifications to your mind" or "ability to do anything you want, to anyone, with complete impunity" are even further OOD, and open up even more powerful superstimuli/reward-system hacks. Aside from tempting/corrupting humans to become selfish, I think that kind of situation has high potential to just lead to them going insane or breaking (e.g. start wireheading) in any number of ways.

    And then there are other failure modes, like insufficient moral uncertainty and locking in some parochial choice of values, or a set of values that made sense in some baseline human context but which generalize to something horrible. ("Obviously we should fill the universe with Democracy/Christianity/Islam/Hedonism/whatever!", ... "Oops, turns out Yahweh is pretty horrible, actually!") ↩︎

I'd be interested to see that draft as a post!

What fraction of humans in set X would you guess have a "positive empathy-sadism balance", for

  • X = all of humanity?
  • X = people in control of (governmental) AGI projects?

I agree that the social environment / circumstances could have a large effect on whether someone ends up wielding power selfishly or benevolently. I wonder if there's any way anyone concerned about x/s-risks could meaningfully affect those conditions.

I'm guessing[1] I'm quite a bit more pessimistic than you about what fraction of humans would produce good outcomes if they controlled the world.


  1. with a lot of uncertainty, due to ignorance of your models. ↩︎

I agree that "strengthening democracy" sounds nice, and also that it's too vague to be actionable. Also, what exactly would be the causal chain from "stronger democracy" (whatever that means) to "command structure in the nationalized AGI project is trustworthy and robustly aligned to the common good"?

If you have any more concrete ideas in this domain, I'd be interested to read about them!

Pushing for nationalization or not might affect when it's done, giving some modicum of control.

I notice that I have almost no concrete model of what that sentence means. A couple of salient questions[1] I'd be very curious to hear answers to:

  • What concrete ways exist for affecting when (and how) nationalization is done? (How, concretely, does one "push" for/against nationalization of AGI?)

  • By what concrete causal mechanism could pushing for nationalization confer a modicum of control; and control over what exactly, and to whom?


  1. Other questions I wish I (or people advocating for any policy w.r.t. AGI) had answers to include (i.a.) "How could I/we/anyone ensure that the resulting AGI project actually benefits everyone? Who, in actual concrete practice, would end up effectively having control over the AGI? How could (e.g.) the public hold those people accountable, even as those people gain unassailable power? How do we ensure that those people are not malevolent to begin with, and also don't become corrupted by power? What kinds of oversight mechanisms could be built, and how?" ↩︎

make their models sufficiently safe

What does "safe" mean, in this post?

Do you mean something like "effectively controllable"? If yes: controlled by whom? Suppose AGI were controlled by some high-ranking people at (e.g.) the NSA; with what probability do you think that would be "safe" for most people?

Doing nationalization right

I think this post (or the models/thinking that generated it) might be missing an important consideration[1]: "Is it possible to ensure that the nationalized AGI project does not end up de facto controlled by not-good people? If yes, how?"

Relevant quote from Yudkowsky's Six Dimensions of Operational Adequacy in AGI Projects (emphasis added):

Opsec [...] Military-grade or national-security-grade security. (It's hard to see how attempts to get this could avoid being counterproductive, considering the difficulty of obtaining trustworthy command and common good commitment with respect to any entity that can deploy such force [...])

Another quote (emphasis mine):

You cannot possibly cause such a project[2] to exist with adequately trustworthy command, alignment mindset, and common-good commitment, and you should therefore not try to make it exist, first because you will simply create a still more dire competitor developing unaligned AGI, and second because if such an AGI could be aligned it would be a hell of an s-risk given the probable command structure.


  1. or possibly a crucial consideration ↩︎

  2. The quote is referring to "[...] a single global Manhattan Project which is somehow not answerable to non-common-good command such as Trump or Putin or the United Nations Security Council. [...]" ↩︎

A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like

  • legal systems, police
  • immune systems (esp. in suppressing cancer)
  • social norms, reputation systems, etc.

I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.

Do you think a good theory of hierarchical agency would subsume those kinds of patterns-in-reality? If yes: I wonder if their inclusion could be used as a criterion/heuristic for narrowing down the search for a good theory?

find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system

We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it's feasible[1] to build any kind of "acute risk period -ending AI"[2] that is not a "subject of AGI-doom arguments"?

I'd be (very) curious to see such arguments. [3]


  1. within time constraints, before anyone else builds a "subject of AGI-doom arguments" ↩︎

  2. or, "AIs that implement humanity's CEV" ↩︎

  3. If I became convinced that it's feasible to build such a "pivotal AI" that is not "subject to AGI doom arguments", I think that would shift a bunch of my probability mass from "we die due to unaligned AI" to "we die-or-worse due to misaligned humans controlling ASI" and "utopia". ↩︎

I think this is an important subject and I agree with much of this post. However, I think the framing/perspective might be subtly but importantly wrong-or-confused.

To illustrate:

How much of the issue here is about the very singular nature of the One dominant project, vs centralization more generally into a small number of projects?

Seems to me that centralization of power per se is not the problem.

I think the problem is something more like

  • we want to give as much power as possible to "good" processes, e.g. a process that robustly pursues humanity's CEV[1]; and we want to minimize the power held by "evil" processes

  • but: a large fraction of humans are evil, or become evil once prosocial pressures are removed; and we do not know how to reliably construct "good" AIs

  • and also: we (humans) are confused and in disagreement about what "good" even means

  • and even if it were clear what a "good goal" is, we have no reliable way of ensuring that an AI or a human institution is robustly pursuing such a goal.

I agree that (given the above conditions) concentrating power into the hands of a few humans or AIs would on expectation be (very) bad. (OTOH, a decentralized race is also very bad.) But concentration-vs-decentralization of power is just one relevant consideration among many.

Thus: if the quoted question has an implicit assumption like "the main variable to tweak is distribution-of-power", then I think it is trying to carve the problem at unnatural joints, or making a false implicit assumption that might lead to ignoring multiple other important variables.

(And less centralization of power has serious dangers of its own. See e.g. Wei Dai's comment.)

I think a more productive frame might be something like "how do we construct incentives, oversight, distribution of power, and other mechanisms, such that Ring Projects remain robustly aligned to 'the greater good'?"

And maybe also "how do we become less confused about what 'the greater good' even is, in a way that is practically applicable to aligning Ring Projects?"


  1. If such a thing is even possible. ↩︎

Upvoted and disagreed. [1]

One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like

[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly "useful/capable".

Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model's weights to 0.0; thereby guaranteeing the non-entrainment of any ("bad") circuits.

I'm curious: what do you think would be a good (...useful?) operationalization of "useful/capable"?

Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model's outputs might cause catastrophe. [2]


  1. I think writing one's thoughts/intuitions out like this is valuable --- for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best). ↩︎

  2. Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly "localized/concentrated" in some sense. (OTOH, that seems likely to at least eventually be the case?) ↩︎

Load More