A potentially somewhat important thing which I haven't seen discussed:
Thank you for (being one of the horrifyingly few people) doing sane reporting on these crucially important topics.
Typo: "And humanity needs all the help we it can get."
Out of (1)-(3), I think (3)[1] is clearly most probable:
(Of course one could also come up with other possibilities besides (...
Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.
That's a good thing to consider! However, taking Earth's situation as a prior for other "cradles of intelligence", I think that consideration returns to the question of "should we expect Earth's lightcone to be better or worse than zero-value (conditional on corrigibility)?"
To me, those odds each seem optimistic by a factor of about 1000, but ~reasonable relative to each other.
(I don't see any low-cost way to find out why we disagree so strongly, though. Moving on, I guess.)
But this isn't any worse to me than being killed [...]
Makes sense (given your low odds for bad outcomes).
Do you also care about minds that are not you, though? Do you expect most future minds/persons that are brought into existence to have nice lives, if (say) Donald "Grab Them By The Pussy" Trump became god-emperor (and was the one deciding what persons/minds get to exist)?
IIUC, your model would (at least tentatively) predict that
If so, how do you reconcile that with e.g. non-sadistic serial killers, rapists, or child abusers? Or non-sadistic narcissists in whose ideal world everyone else would be their worshipful subject/slave?
That last point also raises the question: Would you prefer the existence of lo...
It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity.
I'm assuming neither. I agree with you that both seem (very) unlikely. [1]
It seems like you're assuming that any humans succeeding in controlling AGI is (on expectation) preferable to extinction? If so, that seems like a crux: if I agreed with that, then I'd also agree with "publish all corrigibility results".
I expect that unaligned ASI would lead to extinction, and
It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.
I think there is a key disanalogy to the situation with AGI: The analogy would be stronger if the bomb was likely to kill everyone, but also had a some (perhaps very small) probability of conferring godlike power to whomever holds it. I.e., there is a tradeoff: decrease the probability of dying, at the expense of increasing the probability of S-risks from corrupt(ible) humans gaining godlike power.
If you agree that there exists that kind of tradeoff, I'm cur...
Taking a stab at answering my own question; an almost-certainly non-exhaustive list:
Would the results be applicable to deep-learning-based AGIs?[1] If I think not, how can I be confident they couldn't be made applicable?
Do the corrigibility results provide (indirect) insights into other aspects of engineering (rather than SGD'ing) AGIs?
How much weight one gives to avoiding x-risks vs s-risks.[2]
Who actually needs to know of the results? Would sharing the results with the whole Internet lead to better outcomes than (e.g.) sharing the results wit
I think the main value of that operationalization is enabling more concrete thinking/forecasting about how AI might progress. Models some of the relevant causal structure of reality, at a reasonable level of abstraction: not too nitty-gritty[1], not too abstract[2].
which would lead to "losing the forest for the trees", make the abstraction too effortful to use in practice, and/or risk making it irrelevant as soon as something changes in the world of AI ↩︎
e.g. a higher-level abstraction like "AI that speeds up AI development by a factor of N" might at
I think this approach to thinking about AI capabilities is quite pertinent. Could be worth including "Nx AI R&D labor AIs" in the list?
Cogent framing; thanks for writing it. I'd be very interested to read your framing for the problem of "how do we get to a good future for humanity, conditional on the first attractor state for AGI alignment?"[1]
Would you frame it as "the AGI lab leadership alignment problem"? Or a governance problem? Or something else? ↩︎
Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:
Thanks for the answer. It's nice to get data about how other people think about this subject.
the concern that the more sociopathic people wind up in positions of power is the big concern.
Agreed!
Do I understand correctly: You'd guess that
If so, then I'm curious -- and somewhat bewildered! -- as to how you arrived at those guesses/...
I'd be interested to see that draft as a post!
What fraction of humans in set X would you guess have a "positive empathy-sadism balance", for
I agree that the social environment / circumstances could have a large effect on whether someone ends up wielding power selfishly or benevolently. I wonder if there's any way anyone concerned about x/s-risks could meaningfully affect those conditions.
I'm guessing[1] I'm quite a bit more pessimistic than you about what fraction of humans would...
I agree that "strengthening democracy" sounds nice, and also that it's too vague to be actionable. Also, what exactly would be the causal chain from "stronger democracy" (whatever that means) to "command structure in the nationalized AGI project is trustworthy and robustly aligned to the common good"?
If you have any more concrete ideas in this domain, I'd be interested to read about them!
Pushing for nationalization or not might affect when it's done, giving some modicum of control.
I notice that I have almost no concrete model of what that sentence means. A couple of salient questions[1] I'd be very curious to hear answers to:
What concrete ways exist for affecting when (and how) nationalization is done? (How, concretely, does one "push" for/against nationalization of AGI?)
By what concrete causal mechanism could pushing for nationalization confer a modicum of control; and control over what exactly, and to whom?
Other questions
make their models sufficiently safe
What does "safe" mean, in this post?
Do you mean something like "effectively controllable"? If yes: controlled by whom? Suppose AGI were controlled by some high-ranking people at (e.g.) the NSA; with what probability do you think that would be "safe" for most people?
Doing nationalization right
I think this post (or the models/thinking that generated it) might be missing an important consideration[1]: "Is it possible to ensure that the nationalized AGI project does not end up de facto controlled by not-good people? If yes, how?"
Relevant quote from Yudkowsky's Six Dimensions of Operational Adequacy in AGI Projects (emphasis added):
...Opsec [...] Military-grade or national-security-grade security. (It's hard to see how attempts to get this could avoid being counterproductive, considering the difficulty of obtaining tru
A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like
I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.
Do you think a good theory of hierarchical agency would subsume those kinds of patterns-in-reality? If yes: I wonder if their inclusion could be used as a criterion/heuristic for narrowing down the search for a good theory?
find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system
We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it's feasible[1] to build any kind of "acute risk period -ending AI"[2] that is not a "subject of AGI-doom arguments"?
I'd be (very) curious to see such arguments. [3]
I think this is an important subject and I agree with much of this post. However, I think the framing/perspective might be subtly but importantly wrong-or-confused.
To illustrate:
How much of the issue here is about the very singular nature of the One dominant project, vs centralization more generally into a small number of projects?
Seems to me that centralization of power per se is not the problem.
I think the problem is something more like
we want to give as much power as possible to "good" processes, e.g. a process that robustly pursues humanity's CE
Upvoted and disagreed. [1]
One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly "useful/capable".
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model's weights to 0.0; thereby guaranteeing the non-entrainment of any ("bad") circuits.
I'm curious: what do you think would be a good (...useful?) operationalization of "useful/capable"?
Another issue: K and ...
In Fig 1, is the vertical axis P(world) ?
Possibly a nitpick, but:
The development and deployment of AGI, or similarly advanced systems, could constitute a transformation rivaling those of the agricultural and industrial revolutions.
seems like a very strong understatement. Maybe replace "rivaling" with e.g. "(vastly) exceeding"?
Referring to the quote-picture from the Nvidia GTC keynote talk: I searched the talk's transcript, and could not find anything like the quote.
Could someone point out time-stamps of where Huang says (or implies) anything like the quote? Or is the quote entirely made up?
That clarifies a bunch of thing. Thanks!
I'm not sure I understand what the post's central claim/conclusion is. I'm curious to understand it better. To focus on the Summary:
So overall, evolution is the source of ethics,
Do you mean: Evolution is the process that produced humans, and strongly influenced humans' ethics? Or are you claiming that (humans') evolution-induced ethics are what any reasonable agent ought to adhere to? Or something else?
and sapient evolved agents inherently have a dramatically different ethical status than any well-designed created agents [...]
...according to some h...
I wonder how much work it'd take to implement a system that incrementally generates a graph of the entire conversation. (Vertices would be sub-topics, represented as e.g. a thumbnail image + a short text summary.) Would require the GPT to be able to (i.a.) understand the logical content of the discussion, and detect when a topic is revisited, etc. Could be useful for improving clarity/productivity of conversations.
One of the main questions on which I'd like to understand others' views is something like: Conditional on sentient/conscious humans[1] continuing to exist in an x-risk scenario[2], with what probability do you think they will be in an inescapable dystopia[3]?
(My own current guess is that dystopia is very likely.)
That makes sense; but:
so far outside the realm of human reckoning that I'm not sure it's reasonable to call them dystopian.
setting aside the question of what to call such scenarios, with what probability do you think the humans[1] in those scenarios would (strongly) prefer to not exist?
or non-human minds, other than the machines/Minds that are in control ↩︎
non-extinction AI x-risk scenarios are unlikely
Many people disagreed with that. So, apparently many people believe that inescapable dystopias are not-unlikely? (If you're one of the people who disagreed with the quote, I'm curious to hear your thoughts on this.)
(Ah. Seems we were using the terms "(alignment) success/failure" differently. Thanks for noting it.)
In-retrospect-obvious key question I should've already asked: Conditional on (some representative group of) humans succeeding at aligning ASI, what fraction of the maximum possible value-from-Evolution's-perspective do you expect the future to attain? [1]
My modal guess is that the future would attain ~1% of maximum possible "Evolution-value".[2]
...If tech evolution is similar enough to bio evolution then we should roughly expect tech evolution to have a simil
In general I think maximum values are weird because they are potentially nearly unbounded, but it sounds like we may then be in agreement absent terminology.
But in general I do not think of anything "less than 1% of the maximum value" as failure in most endeavors. For example the maximum attainable wealth is perhaps $100T or something, but I don't think it'd be normal/useful to describe the world's wealthiest people as failures at being wealthy because they only have ~$100B or whatever.
And regardless the standard doom arguments from EY/MIRI etc are very much "AI will kill us all!", and not "AI will prevent us from attaining over 1% of maximum future utility!"
Evolution has succeeded at aligning homo sapiens brains to date
I'm guessing we agree on the following:
Evolution shaped humans to have various context-dependent drives (call them Shards) and the ability to mentally represent and pursue complex goals. Those Shards were good proxies for IGF in the EEA[1].
Those Shards were also good[2] enough to produce billions of humans in the modern environment. However, it is also the case that most modern humans spend at least part of their optimization power on things orthogonal to IGF.
I think our disagreemen...
vast computation some of which is applied to ancestral simulations
I agree that a successful post-human world would probably involve a large amount[1] of resources spent on simulating (or physically instantiating) things like humans engaging in play, sex, adventure, violence, etc. IOW, engaging in the things for which Evolution installed Shards in us. However, I think that is not the same as [whatever Evolution would care about, if Evolution could care about anything]. For the post-human future to be a success from Evolution's perspective, I think it wou...
Humans have not put an end to biological life.
Yup. I, too, have noticed that.
Your doom[1] predictions [...]
C'mon, man, that's obviously a misrepresentation of what I was saying. Or maybe my earlier comment failed badly at communication? In case that's so, here's an attempted clarification (bolded parts added):
...If Evolution had a lot more time (than I expect it to have) to align humans to relative-gene-replication-count, before humans put an end to biological life , as they seem to me to be on track to do, based on things I have observed in the past
evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! [...]
If Evolution had a lot more time to align humans to relative-gene-replication-count, before humans put an end to biological life, then sure, seems plausible that Evolution might be able to align humans very robustly. But Evolution does not have infinite time or "retries" --- humanity is in the process of executing something like a "sharp left turn", and seems likely to succeed long before the human gene pool is taken over by sperm bank donors and such.
The utility function is fitness: gene replication count (of the human defining genes) [1]
Seems like humans are soon going to put an end to DNA-based organisms, or at best relegate them to some small fraction of all "life". I.e., seems to me that the future is going to score very poorly on the gene-replication-count utility function, relative to what it would score if humanity (or individual humans) were actually aligned to gene-replication-count.
Do you disagree? (Do you expect the post-ASI future to be tiled with human DNA?)
Obviously Evolution doesn
I mostly agree.
I also think that impact is very unevenly distributed over people; the most impactful 5% of people probably account for >70% of the impact. [1]
And if so, then the difference in positive impact between {informing the top 5%} and {broadcasting to the field in general on the open Internet} is probably not very large. [2]
Possibly also worth considering: Would (e.g.) writing a public post actually reach those few key people more effectively than (e.g.) sending a handful of direct/targeted emails? [3]
Talking about AI (alignment) here, but I
If {the reasoning for why AGI might not be near} comprises {a list of missing capabilities}, then my current guess is that the least-bad option would be to share that reasoning in private with a small number of relevant (and sufficiently trustworthy) people[1].
(More generally, my priors strongly suggest keeping any pointers to AGI-enabling capabilities private.)
E.g. the most capable alignment researchers who seem (to you) to be making bad strategic decisions due to not having considered {the reasoning for why AGI might not be near}. ↩︎
I can't critique your plan, because I can't parse your writing. My suggestion would be to put some effort into improving the clarity of your writing. [1]
Even basic things, such as the avoidance of long sentences, sometimes with side notes included and separated from the main sentence by commas, rather than e.g. em dashes, and making the scopes of various syntactic structures unambiguous, could go a long way towards making your text more legible. ↩︎
[...] bridge the "gap" between (less-precise proofs backed by advanced intuition) and (precise proofs simple enough for basically anyone to technically "follow").
Meta: Please consider using curly or square brackets ({} or []) for conceptual/grammatic grouping; please avoid overloading parentheses.
Thumbs up for trying to think of novel approaches to solving the alignment problem.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.
A few confusions:
Some p...
[...] iteratively align superintelligence.
To align the first automated alignment researcher, [...]
To validate the alignment of our systems, [...]
What do they mean by "aligned"?
How do we ensure AI systems much smarter than humans follow human intent?
OK. Assuming that
To what extent would you expect the government's or general populace's responses to "Robots with guns" to be helpful (or harmful) for mitigating risks from superintelligence? (Would getting them worried about robots actually help with x-risks?)
Right; that would be a silly thing to think.
My intended message might've been better worded as follows
If staring into abysses is difficult/rough, then adequately staring into the darker abysses might require counter-intuitively large amounts of effort/agency. And yet, I think it might be necessary to grok those darker abysses, if we are to avoid falling into them. That makes me worried.
OTOH, you seem exceptionally reflective, so perhaps that worry is completely unfounded in your case. Anyway, I'm grateful for the work you do; I wish there were more peo...
When people call things like this post "rough to write/read", and consider them to require a content warning, I wonder if most people are able to think clearly (or at all) about actually terrible scenarios, and worry that they aren't. (I'm especially worried if those people have influence in a domain where there might be a tradeoff between mitigating X-risks vs mitigating S-risks.)
I liked the description of the good future, though. Thanks for the reminder that things can (maybe) go well, too.
Whenever people are sad for any reason except s-risk, I wonder if they're able to think at all about important issues. /s
Thanks for the response.
To the extent that I understand your models here, I suspect they don't meaningfully bind/correspond to reality. (Of course, I don't understand your models at all well, and I don't have the energy to process the whole post, so this doesn't really provide you with much evidence; sorry.)
I wonder how one could test whether or not the models bind to reality? E.g. maybe there are case examples (of agents/people behaving in instrumentally rational ways) one could look at, and see if the models postdict the actual outcomes in those examples?
Yes. Also unclear whether the 90% could coordinate to take any effective action, or whether any effective action would be available to them. (Might be hard to coordinate when AIs control/influence the information landscape; might be hard to rise up against e.g. robotic law enforcement or bioweapons.)
Good point! I guess one way to frame that would be as
... (read more)