I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.
I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?
I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:
These involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.
To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.
I think where we differ is that I think Pr[full alignment] is extremely low, and there is quite a lot of space for non-omnicidal partial misalignment.
This seems to be missing what I see as the strongest argument for "utopia": most of what we think of as "bad values" in humans comes from objective mistakes in reasoning about the world and about moral philosophy, rather than from a part of us that is orthogonal to such reasoning in a paperclip-maximizer-like way, and future reflection can be expected to correct those mistakes.
future reflection can be expected to correct those mistakes.
I'm pretty worried that this won't happen, because these aren't "innocent" mistakes. Copying from a comment elsewhere:
Why did the Malagasy people have such a silly belief? Why do many people have very silly beliefs today? (Among the least politically risky ones to cite, someone I’ve known for years who otherwise is intelligent and successful, currently believes, or at least believed in the recent past, that 2⁄3 of everyone will die as a result of taking the COVID vaccines.) I think the unfortunate answer is that people are motivated to or are reliably caused to have certain false beliefs, as part of the status games that they’re playing. I wrote about one such dynamic, but that’s probably not a complete account.
From another comment on why reflection might not fix the mistakes:
many people are not motivated to do “rational reflection on morality” or examine their value systems to see if they would “survive full logical and empirical information”. In fact they’re motivated to do the opposite, to protect their value systems against such reflection/examination. I’m worried that alignment researchers are not worried enough that if an alignment scheme causes the AI to just “do what the user wants”, that could cause a lock-in of crazy value systems that wouldn’t survive full logical and empirical information.
One crucial question is, assuming AI will enable value lock-in when humans want it, will they use that as part of their signaling/status games? In other words, try to obtain higher status within their group by asking their AIs to lock in their morally relevant empirical or philosophical beliefs? A lot of people in the past used visible attempts at value lock in (constantly going to church to reinforce their beliefs, avoiding talking with any skeptics/heretics, etc.) for signaling. Will that change when real lock in becomes available?
Yeah, I'm particular worried about the second comment/last paragraph - people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)
Is this making a claim about moral realism? If so, why wouldn't it apply to a paperclip maximiser? If not, how do we distinguish between objective mistakes and value disagreements?
I interpreted steven0461 to be saying that many apparent "value disagreements" between humans turn out, upon reflection, to be disagreements about facts rather than values. It's a classic outcome concerning differences in conflict vs. mistake theory: people are interpreted as having different values because they favor different strategies, even if everyone shares the same values.
ah yeah, so the claim is something like 'if we think other humans have 'bad values', maybe in fact our values are the same and one of us is mistaken, and we'll get less mistaken over time'?
I tend to want to split "value drift" into "change in the mapping from (possible beliefs about logical and empirical questions) to (implied values)" and "change in beliefs about logical and empirical questions", instead of lumping both into "change in values".
most of what we think of as "bad values" in humans comes from objective mistakes in reasoning
Could the same be also true about most "good values"? Maybe people just makes mistakes about almost everything.
My sense is that most would-be dystopian scenarios lead to extinction fairly quickly. In most Malthusian situations, ruthless power struggles... humans would be a fitness liability that gets optimised away.
The way this doesn't happen is if we have AIs with human-extinction-avoiding constraints: some kind of alignment (perhaps incomplete/broken).
I don't think it makes much sense to reason further than this without making a guess at what those constraints may look like. If there aren't constraints, we're dead. If there are, then those constraints determine the rules of the game.
It sounds like you're implying that you need humans around for things to be dystopic? That doesn't seem clear to me; the AIs involved in the Malthusian struggle might still be moral patients
Sure, that's possible (and if so I agree it'd be importantly dystopic) - but do you see a reason to expect it?
It's not something I've thought about a great deal, but my current guess is that you probably don't get moral patients without aiming for them (or by using training incentives much closer to evolution than I'd expect).
I guess I expect there to be a reasonable amount of computation taking place, and it seems pretty plausible a lot of these computations will be structured like agents who are taking part in the Malthusian competition. I'm sufficiently uncertain about how consciousness works that I want to give some moral weight to 'any computation at all', and reasonable weight to 'a computation structured like an agent'.
I think if you have malthusian dynamics you *do* have evolution-like dynamics.
I assume this isn't a crux, but fwiw I think it's pretty likely most vertebrates are moral patients
I agree with most of this. Not sure about how much moral weight I'd put on "a computation structured like an agent" - some, but it's mostly coming from [I might be wrong] rather than [I think agentness implies moral weight].
Agreed that malthusian dynamics gives you an evolution-like situation - but I'd guess it's too late for it to matter: once you're already generally intelligent, can think your way to the convergent instrumental goal of self-preservation, and can self-modify, it's not clear to me that consciousness/pleasure/pain buys you anything.
Heuristics are sure to be useful as shortcuts, but I'm not sure I'd want to analogise those to qualia (??? presumably the right kind would be - but I suppose I don't expect the right kind by default).
The possibilities for signalling will also be nothing like that in a historical evolutionary setting - the utility of emotional affect doesn't seem to be present (once the humans are gone).
[these are just my immediate thoughts; I could easily be wrong]
I agree with its being likely that most vertebrates are moral patients.
Overall, I can't rule out AIs becoming moral patients - and it's clearly possible.
I just don't yet see positive reasons to think it has significant probability (unless aimed for explicitly).
some relevant ideas here maybe: https://reducing-suffering.org/what-are-suffering-subroutines/
Thanks, that's interesting, though mostly I'm not buying it (still unclear whether there's a good case to be made; fairly clear that he's not making a good case).
Thoughts:
Some thoughts about the ‘default’ trajectory of civilisation and how AI will affect the likelihood of different outcomes.
Is the default non-extinction outcome utopic or dystopic?
Arguments for dystopia:
Arguments for utopia:
How does AI affect these considerations?
Ways AI can make things worse:
Disrupting utopia arguments:
Affects (1):
Disrupts (2+3) if AI is qualitatively different from past technological change and therefore breaks previous patterns
Strengthening dystopia arguments:
(2) AI is likely to make us much better at manipulation - it will allow more intelligently optimised, larger scaled and more personalised targeting of persuasion and other tactics that decouple people’s actions from the things that they ‘really’ value
(4+5) AIs that are moral patients but don’t trigger empathy, or seem like moral patients but are actually not, are going to create murky and confusing ethical territory, increasing the risk of moral catastrophe.
(4+5) AI making the environment more strange and unnatural risks breaking whatever is causing people to have broadly altruistic values
(3+7) AI provides a new and faster-moving ecosystem for selection to take place in (i.e., among individual models or agents, among automated companies, etc), which will increase the strength of this effect relative to other things that influence the trajectory of the world (i.e., that most people don’t want the world to be taken over by whatever corporation is most ruthless). This both increases the probability that the world will be dominated by whichever actor is most ruthless, and increases the probability that we’ll end up in a Malthusian struggle.
(7) AI capabilities increase the influence gap a group can obtain by being more ruthless. If there are more powerful tools on the table to grab, the most grabby people will outcompete others by a larger margin
Ways AI can make things better:
Strengthening utopia arguments
AI is another technology, and as such it will enable humans to better understand and control the world. Scientific progress and economic growth resulting from AI progress will make it cheaper and easier to provide for the needs of sentient beings, and to obtain things we want without harming sentient beings.
Humans overall mostly do things that are in their interests. If we, as a society, develop and deploy an AI capability, that is evidence that the capability does in fact make the world better
Weakening dystopia arguments
(1 + 3) If AI changes the world radically, then maybe current dystopic aspects will disappear. For example, a singleton would eliminate coordination problems, and even a widely trusted advisor would eliminate many coordination problems.
(2) As well as improving manipulation, AI tools can also increase individual people’s ability to find, process, and understand information. AI could vastly improve the quality of education, and therefore people’s judgement and thinking skills. It could improve people’s control of what content they interact with
(4) AI can reduce scarcity and competition, and improve education and availability of information, both of which are likely to increase the frequency of benevolent and altruistic values. AI can help us reflect on and refine our values.