I don't think misaligned AI drives the majority of s-risk (I'm not even sure that s-risk is higher conditioned on misaligned AI), so I'm not convinced that it's a super relevant communication consideration here.
I'm curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.
I find this worrying. If social dynamics have introduced such a substantial freak-out-ness about these kinds of issues, it's hard to evaluate the true probability of them. If s-risks are indeed likely then I, as a potential victim of horrific suffering worse than any human has ever experienced, would want to be able to reasonably evaluate their probability.
What does the distribution of these non-death dystopias look like? There’s an enormous difference between 1984 and maximally efficient torture; for example, do you have a rough guess of what the probability distribution looks like if you condition on an irreversibly messed up but non-death future?
I'm a little confused by the agreement votes with this comment - it seems to me that the consensus around here is that s-risks in which currently-existing humans suffer maximally are very unlikely to occur. This seems an important practical question; could the people who agreement-upvoted elaborate on why they find this kind of thing plausible?
The examples discussed in e.g. the Kaj Sotala interview linked later down the chain tend to regard things like "suffering subroutines", for example.
I have a disturbing feeling that arguing to future AI to "preserve humanity for pascals-mugging-type-reasons" trades off X-risk for S-risk. I'm not sure that any of these aforementioned cases encourage AI to maintain lives worth living.
Because you're imagining AGI keeping us in a box? Or that there's a substantial probability on P(humans are deliberately tortured | AGI) that this post increases?
Presumably it'd take less manpower to review each article that the AI's written (i.e. read the citations & make sure the article accurately describes the subjects) than it would to write articles from scratch. I'd guess this is the case even if the claims seem plausible & fact-checking requires a somewhat detailed reading through of the sources.
Cheers for the reply! :)
integrate these ideas into your mind and it's complaining loudly that you're going to fast (although it doesn't say it quite that way, I think this is a useful framing). Stepping away, focusing on other things for a while, and slowly coming back to the ideas is probably the best way to be able to engage with them in a psychologically healthy way that doesn't overwhelm you
I do try! When thinking about this stuff starts to overwhelm me I can try to put it all on ice, usually some booze is required to be able to do that TBH.
But of course it's also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us
Also, wouldn't you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe's resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?
Since it's my shortform, I'd quite like to just vent about some stuff.
I'm still pretty scared about a transhumanist future going quite wrong. It simply seems to me that there's quite the conjunction of paths to "s-risk" scenarios: generally speaking, any future agent that wants to cause disvalue to us - or an empathetic agent - would bring about an outcome that's Pretty Bad by my lights. Like, it *really* doesn't seem impossible that some AI decides to pre-commit to doing Bad if we don't co-operate with it; or our AI ends up in some horrifying confli...
Thanks for the response; I'm still somewhat confused though. The question was to do with the theoretical best/worst things possible, so I'm not entirely sure whether parallels to (relatively) minor pleasures/pains are meaningful here.
Specifically I'm confused about:
Then you end up into well, to what extent is that a debunking explanation that explains why humans in terms of their capacity to experience joy and suffering are unbiased but the reality is still biased
I'm not really sure what's meant by "the reality" here, nor what's meant by biased...
I'm not sure if this is the right place to ask this, but does anyone know what point Paul's trying to make in the following part of this podcast? (Relevant section starts around 1:44:00)
...Suppose you have a P probability of the best thing you can do and a one-minus P probably the worst thing you can do, what does P have to be so it’s the difference between that and the barren universe. I think most of my probability is distributed between you would need somewhere between 50% and 99% chance of good things and then put some probability or some credence on view
we ask the AGI to "make us happy", and it puts everyone paralyzed in hospital beds on dopamine drips. It's not hard to think that after a couple hours of a good high, this would actually be a hellish existence, since human happiness is way more complex than the amount of dopamine in one's brain (but of course, Genie in the Lamp, Mida's Touch, etc)
This sounds much better than extinction to me! Values might be complex, yeah, but if the AI is actually programmed to maximise human happiness then I expect the high wouldn't wear off. Being turned into a wirehead...
I'm way more scared about the electrode-produced smiley faces for eternity and the rest. That's way, way worse than dying.
FWIW, it seems kinda weird to me that such an AI would keep you alive... if you had a "smile-maximiser" AI, wouldn't it be indifferent to humans being braindead, as long as it's able to keep them smiling?
I'd like to have Paul Christiano's view that the "s-risk-risk" is 1/100 and that AGI is 30 years off
I think Paul's view is along the lines of "1% chance of some non-insignificant amount of suffering being intentionally created", n...
Evolution "wants" pain to be a robust feedback/control mechanism that reliably causes the desired amount of avoidance - in this case, the greatest possible amount.
I feel that there's going to be a level of pain for which a mind of nearly any level of pain tolerance would exert 100% of its energy to avoid. I don't think I know enough to comment on how much further than this level the brain can go, but it's unclear why the brain would develop the capacity to process pain drastically more intense than this; pain is just a tool to avoid certain things, and it ...
I'm unsure that "extreme" would necessarily get a more robust response, considering that there comes a point where the pain becomes disabling.
It seems as though there might be some sort of biological "limit" insofar as there are limited peripheral nerves, the grey matter can only process so much information, etc., and there'd be a point where the brain is 100% focused on avoiding the pain (meaning there'd be no evolutionary advantage to having the capacity to process additional pain). I'm not really sure where this limit would be, though. And I don't really know any biology so I'm plausibly completely wrong.
I think the idea is that the 4th scenario is the case, and you can’t discern whether you’re the real you or the simulated version, as the simulation is (near-) perfect. In that scenario, you should act in the same way that you’d want the simulated version to. Either (1) you’re a simulation and the real you just won $1,000,000; or (2) you’re the real you and the simulated version of you thought the same way that you did and one-boxed (meaning that you get $1,000,000 if you one-box.)
it's much harder to know if you've got it pointed in the right direction or not
Perhaps, but the type of thing I'm describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it's not going to torture everyone if that's the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can't experience these sign-flip type errors; I'm just unsure whether this is something that we have robust s...
Seems a little bit beyond me at 4:45am - I'll probably take a look tomorrow when I'm less sleep deprived (although still can't guarantee I'll be able to make it through then; there's quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about "sign flip in reward function" or "direction of updates to reward model flipped"-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv's paper's abstract) in general.
Would you not agree that (assuming there's an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it's extremely unlikely, I'm not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.
For the record, I think that this is also a risk worth worrying about for non-psychological reasons.
You seem to have a somewhat general argument against any solution that involves adding onto the utility function in "What if that added solution was bugged instead?".
I might've failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that'd be problematic.
I think it's better to move on from...
I see. I'm somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
I don't think that's an example of (3), more like (1) or (2), or actually "none of the above because GPT-2 doesn't have this kind of architecture".
I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that'd suggest it's also a possibility with cosmic ray/other errors.
I hadn't really considered the possibility of a brain-inspired/neuromorphic AI, thanks for the points.
(2) seems interesting; as I understand it, you're basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it's getting positive feedback for bad things and keeps doing them, or something (I don't really know, I'm also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I'd probably...
True, but note that he elaborates and comes up with a patch to the patch (that being have W refer to a class of events that would be expected to happen in the Universe's expected lifespan rather than one that won't.) So he still seems to support the basic idea, although he probably intended just to get the ball rolling with the concept rather than conclusively solve the problem.
Perhaps malware could be another risk factor in the type of bug I described here? Not sure.
I'm still a little dubious of Eliezer's solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.
Or what about the case where instead of maximising -U, the values that the reward function/model gives for each "thing&...
Paperclipping seems to be negative utility, not approximately 0 utility.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to.
How come? It do...
As an almost entirely inapplicable analogy . . . it's just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.
Interesting analogy. I can see what you're saying, and I guess it depends on what specifically gets flipped. I'm unsure about the second example; something like exploring new strategies doesn't seem like something an AGI would terminally value. It's instrumental to optimising the reward function/model, but I can't see it getting flipped *with* the r...
Thanks for the detailed response. A bit of nitpicking (from someone who doesn't really know what they're talking about):
However, the vast majority of these mistakes would probably buff out or result in paper-clipping.
I'm slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at "I have no mouth, and I must scream"....
I've seen that post & discussed it on my shortform. I'm not really sure how effective something like Eliezer's idea of "surrogate" goals there would actually be - sure, it'd help with some sign flip errors but it seems like it'd fail on others (e.g. if U = V + W, a sign error could occur in V instead of U, in which case that idea might not work.) I'm also unsure as to whether the probability is truly "very tiny" as Eliezer describes it. Human errors seem much more worrying than cosmic rays.
I don't really know what the probability is. It seems somewhat low, but I'm not confident that it's *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)
I think I'd stop worrying about it if I were convinced that its probability is extremely low. But I'm not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discus...
It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a...
Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong's idea here:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [...
I asked Rohin Shah about that possibility in a question thread about a month ago. I think he's probably right that this type of thing would only plausibly make it through the training process if the system's *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:
Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect i...
I'm under the impression that an AGI would be monitored *during* training as well. So you'd effectively need the system to turn "evil" (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it'd need to happen a fair bit into the training process. I guess that's possible, but IDK how likely it'd be.
Sure, but the *specific* type of error I'm imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren't being as careful as they could've been.
For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it'd have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* someh...
Wouldn't any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I'm visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it's possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.
Hopefully a system as crucially important as an AGI isn't going to have just one guy watching it who "takes a...
Many entities have sanity-checking tools. They fail. Many have careful developers. They fail. Many have automated tests. They fail. And so on. Disasters happen because all of those will fail to work every time and therefore all will fail some time. If any of that sounds improbable, as if there would have to be a veritable malevolent demon arranging to make every single safeguard fail or backfire (literally, sometimes, like the recent warehouse explosion - triggered by welders trying to safeguard it!), you should probably read more about complex systems and their failures to understand how normal it all is.
If we actually built an AGI that optimised to maximise a loss function, wouldn't we notice long before deploying the thing?
I'd imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned ("A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.")
Interesting. Terrifying, but interesting.
Forgive me for my stupidity (I'm not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy's screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?
I'm interested in arguments surrounding energy-efficiency (and maximum intensity, if they're not the same thing) of pain and pleasure. I'm looking for any considerations or links regarding (1) the suitability of "H=D" (equal efficiency and possibly intensity) as a prior; (2) whether, given this prior, we have good a posteriori reasons to expect a skew in either the positive or negative direction; and (3) the conceivability of modifying human minds' faculties to experience "super-bliss" commensurate with the badness of the worst-possible outcome, such that ... (read more)