All of Anirandis's Comments + Replies

I'm interested in arguments surrounding energy-efficiency (and maximum intensity, if they're not the same thing) of pain and pleasure. I'm looking for any considerations or links regarding (1) the suitability of "H=D" (equal efficiency and possibly intensity) as a prior; (2) whether, given this prior, we have good a posteriori reasons to expect a skew in either the positive or negative direction; and (3) the conceivability of modifying human minds' faculties to experience "super-bliss" commensurate with the badness of the worst-possible outcome, such that ... (read more)

I don't think misaligned AI drives the majority of s-risk (I'm not even sure that s-risk is higher conditioned on misaligned AI), so I'm not convinced that it's a super relevant communication consideration here. 

I'm curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.

I find this worrying. If social dynamics have introduced such a substantial freak-out-ness about these kinds of issues, it's hard to evaluate the true probability of them. If s-risks are indeed likely then I, as a potential victim of horrific suffering worse than any human has ever experienced, would want to be able to reasonably evaluate their probability.

What does the distribution of these non-death dystopias look like? There’s an enormous difference between 1984 and maximally efficient torture; for example, do you have a rough guess of what the probability distribution looks like if you condition on an irreversibly messed up but non-death future?

I'm a little confused by the agreement votes with this comment - it seems to me that the consensus around here is that s-risks in which currently-existing humans suffer maximally are very unlikely to occur. This seems an important practical question; could the people who agreement-upvoted elaborate on why they find this kind of thing plausible?

 

The examples discussed in e.g. the Kaj Sotala interview linked later down the chain tend to regard things like "suffering subroutines", for example.

2Benjy Forstadt
The assumption that a misaligned AI  will choose to kill us may be false. It would be very cheap to keep us alive/keep copies of us and it may find running experiments on us marginally more valuable. See  "More on the 'human experimentation' s-risk":    https://www.reddit.com/r/SufferingRisk/wiki/intro/#wiki_more_on_the_.22human_experimentation.22_s-risk.3A

I have a disturbing feeling that arguing to future AI to "preserve humanity for pascals-mugging-type-reasons" trades off X-risk for S-risk. I'm not sure that any of these aforementioned cases encourage AI to maintain lives worth living.

 

Because you're imagining AGI keeping us in a box? Or that there's a substantial probability on P(humans are deliberately tortured | AGI) that this post increases?

1Cody Rushing
  Yeah, something along the lines of this. Preserving humanity =/= humans living lives worth living.
2David Johnston
Indeed. If the idea of a tradeoff wasn't widely considered plausible I'd have spent more time defending it. I'd say my contribution here is the "and we should act like it" part.

Presumably it'd take less manpower to review each article that the AI's written (i.e. read the citations & make sure the article accurately describes the subjects) than it would to write articles from scratch. I'd guess this is the case even if the claims seem plausible & fact-checking requires a somewhat detailed reading through of the sources.

2Yitz
That would be a much more boring task for most people than direct writing, and would attract fewer volunteers, I’d have to imagine

Cheers for the reply! :)

 

integrate these ideas into your mind and it's complaining loudly that you're going to fast (although it doesn't say it quite that way, I think this is a useful framing). Stepping away, focusing on other things for a while, and slowly coming back to the ideas is probably the best way to be able to engage with them in a psychologically healthy way that doesn't overwhelm you

I do try! When thinking about this stuff starts to overwhelm me I can try to put it all on ice, usually some booze is required to be able to do that TBH.

But of course it's also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us

 

Also, wouldn't you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe's resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?

Since it's my shortform, I'd quite like to just vent about some stuff.

 

I'm still pretty scared about a transhumanist future going quite wrong. It simply seems to me that there's quite the conjunction of paths to "s-risk" scenarios: generally speaking, any future agent that wants to cause disvalue to us - or an empathetic agent - would bring about an outcome that's Pretty Bad by my lights. Like, it *really* doesn't seem impossible that some AI decides to pre-commit to doing Bad if we don't co-operate with it; or our AI ends up in some horrifying confli... (read more)

6Gordon Seidoh Worley
My advice, if advice is what you're looking for, is to distract yourself from all this stuff. It's heavy stuff to deal with, and going too fast can be too much. I think this is generally true for dealing with any type of hard thing. If it's overwhelming, force yourself away from it so you can't be so overwhelmed. That might seem crazy from the inside perspective of worrying about this stuff, but it's actually needed because my model of what happens to overwhelm folks is that there hasn't been time to integrate these ideas into your mind and it's complaining loudly that you're going to fast (although it doesn't say it quite that way, I think this is a useful framing). Stepping away, focusing on other things for a while, and slowly coming back to the ideas is probably the best way to be able to engage with them in a psychologically healthy way that doesn't overwhelm you. (Also don't worry about comparing yourself to others who were able to think about these ideas more quickly with less integration time needed. I think the secret is that those people did all that kind of integration work earlier in their lives, possibly as kids and without realizing it. For example, I grew up very aware of the cold war and probably way more worried about it than I should have been, so other types of existential risks were somewhat easier to handle because I already had made my peace with the reality of various terrible outcomes. YMMV.)
8Viliam
The world sucks, and the more you learn about it, the worse it gets (that is, the worse your map gets; the territory has always been like that). Yet, somehow, good things sometimes happen too, despite all the apparent reasons why they shouldn't. We'll see.

Thanks for the response; I'm still somewhat confused though. The question was to do with the theoretical best/worst things possible, so I'm not entirely sure whether parallels to (relatively) minor pleasures/pains are meaningful here. 

 

Specifically I'm confused about:

Then you end up into well, to what extent is that a debunking explanation that explains why humans in terms of their capacity to experience joy and suffering are unbiased but the reality is still biased

I'm not really sure what's meant by "the reality" here, nor what's meant by biased... (read more)

2Pattern
Ah. I suggested them because I figured that such '(relatively) minor' things are what people have experienced and thus are the obvious source for extrapolating out to theoretical maximum/s. I don't know what's meant by 'reality' there. Your guess seems reasonable (and was more transparent than what you quoted). I'm not sure how to guess the maximum ratio. Likewise. (A quadrillion seems like a lot - I'd need a detailed explanation to get why someone would choose that number.) I think...it makes sense less as emotion, than as a utility function - but that's not what is being talked about. Part of it is...when people are well off do they pursue the greatest pleasure? I think negative extremes prompt a focus on basics. In better conditions, people may pursue more complicated things. Overall, there's something about focus I guess: 'I don't want to die' versus 'I'm happy to be alive!'. Which sentiment is stronger? It's easy to pull that up for a thought experiment, that's extreme, but, if people don't have that as a risk in their lives then maybe the second thing, or the absence of the risk doesn't have as much salience, because the risk isn't present? (Short version: a) it's hard to reason about scenarios outside of experience*, b) this might induce asymmetry in estimates or intuition.) *I have experienced stuff and found 'wow, that was way more intense than I'd expected' - for stuff I had never experienced before.

I'm not sure if this is the right place to ask this, but does anyone know what point Paul's trying to make in the following part of this podcast? (Relevant section starts around 1:44:00)

Suppose you have a P probability of the best thing you can do and a one-minus P probably the worst thing you can do, what does P have to be so it’s the difference between that and the barren universe. I think most of my probability is distributed between you would need somewhere between 50% and 99% chance of good things and then put some probability or some credence on view

... (read more)
3Pattern
Here's a model that might simplify things: Really negative events can affect people's lives for a long time afterward. From that model, it's easier to have utility effects by, say, reducing extreme negative events, than say, making someone who is 'happy' a little bit happier. So while the second thing may seem easier to do (cost), the first thing may still be more impactful even if you divide by its cost. The obvious connection is how things play out within a person's life. If, say, you break your arm, maybe it'll be harder to do other things because: * it's in a cast and you can't use it while it heals * You're in pain. Maybe you don't enjoy things like, like watching a movie, as much, when you're in a lot of pain. [Insert argument for wearing a helmet while riding a bike or motorcycle even if it's mildly inconvenient - because it helps reduce/prevent stuff that's way more inconvenient.] It's easy to scale pain? This just seems like an argument that 'Becoming slightly happier' is less pressing morally than 'reducing the amount of torture* in the world'. *Might be worth noting that if this is about extreme pain, then this implies 'improving access to medical care' can be a very powerful intervention, i.e., effective altruism.

we ask the AGI to "make us happy", and it puts everyone paralyzed in hospital beds on dopamine drips. It's not hard to think that after a couple hours of a good high, this would actually be a hellish existence, since human happiness is way more complex than the amount of dopamine in one's brain (but of course, Genie in the Lamp, Mida's Touch, etc)

This sounds much better than extinction to me! Values might be complex, yeah, but if the AI is actually programmed to maximise human happiness then I expect the high wouldn't wear off. Being turned into a wirehead... (read more)

2superads91
"This sounds much better than extinction to me! Values might be complex, yeah, but if the AI is actually programmed to maximise human happiness then I expect the high wouldn't wear off. Being turned into a wirehead arguably kills you, but it's a much better experience than death for the wirehead!" You keep dodging the point lol... As someone with some experience with drugs, I can tell you that it's not fun. Human happiness is way subjective and doesn't depend on a single chemical. For instance, some people love MDMA, others (like me) find it a too intense, too chemical, too fabricated happiness. A forced lifetime on MDMA would be some of the worst tortures I can imagine. It would fry you up. But even a very controlled dopamine drip wouldn't be good. But anyway, I know you're probably trolling, so just consider good old-fashioned torture in a dark dungeon instead... On Paul: yes, he's wrong, that's how. " I think most scenarios where you've got a boundless optimiser superintelligence would lead to the creation of new minds that would perfectly satisfy its utility function." True, except that, on that basis alone, you have no idea how that would happen and what would it imply for those new minds (and old ones), since you're not a digital superintelligence.

I'm way more scared about the electrode-produced smiley faces for eternity and the rest. That's way, way worse than dying.

FWIW, it seems kinda weird to me that such an AI would keep you alive... if you had a "smile-maximiser" AI, wouldn't it be indifferent to humans being braindead, as long as it's able to keep them smiling?

 

I'd like to have Paul Christiano's view that the "s-risk-risk" is 1/100 and that AGI is 30 years off

I think Paul's view is along the lines of "1% chance of some non-insignificant amount of suffering being intentionally created", n... (read more)

4superads91
Thanks for the attentious commentary. 1. Yeah, I was guessing that the smiley faces wouldn't be the best example... I was just wanting to draw something from the Eliezer/Bostrom universe since I had mentioned the paperclipper beforehand. So, maybe a better Eliezer-Bostrom example would be, we ask the AGI to "make us happy", and it puts everyone paralyzed in hospital beds on dopamine drips. It's not hard to think that after a couple hours of a good high, this would actually be a hellish existence, since human happiness is way more complex than the amount of dopamine in one's brain (but of course, Genie in the Lamp, Mida's Touch, etc) 2. So, don't you equate this kind of scenario with a significant amount of suffering? Again, forget the bad example of the smiley faces, and reconsider. (I've actually read in a popular lesswrong post about s-risks Paul clearly saying that the risk of s-risk was 1/100th of the risk of x-risk (which makes for even less than 1/100th overall). Isn't that extremely naive, considering the whole Genie in the Lamp paradigm? How can we be so sure that the Genie will only create hell 1 time for each 100 times it creates extinction?) 3. a) I agree that a suffering-maximizer is quite unlikely. But you don't necessarily need one to create s-risks scenarios. You just need a Genie in the Lamp scenario. Like the dopamine drip example, in which the AGI isn't trying to maximize suffering, quite on contrary, but since it's super-smart in Sciences but lacks human common sense (a Genie), it ends up doing it. b) Yes I had read that article before. While it presents some fair solutions, I think it's far from being mostly solved. "Since hyperexistential catastrophes are narrow special cases (or at least it seems this way and we sure hope so), we can avoid them much more widely than ordinary existential risks." Note the "at least it seems this way and we surely hope so". Plus, what's the odds that the first AGI will be created by someone who listens to w
  • I think the problem is very likely to be resolved by different mechanisms based on trust and physical control rather than cryptography.

Do you expect these mechanisms to also resolve the case where a biological human is forcibly uploaded in horrible conditions?

Lurker here; I'm still very distressed after thinking about some futurism/AI stuff & worrying about possibilities of being tortured. If anyone's willing to have a discussion on this stuff, please PM!

[This comment is no longer endorsed by its author]Reply
3Viliam
Just a note: while there are legitimate reasons to worry about things, sometimes people worry simply because they are psychologically prone to worry (they just switch from one convenient object of worry to another). The former case can be solved by thinking about the risks and possible precautions; the latter requires psychological help. Please make sure you know the difference, because no amount of rationally discussing risks and precautions can help with the fear that is fundamentally irrational (it usually works the opposite way: the more you talk about it, the more scared you are).

I know I've posted similar stuff here before, but I could still do with some people to discuss infohazardous s-risk related stuff that I have anxieties with. PM me.

a

[This comment is no longer endorsed by its author]Reply
2avturchin
In that view, identity is very fragile and cryopreservation could damage it. Thus no risks of s-risks via cryonics. A weaker argument is: if evil AI is interested in torturing real people, it maybe satisfied with billions ones who live, and additional cost of developing resurrection technology only for torture cryopatients may be too high. It will be cheaper to procreate new people with the same efforts.

Evolution "wants" pain to be a robust feedback/control mechanism that reliably causes the desired amount of avoidance - in this case, the greatest possible amount.

I feel that there's going to be a level of pain for which a mind of nearly any level of pain tolerance would exert 100% of its energy to avoid. I don't think I know enough to comment on how much further than this level the brain can go, but it's unclear why the brain would develop the capacity to process pain drastically more intense than this; pain is just a tool to avoid certain things, and it ... (read more)

5DanArmak
The brain doesn't have the capacity to correctly process extreme pain. That's why it becomes unresponsive or acts counterproductively.  The brain has the capacity to perceive extreme pain. This might be because: * The brain has many interacting subsystems; the one(s) that react to pain stop working before the ones that perceive it * The range of perceivable pain (that is, the range in which we can distinguish stronger from weaker pain) is determined by implementation details of the neural system. If there was an evolutionary benefit to increasing the range, we would expect that to happen. But if the range is greater than necessary, that doesn't mean there's an evolutionary benefit to decreasing it; the simplest/most stable solution stays in place.

I'm unsure that "extreme" would necessarily get a more robust response, considering that there comes a point where the pain becomes disabling.

It seems as though there might be some sort of biological "limit" insofar as there are limited peripheral nerves, the grey matter can only process so much information, etc., and there'd be a point where the brain is 100% focused on avoiding the pain (meaning there'd be no evolutionary advantage to having the capacity to process additional pain). I'm not really sure where this limit would be, though. And I don't really know any biology so I'm plausibly completely wrong.

5DanArmak
I meant robust in the sense of decreasing the number of edge cases where the pain is insufficiently strong to motivate the particular individual as strongly as possible. (Since pain tolerance is variable, etc.) Evolution "wants" pain to be a robust feedback/control mechanism that reliably causes the desired amount of avoidance - in this case, the greatest possible amount. That's an excellent point. Why would evolution allow (i.e. not select against) the existence of disabling pain (and fear, etc)?  Presumably, in the space of genotypes available for selection - in the long term view, and for animals besides humans - there are no cheap solutions that would have an upper cut-off to pain stimuli (below the point of causing unresponsiveness) without degrading the avoidance response to lower levels of pain. There is also the cutoff argument: a (non-human) animal can't normally survive e.g. the loss of a limb, so it doesn't matter how much pain exactly it feels in that scenario. Some cases of disabling pain fall in this category.  Finally, evolution can't counteract human ingenuity in torture, because humans act on much smaller timescales. It is to be expected that humans who are actively trying to cause pain (or to imagine how to do so) will succeed in causing amounts of pain beyond most anything found in nature.

I think the idea is that the 4th scenario is the case, and you can’t discern whether you’re the real you or the simulated version, as the simulation is (near-) perfect. In that scenario, you should act in the same way that you’d want the simulated version to. Either (1) you’re a simulation and the real you just won $1,000,000; or (2) you’re the real you and the simulated version of you thought the same way that you did and one-boxed (meaning that you get $1,000,000 if you one-box.)

1solomon alon
I agree with you, I just was trying to emphasize that if your the real you your decision doesn't change anything. At most it can do is if the simulation is extremely accurate is it can reveal what was already chosen since you know that you will make the same decision as you previously made in the simulation. The big difference between me and timeless decision theory is that I contend that the only reason to choose just box B is because you might be in the simulation. This completely gets rid of ridiculous problems like roko's basilisk. Since we are not currently simulating a AI therefore a future AI cannot affect us. If the AI had the suspicion that it was in a simulation then it might have a incentive to torture people but given that it has no reason to think that, torture is a waste of time and effort. 

If Trump loses the election, he's not the president anymore and the federal bureaucracy and military will stop listening to him.

He’d still be president until Biden’s inauguration though. I think most of the concern is that there’d be ~3 months of a president Trump with nothing to lose.

5ChristianKl
The idea that Trump would have nothing to lose assumes he cares about neither his business wealth nor his personal freedom and only cares about holding office. I don't think that's what Trump is about.

If anyone happens to be willing to privately discuss some potentially infohazardous stuff that's been on my mind (and not in a good way) involving acausal trade, I'd appreciate it - PM me. It'd be nice if I can figure out whether I'm going batshit.

it's much harder to know if you've got it pointed in the right direction or not

Perhaps, but the type of thing I'm describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it's not going to torture everyone if that's the case.)

This seems easier than recognising whether the sign is flipped or just designing a system that can't experience these sign-flip type errors; I'm just unsure whether this is something that we have robust s... (read more)

Seems a little bit beyond me at 4:45am - I'll probably take a look tomorrow when I'm less sleep deprived (although still can't guarantee I'll be able to make it through then; there's quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about "sign flip in reward function" or "direction of updates to reward model flipped"-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv's paper's abstract) in general.

3Gordon Seidoh Worley
Actually, I'm not sure that sign flips are easier to deal with. A sentiment I've heard expressed before is that it's much easier to trim something to more a little more or less of it, but it's much harder to know if you've got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.
9Zack_M_Davis
Sleep is very important! Get regular sleep every night! Speaking from personal experience, you don't want to have a sleep-deprivation-induced mental breakdown while thinking about Singularity stuff!

Would you not agree that (assuming there's an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it's extremely unlikely, I'm not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.

For the record, I think that this is also a risk worth worrying about for non-psychological reasons.

1[comment deleted]
You seem to have a somewhat general argument against any solution that involves adding onto the utility function in "What if that added solution was bugged instead?".

I might've failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that'd be problematic.


I think it's better to move on from
... (read more)
1[anonymous]

I see. I'm somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.


I don't think that's an example of (3), more like (1) or (2), or actually "none of the above because GPT-2 doesn't have this kind of architecture".

I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that'd suggest it's also a possibility with cosmic ray/other errors.

4Steven Byrnes
I'm not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don't have any tractable directions for progress in that scenario (or just don't know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don't distinguish between prosaic AGI and brain-inspired AGI. Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I've worked a lot on trying to understand how the neocortical algorithm works, and I don't think that the algorithm is all that complicated (cf. "cortical uniformity"), and I think that ongoing work is zeroing in on it (see here).

I hadn't really considered the possibility of a brain-inspired/neuromorphic AI, thanks for the points.

(2) seems interesting; as I understand it, you're basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it's getting positive feedback for bad things and keeps doing them, or something (I don't really know, I'm also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I'd probably... (read more)

4Steven Byrnes
The whole point of the reward signals are to change the AI's motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of "this concept is rewarding", and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I'm thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario. I don't think that's an example of (3), more like (1) or (2), or actually "none of the above because GPT-2 doesn't have this kind of architecture".

Mainly for brevity, but also because it seems to involve quite a drastic change in how the reward function/model as a whole functions. So it doesn't seem particularly likely that it'll be implemented.

True, but note that he elaborates and comes up with a patch to the patch (that being have W refer to a class of events that would be expected to happen in the Universe's expected lifespan rather than one that won't.) So he still seems to support the basic idea, although he probably intended just to get the ball rolling with the concept rather than conclusively solve the problem.

3Steven Byrnes
Oops, forgot about that. You're right, he didn't rule that out. Is there a reason you don't list his "A deeper solution" here? (Or did I miss it?) Because it trades off against capabilities? Or something else?

Perhaps malware could be another risk factor in the type of bug I described here? Not sure.

I'm still a little dubious of Eliezer's solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.

Or what about the case where instead of maximising -U, the values that the reward function/model gives for each "thing&... (read more)

I see what you're saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I'll wait until you're able to write down your thoughts on this at length; this is something that I'd like to see elaborated on (as well as everything else regarding hyperexistential risk.)

Paperclipping seems to be negative utility, not approximately 0 utility.

My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.


I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to.

How come? It do... (read more)

3Dach
I didn't mean to imply that a signflipped AGI would not instrumentally explore. I'm saying that, well... modern machine learning systems often get specific bonus utility for exploring, because it's hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don't have this bonus will often get stuck in local maximums. Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are "curious". This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior. Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route. If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn't result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object. Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence. The point of this argument is super obvious, so you probably thought I was saying something else. I'm going somewhere with this, though- I'll expand later.
As an almost entirely inapplicable analogy . . . it's just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.

Interesting analogy. I can see what you're saying, and I guess it depends on what specifically gets flipped. I'm unsure about the second example; something like exploring new strategies doesn't seem like something an AGI would terminally value. It's instrumental to optimising the reward function/model, but I can't see it getting flipped *with* the r... (read more)

3Dach
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I'm highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping. Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there's no utility in some sense, but human values don't actually seem to work that way. I rate universes where humans never existed at all and I'm... not sure what 0 utility would look like. It's within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility. In the context of AI safety, I think "robust failsafe measures just in case" is part of "careful engineering". So, we agree! I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to. I have much more to say on this topic and about the rest of your comment, but it's definitely too much for a comment chain. I'll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.

Thanks for the detailed response. A bit of nitpicking (from someone who doesn't really know what they're talking about):

However, the vast majority of these mistakes would probably buff out or result in paper-clipping.

I'm slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at "I have no mouth, and I must scream".... (read more)

4Dach
It's hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited. As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the "top-scoring" sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it's not minimizing the intended function, it's just doing something weird. If we inverted the utility function, this would (unless we take specific measures to combat it like you're mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don't want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre p

I've seen that post & discussed it on my shortform. I'm not really sure how effective something like Eliezer's idea of "surrogate" goals there would actually be - sure, it'd help with some sign flip errors but it seems like it'd fail on others (e.g. if U = V + W, a sign error could occur in V instead of U, in which case that idea might not work.) I'm also unsure as to whether the probability is truly "very tiny" as Eliezer describes it. Human errors seem much more worrying than cosmic rays.

I don't really know what the probability is. It seems somewhat low, but I'm not confident that it's *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)


I think I'd stop worrying about it if I were convinced that its probability is extremely low. But I'm not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discus... (read more)

5Dach
You can't really be accidentally slightly wrong. We're not going to develop Mostly Friendly AI, which is Friendly AI but with the slight caveat that it has a slightly higher value on the welfare of shrimp than desired, with no other negative consequences. The molecular sorts of precision needed to get anywhere near the zone of loosely trying to maximize or minimize for anything resembling human values will probably only follow from a method that is converging towards the exact spot we want it to be at, such as some clever flawless version of reward modelling. In the same way, we're probably not going to accidentally land in hyperexistential disaster territory. We could have some sign flipped, our checksum changed, and all our other error-correcting methods (Any future seed AI should at least be using ECC memory, drives in RAID, etc.) defeated by religious terrorists, cosmic rays, unscrupulous programmers, quantum fluctuations, etc. However, the vast majority of these mistakes would probably buff out or result in paper-clipping. If an FAI has slightly too high of a value assigned to the welfare of shrimp, it will realize this in the process of reward modelling and correct the issue. If its operation does not involve the continual adaptation of the model that is supposed to represent human values, it's not using a method which has any chance of converging to Overwhelming Victory or even adjacent spaces for any reason other than sheer coincidence. A method such as this has, barring stuff which I need to think more about (stability under self-modification), no chance of ending up in a "We perfectly recreated human values... But placed an unreasonably high value on eating bread! Now all the humans will be force-fed bread until the stars burn out! Mwhahahahaha!" sorts of scenarios. If the system cares about humans being alive enough to not reconfigure their matter into something else, we're probably using a method which is innately insulated from most types of hyperexis

It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a... (read more)

Sure, but I'd expect that a system as important as this would have people monitoring it 24/7.

Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong's idea here:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [
... (read more)

I asked Rohin Shah about that possibility in a question thread about a month ago. I think he's probably right that this type of thing would only plausibly make it through the training process if the system's *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:

Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect i
... (read more)

I'm under the impression that an AGI would be monitored *during* training as well. So you'd effectively need the system to turn "evil" (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it'd need to happen a fair bit into the training process. I guess that's possible, but IDK how likely it'd be.

4habryka
Yeah, I do think it's likely that AGI would be monitored during training, but the specific instance of Open AI staff being asleep while we train the AI is a clear instance of us not monitoring the AI during the most crucial periods (which, to be clear, I think is fine since I think the risks were indeed quite low, and I don't see this as providing super much evidence about Open AI's future practices)

Sure, but the *specific* type of error I'm imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren't being as careful as they could've been.

For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it'd have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* someh... (read more)

2ChristianKl
Given that compute is very expensive, economic pressures will push training to be 24/7, so it's unlikely that people generally pause the training when going to sleep.
4habryka
At least with current technologies, I expect serious risks to start occuring during training, not deployment. That's ultimately when you will the greatest learning happening, when you have the greatest access to compute, and when you will first cross the threshold of intelligence that will make the system actually dangerous. So I don't think that just checking things after they are trained is safe.

Wouldn't any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I'm visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it's possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.


Hopefully a system as crucially important as an AGI isn't going to have just one guy watching it who "takes a... (read more)

Many entities have sanity-checking tools. They fail. Many have careful developers. They fail. Many have automated tests. They fail. And so on. Disasters happen because all of those will fail to work every time and therefore all will fail some time. If any of that sounds improbable, as if there would have to be a veritable malevolent demon arranging to make every single safeguard fail or backfire (literally, sometimes, like the recent warehouse explosion - triggered by welders trying to safeguard it!), you should probably read more about complex systems and their failures to understand how normal it all is.

If we actually built an AGI that optimised to maximise a loss function, wouldn't we notice long before deploying the thing?


I'd imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned ("A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.")

1mako yass
Maybe the project will come up with some mechanism that detects that. But if they fall back to the naive "just watch what it does in the test environment and assume it'll do the same in production," then there is a risk it's going to figure out it's in a test environment, and that its judges would not react well to finding out what is wrong with its utility function, and then it will act aligned in the testing environment. If we ever see a news headline saying "Good News, AGI seems to 'self-align' regardless of the sign of the utility function!" that will be some very bad news.
7gwern
Even if you disclaim configuration errors or updates (despite this accounting for most of a system's operating lifespan, and human/configuration errors accounting for a large fraction of all major errors at cloud providers etc according to postmortems), an error may still happen too fast to notice. Recall that in the preference learning case, the bug manifested after Christiano et al went to sleep, and they woke up to the maximally-NSFW AI. AlphaZero trained in ~2 hours wallclock, IIRC. Someone working on an even larger cluster commits a change and takes a quick bathroom break...

Interesting. Terrifying, but interesting.

Forgive me for my stupidity (I'm not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy's screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?

Do you think that this type of thing could plausibly occur *after* training and deployment?

4gwern
Yes. For example: lots of applications use online learning. A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.
Load More