Did people say why they deferred to these people?
No, only asked respondents to give names
I think another interesting question to correlate this would be "If you believe AI x-risk is a severely important issue, what year did you come to believe that?".
Agree, that would have been interesting to ask
Things that surprised me about the results
Sorry for late, will be out this month!
Just wanted to say this is the single most useful thing I've read for improving my understanding of alignment difficulty. Thanks for taking the time to write it!
Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, "world model" evokes some object that has a map-territory relationship with the world. It's not clear to me that GPT-3 has that.
Another part of me thinks: I'm confused. It seems just as reasonable to claim that it obviously has a world model that's just not very smart. I'm probably using bad concepts and should think about this more.
It looks good to me!
This is already true for GPT-3
Idk, maybe...?
Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.
Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:
Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall
Yeah, this — I now see what you were getting at!
One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.
I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.
Instead of "always go left", how about "always go along one wall"?
Yeah, maybe better, though still doesn't quite capture the "backing up" part of the algorithm. Maybe "I explore all paths through the maze, taking left hand turns first, backing up if I reach a dead end"... that's a bit verbose though.
I don't think there is a difference.
Gotcha
Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.
I've since been told about Tasshin Fogleman's guided metta meditations, and have found their aesethic to be much more up my alley than the others I've tried. I'd expect others who prefer a more rationalist-y aesthetic to feel similarly.
The one called 'Loving our parts' seems particularly good for self-love practice.
I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I'm convinced that inner misalignment is possible).
So, I currently tend to prefer the following as the strongest "solid, specific reason to expect dangerous misalignment":
We don't yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.
Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize contro...
Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:
Minor:
(If you don't know what depth-first search means: as far as mazes are concerned, it's simply the "always go left" rule.)
I was confused for a while, because my interpretation of "always go left" doesn't involve backing up (instead, when you get to a wall on the left, you just keep walking into it forever).
Amazing!
This has inspired me to try this too. I think I won't do 1h per day because I'm out of practice with meditation so 1h sounds real hard, but I commit to doing 20 mins per day for 10 days sometime in February.
What resources did you use to learn/practice? (Anything additional to the ones recommended in this post?) Was there anything else that helped?
Good idea, I can't get it to work on LW but here is the link: https://docs.google.com/document/d/1XyXNZjRTNImRB6HNOOr_2S0uASpwjopRfyd4Y8fATf8/edit?usp=sharing
why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (...)
If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I'd love to hear them!
Minor terminology note, in case discussion about "genomic/genetic bottleneck" continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard's meaning), so genomic bottleneck seems like the better term to use.
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
... seem like world models that make sense to me, given the surrounding justifications
FWIW, I don't really understand those world models/intuitions yet:
Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:
My own guess is that this is not that far-fetched.
Thanks for writing this out, I found it helpful and it's updated me a bit towards human extinction not being that far-fetched in the 'Part 1' world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Without the argument this feels alarmist
Let me try to spell out the argument a little more - I think my original post was a little unclear. I don't think the argument actually appeals to the "convergent in...
Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.
I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here - basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there's some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive a...
I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn't seem to exist, so here's my attempt.
If we don’t have the techniques to reliably align AI, will someone deploy AI anyway? I think it’s more likely the answer is yes.
What level of deployment of unaligned benchmark systems do you expect would make doom plausible? "Someone" suggests maybe you think one deployment event of a sufficiently powerful system could be enough (which would be surprising in slow takeoff worlds). If you do think this, is it something to do with your expectations about discontinuous progress around AGI?
A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice
Sure, I agree this is a stronger point.
The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
Not really, unfortunately. In those posts, the authors are focusing on painting a plausible pi...
I'm broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
to the extent that Evan has felt a need to write an entire clarification post.
Yeah, and recently there has been even more disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I'm talking about abergal's suggestion of what inner alignment failure should refer to (basically: a model pursuing a differe...
Thanks for your reply!
depends on what you mean with strongest arguments.
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").
...many di
Immersion reading, i.e. reading a book and listening to the audio version at the same time. It makes it easier to read when tired, improves retention, increases the speed at which I can comfortably read.
Most of all, with a good narrator, it makes reading fiction feel closer to watching a movie in terms of the 'immersiveness' of the experience (which retaining all the ways in which fiction is better than film).
It's also marginally very cheap and easy if you're willing to pay for a Kindle and Audible subscription.
Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn't incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)
(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)
FYI, broken link in this sentence.
I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordinati...
only sleep when I'm tired
Sounds cool, I'm tempted to try this out, but I'm wondering how this jives with the common wisdom that going to bed at the same time every night is important? And "No screens an hour before bed" - how do you know what "an hour before bed is" if you just go to bed when tired?
I feel similarly, and still struggle with turning off my brain. Has anything worked particularly well for you?
I'm curious how you actually use the information from your Oura ring? To help measure the effectiveness of sleep interventions? As one input for deciding how to spend your day? As a motivator to sleep better? Something else?
Makes sense, thanks!
being trained on "follow instructions"
What does this actually mean, in terms of the details of how you'd train a model to do this?
Thanks for the reply - a couple of responses:
it doesn't seem useful to get a feeling for "how far off of ideal are we likely to be" when that is composed of: 1. What is the possible range of AI functionality (as constrained by physics)? - ie what can we do?
No, these cases aren't included. The definition is: "an existential catastrophe that could have been avoided had humanity's development, deployment or governance of AI been otherwise". Physics cannot be changed by humanity's development/deployment/governance decisions. (I agree that cases 2 and 3 are...
Thanks for pointing this out. We did intend for cases like this to be included, but I agree that it's unclear if respondents interpreted it that way. We should have clarified this in the survey instructions.
Is one question combining the risk of "too much" AI use and "too little" AI use?
Yes, it is. Combining these cases seems reasonable to me, though we definitely should have clarified this in the survey instructions. They're both cases where humanity could avoided an existential catastrophe by making different decisions with respect to AI.
Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.
I'd be curious to hear how you think the Production Web stories differ from part 1 of Paul's "What failure looks like".
To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don't generalise "well" (i.e. in ways desirable by human standards), because they're ...
I'm a bit confused about the edges of the inadequate equilbrium concept you're interested in.
In particular, do simple cases of negative externalities count? E.g. the econ 101 example of "factory pollutes river" - seems like an instance of (1) and (2) in Eliezer's taxonomy - depending on whether you're thinking of the "decision-maker" as (1) the factory owner (who would lose out personally) or (2) the government (who can't learn the information they need because the pollution is intentionally hidden). But this isn't what I'd typically think of as a bad Nash equilibrium, because (let's suppose) the factory owners wouldn't actually be better off by "cooperating"
Just an outside view that over the last decades, a number of groups who previously had to suppress their identities/were vilified are now more accepted (e.g., LGBTQ+, feminists, vegans), and I expect this trend to continue.
I'm curious if you expect this trend to change, or maybe we're talking about slightly different things here?
I had something like "everybody who has to strongly hide part of their identity when living in cities" in mind
Thanks for writing this! Here's another, that I'm posting specifically because it's confusing to me.
Takeoff was slow and lots of actors developed AGI around the same time. Intent alignment turned out relatively easy and so lots of actors with different values had access to AGIs that were trying to help them. Our ability to solve coordination problems remained at ~its current level. Nation states, or something like them, still exist, and there is still lots of economic competition between and within them. Sometimes there is military conflict,...
Epistemic effort: I thought about this for 20 minutes and dumped my ideas, before reading others' answers
Thanks for this, really interesting!
Meta question: when you wrote this list, what did your thought process/strategies look like, and what do you think are the best ways of getting better at this kind of futurism?
More context:
Will MacAskill calls this the "actual alignment problem"
Wei Dai has written a lot about related concerns in posts like The Argument from Philosophical Difficulty
The AI systems in part I of the story are NOT "narrow" or "non-agentic"
Relatedly: if we manage to solve intent alignment (including making it competitive) but still have an existential catastrophe, what went wrong?
Finally posted: https://www.lesswrong.com/posts/qccxb3uzwFDsRuJuP/deference-on-ai-timelines-survey-results