All of Sam Clarke's Comments + Replies

Did people say why they deferred to these people?

No, only asked respondents to give names

I think another interesting question to correlate this would be "If you believe AI x-risk is a severely important issue, what year did you come to believe that?".

Agree, that would have been interesting to ask

Things that surprised me about the results

  • There’s more variety than I expected in the group of people who are deferred to
    • I suspect that some of the people in the “everyone else” cluster defer to people in one of the other clusters—in which case there is more deference happening than these results suggest.
  • There were more “inside view” responses than I expected (maybe partly because people who have inside views were incentivised to respond, because it’s cool to say you have inside views or something). Might be interesting to think about whether it’s good (on
... (read more)
4Quadratic Reciprocity
I don't remember if I put down "inside view" on the form when filling it out but that does sound like the type of thing I may have done. I think I might have been overly eager at the time to say I had an "inside view" when what I really had was: confusion and disagreements with others' methods for forecasting, weighing others' forecasts in a mostly non-principled way, intuitions about AI progress that were maybe overly strong and as much or more based on hanging around a group of people and picking up their beliefs instead of evaluating evidence for myself. It feels really hard to not let the general vibe around me affect the process of thinking through things independently.  Based on the results, I would think more people thinking about this for themselves and writing up their reasoning or even rough intuitions would be good. I suspect my beliefs are more influenced by the people that ranked high in survey answers than I'd want them to be because it turns out people around me are deferring to the same few people. Even when I think I have my own view on something, it is very largely affected by the fact that Ajeya said 2040/2050 and Daniel Kokotajlo said 5/7 years, and the vibes have trickled down to me even though I would weigh their forecasts/methodology less if I were coming across it for the first time.   (The timelines question doesn't feel that important to me for its own sake at the moment but I think it is a useful one to practise figuring out where my beliefs actually come from)

Just wanted to say this is the single most useful thing I've read for improving my understanding of alignment difficulty. Thanks for taking the time to write it!

1Ramana Kumar
Thanks that's great to hear :)

Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, "world model" evokes some object that has a map-territory relationship with the world. It's not clear to me that GPT-3 has that.

Another part of me thinks: I'm confused. It seems just as reasonable to claim that it obviously has a world model that's just not very smart. I'm probably using bad concepts and should think about this more.

It looks good to me!

This is already true for GPT-3

Idk, maybe...?

2Rafael Harth
Is that in doubt? Note that I don't say it models the base objective in the post, I just say that it has a complex world model. This seemed unquestionable to me since it demonstrably knows lots of things. Or are you drawing a distinction between "a lot of facts about stuff" and " a world model?" I haven't draw that; "model" seems very general and "complex" trivially true. It may not be a smart model.
Sam ClarkeΩ080

Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.

Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:

  • ~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
  • so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of
... (read more)
2Rafael Harth
Thanks! I agree it's an error, of course. I've changed the section, do you think it's accurate now?)

Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall

Yeah, this — I now see what you were getting at!

One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.

I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.

Instead of "always go left", how about "always go along one wall"?

Yeah, maybe better, though still doesn't quite capture the "backing up" part of the algorithm. Maybe "I explore all paths through the maze, taking left hand turns first, backing up if I reach a dead end"... that's a bit verbose though.

I don't think there is a difference.

Gotcha

3Rafael Harth
It doesn't? Isn't it exactly the same, at least provided the wall is topologically connected? I believe in the example I've drawn, going along one wall is identical to depth first search. Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall

Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.

3Rafael Harth
Thanks. (I put "look at your comments" on my todo list when you posted them a week ago, then totally forgot, so it's nice to have a reminder.) Instead of "always go left", how about "always go along one wall"? With respect to proxy vs. corrigibility, I'll have to try if I can figure out whether I had a good reason to use both terms there because right now it seems like introducing corrigibility is unnecessary. I don't think there is a difference.

I've since been told about Tasshin Fogleman's guided metta meditations, and have found their aesethic to be much more up my alley than the others I've tried. I'd expect others who prefer a more rationalist-y aesthetic to feel similarly.

The one called 'Loving our parts' seems particularly good for self-love practice.

I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I'm convinced that inner misalignment is possible).

So, I currently tend to prefer the following as the strongest "solid, specific reason to expect dangerous misalignment":

We don't yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.

Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize contro... (read more)

Sam ClarkeΩ6110

Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:

  • Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
  • AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
  • Training an aligned/corrigible/obedient consequentialist is something that Elieze
... (read more)
5Rob Bensinger
Note that this is still better than 'honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don't need to halt'!

Minor:

(If you don't know what depth-first search means: as far as mazes are concerned, it's simply the "always go left" rule.)

I was confused for a while, because my interpretation of "always go left" doesn't involve backing up (instead, when you get to a wall on the left, you just keep walking into it forever).

4Sam Clarke
Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.

Amazing!

This has inspired me to try this too. I think I won't do 1h per day because I'm out of practice with meditation so 1h sounds real hard, but I commit to doing 20 mins per day for 10 days sometime in February.

What resources did you use to learn/practice? (Anything additional to the ones recommended in this post?) Was there anything else that helped?

8Sam Clarke
I've since been told about Tasshin Fogleman's guided metta meditations, and have found their aesethic to be much more up my alley than the others I've tried. I'd expect others who prefer a more rationalist-y aesthetic to feel similarly. The one called 'Loving our parts' seems particularly good for self-love practice.
3KatWoods
So glad to hear it! I don't use any particular resource. Just the general principle of generate the feeling of loving-kindness on something easy for you, then maintain that emotion while thinking of something that's slightly harder to feel loving-kindness towards, then slowly level up, until you're working on people really hard for you.  Good luck! Would love to hear how it goes :) 

why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (...)

If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I'd love to hear them!

Minor terminology note, in case discussion about "genomic/genetic bottleneck" continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard's meaning), so genomic bottleneck seems like the better term to use.

Sam ClarkeΩ6120

Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.

which if true should preclude strong confidence in disaster scenarios

Though only for disaster scenarios that rely on inner misalignment, right?

... seem like world models that make sense to me, given the surrounding justifications

FWIW, I don't really understand those world models/intuitions yet:

  • Re: "earlier patches not generalising as well as the deep algorithms" - I don't understand/am sceptical about the abstraction of "earlier patches" vs. "deep algori
... (read more)
Sam ClarkeΩ6110

Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:

  • Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
  • AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
  • Training an aligned/corrigible/obedient consequentialist is something that Elieze
... (read more)

My own guess is that this is not that far-fetched.

Thanks for writing this out, I found it helpful and it's updated me a bit towards human extinction not being that far-fetched in the 'Part 1' world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.

Without the argument this feels alarmist

Let me try to spell out the argument a little more - I think my original post was a little unclear. I don't think the argument actually appeals to the "convergent in... (read more)

3Vladimir_Nesov
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it's not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.

Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.

I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here - basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there's some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive a... (read more)

5Vladimir_Nesov
My own guess is that this is not that far-fetched. This is a "generic values hypothesis", that human values are enough of a blank slate thing that the Internet already redundantly imprints everything relevant that humans share. In this case a random AI with values that are vaguely learning-from-Internet inspired is not much less aligned than a random human, and although that's not particularly reassuring (value drift can go far when minds are upgraded without a clear architecture that formulates and preserves values), this is a reason for some nontrivial chance of settling on a humane attitude to humanity, which wouldn't just happen on its own, without cause. This possibility gets more remote if values are engineered de novo and don't start out as channeling a language model. Without the argument this feels alarmist. Humans can manage their own survival if they are not actively exterminated, it takes a massive disruption as a byproduct of AIs' activities to prevent that. The possibility of such a disruption is grounded in the convergent instrumental value of resource acquisition and eventual feasibility of megascale engineering, the premises that are not necessarily readily apparent.

I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn't seem to exist, so here's my attempt.

  • Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
  • Initially, this world looks great from a human perspective, and most people are much richer than they are today.
  • But things then go badly in one of two ways (or more likely, a combination of
... (read more)
5Vladimir_Nesov
This is a bit misleading in that the scale of the disaster is more apparent in the regime that takes place some time after this story, when the AI systems are disassembling the Solar System. At that point, humanity would only remain if it's intentionally maintained, so speaking of "our trajectory" that's not being steered by our will is too optimistic. And since by then there's tech that can reconstruct humanity from data, there is even less point in keeping us online than there's currently for gorillas, it's feasible to just archive and forget.
Sam ClarkeΩ010

If we don’t have the techniques to reliably align AI, will someone deploy AI anyway? I think it’s more likely the answer is yes.

What level of deployment of unaligned benchmark systems do you expect would make doom plausible? "Someone" suggests maybe you think one deployment event of a sufficiently powerful system could be enough (which would be surprising in slow takeoff worlds). If you do think this, is it something to do with your expectations about discontinuous progress around AGI?

Sam ClarkeΩ110

A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice

Sure, I agree this is a stronger point.

The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.

Not really, unfortunately. In those posts, the authors are focusing on painting a plausible pi... (read more)

2Koen.Holtman
I feel that Christiano's post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag. There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you. Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor. Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.
Sam ClarkeΩ110

I'm broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.

to the extent that Evan has felt a need to write an entire clarification post.

Yeah, and recently there has been even more disagreement/clarification attempts.

I should have specified this on the top level question, but (as mentioned in my own answer) I'm talking about abergal's suggestion of what inner alignment failure should refer to (basically: a model pursuing a differe... (read more)

3Koen.Holtman
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal's suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works. Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of 'objective robustness' vs 'capability robustness' while avoiding the problem of trying to define a single meaning for the term 'inner alignment'. Seems like good progress to me.
Sam ClarkeΩ010

Thanks for your reply!

depends on what you mean with strongest arguments.

By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).

Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.

Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").

many di

... (read more)
1Koen.Holtman
I disagree. In my reading. all of these books offer fairly wide-ranging surveys of alignment failure mechanisms. A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice. Once we take it as axiomatic that some people are stupid some of the time, presenting a convincing proof that some AI alignment failure mode is theoretically possible does not require much heavy lifting at all. The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely. The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes. There is a technical failure mode story which says that it is very difficult to equip a very powerful future AI with an emergency stop button, that we have not solved that technical problem yet. In fact, this story is a somewhat successful meme in its own right: it appears in all 3 books I mentioned. That story is not very compelling to me. We have plenty of technical options for building emergency stop buttons, see for example my post here. There have been some arguments that none of the identified technical options for building AI stop buttons will be useful or used, because they will all turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that these arguments show a theoretical possibility, but I think it is a very low possibility, so in practice these arguments are not very compelling to me. The more compelling failure mode argument is that people will refuse to use the emergency AI stop button, even though it is available. Many of the posts with the tag above show failure scenarios where the AI fails to be aligned because o
2Koen.Holtman
I'll do the easier part of your question first: I have not read all the material about inner alignment that has appeared on this forum, but I do occasionally read up on it. There are some posters on this forum who believe that contemplating a set of problems which are together called 'inner alignment' can work as an intuition pump that would allow us to make needed conceptual breakthroughs. The breakthroughs sought have mostly to do, I believe, with analyzing possibilities for post-training treacherous turns which have so far escaped notice. I am not (no longer) one of the posters who have high hopes that inner alignment will work as a useful intuition pump. The terminology problem I have with the term 'inner alignment' is that many working on it never make the move of defining it in rigorous mathematics, or with clear toy examples of what are and what are not inner alignment failures. Absent either a mathematical definition or some defining examples, I am not able judge if inner alignment is either the main alignment problem, or whether it would be a minor one, but still one that is extremely difficult to solve. What does not help here is that by now several non-mathematical notions floating around of what an inner alignment failure even is, to the extent that Evan has felt a need to write an entire clarification post. When poster X calls something an example of an inner alignment failure, poster Y might respond and declare that in their view of inner alignment failure, it is not actually an example of an inner alignment failure, or a very good example of an inner alignment failure. If we interpret it as a meme, then the meme of inner alignment has a reproduction strategy where it reproduces by triggering social media discussions about what it means. Inner alignment has become what Minsky called a suitcase word: everybody packs their own meaning into it. This means that for the purpose of distillation, the word is best avoided. If you want to distil the discu
Answer by Sam Clarke10

Immersion reading, i.e. reading a book and listening to the audio version at the same time. It makes it easier to read when tired, improves retention, increases the speed at which I can comfortably read.

Most of all, with a good narrator, it makes reading fiction feel closer to watching a movie in terms of the 'immersiveness' of the experience (which retaining all the ways in which fiction is better than film).

It's also marginally very cheap and easy if you're willing to pay for a Kindle and Audible subscription.

Answer by Sam ClarkeΩ6120

Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn't incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)

  • It's difficult to explicitly write out objective functions which express all our desires about AGI behaviour.
    • There’s no simple metric which we’d like our agents to maximise - rather, desirable AGI behaviour is best formulated in terms of concepts like ob
... (read more)

(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)

FYI, broken link in this sentence.

1Michaël Trazzi
Thanks, removed the disclaimer
Sam ClarkeΩ680

I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordinati... (read more)

7Daniel Kokotajlo
Thanks! The post was successful then. Your point about stickiness is a good one; perhaps I was wrong to emphasize the change in number of ideologies. The "AI takeover without AGI or agency" bit was a mistake in retrospect. I don't remember why I wrote it, but I think it was a reference to this post which argues that what we really care about is AI-PONR, and AI takeover is just a prominent special case. It also might have been due to the fact that a world in which an ideology uses AI tools to cement itself and take over the world, can be thought of as a case of AI takeover, since we have AIs bossing everyone around and getting them to do bad things that ultimately lead to x-risk. It's just a weird case in which the AIs aren't agents or general intelligences. :)

only sleep when I'm tired

Sounds cool, I'm tempted to try this out, but I'm wondering how this jives with the common wisdom that going to bed at the same time every night is important? And "No screens an hour before bed" - how do you know what "an hour before bed is" if you just go to bed when tired?

3matto
I don't adhere to these guidelines strictly, which helps when they conflict. For example, if I'm tired before my usual bed time, then the "no screen rule" goes out the window because I won't have trouble sleeping. And when I'm not tired by my usual bed time and haven't been looking at a screen, then I have plenty of paperbacks to keep me company (or exercise, or cooking, etc.) before I eventually get tired and fall asleep. Practically, this means that I don't use screen after 9pm. I usually fall asleep between 9:30pm and 11:00pm, where the median is around 10:15pm or so. I guess this variance comes from different days, days when I do a lot of exercise or little, days with plenty of sun or just a bit, etc.

I feel similarly, and still struggle with turning off my brain. Has anything worked particularly well for you?

I'm curious how you actually use the information from your Oura ring? To help measure the effectiveness of sleep interventions? As one input for deciding how to spend your day? As a motivator to sleep better? Something else?

being trained on "follow instructions"

What does this actually mean, in terms of the details of how you'd train a model to do this?

3Richard_Ngo
Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it's done the task successfully.

Thanks for the reply - a couple of responses:

it doesn't seem useful to get a feeling for "how far off of ideal are we likely to be" when that is composed of: 1. What is the possible range of AI functionality (as constrained by physics)? - ie what can we do?

No, these cases aren't included. The definition is: "an existential catastrophe that could have been avoided had humanity's development, deployment or governance of AI been otherwise". Physics cannot be changed by humanity's development/deployment/governance decisions. (I agree that cases 2 and 3 are... (read more)

Thanks for pointing this out. We did intend for cases like this to be included, but I agree that it's unclear if respondents interpreted it that way. We should have clarified this in the survey instructions.

Is one question combining the risk of "too much" AI use and "too little" AI use?

Yes, it is. Combining these cases seems reasonable to me, though we definitely should have clarified this in the survey instructions. They're both cases where humanity could avoided an existential catastrophe by making different decisions with respect to AI.

1Ericf
But the action needed to avoid/mitigate in those cases is very different, so it doesn't seem useful to get a feeling for "how far off of ideal are we likely to be" when that is composed of: 1. What is the possible range of AI functionality (as constrained by physics)? - ie what can we do? 2. What is the range of desirable outcomes within that range? - ie what should we do? 3. How will politics, incumbent interests, etc. play out? - ie what will we actually do? Knowing that experts think we have a (say) 10% chance of hitting the ideal window says nothing about what an interested party should do to improve those chances. It could be "attempt to shut down all AI research" or "put more funding into AI research" or "it doesn't matter because the two majority cases are "General AI is impossible - 40%" and "General AI is inevitable and will wreck us - 50%""

Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.

I'd be curious to hear how you think the Production Web stories differ from part 1 of Paul's "What failure looks like".

To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don't generalise "well" (i.e. in ways desirable by human standards), because they're ... (read more)

I'm a bit confused about the edges of the inadequate equilbrium concept you're interested in.

In particular, do simple cases of negative externalities count? E.g. the econ 101 example of "factory pollutes river" - seems like an instance of (1) and (2) in Eliezer's taxonomy - depending on whether you're thinking of the "decision-maker" as (1) the factory owner (who would lose out personally) or (2) the government (who can't learn the information they need because the pollution is intentionally hidden). But this isn't what I'd typically think of as a bad Nash equilibrium, because (let's suppose) the factory owners wouldn't actually be better off by "cooperating"

Just an outside view that over the last decades, a number of groups who previously had to suppress their identities/were vilified are now more accepted (e.g., LGBTQ+, feminists, vegans), and I expect this trend to continue.

I'm curious if you expect this trend to change, or maybe we're talking about slightly different things here?

2ChristianKl
The groups LGBTQ+, feminists, vegans are part of one group of values and people we moved to a point where people with that group of values have no reason anymore to hide their identity when being in cities.  Most of the identities where people currently have a lot to lose when they reveal their identity don't belong to that cluster. To the extend that the trend of that cluster becomes stronger continues, many people for whom it's currently very costly to reveal their identity won't gain anything and might even face a higher cost of revealing their identity.  Generally, the more polarized a society is, the higher the amount of people who have something to lose by revealing their identity. I see rising polarization.

I had something like "everybody who has to strongly hide part of their identity when living in cities" in mind

3ChristianKl
That suggests that groups that at the moment have no support at all will start to get support. Why do you think so?

Thanks for writing this! Here's another, that I'm posting specifically because it's confusing to me.

Value erosion

Takeoff was slow and lots of actors developed AGI around the same time. Intent alignment turned out relatively easy and so lots of actors with different values had access to AGIs that were trying to help them. Our ability to solve coordination problems remained at ~its current level. Nation states, or something like them, still exist, and there is still lots of economic competition between and within them. Sometimes there is military conflict,... (read more)

Answer by Sam Clarke20

Epistemic effort: I thought about this for 20 minutes and dumped my ideas, before reading others' answers

  • The latest language models are assisting or doing a number of tasks across society in rich countries, e.g.
    • Helping lawyers search and summarise cases, suggest inferences, etc. but human lawyers still make calls at the end of the day
    • Similar for policymaking, consultancy, business strategising etc.
    • Lots of non-truth seeking journalism. All good investigative journalism is still done by humans.
    • Telemarketing and some customer service jobs
  • The latest
... (read more)
2ChristianKl
Does that prediction inlude poor white people, BDSM people, generally everybody who has to strongly hide part of their identity when living in cities or only those groups that compatible with intersectional thinking?

Thanks for this, really interesting!

Meta question: when you wrote this list, what did your thought process/strategies look like, and what do you think are the best ways of getting better at this kind of futurism?

More context:

  • One obvious answer to my second question is to get feedback - but the main bottleneck there is that these things won't happen for many years. Getting feedback from others (hence this post, I presume) is a partial remedy, but isn't clearly that helpful (e.g. if everyone's futurism capabilities are limited in the same ways). Maybe you'
... (read more)
3Daniel Kokotajlo
Thanks! Good idea to make your own list before reading the rest of mine--I encourage you to post it as an answer. My process was: I end up thinking about future technologies a lot, partly for my job and partly just cos it's exciting. Through working at AI Impacts I've developed a healthy respect for trend extrapolation as a method for forecasting tech trends; during the discontinuities project I was surprised by how many supposedly-discontinuous technological developments were in fact bracketed on both sides by somewhat-steady trends in the relevant metric. My faith in trend extrapolation has made successful predictions at least once, when I predicted that engine power-to-weight ratios would form a nice trend over two hundred years and yep. As a result of my faith in trend extrapolation, when I think about future techs, the first thing I do is google around for relevant existing trends to extrapolate. Sometimes this leads to super surprising and super important claims, like the one about energy being 10x cheaper. (IIRC extrapolating the solar energy trend gets us to energy that is 25x cheaper or so, but I was trying to be a bit conservative). As for the specific list I came up with: This list was constructed from memory, when I was having trouble focusing on my actual work one night. The things on the list were things I had previously concluded were probable, sometimes on the basis of trend extrapolation and sometimes not. I wouldn't be surprised if I'm just wrong about various of these things. I don't consider myself an expert. Part of why I made the post is to get pushback, so that I could refine my view of the future. I don't know what your bottleneck is, I'm afraid. I haven't even seen your work, for all I know it's better than mine. I agree feedback by reality would be great but alas takes a long time to arrive. While we wait, getting feedback from each other is good.

Will MacAskill calls this the "actual alignment problem"

Wei Dai has written a lot about related concerns in posts like The Argument from Philosophical Difficulty

The AI systems in part I of the story are NOT "narrow" or "non-agentic"

  • There's no difference between the level of "narrowness" or "agency" of the AI systems between parts I and II of the story.
    • Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
    • Put another way, both parts I and II are instances of the "second species" concern/gorilla problem: that AI sys
... (read more)

Relatedly: if we manage to solve intent alignment (including making it competitive) but still have an existential catastrophe, what went wrong?

Load More