Ways to buy time

Orpheus16; Olive Branch; Thomas Larsen

On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).

But what does “buying time” actually look like? In this post, we list some interventions that have the potential to buy time (some of which also have other benefits, like increasing coordination, accelerating community growth, and reducing the likelihood that labs deploy dangerous systems).

If you are interested in any of these, please reach out to us. Note also that Thomas has a list of specific technical projects (with details about how they would be implemented), and he is looking for collaborators.

Summary

Ideas to buy time:

Direct outreach to AGI researchers

Develop new resources that make AI x-risk arguments & problems more concrete

Demonstrate concerning capabilities & alignment failures

Support safety and governance teams at major AI labs

Summary table:

Intervention	Why it buys time	Other possible benefits	Technical or non-technical
Direct outreach (through written resources and 1-1 conversations)	Some ML/AGI researchers haven’t heard the core arguments around AI x-risk. Engaging with written resources will cause some of them to be more concerned about AI x-risk. Some ML/AGI researchers have heard the core arguments. 1-1 conversations can allow safety researchers to better understand & address their cruxes (and vice-versa)	Generates new critiques of existing alignment ideas, arguments, and proposals More people do alignment research Increased trust and coordination between labs and non-lab safety researchers.	Technical + Nontechnical (non-technical people can organize and support these efforts, with technical people being the ones actually having conversations, giving presentations, choosing outreach resources, etc.)
New resources	Many ML/AGI researchers have heard the theoretical ideas but want to see resources that are more concrete & grounded in empirical research.	Formalizes problems; makes it easier for ML experts & engineers to contribute to alignment research.	Technical
Concerning demonstrations of alignment failure	Many ML/AGI researchers would take AI x-risk more seriously if there were clear demonstrations of alignment failures.	If an AI lab is about to deploy a system that could destroy the world, a compelling demo might convince them not to deploy. More people do alignment research	Technical
Break proposals	Many ML/AGI researchers believe that one (or more) existing alignment proposals will work	Helps alignment researchers understand cruxes & prioritize between research ideas/agendas	Technical
Coordination events	Similar to direct 1-1 outreach. Also, could lead to collaborations between safety-conscious individuals at different organizations.	Increased trust and coordination between labs and non-lab safety researchers. New critiques of existing alignment ideas, arguments, and proposals	Technical + nontechnical (Technical people should be at these events, though non-technical people could organize them)
Support lab teams	Safety teams and governance teams at top AI labs can promote a safety culture at these labs and push for policies that slow down AGI research, reduce race dynamics, etc.	Increased trust and coordination between labs and non-lab safety researchers.	Non-technical
Lab safety standards [Ignore this part. We needed to put filler text here to format the table properly; for some reason the table looks better when there is a lot of text here.]	There seem to be some policies that, if implemented in a reasonable way, could extend timelines & reduce race dynamics	Increased trust and coordination between labs and non-lab safety researchers. Some standards could reduce the likelihood that labs deploy dangerous systems (e.g., a policy that a system must first pass an interpretability check or deception check).	Non-technical

Disclaimers

Feel free to skip this section if you’re interested in learning more about our proposed “buying time” ideas.

Disclaimer #1: Some of these interventions assume that timelines are largely a function of the culture at major AI labs. More specifically, we expect that timelines are largely a function of (a) the extent to which leaders and researchers at AI labs are concerned about AI x-risk and (b) the extent to which they have concrete interventions they can implement to reduce AI x-risk, and (c) how costly it is to implement those interventions.

Disclaimer #2a: We don’t spend much time arguing which of these interventions are most impactful. This is partly because many of these need to be executed by people with specific skill sets, so personal fit considerations will be especially relevant.

Nonetheless, we currently think that the following three areas are the most important:

Direct outreach to AGI researchers (more here)
Demonstrate concerning behavior & alignment failures in current (and future) models (more here)
Organize coordination events (more here)

Disclaimer #2b: The most important interventions do not necessarily need the most people. As an example, 1-2 (highly competent) teams organizing coordination events is likely sufficient to saturate the space, whereas we could see 5+ teams working on demonstrating alignment failures. Additionally, projects with minimal downside risks are best-suited to absorb the most people.

We currently think that the following three projects could absorb lots of talented people:

Demonstrate concerning behavior & alignment failures in current (and future) models (more here)
Develop new resources that make AI x-risk arguments & problems more concrete (more here)
Break and redteam alignment proposals (more here)

Disclaimer #3: Many of these interventions have serious downside risks. We also think many of them are difficult, and they only have a shot at working if they are executed extremely well by people who have (a) strong models of downside risks, (b) the ability to notice when their work might be accelerating AGI timelines, and (c) the ability to notice when their work is reducing their ability to think well & see the world clearly. See also Habryka’s comment here (and note that although we’re still excited about more people considering this kind of work, we agree with many of the concerns he lists, and we think people should understand these concerns deeply before performing this kind of work. Please feel free to reach out to us before doing anything risky).

Disclaimer #4: Many of these interventions have large benefits other than buying time. For the most part, we think that the main benefit from most of these interventions is their effect on buying time, but we won’t be presenting those arguments here.

Disclaimer #5: We have several “background assumptions” that inform our thinking. Some examples include (a) somewhat short AI timelines (AGI likely developed in 5-15 years), (b) high alignment difficulty (alignment by default is unlikely, and current approaches seem unlikely to work), and (c) there has been some work done in each of these areas, but there are opportunities to do things that are much more targeted & ambitious than previous/existing projects.

Disclaimer #6: This was written before the FTX crisis. We think the points still stand.

Ideas to buy time

Direct outreach to AGI researchers

The case for AGI risk is relatively nuanced and non-obvious. There is value in raising awareness about basic arguments about AI x-risk and why alignment might fail by default. This makes it easier for people to quickly understand the concerns for AI x-risk, which means that more people will buy-in to alignment being hard.

Examples of work that we’d be excited to see disseminated more widely:

Superintelligence
Many MIRI analyses, including this talk and the 2022 MIRI Discussion
AGI safety fundamentals
The case for taking AI seriously as a threat to humanity
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Resources that make AI x-risk arguments more concrete (see here)

Note these resources, while a lot better than nothing, are still pretty far from ideal. In particular, we wish that there was an accessible version of AGI Ruin.

Additionally, written resources are often not sufficient to address the cruxes and worldviews of people who are performing AGI research. Individualized conversations between AGI researchers and knowledgeable alignment researchers could help address cruxes around safety.

It is important for these conversations to be conducted by people who are deeply familiar with AI alignment arguments and also have a strong understanding of the ML/AGI community. However, we think that non-technical people could play an important role in organizing these efforts (e.g., by setting up conversations between safety researchers and AGI researchers, setting up talks for technical people to give at major AI labs, and doing much of the logistics/ops/non-technical work required to coordinate an ambitious series of outreach activities).

Disclaimer: there is also a lot of downside risk here. Doing this type of outreach without adequate preparation or respect may cause the community to lose the respect of AGI researchers or make people confused about AI x-risk concerns. We encourage people interested in this work to reach out to us. We also suggest this post by Vael Gates and this post by Maris Hobbhahn. Note also that these posts focus on outreach to ML academics, whereas we’re most excited about well-conducted outreach efforts to AGI researchers at leading AI labs.

Develop new resources that make AI x-risk arguments & problems more concrete

Many of the existing AI x-risk resources focus on theoretical/conceptual arguments. Additionally, many of them were written before we knew much about deep learning or large language models.

Some people find these philosophical arguments compelling, but others demand evidence that is more concrete, more grounded in empirical research, and more rooted in the “engineering mindset.”

We believe there is a clear gap in the AI x-risk space right now: many theoretical and conceptual arguments can be discussed in the context of present-day AI systems, concretizing and strengthening the case for AI x-risk.

By creating better AGI risk resources, we can (a) find new alignment researchers and (b) get people who are building AGI to be more cautious and more safety-focused.

Examples of this work include:

Goal Misgeneralization in Deep Reinforcement Learning and Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals found concrete examples of inner misalignment in reinforcement learning settings.
Optimal Policies Tend to Seek Power formally demonstrated how power is an instrumentally convergent goal.
Scaling laws for Reward Model Overoptimization explored how putting too much optimization pressure on imperfect reward proxies resulted in failure to generalize, as is predicted by Goodhart's law.
Specification gaming: the flip side of AI ingenuity described how often the reward functions given to RL agents can be 'gamed' — the RL agent can take actions that achieve high reward but do not achieve the intended outcome of the designer.
Why AI alignment could be hard with modern deep learning describes how modern deep learning methods are likely to favor unaligned AI
The alignment problem from a deep learning perspective grounds the alignment problem in a deep learning perspective
X-risk analysis for AI research presents a concrete checklist that ML researchers can use to evaluate their research from an X-risk perspective

Demonstrate concerning capabilities & alignment failures

Another way to make AI x-risk ideas more concrete is to actually observe problems in existing models. As an example, we might detect power-seeking behavior or deceptive tendencies in large language models. We’re sympathetic to the idea that some alignment failures may not occur until we get to AGI (especially if we expect a sharp increase in capabilities). But it seems plausible that at least some alignment failures could be identified with sub-AGI models.

Examples of this kind of work:

The Evaluations Project (led by Beth Barnes)
- Beth’s team is trying to develop evaluations that help us understand when AI models might be dangerous.
- We consider this to be a sufficiently ambitious intervention. If a reasonable evaluation was developed and Magma^[1] decided to implement it, it could improve Magma’s ability to identify dangerous models. As a toy example, you could imagine Magma is about to deploy a model. Before they do so, they contact ARC, ARC implements an eval tool, and the eval tool reveals the model (a) has the power to change the world, (b) actively deceives humans in certain contexts, or (c) learns incorrect goals when implemented out-of-distribution. This eval could then lead Magma to (a) delay deployment and (b) work with ARC [and others] to figure out how to improve the model.
Encultured (led by Andrew Critch)
Encultured is trying to develop a video game that could serve as “an amazing sandbox for play-testing AI alignment & cooperation.”
From a “buying time” perspective, the theory of change seems very similar to that of the evals project. Encultured’s video game essentially serves as an eval. Magma could deploy its AI system in the video game, allowing them to detect undesirable behavior (e.g., power-seeking, deception), and then causing them to delay the deployment of their powerful model as they try to improve the safety/alignment of the model.

Ideas for new projects in this vein include:

A thorough analysis of whether deception achieves higher reward from RLHF and is therefore selected for.
Empirical demonstrations of various instrumentally convergent goals like self-preservation, power-seeking, and self-improvement. This could be especially interesting in a chain of thought language model that is operating at a high capabilities level and for which you can see the model's reasoning for selecting instrumentally convergent actions. (Tamera Lanham’s externalized reasoning oversight agenda is an example of good work in this direction)
An empirical analysis of Goodharting. Find a domain for which humans can't give an exact reward signal, and then demonstrate all the difficulties that arise when working with an imperfect reward signal. This is similar to specification gaming and would be built on top of this.

Break and red team alignment proposals (especially those that will likely be used by major AI labs)

Many AGI researchers already know about the alignment problem, but they don’t expect it to be as difficult as we do. One reason for this is they often believe that current alignment proposals will be sufficient.

We think it’s useful for people to focus on (a) finding problems with existing alignment proposals and (b) making stronger arguments about already-known problems. (Often, critiques are already being made informally in office conversations or LessWrong posts, but they aren’t reaching key stakeholders at labs).

Examples of previous work:

Vivek Hebbar’s SERI MATS application questions break down how people can approach (a) finding problems with existing alignment proposals and (b) making stronger arguments about already-known problems
Nate Soares’s critiques of Eliciting Latent Knowledge, Shard Theory, and various other alignment proposals.
Critics of CIRL argue that it fails as an alignment solution due to the problem of fully updated deference. In a nutshell, the idea of CIRL is to induce corrigibility by maintaining uncertainty over the human's values, and the failure mode is that once the model learns a sufficiently narrow distribution over the human's values, it optimizes that in an unbounded fashion (see also the ACX post and Ryan Carey’s paper on this topic).
Critiques of RLHF argue that it selects for policies which involve deceiving the human giving feedback.

We would be especially excited for more breaking & redteaming projects that engage with proposals that AGI researchers think will work (e.g., RRM and RLHF). Ideally, these projects would present ideas that are legible to researchers^[2] at AI labs & the ML community, and involve back-and-forth discussions between AGI researchers and safety teams at AI labs.

Organize coordination events

Events that get alignment researchers and AGI researchers together in the same room discussing AGI, core alignment difficulties, and alignment proposals.

On a small scale, such events could include safety talks at leading AGI labs. As an example of a more ambitious event, Anthropic’s interpretability retreat involved many alignment researchers and AGI researchers discussing interpretability proposals, their limitations, and some future directions.

Thinking even more ambitiously, there could be fellowships & residencies that bring AGI researchers and the broader alignment community. Imagine a hypothetical training program run via a collaboration between an AI lab and an AI alignment organization. This program could help employees learn about the latest developments in large language models, learn about the importance of & latest developments in AI alignment research, and lead to collaborations/friendships between incoming AGI researchers and incoming alignment researchers.

One could also imagine programs for senior researchers could involve collaborations between experienced members of AI labs and experienced members of the AI alignment community.

Examples of previous work:

AI safety conferences in Puerto Rico by FLI
Singularity Summits by MIRI
Interpretability retreat by Anthropic
Talks at OpenAI, DeepMind, etc. by various alignment researchers and thinkers

Note that there are downside risks of such programs, especially insofar as they could lead to new capabilities insights that accelerate AGI timelines). We think researchers should be extremely cautious about advancing capabilities and should generally keep such research private.

Support safety and governance teams at major AI labs

It will largely be the responsibility of safety and governance teams to push labs to not publish papers that differentially-advance-capabilities, maintain strong information security, invest in alignment research, use alignment strategies, and not deploy potentially dangerous models. As such, it’s really important that members of the AI alignment community support safety-conscious individuals at the labs (and consider joining the lab safety/governance teams).

Examples of teams that people could support:

OpenAI alignment or governance teams
Deepmind safety or governance teams
Anthropic alignment, interpretability, or governance teams
Teams at Google Brain, Meta AI research, Stability AI, etc.

Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here. People considering supporting lab teams are welcome to reach out to us to discuss this tradeoff.

Develop and promote reasonable safety standard for AI labs

Organizations that are concerned about alignment / x-risk can set robustly good standards that could propagate to other labs. Examples of policies that can work well at this:

Infosecurity policies, for example, Conjecture's Internal Infohazard policy is explicitly aimed at promoting cross lab coordination and trust, and one of the hopes is that other organizations will publicly commit to similar policies.
Publication policies that reduce the spread of capabilities insights. For example, Anthropic has committed to not publish capabilities beyond the state of the art, and one of their hopes in this is to set an example that other labs could follow.
Cooperation agreements. For example, OpenAI has a merge clause in their charter that triggers if someone else is close to building AGI, they will stop and assist with that project instead of racing themselves. If more people have this kind of agreement, this can counteract race dynamics and prevent a 'race to the bottom' in terms of alignment effort.

Other ideas

Benchmarks: Safety benchmarks can be a good way of incentivizing work on safety (and disincentivizing work on capabilities advances until certain safety benchmarks are met). For example, progress on OOD robustness does not generally come with capabilities externalities. (Note also Safe Bench is a contest by the Center for AI Safety trying to promote benchmark development.)
Competitions: Alignment competitions to get ML researchers thinking about safety problems. Contests that engage ML researchers could help them understand the difficulty of certain alignment subproblems (and could potentially help generate solutions to these problems). (Note that we’re about to launch AI Alignment Awards. We're offering up to $100,000 for people who make progress on Goal Misgeneralization or Corrigibility.)
Overviews of open problems: In order to redirect work towards safety it is useful to have regular papers outlining open problems that are useful for people to work on. Past examples of this include Concrete problems in AI safety and Unsolved Problems in ML safety. Although it’s possible that progress on these problems directly contributes to alignment research, we think that the primary benefit of this kind of work will involve getting the mainstream ML research community more concerned about safety and AI x-risk, which ultimately influences major AI labs & slows down timelines.
X-risk analyses: We are excited about x-risk analyses described in this paper. We encourage more researchers (and AGI labs) to think explicitly about how their work could contribute to increasing or decreasing AGI x-risk.
Discussions about alignment between AGI leaders and members of the alignment community: We encourage more dialogues, discussions, and debates between AGI researchers/leaders and members of the alignment community.
Create new and better resources making the case for AGI x-risk. Current resources are decent, but we think that all of the existing ones all have drawbacks. One of our favorite intro resources is Superintelligence, but it is from 2014 and doesn't have that much about deep learning and nothing about transformers/LLMs/scaling.

We are grateful to Ashwin Acharya, Andrea Miotti, and Jakub Kraus for feedback on this post.

^{^}
We use "Magma" to refer to a (fictional) leading AI lab that is concerned about safety (see more here).
^{^}
Note that there are often trade-offs between legibility and other desiderata. We agree with concerns that Habryka brings up in this comment, and we think anyone performing ML outreach should be aware of these failure modes:
“I think one of the primary effects of trying to do more outreach to ML-researchers has been a lot of people distorting the arguments in AI Alignment into a format that can somehow fit into ML papers and the existing ontology of ML researchers. I think this has somewhat reliably produced terrible papers with terrible pedagogy and has then caused many people to become actively confused about what actually makes AIs safe (with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully "aligned", or that we just need to throw some RLHF at the problem and the AI will learn our values fine). I am worried about seeing more of this, and I think this will overall make our job of actually getting humanity to sensibly relate to AI harder, not easier.”

On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).

I'm quite a bit more pessimistic about having lots of people doing these approaches than you seem to be. In the abstract my concerns are somewhat similar to Habryka's, but I think I can make them a lot more concrete given this post. The TL;DR is: (1) for half the things, I think they're net negative if done poorly, and I think that's probably the case on the current margin, and (2) for the other half of things, I think they're great, and the way you accomplish them is by joining safety / governance teams at AI labs, which are already doing them and are in a much better position to do them than anyone else.

(When talking about industry labs here I'm thinking more about Anthropic and DeepMind -- I know less about OpenAI, though I'd bet it applies to them too.)

Direct outreach to AGI researchers

Currently, I'd estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I'd think wasn't clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded. EDIT: I looked back and explicitly counted -- I ran it with at least 19 people, and 2 succeeded: one gave an argument for "AI risk is non-trivially likely", another gave an argument for "this is a speculative worry but worth investigating" which I wasn't previously counting but does meet my criterion above.) Those 50 people tend to be busy and in any case your post doesn't seem to be directed at them. (Also, if we require people to write down an argument in advance that they defend, rather than changing it somewhat based on pushback from me, my estimate drops to, idk, 20 people.)

Now, even arguments that are clearly flawed to me could convince AGI researchers that AI risk is important. I tend to think that the sign of this effect is pretty unclear. On the one hand I don't expect these researchers to do anything useful, partly because in my experience "person says AI safety is good" doesn't translate into "person does things", and partly because incorrect arguments lead to incorrect beliefs which lead to useless solutions. On the other hand maybe we're just hoping for a general ethos of "AI risk is real" that causes political pressure to slow down AI.

But it really doesn't seem great that my case for wide-scale outreach being good is "maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we'll slow down, and the extra years of time will help". So overall my guess is that this is net negative.

(On my beliefs, which I acknowledge not everyone shares, expecting something better than "mass delusion of incorrect beliefs that implies that AGI is risky" if you do wide-scale outreach now is assuming your way out of reality.)

(Fwiw I do expect that there will be a major shift towards AI risk being taken more seriously, as AGI becomes more visceral to people, as outreach efforts continue, and as it becomes more of a culturally expected belief. I often view my job as trying to inject some good beliefs about AI risk among the oncoming deluge of beliefs about AI risk.)

Develop new resources that make AI x-risk arguments & problems more concrete

Seems good if done by one of the 20 people who can make a good argument without pushback from me. If you instead want this to be done on a wide scale I think you have basically the same considerations as above.

Demonstrate concerning capabilities & alignment failures

Seems probably net negative when done at a wide scale, as we'll see demonstrations of "alignment failures" that aren't actually related to the way I expect alignment failures to go, and then the most viral one (which won't be the most accurate one) will be the one that dominates discourse.

Break and red team alignment proposals (especially those that will likely be used by major AI labs)

For the examples of work that you cite, my actual prediction is that they have had ~no effect on the broader ML community, but if they did have an effect, I'd predict that the dominant one is "wow these alignment folks have so much disagreement and say pretty random stuff, they're not worth paying attention to". So overall my take is that this is net-negative from the "buying time" perspective (though I think it is worth doing for other reasons).

Organize coordination events

I'm not seeing why any of the suggestions here are better than the existing strategy of "create alignment labs at industry orgs which do this sort of coordination".

(But I do like the general goal! If you're interested in doing this, consider trying to get hired at an industry alignment lab. It's way easier to do this when you don't have to navigate all of the confidentiality protocols because you're a part of the company.)

I guess one benefit is that you can have some coordination between top alignment people who aren't at industry labs? I'm much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.

Support safety and governance teams at major AI labs

Strongly in favor of the goal, but how do you do this other than by joining the teams?

Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here.

The linked article is about capabilities roles in labs, not safety / governance teams in labs. I'd guess that most people including many of those 11 anonymous experts would be pretty positive on having people join safety / governance teams in labs.

Develop and promote reasonable safety standard for AI labs

Sounds great! Seems like you should do it by joining the relevant teams at the AI labs, or at least having a lot of communication with them. (I think it's way way harder to do outside of the labs because you are way less informed about what the constraints are and what standards would be feasible to coordinate on.)

You could do abstract research on safety standards with the hope that this turns into something useful a few years down the line. I'm somewhat pessimistic on this but much less confident in my pessimism here.

Currently, I'd estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I'd think wasn't clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded.)

I wonder if this is because people haven't optimised for being able to make the case. You don't really need to be able to make a comprehensive case for AI risk to do productive research on AI risk. For example, I can chip away at the technical issues without fully understanding the governance issues, as long as I roughly understand something like "coordination is hard, and thus finding technical solutions seems good".

Put differently: The fact that there are (in your estimation) few people who can make the case well doesn't mean that it's very hard to make the case well. E.g., for me personally, I think I could not make a case for AI risk right now that would convince you. But I think I could relatively easily learn to do so (in maybe one to three months???)

(I've edited the quote to say it's 2/19.)

I agree you don't need to have a comprehensive case for risk to do productive research on it, and overall I am glad that people do in fact work on relevant stuff without getting bogged down in ensuring they can justify every last detail.

I agree it's possible that people could learn to make a good case. I don't expect it, because I don't expect most people to try to learn to make a case that would convince me. You in particular might do so, but I've heard of a lot of "outreach to ML researchers" proposals that did not seem likely to do this.

Why do you think that the number of people who could make a convincing case to you is so low? Where do they normally mess up?

Not Rohin (who might disagree with me on what constitutes a "good" case) but I've also tried to do a similar experiment.

Besides the "why does RLHF not work" question, which is pretty tricky, another classic theme is people misciting the ML literature, or confidently citing papers that are outliers in the literature as if they were settled science. If you're going to back up your claims with citations, it's very important to get them right!

I'd encourage you to write up a blog post on common mistakes if you can find the time.

Why do you think that the number of people who could make a convincing case to you is so low?

Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)

Where do they normally mess up?

There's a lot of different arguments people give, that I dislike for different reasons, but one somewhat common theme was that their argument was not robust to "it seems like InstructGPT is basically doing what its users want when it is capable of it, why not expect scaled up InstructGPT to just continue doing what its users want?"

(And when I explicitly said something like that, they didn't have a great response.)

Yeah... I suppose you could go through Evan Hubringer's arguments in "How likely is deceptive alignment?", but I suppose you'd probably have some further pushback which would be hard to answer.

(On my beliefs, which I acknowledge not everyone shares, expecting something better than "mass delusion of incorrect beliefs that implies that AGI is risky" if you do wide-scale outreach now is assuming your way out of reality.)

I'm from the future, January 2024, and you get some Bayes Points for this!

The "educated savvy left-leaning online person" consensus (as far as I can gather) is something like: "AI art is bad, the real danger is capitalism, and the extinction danger is some kind of fake regulatory-capture hype techbro thing which (if we even bother to look at the LW/EA spaces at all) is adjacent to racists and cryptobros".

Still seems too early to tell whether or not people are getting lots of false beliefs that are still pushing them towards believing-AGI-is-an-X-risk, especially since that case seems to be made (in the largest platform) indirectly in congressional hearings that nobody outside tech/politics actually watches.

But it really doesn't seem great that my case for wide-scale outreach being good is "maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we'll slow down, and the extra years of time will help". So overall my guess is that this is net negative.

To devil's steelman some of this: I think there's still an angle that few have tried in a really public way. namely, ignorance and asymmetry. (There is definitely a better term or two for what I'm about to describe, but I forgot it. Probably from Taleb or something.)

A high percentage of voting-eligible people in the US... don't vote. An even higher percentage vote in only the presidential elections, or only some presidential elections. I'd bet a lot of money that most of these people aren't working under a Caplan-style non-voting logic, but instead under something like "I'm too busy" or "it doesn't matter to me / either way / from just my vote".

Many of these people, being politically disengaged, would not be well-informed about political issues (or even have strong and/or coherent values related to those issues). What I want to see is an empirical study that asks these people "are you aware of this?" and "does that awareness, in turn, factor into you not-voting?".

I think there's a world, which we might live in, where lots of non-voters believe something akin to "Why should I vote, if I'm clueless about it? Let the others handle this lmao, just like how the ~~nice~~ smart people somewhere make my bills come in."

In a relevant sense, I think there's an epistemically-legitimate and persuasive way to communicate "AGI labs are trying to build something smarter than humans, and you don't have to be an expert (or have much of a gears-level view of what's going on) to think this is scary. If our smartest experts still disagree on this, and the mistake-asymmetry is 'unnecessary slowdown VS human extinction', then it's perfectly fine to say 'shut it down until [someone/some group] figures out what's going on'".

To be clear, there's still a ton of ways to get this wrong, and those who think otherwise are deluding themselves out of reality. I'm claiming that real-human-doable advocacy can get this right, and it's been mostly left untried.

EXTRA RISK NOTE: Most persuasion, including digital, is one-to-many "broadcast"-style; "going viral" usually just means "some broadcast happened that nobody heard of", like an algorithm suggesting a video to a lot of people at once. Given this, plus anchoring bias, you should expect and be very paranoid about the "first thing people hear = sets the conversation" thing. (Think of how many people's opinions are copypasted from the first ~~classy video essay~~ mass-market John Oliver video they saw about the subject, or the first Fox News commentary on it.)

Not only does the case for X-risk need to be made first, but it needs to be right (even in a restricted way like my above suggestion) the first time. Actually, that's another reason why my restricted-version suggestion should be prioritized, since it's more-explicitly robust to small issues.

(If somebody does this in real life, you need to clearly end on something like "Even if a minor detail like [name a specific X] or [name a specific Y] is wrong, it doesn't change the underlying danger, because the labs are still working towards Earth's next intelligent species, and there's nothing remotely strong about the 'safety' currently in place.")

I agree with you that one of the best ways to "buy time" is to join the alignment or governance teams at major AI labs (in part b/c confidentiality agreements). I also agree that most things are easy to implement poorly by default. However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa and 2) there are other ways to instantiate some of the proposals that aren't literally just "Join OpenAI/Deepmind/Anthropic/etc":

Direct outreach to AGI researchers

While I agree that most people are pretty bad at making the alignment case, I do think vibes matter! In particular, I think you're underestimating the value of a 'general ethos of "AI risk is real"'. (Though I still agree that the average direct outreach attempt will probably be slightly negative.)

Demonstrate concerning capabilities & alignment failures

Presumably, the way you'd do this is to work with one of the scaling labs?

Break and red team alignment proposals (especially those that will likely be used by major AI labs

I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren't really visible to most people in labs?

I think 1) is the most concerning one - I've heard many people make informal arguments in favor of/against Jan's RRM + Alignment research proposal, but I don't think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan's stuff and also critique it.

Organize coordination events
[...]
I guess one benefit is that you can have some coordination between top alignment people who aren't at industry labs? I'm much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.

You can also coordinate top alignment people not at labs <> people at labs, etc. But I do agree that doing good alignment work is important!

However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa

Sure. Of the small number of people who can do any of these well, they should split them up based on comparative advantage. This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).

I do think vibes matter! In particular, I think you're underestimating the value of a 'general ethos of "AI risk is real"'.

I very much agree that vibes matter! Do you have in mind some benefit other than the one I mentioned above:

But it really doesn't seem great that my case for wide-scale outreach being good is "maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we'll slow down, and the extra years of time will help".

(More broadly it increases willingness to pay an alignment tax, with "slowing down" as one example.)

Importantly, vibes are not uniformly beneficial. If the vibe is "AI systems aren't robust and so we can't deploy them in high-stakes situations" then maybe everyone coordinates not to let the AI control the nukes and ignores the people who are saying that we also need to worry about the generalist foundation models because it's fine, those models aren't deployed in high-stakes situations.

Presumably, the way you'd do this is to work with one of the scaling labs?

Sure, that could work. (Again my main claim is "you can't usefully throw hundreds of people at this" and not "this can never be done well".)

I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren't really visible to most people in labs?
I think 1) is the most concerning one - I've heard many people make informal arguments in favor of/against Jan's RRM + Alignment research proposal, but I don't think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan's stuff and also critique it.

I'm confused. Are you trying to convince Jan or someone else? How does it buy time?

(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn't have much effect on that population, though I think it's pretty plausible I'm wrong about that because the situation at OpenAI is different from DeepMind.)

You can also coordinate top alignment people not at labs <> people at labs, etc.

As a person at a lab I'm currently voting for less coordination of this sort, not more, but I agree that this is also a thing you can do. (As with everything else, my main claim is that this isn't a scalable intervention.)

This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).

Fair. I think I failed to address this point entirely.

I do think there's a nonzero amount of people who would not be that good at novel alignment research and would still be good at the tasks mentioned here, but I agree that there isn't a scalable intervention here, or at least not more so than standard AI alignment research (especially when compared to some appraoches like the brute-force mechanistic interp many people are doing).

(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn't have much effect on that population, though I think it's pretty plausible I'm wrong about that because the situation at OpenAI is different from DeepMind.)

Yeah, I also messed up here -- I think this would plausibly have little effect on that population. I do think that a good answer to "why does RLHF not work" would help a nonzero amount, though.

As a person at a lab I'm currently voting for less coordination of this sort, not more

Agree that it's not scalable, but could you share why you'd vote for less?

Agree that it's not scalable, but could you share why you'd vote for less?

Idk, it's hard to explain -- it's the usual thing where there's a gazillion things to do that all seem important and you have to prioritize anyway. (I'm just worried about the opportunity cost, not some other issue.)

I think the biggest part of coordination between non-lab alignment people and lab alignment people is making sure that people know about each other's research; it mostly feels like the simple method of "share info through personal connections + reading posts and papers" is working pretty well right now. Maybe I'm missing some way in which this could be way better, idk.

My guess is most of the value in coordination work here is either in making posts/papers easier to write or ship, or in discovering new good researchers?

Those weren't what I thought of when I read "coordination" but I agree those things sound good :)

Another good example would be better communication tech (e.g. the sort of thing that LessWrong / Alignment Forum aims for, although not those in particular because most lab people don't use it very much).

I feel like most of the barrier in practice for people not "coordinating" in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don't want to ship.

And yeah, better communication tech in general would be good, but I'm not sure how to start on that (while it's pretty obvious what a few candidate steps toward making posts/papers easier to write/ship would look like?)

I agree it's not clear what to do on better communication tech.

I feel like most of the barrier in practice for people not "coordinating" in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don't want to ship.

Idk, a few years ago I would have agreed with you, but now my impression is that people mostly don't read things and instead talk to each other for this purpose. I wouldn't really expect that to change with more writing, unless the writing is a lot better?

(I do think that e.g. mech interp researchers read each other's mech interp papers, though my impression from the outside is that they also often hear about each other's results well before they're published. Similarly for scalable oversight.)

Things that aren't on your list but maybe should be:

Understand the personal (e.g. depression, broken connections), local-social (e.g. peer pressure), and global-societal (e.g. whatever metaphysical wars people thing they're in) forces that are pushing and will push people to work on dangerous stuff. Since the arguments for taking AI risk very seriously are pretty solid, there's maybe some reason other than logic that people aren't, from their own inside selfish view, worried. One can say "well it's peer pressure and monetary / status incentives" but that's vague, doesn't say how to change it, and doesn't explain why those incentives point this way and not that. (Well except the money one.) With that understanding more avenues might become apparent. (On this vein, making the broader culture more healthy is good.)
Push a distinction between software 2.0 and AGI research. Things that might actually make money using AI should be separated out from AGI so that investors can learn to distinguish them. (Maybe infeasible but worth a try.) E.g. AlphaFold and Tesla self-driving are very much, IIUC, software 2.0 and not AGI (in contrast to e.g. efforts to throw an RL agent into a huge range of tasks and cranking up the compute).
Make genuine friends with AGI capabilities researchers; be truly trustworthy, besides all the AI X-risk stuff. Then they might want to listen to the reasons you're worried.
Become extremely rich (ethically!) and then buy the research organizations, and then, you know, you'll have their attention.
Seduce AGI capabilities researchers and then pillow talk about waves of ~~pleasure~~ nanobots and how unsexy it is to kill everyone.
Make memes that are sufficiently dank to be popular to normie AGI capabilities researchers, and explain arguments.

Disclaim "AI ethics" stuff that's more about "you can't have fun with this image generator" and less about "this might kill everyone". Or at least distinguish them as two totally different things. Bad to conflate all anti-AI stuff together, so that from the perspective of capabilities researchers, it's just Luddism.

As a meta point, I really appreciate you spelling out your proposals at this level of detail. Many proposals avoid critique by being vague---this was one of my complaints with your original "more people should buy time" post---and I think it's very admirable that you're making your ideas so readily available for critique.

I am an AI/AGI alignment researcher. I do not feel very optimistic about the effectiveness of your proposed interventions, mainly because I do not buy your underlying risk model and and solution model. Overall I am getting a vibe that you believe AGI will be invented soon, which is a valid assumption for planning specific actions, but then things get more weird in your solution model. To give one specific example of this:

It will largely be the responsibility of safety and governance teams to push labs to not publish papers that differentially-advance-capabilities, maintain strong information security, invest in alignment research, use alignment strategies, and not deploy potentially dangerous models.

There is an underlying assumption in the above reasoning, and in many other of your slowdown proposals, that the AI labs themselves will have significant influence on how their AI innovations will be used by downstream actors. You are assuming that they can prevent downstream actors from creating misaligned AI/AGI by not publishing certain research and not releasing foundation models with certain capabilities.

This underlying assumption, one where the labs or individual ML researchers have significant choke-point power that can lower x-risk, is entirely wrong. To unpack this statement a bit more: current advanced AI, including foundation models, is a dual-use technology that can be configured to do good as well as evil, that has the potential to be deployed by actors who will be very careful about it, and other actors who will be very careless. Also, we have seen that if one lab withholds its latest model, another party will quickly open-source an equally good model. Maybe real AGI, if it ever gets invented, will be a technology with an entirely different nature, but I am not going to bet on it.

More generally: I am seeing you make a mistake that I have seen a whole crowd of influencers and community builders is making, You are following the crowd, and the crowd focuses too much on the idea that they need to convince 'researchers in top AI labs' and other 'ML researchers' in 'top conferences' about certain dangers:

The crowd focuses on influencing AI research labs and ML researchers without considering if these parties have the technical or organisational/political power to control how downstream users will use AI or future AGI. In general, they do not have this power to control. If you are really worried about an AI lab inventing an AGI soon (personally I am not, but for the sake of the argument), you will need to focus on its management, not on its researchers.
The crowd focuses on influencing ML researchers without considering if these parties even have the technical skills or attitude needed to be good technical alignment researchers. Often, they do not. (I expand on this topic here. The TL;DR: treating the management of the impact of advances in ML on society as an ML research problem makes about as much sense as our forefathers treating the management of the impact of the stream engine on society as a steam engine engineering problem. For the long version, see the paper linked to the post.)

Overall, when it comes to putting more manpower into outreach, I feel that safety awareness outreach to downstream users, and those who might regulate their actions via laws, moral persuasion, or product release decisions, is far more important.

You recommend Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. I wonder if you would also include Why Would AI "Aim" to Defeat Humanity on this list? I know it came out after this post.

On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).

(When talking about industry labs here I'm thinking more about Anthropic and DeepMind -- I know less about OpenAI, though I'd bet it applies to them too.)

Direct outreach to AGI researchers

Develop new resources that make AI x-risk arguments & problems more concrete

Demonstrate concerning capabilities & alignment failures

Break and red team alignment proposals (especially those that will likely be used by major AI labs)

Organize coordination events

I'm not seeing why any of the suggestions here are better than the existing strategy of "create alignment labs at industry orgs which do this sort of coordination".

Support safety and governance teams at major AI labs

Strongly in favor of the goal, but how do you do this other than by joining the teams?

Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here.

Develop and promote reasonable safety standard for AI labs

Currently, I'd estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I'd think wasn't clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded.)

(I've edited the quote to say it's 2/19.)

Why do you think that the number of people who could make a convincing case to you is so low? Where do they normally mess up?

Not Rohin (who might disagree with me on what constitutes a "good" case) but I've also tried to do a similar experiment.

I'd encourage you to write up a blog post on common mistakes if you can find the time.

Why do you think that the number of people who could make a convincing case to you is so low?

Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)

Where do they normally mess up?

(And when I explicitly said something like that, they didn't have a great response.)

Yeah... I suppose you could go through Evan Hubringer's arguments in "How likely is deceptive alignment?", but I suppose you'd probably have some further pushback which would be hard to answer.

(On my beliefs, which I acknowledge not everyone shares, expecting something better than "mass delusion of incorrect beliefs that implies that AGI is risky" if you do wide-scale outreach now is assuming your way out of reality.)

I'm from the future, January 2024, and you get some Bayes Points for this!

But it really doesn't seem great that my case for wide-scale outreach being good is "maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we'll slow down, and the extra years of time will help". So overall my guess is that this is net negative.

Direct outreach to AGI researchers

Demonstrate concerning capabilities & alignment failures

Presumably, the way you'd do this is to work with one of the scaling labs?

Break and red team alignment proposals (especially those that will likely be used by major AI labs

Organize coordination events
[...]
I guess one benefit is that you can have some coordination between top alignment people who aren't at industry labs? I'm much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.

You can also coordinate top alignment people not at labs <> people at labs, etc. But I do agree that doing good alignment work is important!

However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa

I do think vibes matter! In particular, I think you're underestimating the value of a 'general ethos of "AI risk is real"'.

I very much agree that vibes matter! Do you have in mind some benefit other than the one I mentioned above:

But it really doesn't seem great that my case for wide-scale outreach being good is "maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we'll slow down, and the extra years of time will help".

(More broadly it increases willingness to pay an alignment tax, with "slowing down" as one example.)

Presumably, the way you'd do this is to work with one of the scaling labs?

Sure, that could work. (Again my main claim is "you can't usefully throw hundreds of people at this" and not "this can never be done well".)

I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren't really visible to most people in labs?
I think 1) is the most concerning one - I've heard many people make informal arguments in favor of/against Jan's RRM + Alignment research proposal, but I don't think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan's stuff and also critique it.

I'm confused. Are you trying to convince Jan or someone else? How does it buy time?

You can also coordinate top alignment people not at labs <> people at labs, etc.

This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).

Fair. I think I failed to address this point entirely.

(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn't have much effect on that population, though I think it's pretty plausible I'm wrong about that because the situation at OpenAI is different from DeepMind.)

Yeah, I also messed up here -- I think this would plausibly have little effect on that population. I do think that a good answer to "why does RLHF not work" would help a nonzero amount, though.

As a person at a lab I'm currently voting for less coordination of this sort, not more

Agree that it's not scalable, but could you share why you'd vote for less?

Agree that it's not scalable, but could you share why you'd vote for less?

My guess is most of the value in coordination work here is either in making posts/papers easier to write or ship, or in discovering new good researchers?

Those weren't what I thought of when I read "coordination" but I agree those things sound good :)

I agree it's not clear what to do on better communication tech.

I feel like most of the barrier in practice for people not "coordinating" in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don't want to ship.

Things that aren't on your list but maybe should be:

Understand the personal (e.g. depression, broken connections), local-social (e.g. peer pressure), and global-societal (e.g. whatever metaphysical wars people thing they're in) forces that are pushing and will push people to work on dangerous stuff. Since the arguments for taking AI risk very seriously are pretty solid, there's maybe some reason other than logic that people aren't, from their own inside selfish view, worried. One can say "well it's peer pressure and monetary / status incentives" but that's vague, doesn't say how to change it, and doesn't explain why those incentives point this way and not that. (Well except the money one.) With that understanding more avenues might become apparent. (On this vein, making the broader culture more healthy is good.)
Push a distinction between software 2.0 and AGI research. Things that might actually make money using AI should be separated out from AGI so that investors can learn to distinguish them. (Maybe infeasible but worth a try.) E.g. AlphaFold and Tesla self-driving are very much, IIUC, software 2.0 and not AGI (in contrast to e.g. efforts to throw an RL agent into a huge range of tasks and cranking up the compute).
Make genuine friends with AGI capabilities researchers; be truly trustworthy, besides all the AI X-risk stuff. Then they might want to listen to the reasons you're worried.
Become extremely rich (ethically!) and then buy the research organizations, and then, you know, you'll have their attention.
Seduce AGI capabilities researchers and then pillow talk about waves of ~~pleasure~~ nanobots and how unsexy it is to kill everyone.
Make memes that are sufficiently dank to be popular to normie AGI capabilities researchers, and explain arguments.

Disclaim "AI ethics" stuff that's more about "you can't have fun with this image generator" and less about "this might kill everyone". Or at least distinguish them as two totally different things. Bad to conflate all anti-AI stuff together, so that from the perspective of capabilities researchers, it's just Luddism.

It will largely be the responsibility of safety and governance teams to push labs to not publish papers that differentially-advance-capabilities, maintain strong information security, invest in alignment research, use alignment strategies, and not deploy potentially dangerous models.

The crowd focuses on influencing AI research labs and ML researchers without considering if these parties have the technical or organisational/political power to control how downstream users will use AI or future AGI. In general, they do not have this power to control. If you are really worried about an AI lab inventing an AGI soon (personally I am not, but for the sake of the argument), you will need to focus on its management, not on its researchers.
The crowd focuses on influencing ML researchers without considering if these parties even have the technical skills or attitude needed to be good technical alignment researchers. Often, they do not. (I expand on this topic here. The TL;DR: treating the management of the impact of advances in ML on society as an ML research problem makes about as much sense as our forefathers treating the management of the impact of the stream engine on society as a steam engine engineering problem. For the long version, see the paper linked to the post.)

LESSWRONG
LW

LESSWRONG
LW

34

Ways to buy time

34

Summary

Disclaimers

Ideas to buy time

Direct outreach to AGI researchers

Develop new resources that make AI x-risk arguments & problems more concrete

Demonstrate concerning capabilities & alignment failures

Break and red team alignment proposals (especially those that will likely be used by major AI labs)

Organize coordination events

Support safety and governance teams at major AI labs

Develop and promote reasonable safety standard for AI labs

Other ideas

34

34