TLDR: AI systems are failing in obvious and manageable ways for now. Fixing them will push the failure modes beyond our ability to understand and anticipate, let alone fix. The AI safety community is also doing a huge economic service to developers. Our belief that our minds can "fix" a super-intelligence - especially bit by bit - needs to be re-thought. 

I wanted to write this post forever, but now seems like a good time.  The case is simple, I hope it takes you 1min to read.

  1. AI safety research is still solving easy problems.  We are patching up the most obvious (to us) problems. As time goes we will no longer be able to play this existential risk game of chess with AI systems. I've argued this a lot (ICML 2024 spotlight paper; also www.agencyfoundations.ai). Seems others have this thought.
  2. Capability development is getting AI safety research for free. It's likely in the millions to tens of millions of dollars. All the "hackathons", and "mini" prizes to patch something up or propose a new way for society to digest/adjust to some new normal (and increasingly incentivizing existing academic labs). 
  3. AI safety research is speeding up capabilities. I hope this is somewhat obvious to most.

I write this now because in my view we are about 5-7 years before massive human biometric and neural datasets will enter our AI training.  These will likely generate amazing breakthroughs in long-term planning and emotional and social understanding of the human world.  They will also most likely increase x-risk radically.

Stopping AI safety research or taking it in-house with security guarantees etc, will  slow down capabilities somewhat - and may expose capabilities developers more directly to public opinion of still manageable harmful outcomes. 

New Comment
39 comments, sorted by Click to highlight new comments since:

This seems more an argument against evals, interpretability, trojans, jailbreak protection, adversarial robustness, control, etc right? Other (less funded & staffed) approaches don’t have the problems you mention.

[-]catubc116

Thanks Garrett. There is obviously nuance that a 1min post can't get at. I am just hoping for at least some discussion to be had on this topic. There seems to be little to none now.

MIRI has stopped all funding of safety research (to focus on advocacy) explaining that the research that they have been funding (which does not have the problem that it helps the AI project more than it helps the AI safety project) cannot bear fruit quickly enough to materially effect our chances of survival.

I don’t see how that’s relevant to my comment.

MIRI has plenty of stored money that they could use to continue to fund technical safety research, but MIRI leadership assesses that it is not worth funding (even though MIRI was the first funder to fund AI safety research). MIRI leadership has enough experience and a good enough track record that the aforementioned assessment should have some bearing on any conversation about "other (less funded & staffed) approaches" to AI safety.

Do you see the relevance now?

Yes, I do. I agree with Eliezer and Nate that the work MIRI was previously funding likely won't yield many useful results, but I don't think its correct to generalize to all agent foundations everywhere. Eg I'm bullish on natural abstractions, singular learning theory, comp mech, incomplete preferences, etc. None of which (except natural abstractions) was on Eliezer or Nate's radar to my knowledge.

In the future I'd also recommend actually arguing for the position you're trying to take, instead of citing an org you trust. You should probably trust Eliezer, Nate, and MIRI far less than you do, if you're unable to argue for their position without reference to the org itself. In this circumstance I can see where MIRI is coming from, so its no problem on my end. But if I didn't know where MIRI was coming from, I would be pretty annoyed. I also expect my comment here won't change your mind too much, since you probably have a different idea of where MIRI is coming from, and your crux may not be any object level point, but the meta level point about how good Eliezer & Nate's ability to judge research directions is, determining how much you defer to them & MIRI.

[+][comment deleted]20

I feel like you're saying "safety research" when the examples of what corporations centrally want is "reliable control over their slaves"... that is to say, they want "alignment" and "corrigibility" research.

This has been my central beef for a long time.

Eliezer's old Friendliness proposals were at least AIMED at the right thing (a morally praiseworthy vision of humanistic flourishing) and CEV is more explicitly trying for something like this, again, in a way that mostly just tweaks the specification (because Eliezer stopped believing that his earliest plans would "do what they said on the tin they were aimed at" and started over). 

If an academic is working on AI, and they aren't working on Friendliness, and aren't working on CEV, and it isn't "alignment to benevolence " or making "corrigibly seeking humanistic flourishing for all"... I don't understand why it deserves applause lights.

(EDITED TO ADD: exploring the links more, I see "benevolent game theory, algorithmic foundations of human rights" as topics you raise. This stuff seems good! Maybe this is the stuff you're trying to sneak into getting more eyeballs via some rhetorical strategy that makes sense in your target audience?)

"The alignment problem" (without extra qualifications) is an academic framing that could easily fit in a grant proposal by an academic researcher to get funding from a slave company to make better slaves. "Alignment IS capabilities research".

Similarly, there's a very easy way to be "safe" from skynet: don't built skynet!

I wouldn't call a gymnastics curriculum that focused on doing flips while you pick up pennies in front of a bulldozer "learning to be safe". Similarly, here, it seems like there's some insane culture somewhere that you're speaking to whose words are just systematically confused (or intentionally confusing).

Can you explain why you're even bothering to use the euphemism of "Safety" Research? How does it ever get off the ground of "the words being used denote what naive people would think those words mean" in any way that ever gets past "research on how to put an end to all AI capabilities research in general, by all state actors, and all corporations, and everyone (until such time as non-safety research, aimed at actually good outcomes (instead of just marginally less bad outcomes from current AI) has clearly succeeding as a more important and better and more funding worthy target)"? What does "Safety Research" even mean if it isn't inclusive of safety from the largest potential risks?

I think this is a good steelman of the original post. I find it more compelling.

Your "easy way to be safe," just not building AGI is commonly considered near-impossible. Can you point me to plans or arguments for how we can convince people not to build AGI? The arguments I'm aware of, that alignment is very very hard, they'll have the moral status of slaves, or that they're likely to lock in a bad future, are not complete enough to be compelling even to me, let alone technologists or politicians with their own agenda and limited attention for the arguments.

I suspect we'd be wiser not to build AGI, and definitely wiser to go slower, but I see no route to convincing enough of the world to do that.

What does "Safety Research" even mean if it isn't inclusive of safety from the largest potential risks?

I very much agree. I don't call my work safety research, to differentiate it from all of the stuff that may-or-may-not actually help with AGI alignment. To be fair, steering and interpretability work might contribute to building safe AGI, there's just not a very clear plan for how it would be applied to LLM-based AGI, rather than tool LLMs - so much of it probably contributes approximately nothing (depending on how you factor the capabilities appplications) to mitigating the largest risk: misaligned AGI.

that alignment is very very hard

There are also grounded arguments why alignment is unworkable. Ie. that AGI could not control its effects in line with staying safe to humans.

I’ve written about this, and Anders Sandberg is currently working on mathematically formalising an elegant model of AGI uncontainability.

What's a good overview of those grounded arguments? I looked at your writings and it wasn't clear where to start.

Seth. I just spoke about this work at ICML yesterday. Some other similar works:

Eliezers work from way back in 2004.  https://intelligence.org/files/CEV.pdf.  I haven't read it in full - but it's about AIs that interact with human volition - which is what I'm also worried about. 

Christiano's: https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like. This is a lot about slow take offs and AI's that slowly become unstoppable or unchangeable because they become part of our economic world.

My paper on arxiv is a bit of a long read (GPT-it) : https://arxiv.org/abs/2305.19223 But it tries to show where some of the weak points in human volition and intention generation are - and why we (i.e. "most developers and humanity in general") still think of human reasoning in a mind-body dualistic framework: i.e. there's a core to human thought, goal selection and decisoin making - that can never be corrupted or manipulated. We've already discovered loads of failure modes - and we weren't even faced with omnipotent-like opponents. (https://www.sog.unc.edu/sites/www.sog.unc.edu/files/course_materials/Cognitive%20Biases%20Codex.pdf).  The other point main point my work makes is that when you apply enough pressure on an aligned AI/AGI to find an optimal solution or "intent" you have for a problem that is too hard to solve - the solution it will eventually find is to change the "intent" of the human. 

Thank you!

The link to your paper is broken. I've read the Christiano piece. And some/most of the CEV paper, I think.

Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.

Sorry, fixed broken link now. 

The problem with "understanding the concept of intent" - is that intent and goal formation are some of the most complex notions in the universe involving genetics, development, psychology, culture and everything in between.  We have been arguing about what intent - and correlates like "well-being" mean - for the entire history of our civilization. It looks like we have a good set of no-nos (e.g. read the UN declaration on human rights) - but in terms of positive descriptions of good long term outcomes it gets fuzzy.  There we have less guidance, though I guess trans- and post-humanism seems to be a desirable goal to many. 

I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI "wanted" to.

As for understanding the concept of intent, I agree that "true" intent is very difficult to understand, particularly if it's projected far into the future. That's a huge problem for approaches like CEV. The virtue of the approach I'm suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring "true" intent, the AGI just "wants" to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I'm thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.

I'll read your article.

What's a good overview of those grounded arguments?

 

Thanks, appreciating your question. The best overview I managed to write was the control problem post.  Still takes quite some reading through to put the different parts of the argument together though.

The burden of proof is on you that current safety research is not incremental progress towards safety research that matters on superintelligent AI. Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.

Also, the directions suggested in section 5 of the paper you linked seem to fall well within the bounds of normal AI safety research.

Edit: Two people reacted to taboo "burden of proof". I mean that the claim is contrary to reference classes I can think of, and to argue for it there needs to be some argument why it is true in this case. It is also possible that the safety effect is significant but outweighted by the speedup effect, but that should also be clearly stated if it is what OP believes.

Reply22221

If we are going to use the term burden of proof, I would suggest the burden of proof is on the people who claim that they could make potentially very dangerous systems safe using any (combination of) techniques.

Let’s also stay mindful that these claims are not being made in a vacuum. Incremental progress on making these models usable for users (which is what a lot of applied ML safety and alignment research comes down to) does enable AI corporations to keep scaling.

I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.

I don't think it's worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.

I have not really thought about the other side-- making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.

Appreciating your thoughtful comment.  

It's hard to pin down ambiguity around how much alignment "techniques" make models more "usable", and how much that in turn enables more "scaling". This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI's models for release. 

Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped). 

This is where I think we need more discussion:

  • Is the extent of AGI control possible at least more than the extent of control needed 
    (to prevent eventual convergence on causing human extinction)?



* I use the term "control" in the established control theory sense, consistent with Yampolskiy's definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda).
** This is a rough way of stating it. It's also about the machinery performing fewer of the tasks we wouldn't want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the 'we' is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons' preferences expressed from within different perceived contexts is even a sound concept. 

Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.

This raises questions about the reference class.

  • Does controlling a self-learning (and evolving) system fit in the same reference class as the problems that engineers have “generally” been able to solve (such as moving rockets)?
  • Is the notion of “powerful” technologies in the sense of eg. rockets being powerful the same notion as “powerful” in the sense of fully autonomous learning being powerful?
  • Based on this, can we rely on the reference class of past “powerful” technologies as an indicator of being able to make incremental progress on making and keeping “AGI” safe?

I think logically the safety research needs to more than incrementally progress toward alignment (your implied claim in that burden of proof). It needs to speed alignment toward the finish line (working alignment for the AGI we actually build) more than it speeds capabilities toward the finish line of building takeover-capable AGI.

I agree with you that in general, research tends to make progress toward its stated goals.

But isn't it a little odd that nobody I know of has a specific story for how we get from tuning and interpretability of LLMs to functionally safe AGI and ASI? I do have such a story, but the tuning and interpretability play only a minor role despite making up the vast bulk of "safety research".

Research usually just goes in a general direction, and gets unexpected benefits as well as eventually accomplishing some of its stated goals. But having a more specific roadmap seems wise when some of those "unexpected benefits" might kill everyone.

That's not to say I think we should shut down safety research; I just think we should have a bit more of a plan for how it accomplishes the stated goals. I'm afraid we've gotten a bit distracted from AGI x-risk by making LLMs safe - when nobody ever thought LLMs by themselves are likely to be very dangerous.

Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful.

Sure, but if you want to do this kind of research, you should do it in such a way that it does not end up making the situation worse by helping "the AI project" (the irresponsible AI labs burning the time we have left before AI kills us all). That basically means keeping your research results secret from the AI project, and merely refraining from publishing your results is insufficient IMHO because employees in the private sector are free to leave your safety lab and go work for an irresponsible lab. It would be quite helpful here if an organization doing the kind of safety research you want had the same level of control over its employees that secret military projects currently have: namely, the ability to credibly threaten your employees with decades in jail if they bring the secrets they learned in your employ to organizations that should not have those secrets.

The way things are now, the main effect of the AI safety project is to give unintentional help to the AI project IMHO.

This conflates research that is well enough aimed to prevent the end of everything good, with the common safety research that is not well aimed and mostly is picking easy, vaguely-useful-sounding things; yup, agreed that most of that research is just bad. It builds capability that could, in principle, be used to ensure everything good survives, but it is by no means the default and nobody should assume publishing their research is automatically a good idea. It very well might be! but if your plan is "advance a specific capability, which is relevant to ensuring good outcomes", consider the possibility that it's at least worth not publishing.

Not doing the research entirely is a somewhat different matter, but also one to consider.

OK but what is your plan for a positive Singularity? Just putting AGI/ASI off by say 1 year doesn't necessarily give a better outcome at all.

Perhaps we should focus on alignment problems that only appear for more powerful systems, as a form of differential technological development. Those problems are harder (will require more thought to solve), and are less economically useful to solve in the near-term.

How do you practically do that? We don't know what they are, and that seems to be assuming our present progress, e.g. in Mechanical Interpretability doesn't help at all. Such work requires the existence of more powerful systems than exist today surely?

Can you say more about what types of AI safety research you are referring to? Interpretability, evals, and steering for deep nets, I assume, but not work that's attempting to look forward and apply to AGI and ASI?

AI safety research is speeding up capabilities. I hope this is somewhat obvious to most.

This contradicts the Bitter Lesson, though. Current AI safety research doesn't contribute to increased scaling, either through hardware advances or through algorithmic increases in efficiency. To the extent that it increases the usability of AI for mundane tasks, current safety research does so in a way that doesn't involve making models larger. Fears of capabilities externalities from alignment research are unfounded as long as the scaling hypothesis continues to hold.

Doesn't the whole concept of takeoff contradict the Bitter Lesson according to some uses of it? That is our present hardware could be much more capable if we had the right software.

Scaling matters, but it's not all that matters.

For example, RLHF

I think this is an important discussion to have but I suspect this post might not convince people who don't already share similar beliefs.

1. I think the title is going to throw people off. 

I think what you're actually saying "stop the current strain of research focused on improving and understanding contemporary systems which has become synonymous with the term AI safety" but many readers might interpret this as if you're saying "stop research that is aimed at reducing existential risks from AI". It might be best to reword it as "stopping prosaic AI safety research". 

In fairness, the first, narrower definition of AI Safety certainly describes a majority of work under the banner of AI Safety. It certainly seems to be where most of the funding is going and describes the work done at industrial labs. It is certainly what educational resources (like the AI Safety Fundamentals course) focus on. 
 

2. I've had a limited number of experiences informally having discussions with researchers on similar ideas (not necessarily arguing for stopping AI safety research entirely though). My experience is that people either agree immediately or do not really appreciate the significance of concerns about AI safety research largely being on the wrong track. Convincing people in the second category seems to be rather difficult.

To summarize what I'm trying to convey:
I think this is a crucial discussion to have and it would be beneficial to the community to write this up into a longer post if you have the time. 
 

Convincing people in the second category seems to be rather difficult.

I expect that it will prove much easier to convince people before they invest 1000s of hours in preparing themselves for and accumulating work experience in AI safety, and the way it is now, most young people contemplating a career in AI aren't aware that many observers believe that is are no AI-safety research program whose expected helpfulness exceeds its expected harmfulness (or more precisely, if there is one, we cannot pick that program out from the crowd).

[-]O O40

AI control is useful to corporations even if it doesn't result in more capabilities. This is why so much money is invested in it. Customers want predictable and reliable AI. There is a great post here about AI's aligning to Do What I want and Double Checking in the short term. There's your motive.  

Also in a world where we stop safety research, it's not obvious to me why capabilities research will be stopped or even slowed down. I can imagine them being slightly less economically valuable but not much less capable. If anything, without reliability, devs might be pushed to extract value out of these models by making them more capable. 

Fixing them will push the failure modes beyond our ability to understand and anticipate, let alone fix.

So that's why this point isn't very obvious to me. It seems like we can just have both failures we can understand and can't understand. They aren't mutually exclusive.[1]

  1. ^

    Also if we can't understand why something is bad, even given a long amount of time, is it really bad?

  1. AI safety research is still solving easy problems. [...]
  2. Capability development is getting AI safety research for free. [...]
  3. AI safety research is speeding up capabilities. [...]

Even if (2) and (3) are true and (1) is mostly true (e.g. most safety research is worthless), I still think it can easily be worthwhile to indiscriminately increase the supply of safety research[1].

The core thing is a quantitative argument: there are far more people working on capabities than x-safety and if no one works on safety, nothing will happen at all.

Copying a version of this argument from a prior comment I made:

There currently seems to be >10x as many people directly trying to build AGI/improve capabilities as trying to improve safety.

Suppose that the safety people have as good ideas and research ability as the capabilities people. (As a simplifying assumption.)

Then, if all the safety people switched to working full time on maximally advancing capabilities, this would only advance capabilites by less than 10%.

If, on the other hand, they stopped publically publishing safety work and this resulted in a 50% slow down, all safety work would slow down by 50%.

Naively, it seems very hard for publishing less to make sense if the number of safety researchers is much smaller than the number of capabilities researchers and safety researchers aren't much better at capabilities than capabilities researchers.


  1. Of course, there might be better things to do than indiscriminately increase the supply. E.g., maybe it is better to try to steer the direction of the field. ↩︎

there are far more people working on safety than capabilities

If only...

(Oops, fixed.)

I agree with the premises (except "this is somewhat obvious to most" 🤷).

On the other hand, stopping AI safety research sounds like a proposal to go from option 1 to option 2:

  1. many people develop capabilities, some of them care about safety
  2. many people develop capabilities, none of them care about safety

Could you expand upon your points in the second-to-last paragraph? I feel there are a lot of interesting thoughts leading to these conclusions, but it's not immediately clear to me what they are.