Note: I no longer endorse this post as strongly as I did when publishing it. I now agree with Neel Nanda's criticism and endorse this position. I still think the analysis in the post could be useful, but I wanted to include this note at the beginning of the post in the case that someone else in my position happens to find it. 

Introduction

When crafting ideas related to alignment, it is natural to want to share them, get feedback, run experiments and iterate in the hope of improving them, but doing this suddenly can bare contradiction to the security mindsetThe Waluigi Effect, for example, ended with a rather grim conclusion:

If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe. Moreover, this Semiotic–Simulation Theory has increased my credence in the absurd science-fiction tropes that the AI Alignment community has tended to reject, and thereby increased my credence in s-risks.”

Could the same apply to other prosaic alignment techniques? What if they do end up scaling to superintelligence?

It's easy to internally justify publishing in spite of this. Reassuring comments always sound reasonable at the time, but aren't always robust upon reflection. While developing new ideas one may tell themselves “This is really interesting! I think this could have an impact on alignment!”, and while justifying publishing them “It’s probably not that good of an idea anyway, just push through and wait until someone points out the obvious flaw in your plan.” Completely and detrimentally inconsistent.

Arguments 1 and 3

Many have shared their musings about this before. Andrew Saur does so in his post “The Case Against AI Alignment”, Andrea Miotti explains here why they believe RLHF-esque research is a net-negative for existential safety (see Paul Christiano’s response), and Christiano provides an alternative perspective in his “Thoughts on the Impact of RLHF Research”. Generally, these arguments seem to fall into three different categories (or their inversions):

  1. This research will further capabilities more than it will alignment
  2. This alignment technique if implemented could actually elevate s/x-risks
  3. This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than alignment. 

To clarify, I am speaking here in terms of impact, not necessarily in terms of some static progress metric (e.g. one year of alignment progress is not necessarily equivalent to one year of capabilities progress in terms of impact).

In an ideal world, we could trivially measure potential capabilities/alignment advances and make simple comparisons. Sadly this is not that world, and realistically most decisions are going to be intuition-derived. Worse yet is that argument 2 is even murkier than 1 and 3. Trying to pose responses to question 2 is quite literally trying to prove that a solution you don’t know the implications of won’t result in an outcome we can hardly begin to imagine through means we don’t understand.

In reference to 1 and 3 (which seem to me addressable by similar solutions), Christiano has proposed a simple way to model this dilemma:

  • There are A times as many researchers who work on existential safety as there are total researchers in AI. I think a very conservative guess is something like A = 1% or 10% depending on how broadly you define it; a more sophisticated version of this argument would focus on more narrow types of safety work.
  • If those researchers don't worry about capabilities externalities when they choose safety projects, they accelerate capabilities by about B times as much as if they had focused on capabilities directly. I think a plausible guess for this is like B = 10%.
  • The badness of accelerating capabilities by 1 month is C times larger than the goodness of accelerating safety research by 1 month. I think a reasonable conservative guess for this today is like 10. It depends on how much other "good stuff" you think is happening in the world that is similarly important to safety progress on reducing the risk from AI---are there 2 other categories, or 10 other categories, or 100 other categories? This number will go up as the broader world starts responding and adapting to AI more; I think in the past it was much less than 10 because there just wasn't that much useful preparation happening.

Adapting this model to work for single publications instead of the entire field:

  • There will be A times as many researchers working on utilizing the contents of your publication for minimizing existential safety as there are researchers using it to enhance AI capabilities
  • B remains the same but applied specifically to the contents of your publication
  • C remains the same but applied specifically to the contents of your publication

This is going to look very different every time it’s performed (e.g. an agent foundations publication is likely to have a larger A value than some agenda that involves turning an LLM into an agent). Reframing as ratios:

  • (Number of researchers applying contents to alignment) / (Number of researchers applying contents to capabilities)
  • (Capabilities progress derived from your publication if dedicated to alignment) / (Capabilities progress derived from your publication if dedicated to capabilities)
  • (Goodness of accelerating alignment by dint of progress derived from your publication) / (Badness of accelerating capabilities by dint of progress derived from your publication)

The issue is that hard numbers are essentially impossible to conjure, and intuitions attached to important ideas are rarely honest, let alone correct. So how can we use a model like this to make judgements about the safety of publishing alignment research? If the result of you sharing your publication being favorable to alignment is dependent on one of these nigh unknowable factors turning in your favormaybe it isn’t safe to share openly. Also, again referring to Yudkowsky’s security mindset literature; analysis by one individual with a robust security mindset is not evidence of it being safe to publish. If you’ve analyzed your research with the framework above, I urge you to think carefully about who to share it with, what feedback you hope to receive from them, and as painful as it might be; the worst case scenario for the malicious use of your ideas.

Something along the lines of Conjecture’s infohazard policy seems reasonable to apply to alignment ideas with a high probability of satisfying any of the aforementioned three arguments, and is something I would be interested in drafting.

Argument 2

This failure mode is embodied primarily by this post, which in short aims to convey that partial alignment could inflate s-risk likelihoods, an example of which is The Waluigi Effect which seems to imply that this is true for LLMs aligned using RLHF. I assume this is of greater concern for prosaic approaches, and considerably less so for formal ones.

S-risk to me seems to be a considerably less intuitive concept to think about than x-risk. As I mentioned earlier, I appeared to be modeling AGI ruin in a binary manner, by which I would consider ‘the good outcome’ one in which we all live happily ever after as immortal citizens of Digitopia, and the bad outcome as one in which we were all dissolved into some flavor of goo. Perhaps this becomes the case as we drift toward the later years of post-AGI existence, but I now see the terrifying and wonderful spectrum of short term outcomes, including good ones that involve suffering and bad ones that do not. I had consumed s-risk literature prior but was so far behind in my thinking compared to my reading that it took actually applying the messages of this literature to a personal scenario to intellectualize it.

The root of this unintuitiveness can likely be found somewhere North of “Good Outcomes can still Entail Suffering” but East of “Reduction of Extinction Risk can Result in an Increase in S-Risk (and Vice Versa)”. In order to overcome this you do at some point need to quantify the goodness/badness of some increment in extinction risk with a unit also applicable to the same increment in s-risk. This looks analogous to the capabilities vs alignment paradigm we have currently whereby it’s difficult to affect the position of one without affecting the other. Not just in terms of publishing but also in terms of focusing research efforts I think this is a critical bit of thinking that needs to be done and needs to be done really well with great rigor. If this has been done already please let me know and I will update this section of the post. I presume answers to this conundrum already exist in studies of axiology, and I have a writeup planned for this soon.

Conclusion

As capabilities progresses further and more rapidly than ever before, maintaining a security mindset when publishing potentially x/s-risk inducing research is critical, and doesn’t necessarily tax overall alignment progress that greatly, as rigorous but high value assessments can be done quickly and are required to be deferred to be effective. Considering the following three arguments before publishing could have powerful long term implications:

  1. This research will further capabilities more than it will alignment
  2. This alignment technique if implemented could actually elevate s/x-risks
  3. This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than it will to alignment. 
New Comment
7 comments, sorted by Click to highlight new comments since:

Some examples of justifications I have given to myself are “You’re so new to this, this is not going to have any real impact anyway”,

I think this argument is just clearly correct among people new to the field - thinking that your work may be relevant to alignment is motivating and exciting and represents the path to eventually doing useful things, but it's also very likely to be wrong. Being repeatedly wrong is what improvement feels like!

People new to the field tend to wildly overthink the harms of publishing, in a way that increases their anxiety and makes them much more likely to bounce off. This is a bad dynamic, and I wish people would stop promoting it

As someone who is quite concerned about the AI Alignment field having had a major negative impact via accelerating AI capabilities, I also agree with this. It's really quite unlikely for your first pieces of research to make a huge difference. I think the key people who I am worried will drive forward capabilities are people who have been in the field for quite a while and have found traction on the broader AGI problems and questions (as well as people directly aiming towards accelerating capabilities, though the worry there is somewhat different in nature). 

It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible. I probably fall into the category of 'wildly overthinking the harms of publishing due to inexperience', but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes and could quickly inform someone of whether or not they might want to show their research to someone more experienced before publishing.

I am personally having this dilemma. I have something I want to publish, but I'm unsure of whether I should listen to the voice telling me "you’re so new to this, this is not going to have any real impact anyway" or the voice that's telling me "if it does have some impact or was hypothetically implemented in a generally intelligent system this could reduce extinction risk but inflate s-risk". It was a difficult decision, but I decided I would rather show someone more experienced, which is what I am doing currently.  This post was intended to be a summary of why/how I converged upon that decision.

but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes

Empirically, many people new to the field get very paralysed and anxious about fears of doing accidental harm, in a way that I believe has significant costs. I haven't fully followed the specific model you outline, but it seems to involve ridiculously hard questions around the downstream consequences of your work, which I struggle to robustly apply to my work (indirect effects are really hard man!). Ditto, telling someone that they need to ask someone more experienced to sanity check can have significant costs in terms of social anxiety (I personally sure would publish fewer blog posts if I felt a need to run each one by someone like Chris Olah first!)

Having significant costs doesn't mean that doing this is bad, per se, but there needs to be major benefits to match these costs, and I'm just incredibly unconvinced that people's first research projects meet these. Maybe if you've gotten a bunch of feedback from more experienced people that your work is awesome? But also, if you're in that situation, then you can probably ask them whether they're concerned.

It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible.

"Irreversible consequences" is not that huge of a deal. The consequences of writing almost any internet comment are irreversible. I feel like you need to argue for also the expected magnitude of the consequences being large, instead of them just being irreversible.

I agree with this sentiment in response to the question of "will this research impact capabilities more than it will alignment?", but not in response to the question of "will this research (if implemented) elevate s-risks?". Partial alignment inflating s-risk is something I am seriously worried about, and prosaic solutions especially could lead to a situation like this.

If your research not influencing s-risks negatively is dependent on it not being implemented, and you think that it your research is good enough to post about, don't you see the dilemma here?

Extremely important discussion to have.