Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:
I think the alignment community thinking correctly is essential for solving alignment. Especially because we will have very limited empirical evidence before AGI, and that evidence will not be obviously directly applicable without some associated abstract argument, any trustworthy alignment solution has to route through the community reasoning sanely.
Also to be clear I think the "advancing capabilities is actually good because it gives us more information on what AGI will look like" take is very bad and I am not defending it. The arguments I made above don't apply, because they basically hinge on work on alignment not actually advancing capabilities.
Hasn't the alignment community historically done a lot to fuel capabilities?
For example, here's an excerpt from a post I read recently
My guess is RLHF research has been pushing on a commercialization bottleneck and had a pretty large counterfactual effect on AI investment, causing a huge uptick in investment into AI and potentially an arms race between Microsoft and Google towards AGI: https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research?commentId=HHBFYow2gCB3qjk2i
I don't think RLHF in particular had a very large counterfactual impact on commercialization or the arms race. The idea of non-RL instruction tuning for taking base models and making them more useful is very obvious for commercialization (there are multiple concurrent works to InstructGPT). PPO is better than just SFT or simpler approaches on top of SFT, but not groundbreakingly more so. You can compare text-davinci-002 (FeedME) and text-davinci-003 (PPO) to see.
The arms race was directly caused by ChatGPT, which took off quite unexpectedly not because of model quality due to RLHF, but because the UI was much more intuitive to users than the Playground (instruction following GPT3.5 was already in the API and didn't take off in the same way). The tech tree from having a powerful base model to having a chatbot is not constrained on RLHF existing at all, either.
To be clear, I happen to also not be very optimistic about the alignment relevance of RLHF work beyond the first few papers--certainly if someone were to publish a paper today making RLHF twice as data efficient or whatever I would consider this basically just a capabilities paper.
I think empirically EA has done a bunch to speed up capabilities accidentally. And I think theoretically we're at a point in history where simply sharing an idea can get it in the water supply faster than ever before.
A list of unsolved problems, if one of them is both true and underappreciated, can have a big impact.
The conversations I've had with people at Deepmind, OpenAI, and in academia make me very sure that lots of ideas on capabilities increases are already out there so there's a high chance anything you suggest would be something people are already thinking about. Possibly running your ideas past someone in those circles, and sharing anything they think is unoriginal would be safe-ish?
I think one of the big bottlenecks is a lack of ways to predict how much different ideas would help without actually trying them at costly large scale. Unfortunately, this is also a barrier to good alignment work. I don't have good ideas on making differential progress on this.
If {the reasoning for why AGI might not be near} comprises {a list of missing capabilities}, then my current guess is that the least-bad option would be to share that reasoning in private with a small number of relevant (and sufficiently trustworthy) people[1].
(More generally, my priors strongly suggest keeping any pointers to AGI-enabling capabilities private.)
E.g. the most capable alignment researchers who seem (to you) to be making bad strategic decisions due to not having considered {the reasoning for why AGI might not be near}. ↩︎
I think that sharing the reasoning in private with a small number of people might somewhat help with the "Alignment people specifically making bad strategic decisions that end up having major costs" cost, but not the others, and even then it would only help a small amount of the people working in alignment rather than the field in general.
From a broad policy perspective, it can be tricky to know what to communicate. I think it helps if we think a bit more about the effects of our communication and a bit less about correctly conveying our level of credence in particular claims. Let me explain.
If we communicate the simple idea that AGI is near then it pushes people to work on safety projects that would be good to work on even if AGI is not near while paying some costs in terms of reputation, mental health, and personal wealth.
If we communicate the simple idea that AGI is not near then people will feel less need to work on safety soon. This would let them not miss out on opportunities that would be good to take ahead of when they actually need to focus on AI safety.
We can only really communicate one thing at a time to people. Also, we should worry more about tail risks a false positives (thinking we can build AGI safely when we cannot) than false negatives (thinking we can't build AGI safely when we can). Taking these two facts into consideration, I think the policy implication is clear: unless there is extremely strong evidence that AGI is not near, we must act and communicate as if AGI is near.
I reached this via Joachim pointing it out as an example of someone urging epistemic defection around AI alignment, and I have to agree with him there. I think the higher difficulty posed by communicating "we think there's a substantial probability that AGI happens in the next 10 years" vs "AGI is near" is worth it even from a PR perspective, because pretending you know the day and the hour smells like bullshit to the most important people who need convincing that AI alignment is nontrivial.
Whatever the probability of AGI in the reasonably near future (5-10 years), the probability of societal shifts due to implementation of highly capable yet sub-AGI AI is strictly higher. I think regardless of where AI "lands" in terms of slowing down in progress (if it is the case we see an AI winter/fall), the application of systems that exist even just today, even if technological progress were to stop, is enough to merit appreciating the different world that is coming within the same order of magnitude as how different it would be with AGI.
I think it's almost impossible at this point to argue against the value of providence with respect to the rise of dumb (in the relative to AGI sense) but highly highly capable AI.
I think it is okay for you to be vague. Simply saying that you can see numerous bottlenecks, but don't wish to list them to avoid others working on them, is enough to cause some weaker update than a list would cause.
IME a lot of people's stated reasons for thinking AGI is near involve mistaken reasoning and those mistakes can be discussed without revealing capabilities ideas: https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce
An alternative framing that might be useful: What do you see as the main bottleneck for people having better predictions of timelines (as you see it)?
Do you in fact think that having such a list is it?
I occasionally have some thoughts about why AGI might not be as near as a lot of people seem to think, but I'm confused about how/whether to talk about them in public.
The biggest reason for not talking about them is that one person's "here is a list of capabilities that I think an AGI would need to have, that I don't see there being progress on" is another person's "here's a roadmap of AGI capabilities that we should do focused research on". Any articulation of missing capabilities that is clear enough to be convincing, seems also clear enough to get people thinking about how to achieve those capabilities.
At the same time, the community thinking that AGI is closer than it really is (if that's indeed the case) has numerous costs, including at least:
Having a better model of what exactly is missing could conceivably also make it easier to predict when AGI will actually be near. But I'm not sure to what extent this is actually the case, since the development of core AGI competencies feels more of a question of insight than grind[1], and insight seems very hard to predict.
A benefit from this that does seem more plausible would be if the analysis of capabilities gave us information that we could use to figure out what a good future landscape would look like. For example, suppose that we aren't likely to get AGI soon and that the capabilities we currently have will create a society that looks more like the one described in Comprehensive AI Services, and that such services could safely be used to detect signs of actually dangerous AGIs. If this was the case, then it would be important to know that we may want to accelerate the deployment of technologies that are taking in the world in a CAIS-like direction, and possibly e.g. promote rather than oppose things like open source LLMs.
One argument would be that if AGI really isn't near, then that's going to be obvious pretty soon, and it's unlikely that my arguments in particular for this would be all that unique - someone else would be likely to make them soon anyway. But I think this argument cuts both ways - if someone else is likely to make the same arguments soon anyway, then there's also limited benefit in writing them up. (Of course, if it saves people from significant mental anguish, even just making those arguments slightly earlier seems good, so overall this argument seems like it's weakly in favor of writing up the arguments.)
From Armstrong & Sotala (2012):