I think acting to reduce overhang by accelerating research on agents is getting lost in the sauce. You can't blaze a trail through the tech tree towards dangerous AI and then expect everyone else to stop when you stop. The responsible thing to do is to prioritize research that differentially advances beneficial AI even in a world full of hasty people.
Yes, sorry for being unclear. I meant to suggest that this argument implied 'accelerate agents and decelerate planners' could be the desirable piece of differential progress.
Yup, I've made basically this point (section one and most of 2) a few times in conversation. Seems true and important, and a lot of people who want to help save the world are doing useless or counterproductive things due to missing it.
Useless: Most work which doesn't backchain from a path to ending the acute risk period, bearing in mind that most of the acute risk comes not from LLMs without strong guardrails, but from agentic goal-maximising AIs. Sometimes people will do things which are useful for this anyway without tracking this crucial consideration, but the hit-rate is going to be low.
Counterproductive: Work which transfers over to productization and brings more funding and more attention to AI capabilities, especially that which brings dangerously good automated coding and automated research closer. I'd put a good deal of interpretability in this category, being able to open the black box makes it way easier to figure out ways to improve algorithmic efficiency. Interp could be part of a winning play by an actor who is aware of the broader strategic landscape, but I expect broadcasting it is net negative. Nate's post here is pretty good: If interpretability research goes well, it may get dangerous
What kind of interpretability work do you consider plausibly useful or at least not counterproductive?
The main criteria I think is not broadcasting it to organizations without a plan for aligning strong superintelligence that has a chance of working. This probably means not publishing, and also means being at an employer who has a workable plan.
There might be some types of interp which don't have much capabilities potential, and are therefore safe to publish widely. Maybe some of the work focused specifically on detecting deception? But mostly I expect interp to be good only when part of a wider plan with a specific backchained way to end the acute risk period, which might take advantage of the capabilities boosts offered. My steelman of Anthropic is trying to pull something like this off, though they're pretty careful to avoid leaking details of what their wider plan is if they have one.
It seems like eventually people are going to make competent goal-directed agents, and at that point we will indeed have the problems of their exerting more optimisation power than humanity.
In fact it seems like these non-agentic AIs might make things worse, because the goal-maximisation agents will be able to use the non-agentic AIs.
The solution is obviously to prohibit the creation of goal-maximization agents and use scaffolded LLMs, instead.
Unfortunately it seems like people are going to make AI agents anyway, because ML researchers love making things.
I bet geneticists would also love to make some new things with clonning. And yet we have a noticeable lack of clones. Do not underestimate the ability of our civilization to limit its own progress.
So an alternative possible conclusion would be that we should actually try to accelerate agentic AI research as much as possible, because eventually we are going to have influential AI maximisers, and we want them to occur before the forecasting/planning overhang (and the hardware overhang) get too large.
We are currently living in the luckiest possible world where we have powerful AI models which are nevertheless existentially harmless specifically because they lack the agentic part. Moreover, we can use these models to develop agentic-but-not-really systems that can satisfy the demand, without doing the risky research into developping coherent goal maximizators. This is a miracle. We didn't expect that thing could be this good. Suddenly there is a comprehensive way we may not be doomed. And you are proposing to dismiss our incredible advantage and return back to the course of being doomed, anyway.
It's quite possible someone has already argued this, but I thought I should share just in case not.
Goal-Optimisers and Planner-Simulators
When people in the past discussed worries about AI development, this was often about AI agents - AIs that had goals they were attempting to achieve, objective functions they were trying to maximise. At the beginning we would make fairly low-intelligence agents, which were not very good at achieving things, and then over time we would make them more and more intelligent. At some point around human-level they would start to take-off, because humans are approximately intelligent enough to self-improve, and this would be much easier in silicon.
This does not seem to be exactly how things have turned out. We have AIs that are much better than humans at many things, such that if a human had these skills we would think they were extremely capable. And in particular LLMs are getting better at planning and forecasting, now beating many but not all people. But they remain worse than humans at other things, and most importantly the leading AIs do not seem to be particularly agentic - they do not have goals they are attempting to maximise, rather they are just trying to simulate what a helpful redditor would say.
What is the significance for existential risk?
Some people seem to think this contradicts AI risk worries. After all, ignoring anthropics, shouldn’t the presence of human-competitive AIs without problems be evidence against the risk of human-competitive AI?
I think this is not really the case, because you can take a lot of the traditional arguments and just substitute ‘agentic goal-maximising AIs, not just simulator-agents’ in wherever people said ‘AI’ and the argument still works. It seems like eventually people are going to make competent goal-directed agents, and at that point we will indeed have the problems of their exerting more optimisation power than humanity.
In fact it seems like these non-agentic AIs might make things worse, because the goal-maximisation agents will be able to use the non-agentic AIs.
Previously we might have hoped to have a period where we had goal-seeking agents that exerted influence on the world similar to a not-very-influential person, who was not very good at planning or understanding the world. But if they can query the forecasting-LLMs and planning-LLMs, as soon as the AI ‘wants’ something in the real world it seems like it will be much more able to get it.
So it seems like these planning/forecasting non-agentic AIs might represent a sort of planning-overhang, analogous to a Hardware Overhang. They don’t directly give us existentially-threatening AIs, but they provide an accelerant for when agentic-AIs do arrive.
How could we react to this?
One response would be to say that since agents are the dangerous thing, we should regulate/restrict/ban agentic AI development. In contrast, tool LLMs seem very useful and largely harmless, so we should promote them a lot and get a lot of value from them.
Unfortunately it seems like people are going to make AI agents anyway, because ML researchers love making things. So an alternative possible conclusion would be that we should actually try to accelerate agentic AI research as much as possible, and decelerate tool LLM planners, because eventually we are going to have influential AI maximisers, and we want them to occur before the forecasting/planning overhang (and the hardware overhang) get too large.
I think this also makes some contemporary safety/alignment work look less useful. If you are making our tools work better, perhaps by understanding their internal working better, you are also making them work better for the future AI maximisers who will be using them. Only if the safety/alignment work applies directly to the future maximiser AIs (for example, by allowing us to understand them) does it seem very advantageous to me.
2024-07-15 edit: added clarification about differential progress.