What do you think about building legal/technological infrastructure to enable a prompt pause, should it seem necessary?
I'm pretty sure GPT-N won't be able to do it, assuming they follow the same paradigm.
I am curious if you would like to expand on this intuition? I do not share it, and it seems like one potential crux.
I do not share this intuition. I would hope that if I say a handful of words about synthetic data, that will be sufficient to move your imagination into a less certain condition regarding this assertion. I am tempted to try something else first.
Is this actually important to your argument? I do not see how it would end up factoring into this problem, except by more quickly obviating the advances made with understanding and steering LLM behavior. What difference does it make to the question of "stop" if instead of LLMs in a GPT wrapper, the thing that can in fact solve that task in blender is some RNN-generating/refining action-token optimizer?
"LLMs can't do X" doesn't mean X is going to take another 50 years. The field is red hot right now. In many ways new approaches to architecture are vastly easier to iterate on than new bio-sciences work, and those move blazingly fast compared to things like high energy/nuclear/particle physics experiments - and even those sometimes outpace regulatory bodies' abilities to assess and ensure safety. The first nuclear pile got built under some bleachers on a campus.
Even if you're fully in Gary Marcus's camp on criticism of the capabilities of LLMs, his prescriptions for fixing it don't rule out another approach qualitatively similar to transformers that isn't any better for making alignment easy. There's a gap in abstract conceptualization here, where we can - apparently - make things which represent useful algorithms while not having a solid grasp on the mechanics and abstract properties of those algorithms. The upshot of pausing is that we enter into a period of time where our mastery becomes deeper and broader while the challenges we are using it to address remain crisp, constrained, and well within a highly conservative safety-margin.
How is it obvious that we are far away in time? Certain emergency options like centralized compute resources under international monitoring are going to be on long critical paths, and if someone has a brilliant idea for [Self-Censored, To Avoid Being Called Dumb & Having That Be Actually True] and that thing destroys the world before you have all AI training happening in monitored data centers with some totally info-screened black-box fail-safes - then you end up not having a ton of "opportunity cost" compared to the counterfactual where you prevented the world getting paper-clipped because you were willing, in that counterfactual, to ever tell anyone "no, stop" with the force of law behind it.
Seriously,
by stopping AI progress, we lose all the good stuff that AI would lead to
... that's one side of the cost-benefit analysis over counterfactuals. Hesitance over losing even many billions of dollars in profits should not stop us from preventing the end of the world.
"The average return from the urn is irrelevant if you're not allowed to play anymore!" (quote @ 1:08:10, paraphrasing Nassim Taleb)
not having any reference AI to base our safety work on
Seems like another possible crux. This seems to imply that either there has been literally no progress on real alignment up to this point, or you are making a claim about the marginal returns on alignment work before having scary-good systems.
Like, the world I think I see is one where alignment has been sorely underfunded, but even prior to the ML revolution there was good alignment de-confusion work that got done. Having the entire conceptual framing of "alignment" and resources like Arbital's catalogue pre-2022 and "Concrete Problems in AI Safety" and a bunch of other things all seem like incremental progress towards making a world in which one could attempt to build an AI framework -> AGI instantiation -> ASI direct-causal-descendant and have that endeavor not essentially multiply human values by 0 on almost every dimension in the long run.
Why can't we continue this after liquid nitrogen gets poured onto ML until the whole thing freezes and shatters into people bickering about lost investments? Would we expect a better ratio of good/bad outcomes on our lottery prize wheel in 50 years after we've solved the "AI Pause Button Problem" and "Generalized Frameworks for Robust Corrigibility" and "Otherizing/Satisficing/Safe Maximization?" There seems to be a lot of blueprinting, rocket equation, Newtonian mechanics, astrophysics type work we can do even if people can't make 10 billion dollars 5 years from now selling GPT-6 powered products.
It's not that easy for an unassisted AI to do harm - especially existentially significant harm.
I am somewhat baffled by this intuition.
I suspect what's going on here is that the more harm something is proposed to be capable of, the less likely people think that it is.
Say you're driving fast down a highway, what do you think a split second after seeing a garbage truck pull out in front of you while you are traveling towards it with >150km/hr relative velocity? Say your brain could generate in that moment a totally reflectively coherent probability distribution over expected outcomes. Does the distribution go from the most probability mass in scenarios with the least harm to the least probability mass in scenarios with the most harm? "Ah, it's fine," you think, "it would be weird if this killed me instantly, less weird if I merely had a spinal injury, and even less weird if I simply broke my nose and bruised my sternum."
The underlying mechanism - the actual causal processes involved in determining the future arrangements of atoms or the amount of reality fluid in possible Everett Branch futures grouped by similarity in features - that's what you have to pay attention to. What you find difficult to plan for, or what you observe humans having difficulty planning for, does not mean you can map that same difficulty curve onto AI. AlphaZero did not experience the process of getting better at the games it played in the same way humanity experienced that process. It did not have to spend 26 IRL years learning the game painstakingly from traditions established over hundreds of IRL years - it did not have to struggle to sleep well and eat healthy and remain clean from vices and motivated in order to stay on task and perform at its peak capacity. It didn't even need to solve the problem perfectly - like representing "life and death" robustly - in order to in reality beat the top humans and most (or all, modulo the controversy over StockFish being tested in a suboptimal condition) of the top human engines.
It doesn't seem trivial for a certain value of the word "trivial." Still, I don't see how this consideration gives anyone much confidence in it qualitatively being "really tough" the way getting a rocket carrying humans to Mars is tough - where you don't one day just get the right lines of code into a machine and suddenly the cipher descrambles in 30 seconds when before it wouldn't happen no matter how many random humans you had try to guess the code or how many hours other naively written programs spent attempting to brute-force it.
Sometimes you just hit enter, kick a snowball at the top of a mountain, and 1s and 0s click away, and an avalanche comes down in a rush upon the schoolhouse 2 km below your skiing trail. The badness of the outcome didn't matter one bit to its probability of occuring in those real world conditions in which it occured. The outcome depended merely on the actual properties of the physical universe, and what effects descend from which causes. See Beyond The Reach of God for an excellent extended meditation on this reality.
A few things stood out to me here; namely, a steelman and a strawman.
You make the sweeping claim that "AI can bring - already brings - lots of value, and general improvements to human lives", but don't substantiate that claim at all. (Maybe you think it's obvious, as a daily user. I think there's lots of room to challenge AI utility to human beings.) Much of the "benefits of AI" talk boils down to advertising and hopeful hype from invested industries. I would understand claiming, as an example, that "AI increases the productivity or speed of certain tasks, like writing cover letters". That might be an improvement in human lives, though it depends on things like whether the quality also decreases, and who or what is harmed as part of the cost of doing it.
But this should be explored and supported; it is not at all obvious. Claiming that there is "lots of value" isn't very persuasive by itself -- especially since you include "improvements to human lives" in your statement. I'd be very curious to know which improvements to human lives AI has brought, and whether they actually stand up against, not just the dangers, but the actual already-existing downsides of AI, as well.
Those are, I feel, the strawmen in this argument. And I apologize if I make any inadvertently heated word-choices here; the tactic you're using is a theme I've been seeing that's getting under my skin: To weigh AI's "possible gains" against its "potential dangers" in a very sci-fi, what-might-happen-if-it-wakes-up way, while failing to weigh its actual, demonstrated harms and downsides as part of the equasion at all. This irks me. I think it particularly irks me when an argument (such as this one) claims all the potential upsides for AI as benefits for humans in one sweeping sentence, but then passes over the real harms to humans that even the fledgling version of AI we have has already caused, or set in motion.
I understand that the "threat" of sentient supercomputer is sexier to think about -- and it serves as a great humblebrag for the industry, too. They get to say "Yes yes, we understand that the little people are worried our computers are TOO smart, hahaha, yes let's focus on that" -- but it's disingenuous to call the other problems "boring dangers", even though I'm sure there's no interest in discussing them at AI tech conventions. But many of these issues aren't dangers; they're already-present, active problems that function as distinct downsides to allowing AI (with agency or not) unfettered access to our marketplaces.
Three of many possible examples, open to argument, but worthy of consideration, of already-a-downside "dangers" are: Damage to the environment / wasting tons of resources in an era where we should definitely be improving on efficiency (and you know, feeding the poor and stuff, maybe rather than "giving" Altman 7 trillion dollars). Mass-scale theft from artists and craftspeople that could harm or even destroy entire industries or areas of high human value. (And yes, that's an example of "the tech industry being bad guys" and not inherent to AI as a concept, but it is also how the real AI is built and used by the people actually doing it, and currently no-one seems able to stop them. So it's the same type of problem as having an AI that was designed poorly with regards to safety: Some rich dudes could absolutely decide to release it in spite of technicalities like human welfare. I mention this to point out that the mechanism for greedy corporations to ignore safety and human lives is already active in this space, so maybe "we're stopping all sorts of bad guys already" isn't such an ironclad reason to ignore those dangers. Or any dangers, because um, that's just kind of a terrible argument for anything; sorry.) And a third one, the scrambling of fact and fiction[1] to the point where search-engines are losing utility and people who need to know the difference for critical reasons -- like courts, scientists and teachers -- are struggling to do their work.
All of which is a bit of a long way to say that I see a steelman and a strawman here, making this argument pretty weak overall. I also see ways you could improve both of those, by looking into the details (if they exist) of your steelman, and broadening your strawmen to include not just theoretical downsides but real ones.
But you made me think, and elucidate something that's been irritating me for many months; thank you for that!
[1] I object to the term "hallucination"; it's inaccurate and offensive. I don't love "fiction" either, but at least it's accurate.
I'm a daily user of ChatGPT, sometimes supplementing it with Claude, and the occasional local model for some experiments. I try to make squeeze LLMs into agent-shaped bodies, but it doesn't really work. I also have a PhD, which typically would make me an expert in the field of AI, but the field is so busy and dynamic that it's hard to really state what an "expert" even is.
AI can bring - already brings - lots of value, and general improvements to human lives, so my default stance is that its continued progress - one might say, acceleration - is in our best interest. The flip-side is, of course, a whole host of dangers associated with AI.
Boring dangers
There's a few "boring" dangers. Humans can use AI for evil - this applies to every single technology in the history of technology, so if it was enough to stop AI, it should make us return to monke. Speaking a bit more formally - if the benefits outweight the dangers, we should go ahead with AI development. Speaking a bit less formally - sure, we need to stop the bad guys, but we're stopping all sorts of bad guys already, so I'll just omit this whole section of dangers, because I'm utterly unconvinced that the "boring" bad applications of AI are uniquely worse than the bad applications of other technologies.
Dangerous dangers
The more interesting part of AI danger is, as people here will likely agree, the risk of a superintelligent AI being misaligned with our human values, to the point of becoming a threat to humanity's continued existence. I absolutely acknowledge that a sufficiently powerful AI could pose such threat. A machine that perfectl executes the literal meaning of someone's words can neglect a lot of the implicit assumptions ("Don't run over the baby", "Don't turn people into paperclips") that we humans know intuitively. A superintelligence might develop its own goals, be it instrumental or terminal, and we might have little chance to stop it. I agree with all of this.
And yet, I don't see any reason to stop, or even slow down.
Dangerous capabilities
By far, the most dangerous - and useful - capability for an AI to have is agency. The moment an AI can just go and do things (as opposed to outputting information for a human to read), its capabilities go up a notch. And yet, every AI agent that I've seen so far is either an AGI that's been achieved in a gridworld, or a scam startup that's wasting millions of VC dollars. Really, PauseAI people should be propping up AutoGPT et al. if they want to slow down real progress.
It's not that easy for an unassisted[1] AI to do harm - especially existentially significant harm. I like to think that I'm a fairly smart human, and I have no idea how I would bring about the end of humanity if I so desired. You'd need a lot of knowledge about the world as it is, a lot of persuasive power to get people to do what you want them to, a lot of adaptability to learn how the world changes. And we're simply not even remotely close to a level of capabilities, or even autonomy, where this would be possible.
Auto-regressive models, typically LLMs, do one thing really well - they predict the next most likely token. There's a lot of interesting things that you could do via predicting the next token, but agency, in my opinion, is not one of them. The moment you need to construct multi-step plans, react to changing conditions, communicate with others, and even know what is important enough to remember/observe - LLMs start failing completely.
But they'll get better!
Sure, they might - I hope they do. There's so much good that better models can bring. But the LLM-based paradigm is definitely hitting a wall. Maybe we can scale GPT-5 and GPT-6, GPT-N. They'll get smarter, they'll hallucinate less, they'll put some more people out of jobs, they'll pick more goats in the Goat-Enthusiast Monty Hall variant. They won't get agency. They won't get scary.
But we should pause, figure out safety, and then resume
That would be nice, except for two details - opportunity cost, and not having any reference AI to base our safety work on. Opportunity cost is self-explanatory - by stopping AI progress, we lose all the good stuff that AI would lead to. Pausing it should be the last resort. Without discussing the specific threat levels at this point in history, you could have made an argument for PauseCompute back in the 90's, or whenever really. If we start approaching dangerous levels of capabilities, we should probably stop or pause if our safety knowledge isn't up to snuff. But we're still so, so far.
As for the second part - if we stop AI research right now and only focus on alignment, we will never figure out alignment. We'd be forced into purely philosophical "research" entirely detached from reality. It would have to be general enough to cover any conceivable type of AI, any conceivable threat model, and thus - impossible. Instead, we should keep building our current silly lil' AIs, to start building an understanding of what the capable-and-dangerous models might be in the future, and work on making those specific models safe.
If you did alignment research in the 90's, you wouldn't have focused on LLMs.
A small test
Hey MacGPT-10, make a sphere in Blender.
It's a relatively complex task - figure out a browser, find the right website, open it, download blender, install blender, open blender, figure out how to use it, make a sphere.
No existing model can do it. I'm pretty sure GPT-N won't be able to do it, assuming they follow the same paradigm. And this is maybe 1% of a potentially existentially-threatening level of capabilities.
The end...
...is not near. If AI kills us all, it will be something so different from current models that we won't even think "Of course, I should have predicted this". So for now, I will keep advancing AI capabilities and/or safety (depending on my research/job at any given moment), because they're both valuable, and try to help humanity thrive, and tackle its other, much more immediate threats.
I'm explicitly excluding scenarios of "Human uses AI to deliberately do harm", or "Human uses AI incorrectly and accidentally causes harm"