This is easily the most underrated AI security post of 2024, mostly because it points out that, unlike most files a defender has, AI weights are very, very large files, and that this means you can slow down exfiltration attempts enough to require physical attacks to be used.
It also inspired follow-up work like Defending Against Model Weight Exfiltration Through Inference Verification to achieve the goal of preventing model exfiltration.
The big question that will need to be answered is whether or not preventing exfiltration of model weights is positive EV, and I currently think it probably is, though there are reasonable arguments that state-proof or even state-resistant security worsens things via making the AI race more safe to engage in, and especially in worlds where long pauses are necessary, security could plausibly be negative, but there's no way to get robustly positive actions on AI, so you have to use assumptions that could fail.
And at any rate, this is not relevant for the review at hand.
Ryan Greenblatt made a comment that reasoning models have made AI models smaller, but we now know that current RL/test time compute and reasoning was a one-time boost, and we will almost certainly see scaleups of AI models starting in 2026.
This becomes even more impactful if we believe that for the first AGIs, their files will be around the order of magnitude of human brain memory, which is 2.5 petabytes, which makes exfiltration much harder.
+9 for coming up with an idea so good that there's a realistic chance we could jump from Security Level 2 to Security Level 4 with one idea, where we become resistant to state attacks (and if side effects turn out not to be a viable means of exfiltrating models, we could jump to Security Level 5, which requires defense against top nation states).
As someone who has written that post, I think the title nowadays over claims, and the only reason I chose the title was because of the fact that it was used originally.
I'd probably argue that it explains a non-trivial amount of progress, but nowadays I'd focus way more on compute being the main driver of AI progress in general.
And yes, new architectures/loss functions could change the situation, but the paper is evidence that we should expect these architectures to rely on a lot of compute to be more efficient.
a- Pretraining scaling has slowed down massively since the GPT-4.5 debacle.
This is the one element of the comment that doesn't really stand up, because new data centers that are much larger are being constructed, and the GPT-4.5 debacle was near entirely because we compared GPT 4.5 to o3 with RL, as well as compute being scaled 10x instead of 100x.
Pre-training is still going strong, it's just rested a bit due to the crazy RL scaleup, and it will come back in importance (absent continual learning).
This implies that their trend up to 2030 is likely accurate, but post-2030 absent new paradigms will look a lot different than the median scenario in their model.
My own take on this discussion is that liberalism/democracy is not a normal state, and while liberalism/democracy will be historically important in the post-AGI era, if we seriously take the assumption that labour-replacing AI has a high chance of being developed in the next few decades, then liberalism likely doesn't survive.
A key crux in my thinking is that to a large extent, liberalism has been as powerful as it was because it turned out that values and power-seeking were much more correlated than the orthogonality thesis suggests could happen in AIs, and it turned out that technological development broadly speaking in the last couple of centuries has reaped way, way more rewards for liberalism than virtually any other value system.
Most people do not care about liberalism and would almost certainly trade away control over politics for convenience (or even the promise of convenience), and this is a takeaway that was strengthened from the 2024 election results in the US.
The compact generator of technological development favoring particular values way more than others, is that pretty much all technology for the last couple of centuries has required ever-increasing logistics + trade to work well, combined with technology amplifying power instead of allowing you to create new power from thin air.
But AGI + robotics + nanotech let you both reduce the logistics needs of states/corporations massively, combined with the ability to no longer depend on the former population for logistics needs, plus the remainder of logistics problems are dealt with superhuman efficiency and in the cases where we do align AI (which looks pretty likely to me), you no longer have to deal with coordination/agency costs.
It also creates power rather than merely amplifying power.
This fundamentally disadvantages liberalism relative to it's prior position, where it's now competing on equal ground with other value systems.
And decentralization doesn't save us, since the technologies that drive autocratization are convergent, combined with most people not caring as much about freedom as much as they care about convenience, so a decentralized world is a world with at best a super-majority of benevolent autocracies.
One area where I'm contrarian on relative to everyone else is that this is probably not going to be nearly as bad as liberals fear, mostly because there are systematic differences between historical autocracies and post-AGI ones that shift the distribution towards more favorable outcomes, but the main positive differences is that people are (slightly) altruistic/care about others, and that this has less diminishing returns compared to their selfish wants, even accounting for The Elephant In The Brain which says we are much less altruistic and more selfish than we pretend ourselves to be, which matters given the abundance of resources we will have in the future, combined with selection effects being much less potent in autocracies that force dictators to choose incompetents because they are loyal, as well as the potential distribution of dictators being much, much better than past distributions (since the fraction of benevolent dictators is 1 in 3-5, which is very, very impressive compared to historical norms).
Especially in worlds where democracy is intractable to preserve and AI pause/slowdowns are difficult to do, we might have to chance benevolent dictatorship.
One final point: I have a pretty large disagreement with the following as it relates to AI:
- There is a lot of path dependence in the development of technologies, and effects like Wright’s law mean that things can snowball: if you manage to produce a bit, it gets cheaper and it’s better to produce a larger amount, and then an enormous amount. This is how solar panels went from expensive things stuck on satellites to on track to dominate the global energy supply. Tech, once it exists, is also often sticky and hard to roll back. A trickle you start can become a stream. (And whether or not you can change the final destination, you can definitely change the order, and that’s often what you need to avert risks. All rivers end in the ocean, but they take different paths, and the path matters a lot for whether people get flooded.)
More specifically, I'm disagreeing with the view that Wright's Law lets us shape technology for AI very well, and the reason for this is that the primary drivers of progress were compute and data (arguably misleadingly called algorithmic progress), and we incorrectly assumed that powerful AI would (initially) be able to be run with small scales of compute/data available to academics.
A key turning point that, in retrospect, foreshadowed the Deep Learning revolution was the creation of (large at the time) tens of millions of parameters with several GPUs, as it showed that the scaling hypothesis could be reasonable.
Nowadays, Richard Sutton's scaling hypothesis has massively more evidence behind it, and while there are certainly algorithmic/loss function challenges, as Adam Marblestone states, it's now pretty clear that Hans Moravec was much closer to correct when he focused on the compute being necessary than any of his contemporaries.
It seems to me like a time horizon of 3 years or 125 years is complete overkill for automation of enough coding for the bottleneck to shift to experiments and research taste.
My small comment on this is that this is mostly fine if you take the worldview that tacit knowledge matters and the inflated time horizon is to make sure that tacit knowledge needs are adequately modeled here.
Steve Newman has a good discussion on this here.
Basically agree with this in the near term, though I do think in the longer term, especially in the 2030s, continual learning will bring the dangers of AGI, and probably will lead to faster takeoffs than purely LLM-based takeoff worlds.
But yes, for at least the next 5 years, continual learning will differentially wake up the world to AGI without bringing the dangers of AGI, but unlike many on here, I don't expect it to lead to policy that lets us reduce x-risk from AI much for the reasons Anton Leicht states here, but in short form, even if accelerationist power declines, this doesn't necessarily mean AI existential safety can take advantage of it, and AI safety money will decline as a percentage compared to money for various job protection lobbies, and while accelerationists won't be able to defeat entire anti-AI bills, it will still remain easy for them to neuter AI safety bills enough to make the EV of politics for reducing existential risk either much less than technical AI safety, or even outright worthless/negative depending on the politics of AI.
There's a Dwarkesh quote on continual learning that I really want to emphasize here:
“Solving” continual learning won’t be a singular one-and-done achievement. Instead, it will feel like solving in context learning. GPT-3 demonstrated that in context learning could be very powerful (its ICL capabilities were so remarkable that the title of the GPT-3 paper is ‘Language Models are Few-Shot Learners’). But of course, we didn’t “solve” in-context learning when GPT-3 came out - and indeed there’s plenty of progress still to be made, from comprehension to context length. I expect a similar progression with continual learning. Labs will probably release something next year which they call continual learning, and which will in fact count as progress towards continual learning. But human level continual learning may take another 5 to 10 years of further progress.
Now that it is the New Year, I made a massive thread on twitter concerning a lot of my own opinionated takes on AI, which to summarize are my lengthening timelines, which correlates to my view that new paradigms for AI are likelier than they used to be and more necessary, which reduces AI safety from our vantage point in expectation, AI will be a bigger political issue than I used to think and depending on how robotics ends up, it might be the case that by 2030 LLMs are just good enough to control robots even if their time horizon for physical tasks is pretty terrible, because you don't need much long term planning, which would make AI concern/salience go way up, though contra the hopes of a lot of people in AI safety, this almost certainly doesn't let us reduce x-risk by much, for reasons Anton Leicht talks about here, and many more takes in the full thread above.
But to talk about some takes that didn't make it into the main Twitter thread, here are some to enjoy:
More generally, once you are able to go into space and create enough ships such that you can colonize solar systems/galaxies, your civilization is immune to existential threats that rely solely on our known physics, which is basically everything that isn't using stellar/galactic resources, and this vastly simplifies the coordination problems compared to coordination problems here on Earth.
I instead want space governance to be prioritized more for 2 reasons:
These are my takes for New Years today.
A couple of things happened that made terms like AGI/ASI less useful:
Not a novel point, but: one reason that progress against benchmarks feels disconnected from real-world deployment is that good benchmarks and AI progress are correlated endeavors. Both benchmarks and ML fundamentally require verifiable outcomes. So, within any task distribution, the benchmarks we create are systematically sampling from the automatable end.
Importantly, there's no reason to believe this will stop. So we should expect benchmarks will continue to feel rosey compared to real-world deployment.
Thane Ruthenis also has an explanation about why benchmarks tend to overestimate progress here.
2. We assumed intelligence/IQ was much more lower-dimensional than it really was. Now I don't totally blame people for thinking that there was a chance that AI capabilities were lower dimensional than they were, but way too many expected an IQ analogue for AI to work. This is in part an artifact of current architectures being more limited on some key dimensions like continual learning/long-term memory, but I wouldn't put anywhere close to all of the jaggedness on AI deficits, and instead LWers forgot that reality is surprisingly detailed.
Remember, even in humans IQ only explains 30-40% of human performance, which while being more than a lot of people want to admit, nerd communities like LessWrong have the opposite failure mode of believing that intelligence/IQ/capabilities is very low dimensional with a single number dominating how performant you are.
To be frank, this is a real-life version of an ontological crisis, where certain assumptions about AI, especially on LW turned out to be entirely wrong, meaning that certain goal-posts/risks have turned out to at best require conceptual fragmentation, and at worst turned out to be incoherent.
This is still a quite good post on how to think about AI in the near-term, and the lessons generalize broadly beyond even the specific examples.
The main lessons I take away from this post are these:
Summed up as "often the first system to do X will not be the first systems to do Y".
Summed up by this quote:
But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing.
And this matters because this should update us towards believing that we can delegate/automate quite a lot of alignment research in the critical period, meaning that our chance at surviving AI is higher than often assumed.
I will say that there is a caveat in that I now believe that persuasion/epistemics is one of the few areas where I expect much more discontinuous changes, but that's going to be discussed below when I discuss my use of reacts on the post.
I do admit that this is quite continuous with the 3rd lesson, so I'm not going to dwell on details here.
Just know that AI automation of AI safety shouldn't be dismissed casually.
Now that I'm done listing off lessons, I want to talk about my use of reacts.
I used 5 reacts on the post, and I missed the mark on 2 reacts, that being the hit the mark react on persuasion and using the important react on a study.
The main reason I missed the mark here is that I now believe that the study only shows that it's easy for AIs to persuade people for a short time when the topic isn't salient, and unfortunately most of the high-value epistemic applications will make topics salient, meaning that persuasion is way harder than people believe it is.
Persuasion/Epistemics is one of the few domains where I expect strongly discontinuous progress, but due to the hardness of persuasion I now think that AI persuasion is much less of a threat than I used to think (in the regime where our choices actually matter for AI risk), and this makes me more optimistic than I used to be around trusting AI outputs even in domains where tasks are hard to verify, as humans are very, very hard to persuade.
(A good book on this is Not Born Yesterday: The Science of Who We Trust and What We Believe by Hugo Mercier.)
I'd give this a +4 vote. It's good and important for the near-term, and while not the most important for AI, it's still a pretty good collection of ideas (though persuasion capability will increase far more discontinuously than the author claims).
Link to long comments that I want to pin, but are too long to be pinned:
https://www.lesswrong.com/posts/Zzar6BWML555xSt6Z/?commentId=aDuYa3DL48TTLPsdJ
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
https://www.lesswrong.com/posts/DCQ8GfzCqoBzgziew/?commentId=RhTNmgZqjJpzGGAaL
https://www.lesswrong.com/posts/NjzLuhdneE3mXY8we?commentId=rcr4j9XDnHWWTBCTq