There is also a desire to create self-evolving agents, possibly related to the instinct of procreation, which becomes especially dangerous when combined with the desire to end the world.
lol
Wait, you're serious?
Auto mode and the monitor are best-effort filters, not hard boundaries — a model can be talked past. The hard boundaries are the VM and the firewall. See
SECURITY.md.Sessions are ephemeral by default: throwaway volumes are wiped on exit, so nothing an attacker stages in one session survives into the next. Your workspace is a host directory, so every file and commit the agent makes stays on your local disk.
docs/configuration.mdcovers persistence and workspace options.
I think the hard sessions might not be secure enough if the workspace is a host directory...
There is also the problem that AI makes phishing attacks easier and more destructive, and there isn't an obvious way to leverage AI to defend against it since it occurs at the human level.
It seems to me that a unilateral pause would have a higher chance of working later, when more researchers at more labs are concerned about alignment, rather than now.
I agree, though I think the risk of a secret ASI project being successful with limited resources probably increases significantly the longer the timeline, based on a basic fragile-world extrapolation. The group of actors with capability to reach ASI will expand from major powers to middle powers and eventually small nations (some of which may be nuclear pariah states) and private groups. All it takes is one group to pull off a successful secret project (or possibly a not-so-secret one if it's run by a country with enough nukes) to break the equilibrium.
Ass...
I think these mitigations only apply for external use outside of trusted customers, not for the internal AI lab and government uses that are relevant to takeover risk. So the main consequence is reducing the chance of some catastrophe caused by human use of the models, not AI takeover risk
I agree with most of the points presented in the post to various extents, but I don't think these arguments actually support the post's conclusion.
If you take only one thing from this post, take this: any theory of change that falls to one of these competitive pressures is completely useless. The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.
Doesn't this plan (effectively an outright ASI ban) fall to competitive pressures between labs and nations just as easily as the others...
This looks like the result of some kind of conciseness/token efficiency reward during training leading to the model wanting to make its CoT more concise than can be expressed in normal English.
Didn't Anthropic recently increase the number of tokens required to represent a certain amount of text (starting from Opus 4.7)? Maybe reverting this change or something might be a good idea. Or use something like MTP and make the conciseness training step MTP-aware or something so that the extra filler words needed to make the CoT legible are nearly free.
Regardless,...
A big argument in this post is that Anthropic's communications are overly vague, yet it backs this claim up with only quoting two sentences from the RSI blog post.
In the original RSI post, Anthropic actually goes into considerable detail into their position and why they choose their stance given the competitive dynamics that are involved. It is true that they didn't make hard commitments, but that is fully explainable by the uncertainty and competitive dynamics involved.
The other point here is about the apparent contradictions ("muddying the waters") this ...
The corollary is that overcoming Amdahl's Law becomes the overriding concern for work acceleration.
When AI brings a massive speedup to some types of work and you have sufficient access to it, it's similar to suddenly getting a beefy 64-core CPU, and increasing work efficiency is like optimizing your software to that CPU. Parallelism basically trumps all else until you are able to fully utilize the hardware.
> Rationalist LLM systems for research
This is such a profound idea. An LLM scaffold built explicitly on rationalist principles might be exactly what is needed to get LLMs (not even necessarily frontier ones) to scale to accurately thinking about complex problems and producing accurate conclusions on them.
Insofar as there are similarities in cognitive biases between humans and LLMs (and there probably are many similarities, since e.g. biases like motivated thinking arise from the need to prioritize thinking time to different topics in general), the same ...
Anthropic's p(doom) is probably a lot lower than 1, but it still can be much higher than 0 while still justifying their current behavior. They only need to presuppose that they have better alignment techniques and care more about alignment than their competitors, and right now they have good reasons to think that way. So let's assume that you're a decision maker at Anthropic who do think you have better alignment and would like to decide whether to (unilaterally) keep racing or not.
If you think p(doom) approximates one, then the only way to play is to stop...
A bit of thought on this approach: in the past we've always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won't scale to AIs with high enough task length horizons, since you'd have to get the human to work that long to verify the benchmark task.
I'd like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.
I think this is absolutely true in today's world, and it's the reason why entrepreneurship works. People and companies who try to act against you probably aren't doing it sophisticatedly in the way that you envision in your worst case scenario. Big companies have all the advantage, yet they slip up and give startups the opportunity to usurp them. They mess up all the time in positions that are impossible to lose if you actually take care to play optimally long term (see Valve in PC gaming for an example of the kind of dominance an incumbent can achieve if ...
LLMs probably enter RL a few months after their pretraining data cutoff. Why not just train the model to forecast events that occurred after the pretraining cutoff date but before the present day, so that you immediately have access to the ground truth? You should be able to find a lot of training examples from real world events and even produce them automatically.
If you want to spread the training out across a longer time (going backwards to ~2022), just use an old pretraining corpus published by some open-source lab somewhere, so you have a few years wor...
Well, the manager in your case is not doing RL on honesty, it's more like doing RL on "honest-looking task completion" which can either lead to honest task completion or dishonesty that isn't caught. Not too appreciably different than AI training here.
Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There's presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you're an AI undergoing training.
If AI's level of...
I think this is a valid and pretty big concern, especially down the line.
Let's say that in 1 year Anthropic will have a model capable of running a tech startup almost entirely autonomously (maybe it needs 1 good CEO to set direction and that's it). Everyone else in the public has significantly less capable models, perhaps because Anthropic is in the lead or their competitors also can't release their SOTA models for safety reasons.
What's stopping Anthropic from turning themselves into a startup accelerator in that situation and just hire founders and run do...
Maybe we can do more Vending Bench style benchmarks where the AI can keep doing better in a simulated world environment given some constraints?
Basically we put them in a video game with dynamic constraints enforced by an AI game master and score different AIs on their performance with the same game master AI. That way we can measure open-ended, long-term things like coherently running a simulated company.
Except you missed the most important mechanic in this game that overshadows all your mathematical calculations. Morale drains twice as fast when you attack compared to when you defend. Therefore the mathematically optimal way to play is to sit there doing nothing. There, i've solved War of Dots!