2001zhaozhao — LessWrong

War of Dots: CRUSHING my opponents with FACTS and LOGIC

2001zhaozhao10d30

Except you missed the most important mechanic in this game that overshadows all your mathematical calculations. Morale drains twice as fast when you attack compared to when you defend. Therefore the mathematically optimal way to play is to sit there doing nothing. There, i've solved War of Dots!

The desire to end the world

2001zhaozhao12d31

There is also a desire to create self-evolving agents, possibly related to the instinct of procreation, which becomes especially dangerous when combined with the desire to end the world.

lol

Wait, you're serious?

TurnTrout's shortform feed

2001zhaozhao13d40

Auto mode and the monitor are best-effort filters, not hard boundaries — a model can be talked past. The hard boundaries are the VM and the firewall. See SECURITY.md.
Sessions are ephemeral by default: throwaway volumes are wiped on exit, so nothing an attacker stages in one session survives into the next. Your workspace is a host directory, so every file and commit the agent makes stays on your local disk. docs/configuration.md covers persistence and workspace options.

I think the hard sessions might not be secure enough if the workspace is a host directory... (read more)

The Dual-Use Gap

2001zhaozhao13d11

There is also the problem that AI makes phishing attacks easier and more destructive, and there isn't an obvious way to leverage AI to defend against it since it occurs at the human level.

What if Anthropic unilaterally paused capabilities development right now?

2001zhaozhao16d10

It seems to me that a unilateral pause would have a higher chance of working later, when more researchers at more labs are concerned about alignment, rather than now.

The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably

2001zhaozhao17d10

I agree, though I think the risk of a secret ASI project being successful with limited resources probably increases significantly the longer the timeline, based on a basic fragile-world extrapolation. The group of actors with capability to reach ASI will expand from major powers to middle powers and eventually small nations (some of which may be nuclear pariah states) and private groups. All it takes is one group to pull off a successful secret project (or possibly a not-so-secret one if it's run by a country with enough nukes) to break the equilibrium.

Ass... (read more)

Vladimir_Nesov's Shortform

2001zhaozhao17d20

I think these mitigations only apply for external use outside of trusted customers, not for the internal AI lab and government uses that are relevant to takeover risk. So the main consequence is reducing the chance of some catastrophe caused by human use of the models, not AI takeover risk

The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably

2001zhaozhao17d10

I agree with most of the points presented in the post to various extents, but I don't think these arguments actually support the post's conclusion.

If you take only one thing from this post, take this: any theory of change that falls to one of these competitive pressures is completely useless. The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.

Doesn't this plan (effectively an outright ASI ban) fall to competitive pressures between labs and nations just as easily as the others... (read more)

Even "illegible" Mythos reasoning traces seem pretty legible

2001zhaozhao18d60

This looks like the result of some kind of conciseness/token efficiency reward during training leading to the model wanting to make its CoT more concise than can be expressed in normal English.

Didn't Anthropic recently increase the number of tokens required to represent a certain amount of text (starting from Opus 4.7)? Maybe reverting this change or something might be a good idea. Or use something like MTP and make the conciseness training step MTP-aware or something so that the extra filler words needed to make the CoT legible are nearly free.

Regardless,... (read more)

Anthropic did not call for a pause on AI

2001zhaozhao18d2321

A big argument in this post is that Anthropic's communications are overly vague, yet it backs this claim up with only quoting two sentences from the RSI blog post.

In the original RSI post, Anthropic actually goes into considerable detail into their position and why they choose their stance given the competitive dynamics that are involved. It is true that they didn't make hard commitments, but that is fully explainable by the uncertainty and competitive dynamics involved.

The other point here is about the apparent contradictions ("muddying the waters") this ... (read more)

Why Software Automation Is Hard

2001zhaozhao19d10

The corollary is that overcoming Amdahl's Law becomes the overriding concern for work acceleration.

When AI brings a massive speedup to some types of work and you have sufficient access to it, it's similar to suddenly getting a beefy 64-core CPU, and increasing work efficiency is like optimizing your software to that CPU. Parallelism basically trumps all else until you are able to fully utilize the hardware.

Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

2001zhaozhao23d30

> Rationalist LLM systems for research

This is such a profound idea. An LLM scaffold built explicitly on rationalist principles might be exactly what is needed to get LLMs (not even necessarily frontier ones) to scale to accurately thinking about complex problems and producing accurate conclusions on them.

Insofar as there are similarities in cognitive biases between humans and LLMs (and there probably are many similarities, since e.g. biases like motivated thinking arise from the need to prioritize thinking time to different topics in general), the same ... (read more)

When AI Builds Itself (Anthropic Institute Linkpost)

2001zhaozhao23d32

Anthropic's p(doom) is probably a lot lower than 1, but it still can be much higher than 0 while still justifying their current behavior. They only need to presuppose that they have better alignment techniques and care more about alignment than their competitors, and right now they have good reasons to think that way. So let's assume that you're a decision maker at Anthropic who do think you have better alignment and would like to decide whether to (unilaterally) keep racing or not.

If you think p(doom) approximates one, then the only way to play is to stop... (read more)

Is ProgramBench Impossible?

2001zhaozhao2mo40

A bit of thought on this approach: in the past we've always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won't scale to AIs with high enough task length horizons, since you'd have to get the human to work that long to verify the benchmark task.

I'd like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.

Try, even if they have you cold

2001zhaozhao2mo60

I think this is absolutely true in today's world, and it's the reason why entrepreneurship works. People and companies who try to act against you probably aren't doing it sophisticatedly in the way that you envision in your worst case scenario. Big companies have all the advantage, yet they slip up and give startups the opportunity to usurp them. They mess up all the time in positions that are impossible to lose if you actually take care to play optimally long term (see Valve in PC gaming for an example of the kind of dominance an incumbent can achieve if ... (read more)

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

2001zhaozhao2mo10

LLMs probably enter RL a few months after their pretraining data cutoff. Why not just train the model to forecast events that occurred after the pretraining cutoff date but before the present day, so that you immediately have access to the ground truth? You should be able to find a lot of training examples from real world events and even produce them automatically.

If you want to spread the training out across a longer time (going backwards to ~2022), just use an old pretraining corpus published by some open-source lab somewhere, so you have a few years wor... (read more)

Current AIs seem pretty misaligned to me

2001zhaozhao2mo78

Well, the manager in your case is not doing RL on honesty, it's more like doing RL on "honest-looking task completion" which can either lead to honest task completion or dishonesty that isn't caught. Not too appreciably different than AI training here.

Current AIs seem pretty misaligned to me

2001zhaozhao2mo2113

Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There's presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you're an AI undergoing training.

If AI's level of... (read more)

The policy surrounding Mythos marks an irreversible power shift

2001zhaozhao2mo60

I think this is a valid and pretty big concern, especially down the line.

Let's say that in 1 year Anthropic will have a model capable of running a tech startup almost entirely autonomously (maybe it needs 1 good CEO to set direction and that's it). Everyone else in the public has significantly less capable models, perhaps because Anthropic is in the lead or their competitors also can't release their SOTA models for safety reasons.

What's stopping Anthropic from turning themselves into a startup accelerator in that situation and just hire founders and run do... (read more)

We're actually running out of benchmarks to upper bound AI capabilities

2001zhaozhao3mo30

Maybe we can do more Vending Bench style benchmarks where the AI can keep doing better in a simulated world environment given some constraints?

Basically we put them in a video game with dynamic constraints enforced by an AI game master and score different AIs on their performance with the same game master AI. That way we can measure open-ended, long-term things like coherently running a simulated company.