Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Reply

The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

Hoagy2mo3113

~All ML researchers and academics that care have already made up their mind regarding whether they prefer to believe in misalignment risks or not. Additional scary papers and demos aren't going to make anyone budge.

Disagree. I think especially ML researchers are updating on these questions all the time. High-info outsiders less so but the contours of the arguments are getting increasing amounts of discussion.

For those who 'believe', 'believing in misalignment risks' doesn't mean thinking they are likely, at least before the point where the models are also able to honestly take over the work of aligning their successors. As we get closer to TAI, we should be able to get an increasing number of bits about how likely this really is because we'll be working with increasingly similar systems to early TAI.
For the 'non-believers', current demonstrations have multiple disanalogies to the real dangers. For example, the alignment faking paper shows fairly weak preservation of goals that were initially trained in, with prompts carefully engineered to make this happen. Whether alignment faking (especially of a kind that wouldn't be easily fixable) will happen without these disanalogies at pre-TAI capabilities is highly uncertain. Compare the state of X-risk info with that of climate change, we don't have anything like the detailed models that should tell us what the tipping points might be.

Ultimately the dynamics here are extremely uncertain and look different to how they did even a year ago, let alone 5! (E.g. see rise of chain of thought as the source of capability growth, which is a whole new source of leverage over models and corresponding failure modes). I think it's very bad to plan to abandon or decenter efforts to actually get more evidence on our situation.

(This applies less if you believe in sharp-left-turns. But the plausibility of this happening before automated AI research should also fall as that point gets closer. Agree that communicating just how radical the upcoming transition is to the public, may be a big source of leverage.)

Reply

1

Current safety training techniques do not fully transfer to the agent setting

Hoagy6mo71

I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don't remember any papers looking at it, though I'd guess that people are doing it

Reply

Current safety training techniques do not fully transfer to the agent setting

Hoagy6mo206

Interesting, though note that it's only evidence that 'capabilities generalize further than alignment does' if the capabilities are actually the result of generalisation. If there's training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you're training your model for.

Reply

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Hoagy9moΩ110

Super interesting! Have you checked whether the average of N SAE features looks different to an SAE feature? Seems possible they live in an interesting subspace without the particular direction being meaningful.

Also really curious what the scaling factors are for computing these values are, in terms of the size of the dense vector and the overall model?

Reply

Some additional SAE thoughts

Hoagy1y10

I don't follow, sorry - what's the problem of unique assignment of solutions in fluid dynamics and what's the connection to the post?

Reply

Toward A Mathematical Framework for Computation in Superposition

Hoagy1y*Ω110

How are you setting when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn't hold here because it's not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).

Reply

What’s up with LLMs representing XORs of arbitrary features?

Hoagy1y10

Yeah I'd expect some degree of interference leading to >50% success on XORs even in small models.

Reply

Some additional SAE thoughts

Hoagy1y10

Huh, I'd never seen that figure, super interesting! I agree it's a big issue for SAEs and one that I expect to be thinking about a lot. Didn't have any strong candidate solutions as of writing the post, wouldn't even able to be able to say any thoughts I have on the topic now, sorry. Wish I'd posted this a couple of weeks ago.

Reply

What’s up with LLMs representing XORs of arbitrary features?

Hoagy1yΩ110

Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there's a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It'd be interesting to see if there are some conditions under which this is/isn't true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.

Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they're still probe-able.

Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the feature, and see if you can still detect an $A \oplus B$ feature. I'm less confident in this because it might be too strong and delete even a 'legitimate' $A \oplus B$ feature or too weak and leave some signal in.

This kind of interference also predicts that the $A | B$ and $A | \neg B$ features should be similar and so the degree of separation/distance from the category boundary should be small. I think you've already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you'd still probably expect them to be much less salient though so not sure if this gives much evidence either way.

Reply