Bogdan Ionut Cirstea

Automated / strongly-augmented safety research.

Wiki Contributions

Comments

Sorted by

I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.

I'd guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k's $.

In contrast, Sakana's AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.

Even the auto ML R&D benchmarks from METR or UK AISI don't give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.

A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.
 

This seems pretty bad both w.r.t. underestimating the probability of shorter timelines and faster takeoffs, and in more specific ways too. E.g. we could be underestimating by a lot the risks of open-weights Llama-3 (or 4 soon) given all the potential under-elicitation.

Reply41111

Gemini 2.0 Flash Thinking is claimed to 'transparently show its thought process' (in contrast to o1, which only shows a summary): https://x.com/denny_zhou/status/1869815229078745152. This might be at least a bit helpful in terms of studying how faithful (e.g. vs. steganographic, etc.) the Chains of Thought are.   

A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.

A related announcement, explicitly targeting 'building an epistemically sound research agent 
@elicitorg that can use unlimited test-time compute while keeping reasoning transparent & verifiable': https://x.com/stuhlmueller/status/1869080354658890009

The toy task reminded me of the 'Hidden side objectives' subsection in section 'B.1.2 OBFUSCATED SCHEMING REASONING' of Towards evaluations-based safety cases for AI scheming.

For the SOTA on swebench-verified as of 16-12-2024: 'it was around $5k for a total run.. around 8M tokens for a single swebench-problem.'

'That means, around three months, it is possible to achieve performance comparable to current state-of-the-art LLMs using a model with half the parameter size.'

If this trend continues, combined with (better/more extensible) inference scaling laws, it could make LM agents much more competitive on many AI R&D capabilities soon, at much longer horizon tasks.

E.g. - figure 11 from RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts:

Also related: Before smart AI, there will be many mediocre or specialized AIs.

The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section 'B.1.3 HIDDEN SCHEMING REASONING' from Towards evaluations-based safety cases for AI scheming. I wouldn't be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section 'B.1.2 OBFUSCATED SCHEMING REASONING' of Towards evaluations-based safety cases for AI scheming.

I don't dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.

I'm pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can't even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn't seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So 'chain-of-thought but internalized to the model will take over' seems very unlikely with transformers, and much more so if basic mitigations like unlearning (e.g. of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, to mitigate the possibility of latent scheming) were applied.

Steganography is a separate threat model, but even there I'd interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over. 

This prediction seems largely falsified as long as transformers remain the dominant architecture, and especially if we deliberately add optimization pressures towards externalized reasoning and against internal, latent reasoning; see e.g. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and LLMs Do Not Think Step-by-step In Implicit Reasoning.

Load More