Automated / strongly-augmented safety research.
Some additional evidence: o3 used 5.7B tokens per task to achieve its ARC score of 87.5%; it also scored 75.7% on low compute mode using 33M tokens per task:
It might also be feasible to use multimodal CoT, like in Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, and consistency checks between the CoTs in different modalities. Below are some quotes from a related chat with Claude about this idea.
multimodal CoT could enhance faithfulness through several mechanisms: cross-modal consistency checking, visual grounding of abstract concepts, increased bandwidth costs for deception, and enhanced interpretability. The key insight is that coordinating deception across multiple modalities would be substantially more difficult than in text alone.
Multimodal CoT for Enhanced Faithfulness: A Proposal
Core Idea
Multimodal Chain-of-Thought (like MVoT) could enhance CoT faithfulness by making it harder for models to hide information or engage in steganography. The key insight is that forcing models to express their reasoning through both text and generated images creates additional constraints that make unfaithful reasoning more difficult.
Mechanisms for Enhanced Faithfulness
1. Cross-Modal Consistency Checking
When a model must express its reasoning in both text and images, we can verify that both modalities tell the same story:
- Any hidden information in the text would need to be consistently reflected in the generated images
- The required coordination across modalities makes steganography more challenging
- Inconsistencies between modalities could signal unfaithful reasoning
2. Visual Grounding of Abstract Concepts
Images force concrete visualization of abstract concepts:
- Makes it harder to maintain multiple contradictory lines of reasoning
- Abstract deceptions become more apparent when they must be visually represented
- Visual medium reduces ambiguity that could be exploited for steganography
3. Increased Bandwidth Cost for Deception
Hiding information across both text and images is more computationally expensive:
- Model must maintain coherence across a higher-dimensional space
- Steganographic signals need more sophisticated encoding to work across modalities
- Higher cognitive load makes deception more challenging
4. Enhanced Interpretability
Multimodal outputs provide more angles for oversight:
- Humans can cross-reference the text reasoning against visual representations
- Visual artifacts of deceptive reasoning may be easier to spot
- Multiple channels for detecting inconsistencies
Implementation Approach
Required Components
- A multimodal model capable of generating both text and images
- Consistency checking mechanisms across modalities
- Visual grounding mechanisms for abstract concepts
- Multi-channel oversight tools
Training Strategy
- Start with models pre-trained on aligned multimodal data
- Fine-tune with explicit rewards for cross-modal consistency
- Penalize detected inconsistencies between modalities
- Validate using human oversight across both channels
Potential Challenges
Technical Challenges
- Ensuring high-quality image generation that accurately reflects reasoning
- Developing robust cross-modal consistency metrics
- Managing computational overhead of dual-modal generation
Validation Challenges
- Defining ground truth for faithful multimodal reasoning
- Creating test cases that can detect cross-modal deception
- Measuring effectiveness compared to text-only approaches
Research Directions
Immediate Next Steps
- Develop metrics for cross-modal consistency
- Create test suites for detecting multimodal steganography
- Build tools for human validation of multimodal reasoning
- Study failure modes specific to multimodal deception
Longer-term Research
- Explore optimal balances between modalities
- Investigate transfer learning between modalities
- Study emergence of cross-modal deception strategies
- Develop automated detection of multimodal unfaithfulness
Conclusion
Multimodal CoT presents a promising approach for enhancing faithfulness by leveraging the constraints and verification opportunities provided by multiple modalities. While technical challenges remain, the potential benefits warrant further investigation and development.
At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.
We find that these evaluation results are nuanced and there is no one ideal SAE configuration - instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.
It might be interesting (perhaps in the not-very-near future) to study if automated scientists (maybe roughly in the shape of existing ones, like https://sakana.ai/ai-scientist/) using the evals as proxy metrics, might be able to come up with better (e.g. Pareto improvements) SAE architectures, hyperparams, etc., and whether adding more metrics might help; as an analogy, this seems to be the case for using more LLM-generated unit tests for LLM code generation, see Dynamic Scaling of Unit Tests for Code Reward Modeling.
I expect that, fortunately, the AI safety community will be able to mostly learn from what people automating AI capabilities research and research in other domains (more broadly) will be doing.
It would be nice to have some hands-on experience with automated safety research, too, though, and especially to already start putting in place the infrastructure necessary to deploy automated safety research at scale. Unfortunately, AFAICT, right now this seems mostly bottlenecked on something like scaling up grantmaking and funding capacity, and there doesn't seem to be enough willingness to address these bottlenecks very quickly (e.g. in the next 12 months) by e.g. hiring and / or decentralizing grantmaking much more aggressively.
From https://x.com/__nmca__/status/1870170101091008860:
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1
@ryan_greenblatt Shouldn't this be interpreted as a very big update vs. the neuralese-in-o3 hypothesis?
Do you have thoughts on the apparent recent slowdown/disappointing results in scaling up pretraining? These might suggest very diminishing returns in scaling up pretraining significantly before 6e27 FLOP.
I've had similar thoughts previously: https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=rSDHH4emZsATe6ckF.
Gemini 2.0 Flash Thinking is claimed to 'transparently show its thought process' (in contrast to o1, which only shows a summary): https://x.com/denny_zhou/status/1869815229078745152. This might be at least a bit helpful in terms of studying how faithful (e.g. vs. steganographic, etc.) the Chains of Thought are.
I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.
I'd guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k's $.
In contrast, Sakana's AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.
Even the auto ML R&D benchmarks from METR or UK AISI don't give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.
A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.
This seems pretty bad both w.r.t. underestimating the probability of shorter timelines and faster takeoffs, and in more specific ways too. E.g. we could be underestimating by a lot the risks of open-weights Llama-3 (or 4 soon) given all the potential under-elicitation.