Thanks! Some thoughts here:
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I'm looking forward to that.
I think the thing that helps you out here is compositionality---all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn't directly verifiable.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That's great!
But I don't think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in "Putting up bumpers"), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
I agree with most of this. I'd just add that it's not totally obvious to me that RLHF is the way we should be doing alignment out into the future -- it's kind of like electroshock therapy for LLMs which feels kind of pathological from a psychological standpoint. I'd guess that there are more psychologically friendly ways to train LMs --- and understanding the relationship between training data and behaviors feels like a good way to study this!
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Do you mean that the oversight system is scheming or the subject model? For the subject model the hope is that a sufficiently powerful overseer can catch that. If you're worried about the overseer, one reason for optimism is that the oversight models can have significantly smaller parameter count than frontier systems, so are less likely to have weird emergent properties.
The explainer model is actually evaluated based on how well the explanations predict ground-truth activation patterns, so it's not being evaluated by an LM-judge, but against the underlying ground-truth.
There is still room to hack the metric to some extent (in particular, we use an LM-based simulator to turn the explanations into predictions, so you could do better by providing more simulator-friendly explanations). This is probably happening, but we did a head-to-head comparison of LM-generated vs. human-generated explanations, and based on spot-checking them by hand, the higher-scoring explanations under our metric really did seem better.
There's also a number of other sanity checks in the paper if you're interested!
My guess would be that it's because they paid Hypermind directly rather than making the grant to me.
If you are interested, I did a detailed analysis of different groups of forecasters here: https://bounded-regret.ghost.io/scoring-ml-forecasts-for-2023/
I wouldn't treat competitive forecasters as a homogeneous group, but I also think basically everyone was surprised by the rate of progress on the MATH dataset. The main difference is that the better forecasters adjusted quickly after the first surprise and were mostly calibrated after.
My forecasts actually were funded by OP! I would guess that the main counterfactual change as a result of this was going with Hypermind over Good Judgement. It might be interesting to look at differences between those populations of forecasters---I would not model "super forecasters" as homogeneous and in retrospect the particular forecasters we got seemed not super good at AI questions, or else just weren't trying hard enough. But I also worked with some very good, AI-focused forecasters as a sanity check and they were also surprised by progress as determined by pre-registered predictions.
Thanks, appreciate it! Interested if you have any particular tasks you'd want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).
Looks like an issue with the cross-posting (it works at https://bounded-regret.ghost.io/analyzing-long-agent-transcripts-docent/). Moderators, any idea how to fix?
EDIT: Fixed now, thanks to Oliver!
I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).
Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.
(This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).
Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?