I think I'm somewhat unusual in both approaches I'm exploring and results I'd like to see (not necessarily from my own research). So, not sure if others will agree with this.
What I really want to see is:
What would make me feel like we are on target?
One of my biggest worries is that AI will get harder to control as it gets more capable. Seeing existing results replicated on new more powerful models (before yhose models are deployed!) would be great.
Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.
Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.
I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increas...
I’m curious how alignment researchers would answer these two questions: