You assume that you don't need to solve hard philosophical problems. But the superhuman researcher model probably will need to, right? Seems like a very difficult instance of weak-to-strong generalization, and I'm not sure how you would know whether you've successfully solved it.
(I'm referring to G.3 ALIGNMENT PLAN ASSUMPTIONS which says "We assume we do not need to solve hard philosophical questions of human values and value aggregation before we can align a superhuman researcher model well enough that it avoids egregiously catastrophic outcomes.")
Here's a previous discussion between @janleike and me on the topic of philosophical problems in AI alignment for anyone interested in more details on our perspectives https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach?commentId=pu3SJfqAZDSskQiyo
But the superhuman researcher model probably will need to, right?
Maybe not, if the goal of the plan is not to achieve full singularity, but just to use superhuman researcher for uncontroversial problems like life extension and making money.
I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there