Weak-To-Strong Generalization (W2SG)
Humans alone will not be able to evaluate superhuman AI systems on complex tasks. Our ability to directly steer superintelligence is thus limited to weak supervision.
The solution is to take an indirect approach and safely transfer some aspects of alignment to an external process/system. The two main directions leveraging this idea are 1) automated AI alignment with techniques s.a. scalable oversight to assist human evaluators and 2) provable AI safety which relies on on mathematical proofs, algorithmic guarantees or laws of physics to develop “AI with quantitative safety guarantees”.
Even when we consider AI assistance, advanced systems might be capable of behaviors that cannot be covered by the distribution available in the supervised setting. Consequently, alignment solutions whose safety guarantees rely on behavioral evaluation will be insufficient. We must develop provable safety guarantees for OoD and adversarial scenarios.
Weak-to-strong generalization (W2SG) is one potential approach to solving this problem. The goal is to leverage the principles of DL generalization to elicit the capabilities of strong models in order to extrapolate true human intent from incomplete and imperfect supervision. It has its roots in the provable AI safety approach but could benefit a lot from the integration with automated AI alignment techniques.
With this sequence, I want to explore W2SG from a holistic and principled perspective. In doing so, I aim to produce the following deliverables[1]:
Epistemic strategy: High-level research strategy for studying W2SG and DL generalization more broadly.
Generalization desiderata: Formal specification for the generalization strength, types and requirements we want from our AI systems; focus on alignment-relevant capabilities/properties.
Theoretical framework: Mathematics for W2SG, covering predictive models of generalization behavior and formal guarantees for meeting desiderata.
Empirical framework: Analogous setups and evaluation protocols for the weak-to-strong learning scenario. Identify and fix disanalogies between the setup introduced in the W2SG paper and the real-world weak supervision challenge.
Codebase: Necessary implementation for what could be considered a viable W2SG solution. Include test suite, visualizations and APIs for downstream use.
Ultimately, project success should be judged based on wether or not the end results could be integrated within a useful automated alignment researcher (AAR). The project scope is is audacious; it is more meant to suggest what the ultimate W2SG research agenda could look like rather than what I could possibly deliver myself. Nevertheless curious to see how much progress can be made in the following months; eager to collaborate, integrate feedback, and update my views.
- ^
I am aware that progress on these (and W2SG more broadly) has a high chance of advancing AI capabilities. Will do my best to focus on alignment and seek expert opinion when conflicted.