Epistemic status: Naive and exploratory, reflects my primary conceptual understanding, awaiting a technical deep dive. 99% of ideas are not my own, rather distilled from the resources hyperlinked throughout.

Many alignment researchers err towards local optimization i.e. seek low-hanging fruits and leverage incremental improvements. Fast and imperfect iterative improvement is necessary but sometimes it should be integrated within a principled and holistic strategy.

I fear that many people interested in W2SG will default to testing ideas on the basis of informed heuristics, without robust inside views to justify them. While in aggregate we might converge on effective solutions, my general thesis is that W2SG needs people adopting an AI scout approach.

This introductory post and corresponding project aim to actualize such conviction in order to come as close as possible to a W2SG solution. Ultimately, success should be judged based on wether or not the results can be integrated within a useful automated alignment researcher (AAR).

Framing the challenge

The W2SG objective

At the highest abstraction level, solving W2SG means controlling and understanding how advanced AI systems generalize from weak supervision.

The W2SG result

Develop AI systems capable of superhuman generalization across a wide range of (alignment-relevant) tasks[1], with predictable (ideally provable) generalization bounds.

High-level strategy

A potential starting point is to identify and study the factors that influence generalization (GIFs). To then engineer and deploy scalable solutions we require a strategy for measuring progress. Ultimately, we would want to evaluate the strength and type of generalization achieved and compare with expected desiderata. Conceptually, a possible research plan could be structured across five pillars:

Problem specification

  • What types of generalization we care about?
  • What alignment-relevant tasks we want our models to generalize across?
  • What does it mean for AI systems to generalize well?
  • What assumptions are necessary for a W2SG solution to exist e.g check appendix G.3? What are the favorable generalization properties we require/expect in a deployment setting? To what extent do they correspond to reality?

Scientific understanding

  • What mathematics best capture the W2SG problem?
  • How to predict generalization type and strength achieved by W2SG solutions?
  • What formal mechanisms (e.g proofs) can guarantee W2SG outcomes and how to develop them?

Empirical evaluation

  • What setups faithfully capture the W2SG problem?
  • What metrics best quantify generalization desiderata?
  • What datasets best cover generalization desiderata?

Scalable solutions

  • Which factors influence generalization?
  • How to ensure W2SG solutions scale across AI paradigms, methods and settings?
  • What scalable solutions, if any, can be derived from studying GIFs?

Meta

  • What is the best high-level research strategy?
  • What are the best ways to combine W2SG with other alignment techniques?
  • What insights can we leverage from other domains e.g cognitive science, psychology, dynamical systems?

A1: For now, I consider the above questions to be necessary and sufficient for solving W2SG. Moving forward, we need to expand, concretize and eventually answer them[2].

Theory of impact

The superlignment challenge

AI systems (much) smarter than humans could arrive in the next 10 years.

To manage potential risks we need to solve a key technical challenge:

How can we steer and control AI systems much smarter than us?

Alignment pillars

Alignment plans should consist of three main pillars:

  1. Specification/Oversight: “Specify how the system should behave in order to satisfy design goals”. Main examples: reward modeling, direct preference-learning, constitutional principles.
  2. Training/Robustness: “Make the system reliably satisfy the specification across a range of settings”. Main examples: finetuning, RLHF, adversarial training, safety filters.
  3. Validation/Evaluation: Validate the resulting system is aligned i.e. satisfies our specification. Main examples: behavioral evals, interpretability.

Issues with superintelligence

Current alignment techniques have been so far successful but have notable issues that could become existential as we approach superintelligence.

  1. Specification techniques are unsatisfactory. For complex tasks, it might be prohibitively messy and challenging to faithfully represent human intent.
  2. Training methods are flawed.
    1. Optimization pressure and scaling can produce undesired behaviors. Understanding of such phenomena is barely starting to emerge.
    2. RLHF reinforces what humans think is true. We are training models to tell us what we want to hear instead of what is fundamentally true.
  3. Humans alone might be unable to validate/invalidate alignment for advanced AI systems.
    1. Evals research aims to develop an input-output mapping of model behavior by testing “neural networks as black boxes, estimating quality and safety using imperfect measures on a particular distribution”. This comes with problems:
      1. It could be prohibitively challenging to evaluate all meaningful behaviors[3].
      2. Humans might be fundamentally unable to rate/label model outputs on complex tasks.
      3. Out-of-distribution and adversarial settings might be impossible to study, especially in a test environment.
      4. AI systems might become deceptive[4].
  4. Interpretability seeks to explain why, how, and when models work, hoping to find general explanations that extrapolate to out-of-distribution and adversarial settings. The problem with current approaches is that they operate on extremely shaky ground:
    1. There is weak consensus on formal foundations and established practices.
    2. Unclear what it means for humans to understand AI systems
    3. It is unclear if all models will be amendable to human understanding.

Direct vs indirect steering

For these reasons, our ability to directly steer superhuman AI systems is limited to weak supervision. The W2SG paper shows that naive RLHF underperforms in such setting[5], therefore suggesting we will likely need new methods and/or scientific breakthroughs.

Maybe we could take an indirect approach. What if we could safely transfer some aspects of alignment to an external process/system? Potential directions[6]:

  1. Automated AI alignment: Use AI to automate and scale alignment ideas, potentially uncovering new breakthroughs. Techniques: SO and automated interpretability. Agendas and high-level approaches: OpenAI Superalignment, Cyborgism, and supposedly those used by GDM and Anthropic.
  2. Provable AI safety: Rely on mathematical proofs, algorithmic guarantees or laws of physics to develop “AI with quantitative safety guarantees”. Techniques: formal model checking. Agendas and high-level approaches: Safeguarded AI and Provably safe systems[7].

My current opinion is that, if a once-and-for-all superalignment solution even exists, it will likely combine these two ideas. Without a validation protocol that is external to the AI assistant we cannot guarantee safety. However, as established before, humans alone will not be able to evaluate complex AI behaviors[8]. Formal methods are the only alternative I can think of that could, in theory, solve all Specification and Validation issues mentioned above. Unfortunately, in practice, the application of formal verification to ML is bottlenecked by the scale, complexity and lack of transparency associated with currently deployed models.

One option is to relax our binary definition of safety that is so characteristic to formal methods and aim for formal heuristic arguments instead[9]. However, as noted by several researchers[10], automation could unlock the power of mathematical proofs, without having to pay this sacrifice. Furthermore, once we have powerful systems with provable safety guarantees we can use them for automated AI alignment.

W2SG within a broader alignment plan

One can think of W2SG as shifting our trust from explicit AI system evaluation to the principles that govern DL generalization. W2SG is primarily grounded in the provable AI safety idea and focused on the Training pillar. However, when combined with other techniques, I think it can also be leveraged for Specification and Validation, and benefit from automated AI alignment. For example:

W2SG for Specification

Models capable of strong generalization could represent/generate better specifications.

W2SG + RM: Use W2SG techniques to train better reward models.

W2SG + Debate: Train W2SG models on a dataset of human judgements on debates in order to generalize the judgements to a dataset of new (potentially superhuman) debates. This later dataset can be used to supervise future models.

W2SG for Training

If it were to work perfectly, a W2SG solution should, on its own, train aligned superhuman models. The intuition is that “we can do better than naively relying on unreliable feedback from a weak overseer and instead access the latent, greater, capabilities that our AI system has, perhaps by a simple modification of the training objective”.

W2SG for Validation

W2SG + Formal verification: A W2SG theoretical framework, could allow us to crosscheck strong models trained using W2SG techniques against the behaviors suggested by our formal understanding.

W2SG + SO: A more direct, approach is to train W2SG models to assist human evaluators e.g. we could make a powerful code LLM to generalize the concept of truth and then use it to ask questions about AI-generated programs.

Interpretability for W2SG: An interesting idea would be training reward models with W2SG to generalize beyond human capability and then use interpretability to potentially understand the intent of superhuman models.

W2SG for Interpretability: Improve automated interpretability e.g. use W2SG solutions to train models that achieve superhuman performance at explaining patterns of neural activation.

W2SG + Evals: Reward models trained with W2SG might benefit behavioral evals.

W2SG within the Superalignment agenda

In appendix G.1, W2SG is regarded as one potential solution to align the first superhuman AAR[11]. In practice, unlikely to be used on its own but in combination with other techniques, as exemplified above.

The Superalignment team seems particularly excited about this “high-level cross-validation” approach where they[12]:

  • Select a pair of (Training, Validation) techniques[13].
  • Use some dataset splitting method e.g k-fold to define training and validation subsets.
  • Train a model on the training dataset using the Training technique.
  • Train another model on the validation dataset using the Validation technique.
  • Use the Validation model to answer alignment-relevant questions about the Training model.
  • Study meta-questions e.g “Well, are these fundamentally different models or the same models? What are the important differences?”.
  • Use insights gained at steps 5. and 6. to improve the alignment techniques.
  • Possibly automate the entire workflow.

W2SG within the Safeguarded AI agenda

TBD.[14]

W2SG compared

Before reaching a conclusion about the W2SG theory of impact, I want to spend a few minutes talking about:

W2SG vs SO

In some sense…if they worked perfectly, could each solve the same problem in isolation.

The intuition for SO is to make the supervision signal stronger so that we can return to the traditional supervised learning regime. Its ultimate objective is to use AI assistants to evaluate everything that our superhuman systems could do.

Of course in practice you couldn’t literally do that because AI systems will just do a lot, but you can look at everything with some small independent probability. But then you’re still left with this question of: how do you know that the models generalize to the cases where you’re not looking at?

This is basically the OoD generalization challenge (framed in an alignment context) and W2SG is the most straightforward way to address it.

The initial thinking was more modest:

Ideally, we can leverage generalization to make expensive evaluation a lot cheaper. Arguably reward models already fulfill this role: instead of providing a comparison for each episode during RL training we only provide comparisons for a subset of them and let our models generalize to the rest of them. But this is i.i.d generalization because we sample the tasks that we supervise at random from all the tasks that we do.

However, after defining an analogous setup that enables empirical experimentation, the thinking has shifted towards considering the possibility of achieving and understanding weak-to-strong generalization as yet another scalable alignment technique. That being said, in the absence of a comprehensive and rigorous formal understanding, we cannot rely on W2SG alone (more on this later). We would still need a Validation technique like SO.

W2SG vs Interpretability

TBD.

W2SG vs Formal verification

TBD.

Why work on W2SG

Why W2SG might be essential

If our AI systems are capable of behaviors that cannot be covered by the distribution available in the supervised setting (even when considering AI assistance), then alignment solutions whose safety guarantees rely on evaluation will be insufficient. We must develop safety guarantees for OoD and adversarial scenarios. The only W2SG alternatives I am aware of in this regard are interpretability and formal verification but, as sketched above, they seem much more challenging, especially when pursued in isolation. W2SG could solve the “OoD and adversarial” challenge and, additionally, contribute to progress in interpretability and formal verification.

Why W2SG might be possible

TBD.[15]

Counterarguments

The fundamental concern:

…when we actually apply it to tasks we can’t evaluate and don’t have ground truth for, we are to some extent taking a “leap of faith on generalization.” We might have all kinds of empirical data on how it works in other settings, and we might still be able to check in some rare cases using extremely expensive evaluation procedures. But since we should expect superhuman models to understand when we’re taking the leap of faith, this is a particularly natural point for models to turn on us.

Distilling this quote:

  • “when we actually apply it to tasks we can’t evaluate and don’t have ground truth for” - A likely situation we must assume.
  • regardless of having “all kinds of empirical data on how it works in other settings” or “using extremely expensive evaluation procedures” s.a. those based on SO or interpretability - In practice, we will evaluate the type/strength of W2SG using common ML evaluation approaches and/or, in more complex cases, our Validation techniques.
  • we are taking a “leap of faith on generalization” - Because it is impossible to evaluate everything using the somewhat direct approaches mentioned above, for those cases that we miss, we must trust generalization to correctly extrapolate human intent.
  • “since we should expect superhuman models to understand when we’re taking the leap of faith [on generalization], this is a particularly natural point for models to turn on us” - If we trust generalization, assuming we don't know the ground truth, we might end up accepting misaligned, potentially harmful, AI behavior.

Why the AI scout approach[16]

Interdisciplinary argument

As a nascent topic, W2SG lacks a theoretical framework, possibly a key requirement if we want to go beyond an approximate understanding of how powerful models generalize human intent.

To develop such framework we might need to venture into mathematically-rich domains s.a. learning and information theory or even borrow conceptual insights from more disparate fields including neuroscience, physics and complex systems.

First-principles argument

W2SG is a special case of weakly-supervised learning (WSL), focused on OoD generalization and implications to alignment.

Burns et al. introduce W2SG primarily from an LLM alignment perspective i.e. WSL applied to the preference finetuning setting. However, I think a first-principles mindset is needed because:

We do not yet know what future models will look like. We should update our setup over time as we learn more about how broadly superhuman models will be built.

Broader-scope argument

I would go even further than the above quote and suggest studying W2SG decoupled from a particular AI paradigm or even alignment, e.g. AI agents might pose additional generalization concerns beyond those of frontier models.

Instead of seeking to “control and understand how powerful models generalize from weak supervision”, I find value in extending the scope to AI systems more broadly.

Project deliverables

A2: I consider the following results to be necessary and sufficient for answering the broad questions outlined in Framing the challenge.[17]

  1. Epistemic strategy: High-level research strategy for studying W2SG and DL generalization more broadly. It should answer Meta questions.
  2. Generalization desiderata: Formal specification for the generalization strength, types and requirements we want from our AI systems; focus on alignment-relevant capabilities/properties.[18]
  3. Theoretical framework: Mathematics for W2SG, covering predictive models of generalization behavior and formal guarantees for meeting desiderata.
  4. Empirical framework: Analogous setups and evaluation protocols for the weak-to-strong learning scenario. Identify and fix disanalogies between the setup introduced in the W2SG paper and the real-world weak supervision challenge.
  5. Codebase: Necessary implementation for what could be considered a viable W2SG solution. Include test suite, visualizations and APIs for downstream use.

Rough plan

Three stages[19]:

Background

Milestones: Deliverables (D) 1-4, materialized into one report/blog.

Duration: 1.5-3 months[20]

Strategy

  • 1 week: Brief research into AI alignment epistemics and philosophy of science, characterize the epistemic access problem and develop inside view on alignment epistemics, sketch D1[21].
  • 1-3 weeks: Review AI capabilities and alignment fundamentals e.g. DL theory and essential hands-on, transformer architecture, frontier models, RLHF, ML paradigms (fully-supervised, unsupervised, semi-supervised, weakly-supervised, meta-learning), alignment agendas and techniques.
  • 2 weeks - 1 month: Deep dive into W2SG-related theoretical subjects, identify GIFs, sketch D2 and D3 addressing how each GIF[22] fits in.
  • 2 weeks - 1 month: Deep dive into W2SG-related domains, brief research into experiment design, sketch D4.

Analysis

Milestones: Literature review, preliminary experiments, report with promising W2SG solutions.

Duration: 3-5 months

Strategy       

  • 2 weeks - 1 month: Survey the research landscape covering GIFs and W2SG-related domains with their corresponding methodologies, integrate W2SG within your AI (alignment) worldview.
  • 1-2 weeks: Having in mind D4, plan implementation and estimate compute requirements.
  • 2-4 months: Explore the W2SG solution space[23]:
    • For each GIF, run experiments to answer:
      • How, when and to what extend does it contribute to D2?
      • What changes we need to maximize the likelihood of meeting D2?
      • Given D3 and D4, how can we rigorously demonstrate the results?
    • If D2 is well-defined and we manage to identify the right GIFs, answering questions 1-3 should in principle lead to optimal solutions.

Synthesis

Milestones: Final codebase, integration with AAR, publish results.

Duration: 2-4 months

Strategy 

  • 1-2 months: Finish implementation and build APIs for downstream use.
  • 1-3 months: Develop a robust test suite and evaluate methods in realistic AAR settings.
  • 1 month: Prioritize engagement with the wider alignment community and write the final publication.
  1. ^

    Umbrella-term for: properties, capabilities, domains, concepts and procedures.

  2. ^

    The reasoning behind this assumption is as follows: Fundamentally, it seems that if we could define an accurate generalization desiderata and develop a way to measure progress towards accomplishing it, then the most straightforward strategy would be to identify, study and thinker with the factors that influence generalization. In theory if we find all meaningful factors, combine and change them in the best possible way then we could come close to an optimal W2SG solution.

  3. ^
  4. ^

    Even if the supervision process continues in deployment, infrastructure failures/reboots could represent great opportunities for AI systems to become deceptive.

  5. ^

    Even though I believe naive RLHF will fail regardless, it is worth mentioning for completeness that, due to disanalogies between real-world supervision and the analogous setup used, results should be taken with a grain of salt.

  6. ^

    My understanding of these topics is at an early stage. This is why the following paragraphs are kept concise and general. However, I plan to dive deeper and write future posts on both automated AI alignment and provable AI safety.

  7. ^

    These two agendas also have a significant AI automation component.

  8. ^

    It is expected that the first useful AAR will be superhuman in meaningful ways.

  9. ^
  10. ^

    At least for less critical applications.

  11. ^

    For now, I think it is suboptimal to speculate on the use of W2SG beyond that point.

  12. ^

    An extremely trivialized description of what I think one of their approach could be. I plan to soon write a blog about the Superalignment agenda and Automated AI alignment more broadly.

  13. ^

    I refer here to alignment techniques when focused on either the Training or the Validation pillar. For example, we can view W2SG as a Training technique when the goal is to just align the model and as a Validation technique when it is used to train models later used to assist human evaluators.

  14. ^

    All TBD sections are mentioned for completeness/reference. Planned to write them as I make further progress with this project.

  15. ^

    Rough empirical intuitions are presented in Burns et al. However, I want to develop robust theoretical arguments before writing this section.

  16. ^

    Not mentioning them since this post is already too long but there are obvious counterarguments to watch out for when adopting the AI scout approach.

  17. ^

    High confidence that the assumption holds for deliverables 2-4. Uncertain about 1, which might be redundant.

  18. ^

    Potentially, a set of qualitative and quantitative requirements based on each generalization type.

  19. ^

    I expect the stages and individual steps to overlap. This is why the time estimates don’t always add up. They are meant to suggest approximate duration for each step.

  20. ^

    I am assuming little prior experience with ML and alignment.

  21. ^

    This step could perhaps be skipped or postponed?

  22. ^

    CTRL + F if you forgot.

  23. ^

    Trivially simplified, preliminary approach.

New Comment
2 comments, sorted by Click to highlight new comments since:

I don't know if you saw this post from yesterday, but you may find it useful: https://www.lesswrong.com/posts/ELbGqXiLbRe6zSkTu/a-review-of-weak-to-strong-generalization-ai-safety-camp

Wasn't aware of it. Thanks!