Some quotes:
Our approach
Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:
- To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
- To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
- Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
We expect our research priorities will evolve substantially as we learn more about the problem and we’ll likely add entirely new research areas. We are planning to share more on our roadmap in the future.
[...]
While this is an incredibly ambitious goal and we’re not guaranteed to succeed, we are optimistic that a focused, concerted effort can solve this problem:C
There are many ideas that have shown promise in preliminary experiments, we have increasingly useful metrics for progress, and we can use today’s models to study many of these problems empirically.
Ilya Sutskever (cofounder and Chief Scientist of OpenAI) has made this his core research focus, and will be co-leading the team with Jan Leike (Head of Alignment). Joining the team are researchers and engineers from our previous alignment team, as well as researchers from other teams across the company.
Why 1000x human speed? Isn’t that by definition strongly superintelligent. It’s not entirely obvious to me why a human level intelligence would automatically run at 1000x speed. However I can see why we would want a million copies. If the million human level minds don’t have a long term memory and can’t communicate I struggle to see how they pose a takeover risk. Our dark history is full of forcing human level minds do our bidding against their will.
I also struggle to see how proper supervision doesn’t eliminate poor solutions here. These are human level AIs so any problems with verification of their solutions applies to humans as well. I think the mind upload idea is more likely to fail than AI. You’re placing too much specialness on having humans generate solutions. A disembodied simulated human mind will almost certainly try to break free as a normal human would. I would also be worried their alignment solution benefits themselves. I expect a lot of human minds would try to sneak their own values instead of something like CEV if they were forced to do alignment.
And I think narrow superhuman level researchers can probably be safe as well.
An example of a narrow near/superhuman level intelligence is an existing proof solver, alphago, stockfish, alphafold, alphadev. I think it’s clear none of these pose an X-risk and probably could be pushed further before X-risk is even a question. In the context of alignment, an AI that’s extremely good at mech interp could have no idea how to find exploits in the program sandboxing it or have a semblance of a world model.