janleike

[Linkpost] Introducing Superalignment

I'm not entirely sure but here is my understanding:

I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system's hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it's not too much it shouldn't be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.

Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents' outputs such that you can get them to tell you about all of the problems they know about. The "offense-defense" balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.

Replying to[Linkpost] Introducing Superalignment

[Linkpost] Introducing Superalignment

Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.
I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't think there was ever good reason to believe that IDA/debate/RRM would scale indefinitely and I personally don't think that problem will be a big blocker for a while for some

[Link] Why I’m optimistic about OpenAI’s alignment approach

Yeah, you could reformulate the question as "how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?" Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it's trained with RL to successfully prove theorems you can argue it's a consequentialist since it's the distillation of a planning process, but it's also relatively myopic in the sense that it doesn't care about anything that happens after the current theorem is proved. My sense is that in practice it'll matter a lot where you draw your episode boundaries (at least in... (read more)

Replying to[Link] Why I’m optimistic about OpenAI’s alignment approach

[Link] Why I’m optimistic about OpenAI’s alignment approach

Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I'm not sure that it's always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.

However, I'm skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I'd count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.

[Link] Why I’m optimistic about OpenAI’s alignment approach

The post lays out some arguments in favor of OpenAI’s approach to alignment and responds to common objections.

Replying toA challenge for AGI organizations, and a challenge for readers

A challenge for AGI organizations, and a challenge for readers

Thanks for writing this! I'd be very excited to see more critiques of our approach and it's been great reading the comments so far! Thanks to everyone who took the time to write down their thoughts! :)

I've also written up a more detailed post on why I'm optimistic about our approach. I don't expect this to be persuasive to most people here, but it should give a little bit more context and additional surface area to critique what we're doing.

Replying to[Link] A minimal viable product for alignment

janleike4y

I strongly agree with you that it'll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren't good from ones that look good and are actually good.

There is a much stronger version of the claim "alignment proposals are easier to evaluate than to generate" that I think we're discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be... (read more)

Replying to[Link] A minimal viable product for alignment

janleike4y

yeah that's a fair point

Replying to[Link] A minimal viable product for alignment

janleike4y

If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.

[Link] Why I’m excited about AI-assisted human feedback

This is a link post for https://aligned.substack.com/p/alignment-mvp

I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This second post argues that instead of trying to solve the alignment problem once and for all, we can succeed with something less ambitious: building a system that allows us to bootstrap better alignment techniques.

Specification gaming: the flip side of AI ingenuity

This is a link post for https://aligned.substack.com/p/ai-assisted-human-feedback

I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This first post argues for recursive reward modeling and the problem it's meant to address (scaling RLHF to tasks that are hard to evaluate).

Vika

Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton, janleike

(Originally posted to the Deepmind Blog)

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold - but soon finds that even food and drink turn to metal in his hands. In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material - and thus exploit a

... (read 1722 more words →)

Replying toNew safety research agenda: scalable agent alignment via reward modeling

janleike7y

New safety research agenda: scalable agent alignment via reward modeling

Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.

In recursive reward modeling, agent $A_{1}$ is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. $A_{2}$ then assists the human in evaluating $A_{1}$ ’s behavior in order to train a reward model. Decomposition only happens on the evaluation of $A_{1}$ ’s task.

For example, $A_{1}$ proposes some plan $x$ and $A_{2}$ proposes the largest weakness $y$ in the plan. The human then evaluates whether $y$ is indeed a weakness in the plan $x$ and... (read more)

Replying toNew safety research agenda: scalable agent alignment via reward modeling

janleike7y

New safety research agenda: scalable agent alignment via reward modeling

This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.

General Cooperative Inverse RL Convergence