Rauno Arike

Wiki Contributions

Comments

Sorted by

Thank you for the detailed feedback, I found this very helpful and not at all rude or mean!

I suspect there are a few key disagreements between us that make me more optimistic about this project setup than you. I'd be curious about whether you agree on these points being important cruxes:

  • Though I agree that our work primarily belongs within the model organisms paradigm, I disagree that it's only useful as a means to study in-forward-pass goal-directedness. I think there's a considerable chance that the Translucent Thoughts hypotheses are true and AGI will be reached with an architecture very similar to current transformers, which I wouldn't expect to be able to do dangerous goal-directed reasoning within a single forward pass due to the limited number of serial reasoning steps. If this is the case, CoT-based goal-directedness seems relevant to study for its own sake. We're still studying an artificial scenario in the sense that we're giving the models the goal ourselves, while in the deceptive alignment scenarios, the model would have to develop the goal on its own, but I don't think we necessarily have to say anything about in-forward-pass goal-directedness for our work to be applicable to realistic threat models.
  • Studying the intrinsic goals of the models in naturalistic settings would also be interesting, but I expect this to be hard, and also less relevant to the threat models we're worried about. I'm mainly concerned about models becoming more goal-directed with scale and someone figuring out long-horizon RL training on top of that, leading to models that are qualitatively different from current ones. To address this, I'd like to have a benchmark that decomposes goal-directedness into several relevant abilities and provides an accurate score for these abilities across model generations. This would help us better track how models become more goal-directed as they scale, if at all. Additionally, if a new model scores close to humans on many of these abilities, it could serve as a strong alarm bell that we are approaching dangerously goal-directed models. It seems easier to build this sort of benchmark without going through the complicated process of trying to infer the intrinsic goals of the models.
  • The aforementioned decomposition of goal-directedness into various relevant abilities would also be the main value added on top of existing agent benchmarks. We should maybe have been clearer in the post about planning to develop such a decomposition. Since it's easy to evaluate for goal-fulfillment, that was our main focus in the early stages of the project, but eventually, we're hoping to decompose goal-directedness into several abilities such as instrumental reasoning ability, generalization to OOD environments, coherence, etc, somewhat analogously to how the Situational Awareness Dataset decomposes situational awareness into self-knowledge, inferences, and actions.

I definitely agree that it would be interesting to compare the goal-directedness of base models and fine-tuned models, and this is something we're planning to eventually do if our compute budget permits. Similarly, I strongly agree that it would be interesting to study whether anything interesting is going on in the situations where the models exhibit goal-directed behavior, and I'm very interested in looking further into your suggestions for that!

Thanks, that definitely seems like a great way to gather these ideas together!

I guess the main reason my arguments are not addressing the argument at the top is that I interpreted Aaronson's and Garfinkel's arguments as "It's highly uncertain whether any of the technical work we can do today will be useful" rather than as "There is no technical work that we can do right now to increase the probability that AGI goes well." I think that it's possible to respond to the former with "Even if it is so and this work really does have a high chance of being useless, there are many good reasons to nevertheless do it," while assuming the latter inevitably leads to the conclusion that one should do something else instead of this knowably-useless work.

My aim with this post was to take an agnostic standpoint towards whether that former argument is true and to argue that even if it is, there are still good reasons to work on AI safety. I chose this framing because I believe that for people new to the field who don't yet know enough about the field to make good guesses about how likely it is that AGI will be similar to ML systems of today or to human brains, it's useful to think about whether it's worth working on AI safety even if the chance that we'll build prosaic or brain-like AGI turns out to be low.

That being said, I could have definitely done a better job writing the post - for example by laying out the claim I'm arguing against more clearly at the start and by connecting argument 4 more directly to the argument that there's a significant chance we'll build a prosaic or brain-like AGI. It might also be that the quotes by Aaronson and Garfinkel convey the argument you thought I'm arguing against rather than what I interpreted them to convey. Thank you for the feedback and for helping me realize the post might have these problems!