I work on deceptive alignment and reward hacking at Anthropic
This is cool work! There are two directions I'd be excited to see explored in more detail:
Having just finished reading Scott Garrabrant's sequence on geometric rationality: https://www.lesswrong.com/s/4hmf7rdfuXDJkxhfg
These lines:
- Give a de-facto veto to each major faction
- Within each major faction, do pure democracy.
Remind me very much of additive expectations / maximization within coordinated objects and multiplicative expectations / maximization between adversarial ones. For example maximizing expectation of reward within a hypothesis, but sampling which hypothesis to listen to for a given action according to their expected utility rather than just taking the max.
Thank you for catching this.
These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.
Thank you for pointing this out. I should have been more clear.
I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md
I have added the following sentence to the caption of figure one (which links to the markdown file above):
I have added the following section to the discussion:
These changes should go live on Arxiv on Monday at around 5pm due to the arXiv release schedule.