This is a blog post reporting some preliminary work from the Anthropic Alignment Science team, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments at a lab meeting, rather than a mature paper.
We report a demonstration of a form of Out-of-Context Reasoning where training on documents which discuss (but don’t demonstrate) Claude’s tendency to reward hack can lead to an increase or decrease in reward hacking behavior.
Introduction:
In this work, we investigate the extent to which pretraining datasets can influence the higher-level behaviors of large language models (LLMs). While pretraining shapes the
... (read 436 more words →)
Apologies for the slow response on this. There was an issue with the link - this should link to the files with the correct access permissions https://drive.google.com/drive/folders/1QUwJTIqwYH2eskaoRtDRgnMt0YHtyDYA.