This is a linkpost for https://arxiv.org/abs/2411.10588
How well LLMs follow which decision theory affects their ability to cooperate. This could mean the difference between peace and conflict in AI-assisted political bargaining or enable AIs to collude when one is meant to monitor the other, undermining human control.
Do you have any thoughts on "red lines" for AI collusion? That is, "if an AI could do X, then we should acknowledge that AIs can likely collude with each other in monitoring setups."
I’ve spent a lot of the last few years working on issues related to acausal cooperation. With LLMs being clearly dominant over recent years, I’ve now led a team to make a benchmark to figure out how good LLMs are at decision theory and whether and when they lean more CDT or EDT. We hope to expand this dataset in the future, including by incorporating questions that try to measure the updatelessness dimension. Hopefully, this dataset will be useful for future interventions aimed at improving acausal interactions.
Abstract:
Twitter thread: