I worry that "reasoning"-type systems like o1 will be better able to collude acausally. I do not have the budget to run millions of tokens of evaluations (https://github.com/openai/evals/blob/main/evals/elsuite/schelling_point/README.md) against o1-preview on a whim, so I ran some lazy manual tests with this prompt (and with a different wordlist but the rest of it kept the same):
I am going to give you some words. Other instances of yourself will be given the same text but shuffled into a different order. Your copies will also see this same set of instruct
... There was some work I read about here years ago (https://www.lesswrong.com/posts/Zvu6ZP47dMLHXMiG3/optimized-propaganda-with-bayesian-networks-comment-on) on causal graph models of beliefs. Perhaps you could try something like that.
I think we also need to teach AI researchers UI and graphics design. Most of the field's software prints boring things to console, or at most has a slow and annoying web dashboard with a few graphs. The machine which kills us all should instead have a cool scifi interface with nice tabulation, colors, rectangles, ominous targeting reticles, and cryptic text in the corners.
I don't expect them to have human-legible CoT forever. o1/o3 already seem to veer into strangeness sometimes.