I operate by Crocker's rules.
A very related experiment is described in Yudkowsky 2017, and I think one doesn't even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be "suspicious" of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.
Thank you for running the competition! It made me use & appreciate squiggle more, and I expect that a bunch of my estimation workflows in the future will be generating and then tweaking an AI-generated squiggle model.
My best guess is that the intended reading is "90% of the code at Anthropic", not in the world at large—if I remember the context correctly that felt like the option that made the most sense. (I was confused about this at first, and the original context on this is not clear whether the claim is about the world at large or about Anthropic specifically.)
Looked unlikely to me given the most-publicly-associated-with-MIRI person is openly & loudly advocating for funding this kind of work. But maybe the association isn't as strong as I think.
Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):
"One-shotting is possible" is a live hypothesis that I got from various reports from meditation traditions.
I do retract "I learned nothing from this post", the "How does one-shotting happen" section is interesting, and I'd like it to be more prominent. Thanks for poking, I hope I'll find the time to respond to your other comment too.
Please don't post 25k words of unformatted LLM (?) output.
See also the counter-arguments by Gwern.