Hi everyone! My name is Ram Rachum, and this is my first post here :)
I'm an ex-Google software engineer turned MARL researcher. I want to do MARL research that promotes AI safety. You can read more about my research here and sign up for monthly updates.
I had an idea for a project I could do, and I want you to tell me whether it's been done before.
I want to create a demo of Stuart Russell's "You can't fetch the coffee if you're dead" scenario. I'm imagining a MARL environment where agent 1 can "turn on" agent 2 to prepare coffee for agent 1, and then agent 2 at some point understands how to prevent agent 1 from turning it off again. I'd like to get this behavior to emerge using an RL algorithm like PPO. Crucially, the reward function for agent 2 will be completely innocent.
That way we'll have a video of the "You can't fetch the coffee if you're dead" scenario happening, and we could tweak with that setup to see what kind of changes make it less likely or more likely. We could also show that video to laypeople, and it will likely be much easier for them to connect to such a demo rather than to a verbal description of a thought experiment.
Are there any existing demonstrations of this scenario? Any other insights that you have about this idea would be appreciated.
Yes, you can display the problem. But for simple gridworlds like this you can just figure out what's going to happen by using your brain - surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn't quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn't rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. "attempt to go to the goal without disabling the off-switch."). And it doesn't necessarily rule in complicated 3D environments - they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you're just solving an RL problem, not an AI safety problem.