Consider a simple decision problem: you arrange a date with someone, you arrive on time, your partner isn't there. How long do you wait before giving up?
Humans naturally respond to this problem by acting outside the box. Wait a little then send a text message. If that option is unavailable, pluck a reasonable waiting time from cultural context, e.g. 15 minutes. If that option is unavailable...
Wait, what?
The toy problem was initially supposed to help us improve ourselves - to serve as a reasonable model of something in the real world. The natural human solution seemed too messy and unformalizable so we progressively remove nuances to make the model more extreme. We introduce Omegas, billions of lives at stake, total informational isolation, perfect predictors, finally arriving at some sadistic contraption that any normal human would run away from. But did the model stay useful and instructive? Or did we lose important detail along the way?
Many physical models, like gravity, have the nice property of stably approximating reality. Perturbing the positions of planets by one millimeter doesn't explode the Solar System the next second. Unfortunately, many of the models we're discussing here don't have this property. The worst offender yet seems to be Eliezer's "True PD" which requires the whole package of hostile psychopathic AIs, nuclear-scale payoffs and informational isolation; any natural out-of-the-box solution like giving the damn thing some paperclips or bargaining with it would ruin the game. The same pattern has recurred in discussions of Newcomb's Problem where people have stated that any miniscule amount of introspection into Omega makes the problem "no longer Newcomb's". That naturally led to more ridiculous use of superpowers, like Alicorn's bead jar game where (AFAIU) the mention of Omega is only required to enforce a certain assumption about its thought mechanism that's wildly unrealistic for a human.
Artificially hardened logic problems make brittle models of reality.
So I'm making a modest proposal. If you invent an interesting decision problem, please, first model it as a parlor game between normal people with stakes of around ten dollars. If the attempt fails, you have acquired a bit of information about your concoction; don't ignore it outright.
What's the goal of that controlled experiment? If my decision apparatus fails on Newcomb's problem or the "true PD", does it tell you anything about my real world behavior?
It tells us that your real world behavior has the potential to be inconsistent.
Many people carry a decision apparatus that consists of a mess of unrelated heuristics and ad hoc special-cases. Examining extreme cases is a tool for uncovering places where the ad hoc system falls down, so that a more general system can be derived from basic principles, preferably before encountering a real world situation where the flaws in the ad hoc system become apparent.
To my mind, a better analogy than "controlled experiment" would be describing these as decision system unit tests.