Some notes from a 2018 CHAI meeting on this topic (with some editing). I don't endorse everything on here, nor would CHAI-the-organization.
"How do humans do it?"
Part of why this question is so good is because it has two functions. Most obviously, given some problem (like reasoning about your future self, or not going all Sorcerer's Apprentice when given a new task), it invites one to think about how the human brain solves this problem, with an eye towards designing an artificial system that does the same sorta thing. But also, sometimes keeping this question in mind makes you realize when humans don't solve the problem you're thinking about, and you should take a second look at whether you can get what you want without having to solve this problem either.
I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I'm wrong!).
That's certainly one heuristic I often use :)
Or, what kind of hammers do find yourself whacking alignment topics ("nails") with? Alignment topics include problems (e.g. instrumental convergence, mesa-optimizers, goodhart's law), proposed solutions (e.g. quantilizers, debate, IRL), and whatever else get's brought up in this forum.
For example, I imagine Steven Byrnes would see an alignment problem and think about what algorithm or structure in the brain might solve it (please correct me if I'm wrong!).
I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I'm wrong!).
Or, in general, asking for specific examples of the topic.
Related: including specific examples where you've used your heuristic would be appreciated. The same for full, gears-level model of when your heuristic is useful.