Or, what kind of hammers do find yourself whacking alignment topics ("nails") with? Alignment topics include problems (e.g. instrumental convergence, mesa-optimizers, goodhart's law), proposed solutions (e.g. quantilizers, debate, IRL), and whatever else get's brought up in this forum.

For example, I imagine Steven Byrnes would see an alignment problem and think about what algorithm or structure in the brain might solve it (please correct me if I'm wrong!).

I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I'm wrong!).

Or, in general, asking for specific examples of the topic.

Related: including specific examples where you've used your heuristic would be appreciated. The same for full, gears-level model of when your heuristic is useful. 

New to LessWrong?

New Answer
New Comment


2 Answers sorted by

Some notes from a 2018 CHAI meeting on this topic (with some editing). I don't endorse everything on here, nor would CHAI-the-organization.

  • Learning part of your model that previously was fixed.
    • Can be done using neural nets, other ML models, or uncertainty (probability distributions)
    • Example: learning biases instead of hardcoding Boltzmann rationality
  • Relatedly, treating an object as evidence instead of as ground truth
  • Looking at current examples of the problem and/or its solutions:
    • How does the human brain / nature do it?
    • How does human culture/society do this
    • How has cognitive science formalized similar problems / what insights has it produced that we can build on?
    • Principal-agent models/Contracting theory
  • Adversarial agents for robustness
    • Internal design of the system to be adversarial inherently (e.g. debate)
    • External use of adversaries for testing: red teaming, adversarial training
  • ‘Normative Bandwidth’
    • How much information about the correct behavior is actually conveyed vs how much information does the robot policy assume is conveyed
    • E.g. reward functions that are interpreted literally means that you are getting all information necessary for getting the optimal policy -- that’s a huge amount of information, that assumption is always wrong. What’s actually conveyed is a much smaller amount of information -- something like what Inverse Reward Design does (where it says the reward function only conveys information about good behavior in the training environments).
  • Proactive Learning (i.e. what if we ask the human?)
  • Induction (see e.g. iterated amplification)
    • Get good safety properties in simple situations, and then use them to build something more capable while preserving safety properties
  • Analyze a simple model of the situation in theory
  • Indexical uncertainty (uncertainty about your identity)
  • Rationality -- either make sure the agent is rational, or make sure it isn’t (i.e. don’t build agents)
  • Thinking about the human-robot system as a whole, rather than the robot in isolation. (See e.g. CIRL / assistance games.)
  • How would you do it with infinite resources (relaxed constraints)?
    • E.g. AIXI, Solomonoff induction, open-source game theory

"How do humans do it?"

Part of why this question is so good is because it has two functions. Most obviously, given some problem (like reasoning about your future self, or not going all Sorcerer's Apprentice when given a new task), it invites one to think about how the human brain solves this problem, with an eye towards designing an artificial system that does the same sorta thing. But also, sometimes keeping this question in mind makes you realize when humans don't solve the problem you're thinking about, and you should take a second look at whether you can get what you want without having to solve this problem either.

1 comment, sorted by Click to highlight new comments since:

I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I'm wrong!).

That's certainly one heuristic I often use :)