That is not the best statement but close enough to keep a simple title/question.

When I read discussions and comments about AI risks I find myself thinking that there might be two (unstated?) base models in place. I suspect when people talk about the "AI wants" or go about applying utility functions they actually are using humans as some primitive model from which the AI derives.

Similarly, when I hear talk about extinction potential I have the view that the model is that about biology and biological evolution and competition within environmental niches.

Is that something anyone even talks about? If so, what is the view -- any specific papers or comments where I can take a look?  If not, does it sound like a reasonable inference about implicit assumptions/maps in this area?

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

Charlie Steiner

May 02, 2023

20

There's not going to be one right answer.

  1. The outcome pump. This cashes out "wants" in a non-anthropomorphic way. John Wentworth has some good work using this in more non-obvious ways.
  2. Model-based RL. Potentially brain-inspired. This is what I try to think about most of the time. 
  3. Model-free RL. I think a lot of inner alignment arguments, and also some "shard theory" type arguments, use a background model-free RL picture.
  4. Predictive models. Large language models are important, people often interpret them as a prototype for future AI.
  5. Anthropomorphism. Usually not valid, but sometimes used anyway.