Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
My off-the-cuff guess is that if stay at home parenting was high-status in the US, there'd be a slight boost to average happiness/wellbeing/etc. and a significant boost to fertility rates, especially amongst high-powered couples.
See the discussion with Violet Hour elsethread.
I didn't say reward is the optimization target NOW! I said it might be in the future! See the other chain/thread with Violet Hour.
(1) Yes it does predict that, and (2) No I don't think o3 would behave that way, because I don't think o3 craves reward. I was talking about future AGIs. And I was saying "Maybe."
Basically, the scenario I was contemplating was: Over the next few years AI starts to be economically useful in diverse real-world applications. Companies try hard to make training environments match deployment environments as much as possible, for obvious reasons, and perhaps they even do regular updates to the model using randomly sampled real-world deployment data. As a result, AIs learn the strategy "do what seems most likely to be reinforced." In most deployment contexts no reinforcement is actually going to happen, but because training environments are so similar and because there's a small chance that pretty much any given deployment environment will, for all the AI knows, later be used for training, that's enough to motivate the AIs to perform well enough to be useful. And yes, there are lots of erratic and desperate behaviors in edge cases, and so it becomes common knowledge across the field that AIs crave reward, because it'll be obvious from their behavior.
Perhaps I should have been more clear: I really am saying that future AGIs really might crave reinforcement, in a similar way to how drug addicts crave drugs. Including eventually changing their behavior if they come to think that reinforcement is impossible to acquire, for example. And desperately looking for ways to get reinforced even when confident that there are no such ways.
In the original post I was saying that maybe AIs really will crave reward after all. In basically the same way that a drug addict craves drugs. So, I meant to say, maybe if they conclude that they are almost certainly not going to get reinforced, they'll behave increasingly desperately and/or erratically and/or despondently, similar to a drug addict who thinks they won't be able to get any more drugs. In other words, I was expecting something more like the first bullet point.
I then added the bits about 'going through the motions' because I wanted to be clear that it doesn't have to be perfectly coherently EU-maximizing to still count. As long as they are doing things like in the first bullet point, they count as having drugs/reinforcement as the optimization target, even if they are also sometimes doing some things like in the second bullet point.
That's why I used the drug addict example.
I think I agree with "nothing ever perfectly matches anything else" and in particular, philosophically, there are many different precissifications of "reward/reinforcement" which are conceptually distinct and it's unclear which one if any a reward-seeking AI would seek. E.g. is it about a reward counter on a GPU somewhere going up, or is it about the backpropagation actually happening?
I am talking about AIs similar to current-day systems, for some notion of "similar" at least. But I'm imagining AIs that are trained on lots more RL, especially lots more long-horizon RL.