Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting "The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal".
"The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization" is a good summary of my position. You don't say much that directly contradicts this, though I do think that even using the "terminal reward seeker" vs "schemer" distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it'll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.
Aside from that I'd add a few clarifications (nothing major):
I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.
I left some comments on an earlier version of AI 2027; the most relevant is the following:
June 2027: Self-improving AI
OpenBrain now has a “country of geniuses in a datacenter.”
Most of the humans at OpenBrain can’t usefully contribute anymore. Some don’t realize this and harmfully micromanage their AI teams. Others sit at their computer screens, watching performance crawl up, and up, and up.
This is the point where I start significantly disagreeing with the scenario. My expectation is that by this point humans are still better at tasks that take a week or longer. Also, it starts getting really tricky to improve on these, because you get limited by a number of factors: it takes a long time to get real-world feedback, it takes a lot of compute to experiment on week-long tasks, etc.
I expect these dynamics to be particularly notable when it comes to coordinating Agent-4 copies. Like, probably a lot of p-hacking, then other agents knowing that p-hacking is happening and covering it up, and so on. I expect a lot of the human time will involve trying to detect clusters of Agent-4 copies that are spiralling off into doing wacky stuff. Also at this point the metrics of performance won't be robust enough to avoid agents goodharting hard on them.
Good question. Part of the difference is I think the difference between total govt (state + federal) outlays and just federal outlays. But I don't think that would explain all of the discrepancy.
Your question reminds me that I had a discussion with someone else who was skeptical about this graph, and made a note to dig deeper, but never got around to it. I'll chase this up a bit now and see what I find.
Ah, gotcha. Unfortunately I have rejected the concept of Bayesian evidence from my ontology and therefore must regard your claim as nonsense. Alas.
(more seriously, sorry for misinterpreting your tone, I have been getting flak from all directions for this talk so am a bit trigger-happy)
I was referring to the inference:
The explanation that it was done by "a new hire" is a classic and easy scapegoat. It's much more straight forward to believe Musk himself wanted this done, and walked it back when it was clear it was more obvious than intended.
Obviously this sort of leap to a conclusion is very different from the sort of evidence that one expects upon hearing that literal written evidence (of Musk trying to censor) exists. Given this, your comment seems remarkably unproductive.
EDIT: upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you're talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?
Original comment:
The term "careful thinking" here seems to be doing a lot of work, and I'm worried that there's a kind of motte and bailey going on. In your earlier comment you describe it as "analytical philosophy, or more broadly careful/skeptical philosophy". But I think we agree that most academic analytic philosophy is bad, and often worse than laypeople's intuitive priors (in part due to strong selection effects on who enters the field—most philosophers of religion believe in god, most philosophers of aesthetics believe in the objectivity of aesthetics, etc).
So then we can fall back on LessWrong as an example of careful thinking. But as we discussed above, even the leading figure on LessWrong was insufficiently careful even about the main focus of his work for it to be robustly valuable.
So I basically get the sense that the role of careful thinking in your worldview is something like "the thing that I, Wei Dai, ascribe my success to". And I do agree that you've been very successful in a bunch of intellectual endeavours. But I expect that your "secret sauce" is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was "being in a community that prioritized careful thinking". And then I also think you're missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).
In other words, the bailey seems to be "careful thinking is the thing we should prioritize in order to make the world better", and the motte is "I, Wei Dai, seem to be doing something good, even if basically everyone else is falling into the valley of bad rationality".
One reason I'm personally pushing back on this, btw, is that my own self-narrative for why I'm able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I'm willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones. Similarly, a lot of the value that the wider blogosphere has created comes from people being less careful than existing academic norms (including Eliezer and Scott Alexander, whose best works are often quite polemic).
In short: I totally think we want more people coming up with good ideas, and that this is a big bottleneck. But there are many different directions in which we should tug people in order to make them more intellectually productive. Many academics should be less careful. Many people on LessWrong should be more careful. Some scientists should be less empirical, others should be more empirical; some less mathematically rigorous, others more mathematically rigorous. Others should try to live in countries that are less repressive of new potentially-crazy ideas (hence politics being important). And then, of course, others should be figuring out how to actually get good ideas implemented.
Meanwhile, Eliezer and Sam and Elon should have had less of a burning desire to found an AGI lab. I agree that this can be described by "wanting to be the hero who saves the world", but this seems to function as a curiosity stopper for you. When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, "not being distracted/influenced by competing motivations"). I think of extremely strong motivations to change the world (as these outlier figures have) as typically driven by some kind of core emotional dysregulation. And specifically I think of fear-based motivation as the underlying phenomenon which implements status-seeking and many other behaviors which are harmful when taken too far. (This is not an attempt to replace evo-psych, btw—it's an account of the implementation mechanisms that evolution used to get us to do the things it wanted, which now are sometimes maladapted to our current environment.) I write about a bunch of these models in my Replacing Fear sequence.
I didn't end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:
I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.
Whereas coalitions are typically divided into sub-coalitions (e.g. the "western civilization" coalition is divided into countries which are divided into provinces/states; political coalitions are divided into different factions and interest groups; etc). And so "coalitional" seems much closer to capturing this fractal/scale-free property.
Ah, yeah, I was a bit unclear here.
Basic idea is that by conquering someone you may not reduce their welfare very much short-term, but you do reduce their power a lot short-term. (E.g. the British conquered India with relatively little welfare impacts on most Indians.)
And so it is much harder to defend a conquest as altruistic in the sense of empowering, than it is to defend a conquest as altruistic in the sense of welfare-increasing.
As you say, this is not a perfect defense mechanism, because sometimes long-term empowerment and short-term empowerment conflict. But there are often strategies that are less disempowering short-term which the moral pressure of "altruism=empowerment" would push people towards. E.g. it would make it harder for people to set up the "dictatorship of the proletariat" in the first place.
And in general I think it's actually pretty uncommon for temporary disempowerment to be necessary for long-term empowerment. Re your homework example, there's a wide spectrum from the highly-empowering Taking Children Seriously to highly-disempowering Asian tiger parents, and I don't think it's a coincidence that tiger parenting often backfires. Similarly, mandatory mobilization disproportionately happens during wars fought for the wrong reasons.
Thanks! Note that Eric Neyman gets credit for the graphs.
Out of curiosity, what are the other sources that explain this? Any worth reading for other insights?