Dude, the first sentence of the post says "roughly median researchers"; note the "roughly". Researchers at the upper 20th percentile in a field are roughly median researchers; in a field where e.g. the median researcher doesn't get stats 101, the 20th percentile researcher also probably does not understand stats 101.
I would buy that if I actually observed women who are interested in me orchestrating situations in which I find myself alone with them. That is not what I observe in most cases (excluding the women who explicitly ask me out, which is technically "orchestrating a situation in which I find myself alone with her", but presumably not what you meant). Yes, there's a common narrative about how women create affordances for the men to make moves, but that sure is not what I see actually happen.
My clearest data on this comes from slutcon, because I got explicit data (afterward) telling me that a whole bunch of women were interested. Two of them explicitly asked me out. Zero of them orchestrated a situation in which we were alone together, or anything else along those lines.
It has ever happened that an interested woman orchestrated such a situation with me, but it sure does not seem to be the typical case.
"On Green" is one of the LessWrong essays which I most often refer to in my own thoughts (along with "Deep Atheism", which I think of as a partner to "On Green"). Many essays I absorb into my thinking, metabolize the contents, but don't often think of the essay itself. But this one is different. The essay itself has stuck in my head as the canonical pointer to... well, Green, or whatever one wants to call the thing Joe is gesturing at.
When I first read the essay, my main thought was "Wow, Joe did a really good job pointing to a thing that I do not like. Screw Green. And thank you Joe for pointing so well at this thing I do not like.". And when the essay has come up in my thoughts, it's mostly been along similar lines: I encounter something in the wild, I'm like "ugh screw this thing... wait, why am I all 'ugh' at this, and why do other people like it?... oh it's Carlsmith!Green, I see".
But rereading the essay today, I'm struck by different thoughts than a year ago.
Let's start with the headline gesturing:
What is green?
Sabien discusses various associations: environmentalism, tradition, family, spirituality, hippies, stereotypes of Native Americans, Yoda. Again, I don't want to get too anchored on these particular touch-points. At the least, though, green is the "Nature" one. Have you seen, for example, Princess Mononoke? Very green (a lot of Miyazaki is green). And I associate green with "wholesomeness" as well (also: health). In children's movies, for example, visions of happiness—e.g., the family at the end of Coco, the village in Moana—are often very green.
Today, reading that, I think "oxytocin?". That association probably isn't obvious to most of you; oxytocin is mostly known as the hormone associated with companionate love, which looks sort of like wholesomeness and happy villages but not so much like nature. So let me fill in the gap here.
Most people feel companionate love toward their family and (often metaphorical) village. Some people - usually more hippie-ish types - report feeling the same thing towards all of humanity. Even more hippie-ish types go further, feeling the same thing toward animals, or all living things, or the universe as a whole. Generally, the phrase which cues me in is usually "deep connection" - when people report feeling a deep connection to something, that usually seems to cash out to the oxytocin-feeling.
So when I read the post, I think "hmm, maybe Joe is basically gesturing at the oxytocin-feeling, i.e. companionate love". That sure does at least rhyme with a lot of other pieces of the post.
... and that sure would explain why I get a big ol' "ugh" toward Green; I already knew I'm probably genetically unable to feel oxytocin. To me, this whole Green thing is a cluster of stuff which seem basically-unremarkable in their own right, but for some reason other people go all happy-spiral about the stuff and throw away a bunch of their other values to get more Green.
That brings us to the central topic of the essay: is there a case for more Green which doesn't require me to already value Green in its own right? Given that I don't feel oxytocin, would it nonetheless be instrumental to the goals I do have to embrace a bit more Green?
My knee-jerk reaction to deep atheism is basically "yeah deep atheism is just straightforwardly correct". When Joe says "Green, on its face, seems like one of the main mistakes" I'm like "yes, Green is one of the main mistakes, it still looks like that under its face too, it's just straightforwardly a mistake". It indeed looks like a conflation between is and ought, like somebody's oxytocin-packed value system is trying to mess with their epistemics, like they're believing that the world is warm and fuzzy and caring because they have oxytocin-feelings toward the world, not because the world is actually helping them.
... but one could propose that normal human brains, with proper oxytocin function, are properly calibrated about how much to trust the world. Perhaps my lack of oxytocin biases me in a way which gives me a factually-incorrect depth of atheism.
But that proposal does not stand up to cursory examination. The human brain has pretty decent general purpose capability for figuring out instrumental value; why would that machinery be wrong on the specific cluster of stuff oxytocin touches? It would be really bizarre if this one apparent reward-system hack corrects an epistemic error in a basically-working general purpose reasoning machine. It makes much more sense that babies and families just aren't that instrumentally appealing, so evolution dropped oxytocin into the reward system to make babies and families very appealing as a terminal value. And that hack then generalized in weird ways, with people sometimes loving trees or the universe, because that's what happens when one drops a hacky patch into a reward function.
The upshot of all that is... this Green stuff really is a terminal-ish value. And y'know, if that's what people want, I'm not going to argue with the utility function (as the saying goes). But I am pretty darn skeptical of attempts to argue instrumental necessity of Green, and I do note that these are not my values.
I'd have liked if @johnswentworth has responded to Eliezer's comment at more length, though maybe he did and I missed? Still, giving this post a 4 in the review, same as the other.
Done.
It's now the 2024 year-in-review, and Ruby expressed interest in seeing me substantively respond to this comment, so here I am responding a year and a half later.
The concept of "easy to understand" seems like a good example, so let's start there. My immediate reaction to that example is "yeah duh of course an AI won't use that concept internally in a human-like way", followed by "wait what the heck is Eliezer picturing such that that would even be relevant?", followed by "oh... probably he's used to noobs anthropomorphizing AI all the time". There is an argument to be made that "easy for a human to understand" is ontologically natural in our environment for predicting humans, even for nonhuman minds insofar as they're modelling humans (though I'm not confident). But I wouldn't expect an AI to use that concept internally the way a human does, even if the concept is convergent for predicting humans. A similar argument would apply to lots of other reflective concepts: the human version would be (potentially) convergent for modelling humans, but the AI will use that concept in a very different way internally than humans would. That means we can't rely on our anthropomorphizing intuitions to reason about how the AI will use such concepts. But we could still potentially rely on such concepts for alignment purposes so long as we're not e.g. expecting the AI to magically care about them or use them in human-like ways. I would guess Eliezer mostly agrees with that, though probably we could go further down this hole before bottoming out.
But I don't think that line of discussion is all that interesting. The immediate use-cases for a strong AI, like e.g. running uploaded humans, really shouldn't require a bunch of reflective concepts anyway. The main things I want to immediately do with AI, as a first use-case, tend to look like hard science and engineering, not like reasoning about humans or doing anthropomorphic things. I usually don't even imagine interfacing to the thing in natural language.
There is another interesting delta here, one which developed in the intervening year-and-a-half since Eliezer wrote this comment. I think corrigibility is maybe not actually a reflective concept.
Eliezer's particular pointer to corrigibility is extremely reflective:
The "hard problem of corrigibility" is interesting because of the possibility that it has a relatively simple core or central principle - rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.
…
We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.
(source) But Eliezer comes at corrigibility very indirectly. In principle, there is some object-level stuff an AI would do when building a powerful sub-AI while prone to various sorts of errors. That object-level stuff is not itself necessarily all that reflective. The move of "we're talking about whatever stuff an AI would do when building a powerful sub-AI while prone to various sorts of errors" is extremely reflective, but in principle one might identify the object-level stuff via some other method which needn't be that reflective.
In particular, Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals gestures at a very-ontologically-natural-looking concept which sure does seem to overlap an awful lot with corrigibility, to the point where it maybe just is corrigibility (insofar as corrigibility actually has a simple core at all).
LLMs mimic human text. That is the first and primary thing they are optimized for. Humans motivatedly reason, which shows up in their text. So, LLMs trained to mimic human text will also mimic motivated reasoning, insofar as they are good at mimicking human text. This seems like the clear default thing one would expect from LLMs; it does not require hypothesizing anything about motivated reasoning being adaptive.
Once one learns to spot motivated reasoning in one's own head, the short term planner has a much harder problem. It's still looking for outputs-to-rest-of-brain which will result in e.g. playing more Civ, but now the rest of the brain is alert to the basic tricks. But the short term planner is still looking for outputs, and sometimes it stumbles on a clever trick: maybe motivated reasoning is (long-term) good, actually? And then the rest of the brain goes "hmm, ok, sus, but if true then yeah we can play more Civ" and the short term planner is like "okey dokey let's go find us an argument that motivated reasoning is (long-term) good actually!".
In short: "motivated reasoning is somehow secretly rational" is itself the ultimate claim about which one would motivatedly-reason. It's very much like the classic anti-inductive agent, which believes that things which have happened more often before are less likely to happen again: "but you've been wrong every time before!" "yes, exactly, that's why I'm obviously going to be right this time". Likewise, the agent which believes motivated reasoning is good actually: "but your argument for motivated reasoning sure seems pretty motivated in its own right" "yes, exactly, and motivated reasoning is good so that's sensible".
... which, to be clear, does not imply that all arguments in favor of motivated reasoning are terrible. This is meant to be somewhat tongue-in-cheek; there's a reason it's not in the post. But it's worth keeping an eye out for motivated arguments in favor of motivated reasoning, and discounting appropriately (which does not mean dismissing completely).
There is definitely a standard story which says roughly "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans". I do not think that story stands up well under examination; when I think of standard day-to-day examples of motivated reasoning, that pattern sounds like a plausible explanation for some-but-a-lot-less-than-all of them.
For example: suppose it's 10 pm and I've been playing Civ all evening. I know that I should get ready for bed now-ish. But... y'know, this turn isn't a very natural stopping point. And it's not that bad if I go to bed half an hour late, right? Etc. Obvious motivated reasoning. But man, that motivated reasoning sure does not seem very socially-oriented? Like, sure, you could make up a story about how I'm justifying myself to an imaginary audience or something, but it does not feel like one would have predicted the Civ example in advance from the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
Another class of examples: very often in social situations, the move which will actually get one the most points is to admit fault and apologize. And yet, instead of that, people instinctively spin a story about how they didn't really do anything wrong. People instinctively spin that story even when it's pretty damn obvious (if one actually stops to consider it) that apologizing would result in a better outcome for the person in question. Again, you could maybe make up some story about evolving suboptimal heuristics, but this just isn't the behavior one would predict in advance from the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
A pattern with these examples (and many others): motivated reasoning isn't mainly about fooling others, it's about fooling oneself. Or at least a part of oneself. Indeed, there's plenty of standard wisdom along those lines: "the easiest person to fool is yourself", etc.
Here's a model which I think much better matches real-world motivated reasoning. (Note, however, that all the above critique still stands regardless of whether this next model is correct.)
Motivated reasoning simply isn't adaptive. Even in the ancestral environment, motivated reasoning decreased fitness. It appeared in the first place as an accidental side-effect of an overall-beneficial change in human minds relative to earlier minds, and that change was recent enough that evolution hasn't had time to fix the anti-adaptive side effects.
There's more than one hypothesis for what that change could be. Probably something to do with some particular functions within the mind being separated into parts, so that one part of the mind can now sometimes "cheat" by trying to trick another part of the mind, but the separation of those two functions is still overall beneficial.
An example falsifiable prediction of this model: other animals generally do not motivatedly-reason. If the relevant machinery had been around for very long, we would have expected evolution to fix the problem.
Tutoring.
This answer generated by considering what one can usefully hire other people to do full time, for oneself (or one's family) alone.
In this case, I'm mildly skeptical, because probability before Laplace bore a lot less resemblance to today's probability IIUC (though I have not personally read source texts, so don't update too hard on my understanding). Bayes did discover the theorem, but I don't know if he conceptually thought of it like we do today or would have used it like we do today; I view that as largely coming from Laplace. On the flip side, that means Laplace' work on probability theory was maybe highly counterfactual.