A prime example of what (I believe) Yudkowsky is talking about in this bullet point is Social Desirability Bias.
"What is the highest cost we are willing to spend in order to save a single child dying from leukemia ?". Obviously the correct answer is not infinite. Obviously teaching an AI that the answer to this class of questions is "infinite" is lethal. Also, incidentally, most humans will reply "infinite" to this question.
(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."
I expect this sort of thing to be less common with AI systems that can have much bigger "cranial capacity". But then again, I guess that at whatever level of brain size, there will be some problems for which it's too inefficient to do them the "proper" way, and for which comparatively simple heuristics / values work better.
But maybe at high enough cognitive capability, you just have a flexible, fully-general process for evaluating the exact right level of approximation for solving any given problem, and the binary distinction between doing things the "proper" way and using comparatively simpler heuristics goes away. You just use whatever level of cognition makes sense in any given micro-situation.)
+1; this seems basically similar to the cached argument I have for why human values might be more arbitrary than we'd like—very roughly speaking, they emerged on top of a solution to a specific set of computational tradeoffs while trying to navigate a specific set of repeated-interaction games, and then a bunch of contingent historical religion/philosophy on top of that. (That second part isn't in the argument you [Eli] gave, but it seems relevant to point out; not all historical cultures ended up valuing egalitarianism/fairness/agency the way we seem to.)
For instance, my current best model of Alex Turner at this point is like "well maybe some of the AI's internal cognition would end up structured around the intended concept of happiness, AND inner misalignment would go in our favor, in such a way that the AI's internal search/planning and/or behavioral heuristics would also happen to end up pointed at the intended 'happiness' concept rather than 'happy'/'unhappy' labels or some alien concept". That would be the easiest version of the "Alignment by Default" story.
I always get the impression that Alex Turner and his associates are just imagining much weaker optimization processes than Eliezer or I or probably also you are. Alex Turner's arguments make a lot of sense to me if I condition on some ChatGPT-like training setup (imitation learning + action-level RLHF), but not if I condition on the negation (e.g. brain-like AGI, or sufficiently smart scaffolding to identify lots of new useful information and integrate it, or ...).
If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."
I think there's a missing connection here. At least, it seemed a non sequitur at first read to me. At my first read, I thought this was positing that scaling up given humans' computational capacity ceteris paribus makes them lie more. Seems strong (maybe for some).
But I think it's instead claiming that if humans in general had been adapted under conditions of greater computational capacity, then the 'actually care about your friends' heuristic might have evolved lesser weight. That seems plausible (though the self-play aspect of natural selection means that this depends in part on how offence/defence scales for lying/detection).
But I think it's instead claiming that if humans in general had been adapted under conditions of greater computational capacity, then the 'actually care about your friends' heuristic might have evolved lesser weight.
+1, that's what I understood the claim to be.
And as the saying goes, "humans are the least general intelligence which can manage to take over the world at all" - otherwise we'd have taken over the world earlier.
A classic statement of this is by Bostrom, in Superintelligence.
Far from being the smartest possible biological species, we are probably better thought of as the stupidest possible biological species capable of starting a technological civilization - a niche we filled because we got there first, not because we are in any sense optimally adapted to it.
So the reward model doesn't need to be an exact, high-fidelity representation. An approximation is fine, "a little off" is fine, but it needs to be approximately-correct everywhere.
This is not quite true. If you select infinitely hard for high values of a proxy U = X+V where V is true utility and X is error, you get infinite utility in expectation if utility is easier to optimize for (has heavier tails) than error. There are even cases where you get infinite utility despite error having heavier tails than utility, like if error and true utility are independent and both are light-tailed.
Drake Thomas and I proved theorems about this here, and there might be another post coming soon about the nonindependent case.
I think I'm not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V "inside" the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I'd expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn't regressional, and so V and X aren't independent.
(Consider e.g. two arbitrary functions U' and V', and compute the "error term" X' between them. It should be obvious that when U' is maximized, X' is much more likely to be large than V' is; which is simply another way of saying that X' isn't independent of V', since it was in fact computed from V' (and U'). The claim that the reward model isn't even "approximately correct", then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)
I think independence is probably the biggest weakness of the post just because it's an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they're ways to trick the overseer).
The example of two arbitrary functions doesn't seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won't be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness.
Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.