A mere 5% chance that the plane will crash during your flight is consistent with considering this extremely concerning and doing anything in your power to avoid getting on it. "Alignment is impossible" is not necessary for great concern, isn't implied by great concern.
I'm talking about finding world-models in which real objects (such as "strawberries" or "chairs") can be identified.
My point is that chairs and humans can be considered in a similar way.
The most straightforward way of finding a world-model is just predicting your sensory input. But then you're not guaranteed to get a model in which something corresponding to "real objects" can be easily identified.
There's the world as a whole that generates observations, and particular objects on their own. A model that cares about individual objects needs to consider them separately from the world. The same object in a different world/situation should still make sense, so there are many possibilities for the way an object can be when placed in some context and allowed to develop. This can be useful for modularity, but also for formulating properties of particular objects, in a way that doesn't get distorted by the influence of the rest of the world. Human preferences is one such property.
Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them. Hypotheses is another toy example.
One of the features of models/things seems to be how they capture the many possibilities of a system simultaneously, rather than isolated particular possibilities. So what I gestured at was that when considering models of humans, the real objects or models behind a human capture the many possibilities of the way that human could be, rather than only the actuality of how they actually are. And this seems useful for figuring out their preferences.
Path-dependence is the way outcomes depend on the path that was taken to reach them. A path-independent outcome is convergent, it's always the same destination regardless of the path that was taken. Human preferences seem to be path dependent on human timescales, growing up in Egypt may lead to a persistently different mindset from the same human growing up in Canada.
for anything related to human judgement, in theory this isn’t why it’s not doing well
The facts are in there, but not in the form of a sufficiently good reward model that can tell as well as human experts which answer is better or whether a step of an argument is valid. In the same way, RLHF is still better with humans on some queries, hasn't been fully automated to superior results by replacing humans with models in all cases.
Creating an inhumanly good model of a human is related to formulating their preferences. A model captures many possibilities and the way many hypothetical things are simulated in the training data. Thus it's a step towards eliminating path-dependence of particular life stories (and preferences they motivate), by considering these possibilities altogether. Even if some on the possible life stories interact with distortionary influences, others remain untouched, and so must continue deciding their own path, for there are no external influences there and they are the final authority for what counts as aiding them anyway.
Creativity is RL, converting work into closing the generation-discrimination gap wherever it's found (or laboriously created by developing good taste). The resulting generations can be novelty-worthy, imitating them makes it easier to close the gap, reducing the need for creativity.
A reasoning model depends on starting from a sufficient base model that captures the relevant considerations. Solving AIME is like winning at chess, except the rules of chess are trivial, and the rules of AIME are much harder. But the rules of AIME are still not that hard, it's using them to win that is hard.
In the real world, the rules get much harder than that, so it's unclear how far o1 can go if the base model doesn't get sufficiently better (at knowing the rules), and it's unclear how much better it needs to get. Plausibly it needs to get so good that o1-like post-training won't be needed for it to pursue long chains of reasoning on its own, as an emergent capability. (This includes the possibility that RL is still necessary in some other way, as an engine of optimization to get better at rules of the real world, that is to get better reward models.)
Having preferenes is very different from knowing them. There's always a process of reflection that refines preferences, so any current guess is always wrong at least in detail. For a decision theory to have a shot at normativity, it needs to be able to adapt to corrections and ideally anticipate their inevitability (not locking in the older guess and preventing further reflection; instead facilitating further reflection and being corrigible).
Orthogonality asks the domain of applicability to be wide enough that both various initial guesses and longer term refinements to them won't fall out of scope. When a theory makes assumptions about value content, that makes it a moral theory rather than a decision theory. A moral theory explores particular guesses about preferences of some nature.
So in the way you use the term, quantum immortality seems to be a moral theory, involving claims that quantum suicide can be a good idea. For example "use QI to earn money" is a recommendation that depends on this assumption about preferences (of at least some people in some situations).
Use of repeated data was first demonstrated in the 2022 Galactica paper (Figure 6 and Section 5.1), at 2e23 FLOPs but without a scaling law analysis that compares with unique data or checks what happens for different numbers of repeats that add up to the same number of tokens-with-repetition. The May 2023 paper does systematic experiments with up to 1e22 FLOPs datapoints (Figure 4).
So that's what I called "tiny experiments". When I say that it wasn't demonstrated at scale, I mean 1e25+ FLOPs, which is true for essentially all research literature[1]. Anchoring to this kind of scale (and being properly suspicious of results several orders of magnitude lower) is relevant because we are discussing the fate of 4e27 FLOPs runs.
The largest datapoints in measuring the Chinchilla scaling laws for Llama 3 are 1e22 FLOPs. This is then courageously used to choose the optimal model size for the 4e25 FLOPs run that uses 4,000 times more compute than the largest of the experiments. ↩︎
Still consistent with great concern. I'm pointing out that O O's point isn't locally valid, observing concern shouldn't translate into observing belief that alignment is impossible.