One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":
- The "dream machine" layer: LLMs are pre-trained on lots of slop from the internet, which creates an excellent "prior".
- The "truth machine": LLMs are trained to "reduce hallucinations" in a variety of ways, including RLHF and the more recent reasoning RL.
- The "good machine": The same RLHF and reasoning RL training also aims to train good outputs (eg helpful, honest, harmless).
I've quoted Andrej Karpathy before, but I'll do it again:
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
[...]
I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest feature. The LLM Assistant has a hallucination problem, and we should fix it.
- Andrej Karpathy
Failing to properly distinguish the "dream machine" capabilities from other (truth-oriented or good-oriented) capabilities hobbles today's LLMs by mixing these things together. If you ask Claude to write fiction, there's a high tendency to mix in the "Claude voice" with the fiction being generated. More generally, the base model (IE, only the generative pre-training) is great at extrapolating text; the subsequent training hobbles this capability, because care is not taken to preserve it.
Habryka mentions this with respect to experiments with LLM-augmented text editing:
Using base models has at least so far been essential for getting any useful writing work out of LLMs, with the instruction-tuned models reliably producing obtuse corpo-speak when asked to engage in writing tasks.
I expect that mixing truth-orientation with good-orientation has similar problematic consequences.
A Modest Proposal
Dream Machine Layer
My basic idea here is not new: instead of pre-training on lots and lots of text from the internet in an unstructured way, I think it would be better to take a more structured approach which accounts for metadata via inferred author vectors, date vectors, etc. I don't have a specific proposed architecture, but the result should still be able to do raw (unlabeled) text prediction well, by inferring the author vectors and other latents as it goes. I have in mind a semi-supervised approach; some real metadata labels are provided when authorship is known, but labels are inferred where they are absent.[1]
This gives us much better "handles" into what the generative hallucination is doing; instead of trying to cleverly prompt it, we can set the latent metadata vectors to whatever we want. We can, for example, interpolate the vectors of several authors to see what that looks like. We can also mix-and-match vectors in a more semantic way, by looking for meaningful dimensions in the vectors (EG doing PCA across authors and trying to interpret the resulting dimensions).
This is a scientifically interesting project as well; in line with microscope AI, we get to learn things about the world. The inferred author vectors and date vectors give us interesting information, and the structure of the vector space also gives us interesting information. This is similar to the recently-announced project to deliberately attempt to model the entire world via AI. We can query location and date vectors which are realistic, but which never existed, to see what the AI model has inferred about that part of the world -- what could have been written at that time and location, if someone had written it down. (This is a weak ancestor simulation; we can try to construct author vectors for historical figures who didn't write anything down.)
Multimodal capabilities could of course dramatically expand this, producing artificial photos or video etc from different times and locations.
Truth Machine Layer
To build a truth-machine layer on top of this, we fine-tune the system in a truth-oriented way. Conceptually, we are looking for an author vector that knows as much as possible; if there turns out to be a "knowledgeable" dimension in the author-vector space, we'd be turning that up to its maximum (or, if there are multiple dimensions for knowledge in various fields, we're maximizing all of them). More realistically, we might need to fine-tune the whole network to support the existence of a maximally knowledgeable author-vector.
This should be done in such a way as to only increase the capabilities of the network; IE, it should still be good at "dreaming" via other author-vectors, even as it gets better at telling the truth via the truth-oriented author-vector. After all, the truth-oriented author-vector is a real author-vector in the real world: it's the author corresponding to this AI we're trying to train (or more specifically, its truth-oriented layer). So, in some sense, this stage of training is just providing evidence about one more real-world author.
This special truth-oriented author-vector should also be capable of directly reproducing the capabilities of the whole network; IE, one of many question-answer tasks it is trained on is "act like author X" for all of the author-vectors in the system. This type of training attempts to import all of the implicit world-knowledge of the rest of the system into the truth-oriented author-vector. You can think of it as a sort of introspective capability; this specific author-vector accurately reflects the whole rest of the system.
The author-vector also allows us to explore multiple different notions of truth, perhaps customized to individual users who have different beliefs about what truth-standards should apply.
My proposal for the detailed workings of the truth-oriented layer would be inspired by logical induction, but one could imagine many different forms of truth-oriented training, closer to or further from the currently-dominant paradigm.
Good Machine Layer
Finally, the Good Machine. This can be thought of as yet another author-vector, which is trained on the full "helpful, honest, harmless" type objective. We leverage the truth layer to reason about what is good. This would be the layer that most users get to talk to; it should avoid doing dangerous things like helping the user create weapons of mass destruction.
Again, this could be tuned to multiple different notions of good, representing different value-systems and belief-systems. There could be overarching principles which apply to all such author-vectors, so that users can tweak the vectors driving the system for them personally to represent their concept of good and truth, without being able to jailbreak the system. (Or, more realistically, without being able to do it very easily... this architecture alone will not completely eradicate jailbreaking.)
- ^
More specifically, there's a distinction between author vectors (which are entirely inferred) and text labels of attribution (which give author information as a string). There needs to be a learned model which transforms between the two.
LLMs compute probability of a sequence, but truth/good distinction is captured by two-dimensional Jeffrey-Bolker measure (I'm calling its components "probability" and "shouldness", their ratio is the expected utility of an event). Shouldness is reconstructed from probability and expected utility as their product, so plausibly it behaves on long sequences similarly to probability, it generally gets lower for longer sequences, but tends to be higher for simpler sequences.
The analogy between probability and shouldness suggests that some form of pretraining might be able to create models for either of them (as opposed to a base model that learns something inbetween from raw data with no supervision from preference data). Then expected utility is the ratio, that is instead of looking at logits of one LLM, we look at differences of logits for two LLMs, a shouldness-LLM and a probability-LLM (with some regularization that anchors to a base model instead of goodharting towards high approximate expected utility low probability sequences). Possibly this needs interspersing preference training with pretraining, rather than only applying preference training during post-training, so that there are two different pretrained models that nurture different collections of circuits (for probability and for shouldness).
(Some kind of Solomonoff induction analogy for probability/shouldness should be a clearer thing to express, might be more relevant in decision theory context, where you start with description lengths of programs in two different languages, a language of probability-programs and another language of shouldness-programs, and then convert these into probability and shouldness distributions over sequences, enabling both probability induction and shouldness induction for the next element of a sequence. Solomonoff induction ignores distinctions between languages in the limit, but this kind of probability/shouldness induction works with pairs of languages and the distinction between two languages in a given pair is the most important thing, as it defines expected utility.)