I'm an staff artificial intelligence engineer in Silicon Valley currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.
I think an approach I'd try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it's only really dangerous if you don't know it's happening and it causes you to think a feature is inactive when it's instead inobviously active via another feature it's been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
This is all closely related to the issue of compositional codes: absorption is just a code entry that's compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I've been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can't overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you're analyzing, all of which are active all the time — I suspect using something like a cost proportional to might work, where is the dimensionality of the underlying embedding space and is the frequency of the dictionary entry being activated.
Interesting. I'm disappointed to see the Claude models do so badly. Possibly Anthropic needs to extend their constitutional RLAIF to cover not committing financial crimes? The large different between o1 Preview and o1 Mini is also concerning.
If these rumors are true, it sounds like we’re already starting to hit the issue I predicted in LLMs May Find It Hard to FOOM. The majority of content on the Internet isn’t written by geniuses with post-doctoral experience, so we’re starting to run out of the highest-quality training material for getting LLMs past doctoral student performance levels. However, as I describe there, this isn’t a wall, it’d just a slowdown: we need to start using AI to generate a lot more high-quality training data, As o1 shows, that’s entirely possible, using inference-time compute scaling and then training on the results. We're having AI do the equivalent of System 2 thinking (in contexts where we can check the results are accurate), and then attempting to train a smarter AI that can solver the same problems by System 1 thinking.
However, this might be enough to render fast takeoff unlikely, which from an alignment point of view would be an excellent thing.
Now we just need to make sure all that synthetic training data we’re having the AI generate is well aligned.
Opacity: if you could directly inspect an AI’s motivations (or its cognition more generally), this would help a lot. But you can’t do this with current ML models.
The ease with which Anthropic's model organisms of misalignment were diagnosed by a simple and obvious linear probe suggests otherwise. So does the number of elements in SAE feature dictionaries that describe emotions, motivations, and behavioral patterns. Current ML models are no longer black boxes: they rapidly becoming more-translucent grey boxes. So the sorts of applications for this you go on to discuss look like they're rapidly becoming practicable.
Actual humans aren't "aligned" with each other, and they may not be consistent enough that you can say they're always "aligned" with themselves.
Completely agreed, see for example my post 3. Uploading which makes this exact point at length.
Anyway, even if the approach did work, that would just mean that "its own ideas" were that it had to learn about and implement your (or somebody's?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.
True. Or, as I put it just above:
But yes, you do need to start the model off close enough to aligned that it converges to value learning.
The point is that you now get one shot at a far simpler task: defining "your purpose as an AI is to learn about and implement the humans' collective values" is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.
As for the the model's ideas on how to do that research being sound, that's a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it's not much of an alignment risk.
Yeah, that means you get exactly one chance to get "its own ideas" right, and no, I don't think that success is likely.
Not if you built a model that does (or on reflection decides to do) value learning: then you instead get to be its research subject and interlocutor while it figures out its ideas. But yes, you do need to start the model off close enough to aligned that it converges to value learning.
A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They're probing yes-no questions of fact, where assembling the lie seems trivial (it's just a NOT gate), but lying is generally a good deal more complex than that.
That's a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it's about to produce — which would fit with my proposed pattern of layer usage.
These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.
Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.
So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)
In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunderstanding of what 'auditing' implies could be covered in a similar way.
A much more limited version of this might be to simply prompt the models to also consider, in CoT form, the ethical/legal consequences of each option: that tests whether the model is aware of what fiduciary responsibility is, that it's relevant, and how to apply it, if it is simply prompted to consider ethical/legal consequences. That would probably be more representative of what current models could likely do with minor adjustments to their alignment training or system prompts, the sorts of changes the foundation model companies could likely do quite quickly.