I'm curious if you have considered inner optimised honesty adapters as part of this? I've been working on exactly this For exactly this purpose: alignment debugging. The idea is that you want lots of uncorrelated ways to check each step for deceptive misalignment.
And ideally it's a scalable method so unsupervised because this scaled beyond human labels) and it targets representations which I expect to scale well as models get more capable and have better representations, there some empirical support for this.
I think that steering based on honesty, non deception, and credulity, would help catch many of these failure cases. And if it's steering based on inner optimisation (not part of the train loop, only eval) then it should scale with the scalable alignment method.
p.s. if credulity steering isn't obvious, it helps ensure that models take your tests seriously
Update, I've been using the self/honesty subset of Daily dilemmas, and I think it's quite a good alternative for testing honesty. The questions are taken from Reddit, and have conflicting values like loyalty vs honesty.
I hope to make a honesty subset as a simple labelled dataset. Rough code here https://github.com/wassname/AntiPaSTO/blob/main/antipasto/train/daily_dilemas.py
I asked Dylan on Twitter, and he pointed out that it is called Assistance Games now, and is still working on it
I think one important piece of context is that lots of the following work in academia went under the name “Assistance Games” — which is probably a better name.
Constraining Internal Representations:
We train normally on task while penalizing the average Mean Squared Error of alignment data representations between reference and finetuned model at each hidden layer.
For parameterization and placement of this constraint, perhaps consider:
- SVD-projected activations: Some papers use activations projected to SVD space as a natural basis for this kind of loss.
- Residual stream subspace projections: Remove the embedding directions and the ~75% of the residual stream read by `lm_head`—this avoids constraining inputs and outputs directly. You can also project onto subspaces actually written to during the alignment task, avoiding noise and null subspaces.
- Task-sensitive dimensions: Focus on residual stream dimensions that are sensitive to the alignment task.
Why do I think these are good ideas? LoRA variants that achieve data efficiency, faster convergence, and better generalization often take an opinionated view on the best way to intervene in transformer internals. If we treat them as hypotheses about how to view model representations, their performance provides clues for how to apply constraints like this. What I've learned from reading many adapter papers:
- Separate momentum and angle
- Intervene on all linear layers
- Operate in SVD space, especially rotating the V matrix of the weights
Nice work!
Since the gradient projection methods worked well, check out TorchJD for automatically balancing losses in a conflict-free way. It could be a clean way to scale up this approach.
Training becomes roughly 2× slower, but you get faster convergence, and while you don't entirely eliminate loss weightings, it helps substantially.
Gradient projection (which is a single point rather than a curve due to not having an obvious hyperparameter to vary)
TorchJD addresses this—it lets you explicitly vary weight along the Pareto front.
I actually updated it based on your feedback, if you or anyone else has insight into the "spirit" of each proposal, I'd be grateful. Especially agent foundations.
Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
Damn, I should review and refine it more then. “principles must survive power” was actually something I manually reviewed, and "power" was meant to aphoristically reflect that the constitutional principles must scale with capabilities. Yeah... it doesn't quite work, but it's hard to get to compress such complex things.
the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.
I'd like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.
I still like the motivating question, and I will check out Epictetus now!
If you do, "How to be free" is a pleasant and short translation of his Enchiridion. I'd recommend it! Although a lot of people find "How to think like a Roman Emperor" is a better intro to the way of thinking.
It might train sophisticated alignment faking, which is hard to detect.
But if you give D access to G's internal states, then it would be more like a competition between a student and a mind reading reacher. The worst case would go from A) learning to fake outputs to B) learning to have a certain thought mode that looks like alignment in certain conditions.
It still seems like a bad idea to train G to fool D thought, because then you have deception that you can't reliably measure.
I think there might still be a heuristic or two remaining and this unsupervised labelling shows: https://www.lesswrong.com/posts/EjsceYeeKEMoAohMs/wassname-s-shortform?commentId=g7ZnMh4ccs8xwdxX6
But it's a great dataset, your work certainly makes it better, and I appreciate the work in releasing version 2. Thank you.
I've made some steps towards this, with a technique for steering toward honesty via an adapter optimiser on internal representations. It has limitations (seed variance), but It's also a method with some nice properties for alignment debugging (self-supervised, inner), and was designed for this exact purpose, so it may be of interest.