I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a "mostly empty" layer. E,g, for feature ablation, setting the W_O
and b_O
params of the attention heads of your inserted layer to 0 to make sure that the attention heads don't change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).
This would of course be horribly inefficient from a compute perspective (each forward pass would take times as long, where is the original number of layers the model had and is the number of distinct layers in which you're trying to do SAE operations on the residual stream), but I think vllm would handle "llama but with one extra layer" without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
Admittedly this sounds like an empirical claim, yet is not really testable, as these visualizations and input-variable-to-2D-space mappings are purely hypothetical
Usually not testable, but occasionally reality manages to make it convenient to test something like this fun paper.
If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast - if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.
Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.
That would also explain why the culture was so heavily concentrated in tumblr (where the way to express anger and disagreement with a post is to share the post to all your followers with your comment tacked on to the end) and later twitter (same, but now the platform also notices you engage with stuff that makes you angrily reply and shows you more of it).
Here’s a hypothetical ‘gold standard’ test: we do a big randomized controlled trial to see if a bunch of non-experts can actually create a (relatively harmless) virus from start to finish. Half the people would have AI mentors and the other half can only look stuff up on the internet. We’d give each participant $50K and access to a secure wet-lab set up like a garage lab, and make them do everything themselves: find and adapt the correct protocol, purchase the necessary equipment, bypass any know-your-customer checks, and develop the tacit skills needed to run experiments, all on their own. Maybe we give them three months and pay a bunch of money to anyone who can successfully do it.
What if both the AI group and the control group have high success rates at this test?
It sure would.
Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?
If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.
Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.
It's more that any platform that allows discussion of politics risks becoming a platform that is almost exclusively about politics. Upvoting is a signal of "I want to see more of this content", while downvoting is a signal of "I want to see less of this content". So "I will downvote any posts that are about politics or politics-adjacent, because I like this website and would be sad if it turned into yet another politics forum" is a coherent position.
All that said, I also did not vote on the above post.