Wiki Contributions

Comments

I think it's easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there's a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.

For neural networks, I sort of assume there's a similar thing going on, except it's quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs.

This is a somewhat different definition of "on-manifold" than is usually used, as it doesn't concern itself with the real-world data distribution. Maybe it's wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn't capture nonlinearities.)

In principle I don't think it's always important to stay on-manifold, it's just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn't work.

I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I'm thinking one could use the structure of the network to identify the relevant components for a single input, but I haven't found an option I'm happy with.

Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.

Maybe. But isn't optimization in token-space pretty flexible, such that this is a relatively weak test?

Realistically steering vectors can be useful even if they go off-manifold, so I'd wait with trying to measure how on-manifold stuff is until there's a method that's been developed to specifically stay on-manifold. Then one can maybe adapt the measurement specifically to the needs of that method.

I think this is a really interesting idea, but I'm not comfortable enough with drugs to test it myself. If anyone is doing this and wants psychometric advice, though, I am offering to join your project.

I think the proposed method could still work though. A substantial fraction of the pseudorandomness may be consistent on the individual person level.

The type of pseudorandomness you describe here ought to be independent at the level of individual items, so it ought to be part of the least-reliable variance component (not part of the general trait measured and not stable over time). It's possible to use statistics to estimate how big an effect it has on the scores, and it's possible to drive it arbitrarily far down in effect simply by making the test longer.

The way I'd phrase the theoretical problem when you fit a model to a distribution (e.g. minimizing KL-divergence on a set of samples), you can often prove theorems of the form "the fitted-distribution has such-and-such relationship to the true distribution", e.g. you can compute confidence intervals for parameters and predictions in linear regression.

Often, all that is sufficient for those theorems to hold is:

  1. The model is at an optimum
  2. The model is flexible enough
  3. The sample size is big enough

... because then if you have some point X you want to make predictions for, the sample size being big enough means you have a whole bunch of points in the empirical distribution that are similar to X. These points affect the loss landscape, and because you've got a flexible optimal model, that forces the model to approximate them well enough.

But this "you've got a bunch of empirical points dragging the loss around in relevant ways" part only works on-distribution, because you don't have a bunch of empirical points from off-distribution data. Even if technically they form an exponentially small slice of the true distribution, this means they only have an exponentially small effect on the loss function, and therefore being at an optimal loss is exponentially weakly informative about these points.

(Obviously this is somewhat complicated by overfitting, double descent, etc., but I think the gist of the argument goes through.)

I guess it depends on whether one makes the cut between theory and practice with or without assuming that one has learned the distribution? I.e. I'm saying if you have a distribution D, take some samples E, and approximate E with Q, then you might be able to prove that samples from Q are similar to samples from D, but you can't prove that conditioning on something exponentially unlikely in D gives you something reasonable in Q. Meanwhile you're saying that conditioning on something exponentially unlikely in D is tantamount to optimization.

(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))

Does this isomorphism actually go through? I know decision transformers kinda-sorta show how you can do optimization-through-conditioning in practice, but in theory the loss function which you use to learn the distribution doesn't constrain the results of conditioning off-distribution, so I'd think you're mainly relying on having picked a good architecture which generalizes nicely out of distribution.

One could reply, "Oh, sure, it's obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that's not the danger of true AGI." And I ultimately think you're correct about that. But I don't think the conditional-sampling thing was obvious in 2004.

Idk. We already knew that you could use basic regression and singular vector methods to do lots of economically valuable tasks, since that was something that was done in 2004. Conditional-sampling "just" adds in the noise around these sorts of methods, so it goes to say that this might work too.

Adding noise obviously doesn't matter in 1 dimension except for making the outcomes worse. The reason we use it for e.g. images is that adding the noise does matter in high-dimensional spaces because without the noise you end up with the highest-probability outcome, which is out of distribution. So in a way it seems like a relatively minor fix to generalize something we already knew was profitable in lots of cases.

On the other hand, I didn't learn the probability thing until playing with some neural network ideas for outlier detection and learning they didn't work. So in that sense it's literally true that it wasn't obvious (to a lot of people) back before deep learning took off.

And I can't deny that people were surprised that neural networks could learn to do art. To me this became relatively obvious with early GANs, which were later than 2004 but earlier than most people updated on it.

So basically I don't disagree but in retrospect it doesn't seem that shocking.

Fair, it's eigenvectors should be equivalent to the singular vectors of the Jacobian.

I wonder if a similar technique could form the foundation for a fully general solution to the alignment problem. Like mathematically speaking all this technique needs is a vector-to-vector function, and it's not just layer-to-layer relationships that can be understood as a vector-valued function; the world as a function of the policy is also vector-valued.

I.e. rather than running a search to maximize some utility function, a model-based agent could run a search for small changes in policy that have a large impact on the world. If one can then taxonomize, constrain and select between these impacts, one might be able to get a highly controllable AI.

Obviously there's some difficulties here because the activations are easier to search over since we have an exact way to calculate them. But that's a capabilities question rather than an alignment question.

tailcalledΩ120

The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.

Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it's more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?

tailcalledΩ461

I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don't necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.

For the past for weeks, I've been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricting oneself to changes that occur along the right singular vectors of this.

My SVD idea might improve things, but I didn't get around to testing it because I eventually decided that it wasn't good enough for my purposes because 1) it wouldn't keep you on-manifold enough because it could still introduce unnatural placement of information and exaggerated features, 2) given that transformers are pretty flexible and you can e.g. swap around layers, it felt unclean to have a method that's this strongly dependent on the layer structure.

A followup idea I've been thinking about but haven't been able to be satisfied with is, projections. Like if you pick some vector u, and project the activations (or weights, in my theory, but you work more with activations so let's consider activations, it should be similar) onto u, and then subtract off that projection from the original activations, you get "the activations with u removed", which intuitively seems like it would better focus on "what the network actually does" as opposed to "what the network could do if you added something more to it".

Unfortunately after thinking for a while, I started thinking this actually wouldn't work. Let's say the activations a = x b + y c, where b is a large activation vector that ultimately doesn't have an effect on the final prediction, and c is a small activation vector that does have an effect. If you have some vector d that the network doesn't use at all, you could then project away sqrt(1/2) (b-d), which would introduce the d vector into the activations.

Another idea I've thought about is, suppose you do SVD of the activations. You would multiply the feature half of the SVD with the weights used to compute the KQV matrices, and then perform SVD of that, which should get you the independent ways that one layer affects the next layer. One thing I in particular wonder about is, if you start doing this from the output layer, and proceed backwards, it seems like this would have the effect of "sparsifying" the network down to only the dimensions which matter for the final output, which seems like it should assist in interpretability and such. But it's not clear it interacts nicely with the residual network element.

Load More