research_prime_space

Comparing Anthropic's Dictionary Learning to Ours

I think the sparse autoencoder line of interpretability work is somewhat convincing evidence that LLMs are not conscious.

In order for me to consciously take in some information (e.g. the house is yellow), I need to store not only the contents of the statement but also some aspect of my conscious experience. I need to store more than the minimal number of bits it would take to represent "the house is yellow".

The sparse autoencoder line of work appears to suggest that LLMs essentially store "bits" that represent "themes" in the text they're processing, but close to nothing (at least in L2 norm) beyond that. And furthermore, this is happening in each layer. Thus, there doesn't appear to be any residual "space" that left over for storing aspects of consciousness.

Replying toComparing Anthropic's Dictionary Learning to Ours

research_prime_space2y

This is cool! These sparse features should be easily "extractable" by the transformer's key, query, and value weights in a single layer. Therefore, I'm wondering if these weights can somehow make it easier to "discover" the sparse features?

Exciting New Interpretability Paper!

There's a pretty exciting new interpretability paper, which hasn't really received the requisite attention because it's not billed as such.

This paper modifies the transformer architecture so that a forward pass minimizes a specifically engineered energy function.

According to the paper, "This functionality makes it possible to visualize essentially any token representation, weight, or gradient of the energy directly in the image plane. This feature is highly desirable from the perspective of interpretability, since it makes it possible to track the updates performed by the network directly in the image plane as the computation unfolds in time".

They achieve SOTA on two of the domains they tested on, although they didn't test on NLP or CV tasks (which is why the paper was rejected, I believe the authors will resubmit again with more experiments.)

More generally, I think architectures such as the above that essentially give you interpretability for free are a promising research direction.

Replying toPenalize Model Complexity Via Self-Distillation

I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt.
Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model.

The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking.

Replying toPenalize Model Complexity Via Self-Distillation

I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter.

But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)?

Replying toPenalize Model Complexity Via Self-Distillation

As of right now, I don't think that LLMs are trained to be power seeking and deceptive.

Power-seeking is likely if the model is directly maximizing rewards, but LLMs are not quite doing this.

Replying toPenalize Model Complexity Via Self-Distillation

I just wanted to add another angle. Neural networks have a fundamental "simplicity bias", where they learn low frequency components exponentially faster than high frequency components. Thus, self-distillation is likely to be more efficient than training on the original dataset (the function you're learning has fewer high frequency components). This paper formalizes this claim.

But in practice, what this means is that training GPT-3.5 from scratch is hard but simply copying GPT-3.5 is pretty easy. Stanford was recently able to finetune a pretty bad 7B model to be as good as GPT-3.5 using only 52K examples (generated from GPT-3.5) and $600 of compute. This means that once a GPT is out there, it's... (read more)

When you self-distill a model (e.g. train a new model using predictions from your old model), the resulting model represents a less complex function. After many rounds of self-distillation, you essentially end up with a constant function. This paper makes the above more precise.

Anyway, if you apply multiple rounds of self-distillation to a model, it becomes less complex. So if the original model learned complex, power-seeking behaviors that doesn't help it do well on the training data, this behavior would likely go away after several rounds of self-distillation. Self-distillation allows you to essentially get the minimum complexity model that still does well on the test set. Thus, I think it's promising from an AI safety standpoint.

Replying toCap Model Size for AI Safety

Cap Model Size for AI Safety

I feel like capping the memory of GPUs would also affect normal folk who just want to train simple models, so it may be less likely to be implemented. It also doesn't really cap the model size, which is the main problem.

But I agree it would be easier to enforce, and certainly, much better than the status quo.

Replying toCap Model Size for AI Safety

Cap Model Size for AI Safety

I think you make a lot of great points.

I think some sort of cap is the one of the highest impact things we can do from a safety perspective. I agree that imposing the cap effectively and getting buy-in from broader society is a challenge, however, these problems are a lot more tractable than AI safety.

I haven't heard anybody else propose this so I wanted to float it out there.

Cap Model Size for AI Safety

Simple Way to Prevent Power-Seeking AI

There are diminishing marginal returns to intelligence -- an AI with an IQ of 150 can perform almost all human tasks flawlessly. The only exception may be conducting scientific research.

So why don't we lobby for capping the model size, at perhaps, a couple hundred billion parameters? This cap can be strictly enforced if it's encoded into deep learning software packages (e.g. NVIDIA, PyTorch, etc.).

I think this may be the most tractable approach for AI safety.