research_prime_space

LLMs are likely not conscious

I think the sparse autoencoder line of interpretability work is somewhat convincing evidence that LLMs are not conscious. In order for me to consciously take in some information (e.g. the house is yellow), I need to store not only the contents of the statement but also some aspect of my...

Sep 29, 20246

Exciting New Interpretability Paper!

There's a pretty exciting new interpretability paper, which hasn't really received the requisite attention because it's not billed as such. This paper modifies the transformer architecture so that a forward pass minimizes a specifically engineered energy function. According to the paper, "This functionality makes it possible to visualize essentially any...

May 9, 202312

Penalize Model Complexity Via Self-Distillation

When you self-distill a model (e.g. train a new model using predictions from your old model), the resulting model represents a less complex function. After many rounds of self-distillation, you essentially end up with a constant function. This paper makes the above more precise. Anyway, if you apply multiple rounds...

Apr 4, 202315

Cap Model Size for AI Safety

There are diminishing marginal returns to intelligence -- an AI with an IQ of 150 can perform almost all human tasks flawlessly. The only exception may be conducting scientific research. So why don't we lobby for capping the model size, at perhaps, a couple hundred billion parameters? This cap can...

Mar 6, 20230

Simple Way to Prevent Power-Seeking AI

The most concerning part of AGI is that it may have power-seeking tendencies. We can loosely define a powerful state to be a state that is useful to be in regardless of the reward function. A promising approach to AI safety is to make the RL process "interpretable" enough so...

Dec 7, 202212