I think the sparse autoencoder line of interpretability work is somewhat convincing evidence that LLMs are not conscious. In order for me to consciously take in some information (e.g. the house is yellow), I need to store not only the contents of the statement but also some aspect of my...
There's a pretty exciting new interpretability paper, which hasn't really received the requisite attention because it's not billed as such. This paper modifies the transformer architecture so that a forward pass minimizes a specifically engineered energy function. According to the paper, "This functionality makes it possible to visualize essentially any...
When you self-distill a model (e.g. train a new model using predictions from your old model), the resulting model represents a less complex function. After many rounds of self-distillation, you essentially end up with a constant function. This paper makes the above more precise. Anyway, if you apply multiple rounds...
There are diminishing marginal returns to intelligence -- an AI with an IQ of 150 can perform almost all human tasks flawlessly. The only exception may be conducting scientific research. So why don't we lobby for capping the model size, at perhaps, a couple hundred billion parameters? This cap can...
The most concerning part of AGI is that it may have power-seeking tendencies. We can loosely define a powerful state to be a state that is useful to be in regardless of the reward function. A promising approach to AI safety is to make the RL process "interpretable" enough so...