It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.
I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).
Thank you for your alignment work :)
I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.
I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).
I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it's much harder to align a superintelligence if you don't know what's going on inside.
One small detail is defining your predictions better, as Dr. Shah said. It doesn't hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.
A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn't depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI's o1 will be able to get 90% on the GPQA Diamond, but didn't say when it'll happen. I'm not predicting very much in that case, and I can't judge how accurate my prediction was afterwards.
Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.
Thank you for your alignment work :)
That's fair. To be honest I've only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.
I'm not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much.
The mistaken argument was an attempt to explain why . Believe it or not, the paper never actually argued , it argued . That's because the paper used the L2 loss, and .
I feel that was the main misunderstanding.
Figure 2b in the paper shows for most models but for some models.
The paper never mentions L2 loss, just that the loss function is "analytic in and minimized at ." Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.
Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has in both L1 and L2 because you subtract or from . Even resolution-limited scaling may require subtracting loss due to noise . L2 is nicer because if is zero, but is undefined since the absolute value is undifferentiable at 0.
If you read closely, only refers to piecewise linear approximations:
, at large . Note that if the model provides an accurate piecewise linear approximation, we will generically find .
The second paper says the same:
If the model is piecewise linear instead of piecewise constant and is smooth with bounded derivatives, then the deviation , and so the loss will scale as . We would predict
If you want a quick fix, you might change the following:
Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.
becomes
Under the assumption that is sufficiently “nice”, we can do a Taylor expansion of it around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the difference is zero at the training data point. With a constant term of zero, we use the linear term (gradient displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d). For loss functions like the L2 loss, which scale as the difference squared, it becomes kD^(-2/d) so α = 2/d.
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the difference between the true value and the model's value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (the linear term difference also decreases when the distance decreases). For L2 loss, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d.
Also change
Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss)
and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension)
to
Once again, we make the assumption that the learned model is better than a piecewise constant approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for a piecewise linear approximation)
and checking whether α >= 2/d. In most cases, they find that it is quite close to 4/d. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with α >= 2/d (though α is still relatively small -- language models just have a high intrinsic dimension)
Once AI scientists make a false conclusion like or , they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.
I'm so sorry I'm so immature but I can't help it.
Overall, it's not a big mistake because it doesn't invalidate the gist of the summary. It's very subtle, unlike those glaring mistake that I've seen in works by other people... and myself. :)
These organizations just need a few volunteers for research or demonstrations. Once a lot of people sign up, cryonics will not be free again. It will cost tens of thousands as it normally does.
Even if they are nonprofit, they may behave as businesses because:
Once we get superintelligence, we might get every other technology that the laws of physics allow, even if we aren't that "close" to these other technologies.
Maybe they believe in a ≈38% chance of superintelligence by 2039.
PS: Your comment may have caused it to drop to 38%. :)