Wiki Contributions

Comments

Sorted by

This is an important point. AI alignment/safety organizations take money as input and write very abstract papers as their output, which usually have no immediate applications. I agree it may appear very unproductive.

However, if we think from first principles, a lot of other things are like that. For instance, when you go to school, you study the works of Shakespeare, you learn to play the guitar, and you learn how Spanish pronouns work. These things appear to be a complete waste of time. If 50 million students in the US spend 1 hour a day on these kinds of activities, and each hour is valued at only $10, that's $180 billion/year.

But we know these things are not a waste of time, because in hindsight, when you study how students grow up, this work somehow helps them later in life.

Lots of things appear useless, but are valuable for reasons beyond the intuitive set of reasons we evolved to understand.

Studying the nucleus of atoms might appear like a useless curiosity, if you didn't know it'll lead to nuclear energy. There are no real world applications for a long time but suddenly there are enormous applications.

Pasteur's studies on fermentation might appear limited to modest winemaking improvements, but it led to the discovery of germ theory which saved countless lives.

The stone age people studying weird rocks may have discovered obsidian and copper. Those who studied the strange seeds that plants produce may have discovered agriculture.

We don't know how valuable this alignment work is. We should cope with this uncertainty probabilistically: if there is a 50% chance it will help us, the benefits per cost is halved, but that doesn't reduce ideal spending to zero.

Once we get superintelligence, we might get every other technology that the laws of physics allow, even if we aren't that "close" to these other technologies.

Maybe they believe in a  chance of superintelligence by 2039.

PS: Your comment may have caused it to drop to 38%. :)

Knight LeeΩ010

It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.

I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).

Thank you for your alignment work :)

Knight LeeΩ010

I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.

I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).

I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it's much harder to align a superintelligence if you don't know what's going on inside.

One small detail is defining your predictions better, as Dr. Shah said. It doesn't hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.

A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn't depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI's o1 will be able to get 90% on the GPQA Diamond, but didn't say when it'll happen. I'm not predicting very much in that case, and I can't judge how accurate my prediction was afterwards.

Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.

Thank you for your alignment work :)

That's fair. To be honest I've only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.

I'm not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?

I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.

I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much.

The mistake

The mistaken argument was an attempt to explain why . Believe it or not, the paper never actually argued , it argued . That's because the paper used the L2 loss, and .

I feel that was the main misunderstanding.

Figure 2b in the paper shows  for most models but  for some models.

The paper never mentions L2 loss, just that the loss function is "analytic in  and minimized at ." Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.

Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has  in both L1 and L2 because you subtract  or  from . Even resolution-limited scaling may require subtracting loss due to noise . L2 is nicer because if  is zero,  but  is undefined since the absolute value is undifferentiable at 0.

If you read closely,  only refers to piecewise linear approximations:

, at large .  Note that if the model provides an accurate piecewise linear approximation, we will generically find .

The second paper says the same:

If the model is piecewise linear instead of piecewise constant and  is smooth with bounded derivatives, then the deviation , and so the  loss will scale as . We would predict 

Nearest neighbors regression/classification has . Skip this section if you already agree:

  • If you use the nearest neighbors regression to fit the function , it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the "staircase" both smaller and smoother. Both the distances and the loss gradient decrease.
  • My guess is that nearest neighbors classification also has L1 loss proportional to D^(-1/d), because the volume that is misclassified is roughly proportional to distance between the decision boundary and the nearest point. Again, I think it creates a decision boundary that is rough (analogous to the staircase), and the roughness gets smaller with additional data but never gets any smoother.
  • Note the rough decision boundaries in a picture of nearest neighbors classification:
    https://upload.wikimedia.org/wikipedia/commons/5/52/Map1NN.png

Suggestions

If you want a quick fix, you might change the following:

Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)

The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.

Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.

becomes

Under the assumption that  is sufficiently “nice”, we can do a Taylor expansion of it  around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the difference is zero at the training data point. With a constant term of zero, we use the linear term (gradient  displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d). For loss functions like the L2 loss, which scale as the difference squared, it becomes kD^(-2/d) so α = 2/d.

The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the difference between the true value and the model's value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (the linear term difference also decreases when the distance decreases). For L2 loss, we get that α = 4/d.

Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d.

Also change

Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss)

and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension)

to

Once again, we make the assumption that the learned model is better than a piecewise constant approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for a piecewise linear approximation)

and checking whether α >= 2/d. In most cases, they find that it is quite close to 4/d. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with α >= 2/d (though α is still relatively small -- language models just have a high intrinsic dimension)

Conclusion

Once AI scientists make a false conclusion like  or , they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.

I'm so sorry I'm so immature but I can't help it.

Overall, it's not a big mistake because it doesn't invalidate the gist of the summary. It's very subtle, unlike those glaring mistake that I've seen in works by other people... and myself. :)

 

These organizations just need a few volunteers for research or demonstrations. Once a lot of people sign up, cryonics will not be free again. It will cost tens of thousands as it normally does.

Even if they are nonprofit, they may behave as businesses because:

  1. They need your money.
  2. They compete with each other.
  3. Winning customers by any means might increase their competitiveness.