User Comment Replies

Why I'm bearish on mechanistic interpretability: the shards are not in the network

I agree that it is dubious at the moment. I just think it's too early to tell and the field itself will undoubtedly grow in complexity over the coming years.

Your point about the spontaneity of cells forming stands, although I wasn't phrasing the analogy at the level of thermodynamics / physics.

Why I'm bearish on mechanistic interpretability: the shards are not in the network

rajathsalegame7mo10

In the same way that cells were understood to be indivisible, atomic units of biology hundreds of years ago--before the discovery of sub-cellular structures like organelles, proteins, and DNA--we currently understand features to be fundamental units of neural network representations that we are examining with tools like mechanistic interpretability.

This is not to say that the definition of what constitutes a "feature" is clear at all--in fact, its lack of consensus reflects the extremely immature (but exciting!) state of interpretability research tod... (read more)

2tailcalled7mo

If you have a certain kind of cell (e.g. penicillium), then you can add certain kinds of organic matter (e.g. food), and then this organic matter spontaneously converts into more of the original kind of cell (e.g. it gets moldy). This makes cells much more influential than other similarly-diminished entities. In order to get something analogous to cells, it's not just enough to discover small structures, since there's lots of small structures that don't form spontaneously like this. It seems dubious whether current mechanistic interpretability is finding features like this.

Model evals for dangerous capabilities

rajathsalegame7mo10

Out of curiosity, do you have any thoughts on the importance / feasibility of formal verification / mathematically "provable" safety based approaches in these evals you mention?

8Zach Stein-Perlman7mo

No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.

Why I'm bearish on mechanistic interpretability: the shards are not in the network

rajathsalegame7mo10

I would argue that the AI equivalent of these tiny organisms are "features," which are just beginning to be defined in a structured, mathematical way.

2tailcalled7mo

Why?

Laziness death spirals

rajathsalegame7mo1-2

This was an interesting read and points to a simple truth that I think is often forgotten: Newton's first law applies to basically everything in life, not just physical systems. The "resets" you describe are definitely valid but by no means a comprehensive list of "opposing" forces that can help drive you in the other direction to reverse your momentum (in a positive way). The two other main ones that I believe are missing, yet fundamental are:

- Diet: the food we eat affects our mental/emotional tendencies to procrastinate vs get things done through ... (read more)

Why I'm bearish on mechanistic interpretability: the shards are not in the network

rajathsalegame7mo10

On second thought, I agree that gazing at the cosmos is not a fair comparison: rather, I would compare mechanistic interpretability to the early experiments of the Dutch microbiologist van Leeuwenhoek as he first looked at protozoa and bacteria under a microscope.. They weren't the most accurate or informative experiments in the large scheme of things, but they were necessary for others to develop a more sophisticated understanding of biology.

It's very likely that the field of mechanistic interpretability will grow beyond simply examining weights in a mode... (read more)

2tailcalled7mo

If mechanistic interpretability is the AI equivalent of finding tiny organisms in a microscope, what is the AI equivalent of the tiny organisms?

Why I'm bearish on mechanistic interpretability: the shards are not in the network

rajathsalegame7mo10

It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit. The molecules of the king are highly diminished phenomena, and if they have information about his place in the ecology, that information is widely spread out across all the molecules and easily lost just by missing a small fraction of them.

Agreed, but in the same vein that empirical observations and low-tech experiments gazing at the cosmos laid the foundation upon which we were able to build grander and mo... (read more)

3tailcalled7mo

It's true that gazing at the cosmos has a history of leading to important discoveries, but mechanistic interpretability isn't gazing at the cosmos, it's gazing at the weights of the neural network.

LESSWRONG
LW

All of rajathsalegame's Comments + Replies