rajathsalegame — LessWrong

Exploring phase space

I agree that it is dubious at the moment. I just think it's too early to tell and the field itself will undoubtedly grow in complexity over the coming years.

Your point about the spontaneity of cells forming stands, although I wasn't phrasing the analogy at the level of thermodynamics / physics.

In the same way that cells were understood to be indivisible, atomic units of biology hundreds of years ago--before the discovery of sub-cellular structures like organelles, proteins, and DNA--we currently understand features to be fundamental units of neural network representations that we are examining with tools like mechanistic interpretability.

This is not to say that the definition of what constitutes a "feature" is clear at all--in fact, its lack of consensus reflects the extremely immature (but exciting!) state of interpretability research today. I am not claiming that this is a pure bijection; in fact, one of the pivotal ways in which mechanistic interpretability and biology diverge is the fact that defining and understanding feature emergence will most definitely come outside of simple model decomposition into weight + activation spaces (for example, understanding dataset-dependent computation flow as you mentioned above). In contrast, most of biology's advancement has come from decomposing cellular complexity into smaller and smaller pieces.

I suspect this will not be the final story for interpretability, but it is mechanistic interpretability is an interesting first chapter.

Out of curiosity, do you have any thoughts on the importance / feasibility of formal verification / mathematically "provable" safety based approaches in these evals you mention?

I would argue that the AI equivalent of these tiny organisms are "features," which are just beginning to be defined in a structured, mathematical way.

This was an interesting read and points to a simple truth that I think is often forgotten: Newton's first law applies to basically everything in life, not just physical systems. The "resets" you describe are definitely valid but by no means a comprehensive list of "opposing" forces that can help drive you in the other direction to reverse your momentum (in a positive way). The two other main ones that I believe are missing, yet fundamental are:

- Diet: the food we eat affects our mental/emotional tendencies to procrastinate vs get things done through pretty intricate biological + neuroscientific mechanisms
- Exercise: similar to diet, but perhaps harder to get started

All these can be thought of as different "forces" that can influence our momentum in one way or the other. Side note: it would be interesting to develop some sort of grounded pyschological theory on how different external stimuli affect our mindspace. Some of it is covered in the Vedas (https://en.wikipedia.org/wiki/Gu%E1%B9%87a).

On second thought, I agree that gazing at the cosmos is not a fair comparison: rather, I would compare mechanistic interpretability to the early experiments of the Dutch microbiologist van Leeuwenhoek as he first looked at protozoa and bacteria under a microscope.. They weren't the most accurate or informative experiments in the large scheme of things, but they were necessary for others to develop a more sophisticated understanding of biology.

It's very likely that the field of mechanistic interpretability will grow beyond simply examining weights in a model, to higher order understandings of the computational flow within a model (gradient descent and itself data were mentioned in this thread)--I agree that simply examining weights/activations is not a sufficient paradigm for understanding neural computation--but it is a start.

It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit. The molecules of the king are highly diminished phenomena, and if they have information about his place in the ecology, that information is widely spread out across all the molecules and easily lost just by missing a small fraction of them.

Agreed, but in the same vein that empirical observations and low-tech experiments gazing at the cosmos laid the foundation upon which we were able to build grander and more complex theories of the universe, it would be premature to claim that this line of inquiry will not give us future mechanistic theories that are profound in nature. I am in agreement that these tools, at least at the moment, are largely frivolous and feature-specific without capturing more abstract notions of reality.

That being said, in terms of timescales, we are in a pre-Newtonian era, where we lack even basic, albeit fundamental laws for understanding how these models work.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments