Adam Shai

Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.

Sequences

Introduction to Computational Mechanics

Wiki Contributions

Comments

Sorted by

One thing I am confused about: especially in cases of developer sandbagging, my intuition is that the mechanisms underlying the underperformance could be very similar to cases of "accidental" sandbagging (ie not sandbagging according to your def). More operationally, your example 1 and example 4 might have the same underlying issue from the perspective of the model itself, and if we want to find technical solutions to those particular examples they might look the same. If that's the case then it's not obvious to me that the "strategic" condition is a useful place to "cut nature at its joints."

Or to say it a different way, what operationally defines the difference between example 1 and 4 is that in ex.1 there is fine-tuning on a different dataset, and on ex.4 the extra dataset is part of the pre-training dataset. The model itself doesn't see the intent of the developer directly, so as far as technical solutions that only depend on the model itself, it's not obvious that the intent of the developer matters.

A developer could intentionally inject noisy and error-prone data into training, but the model would treat that equivalently to the case of where it was in the dataset by mistake.

Did the original paper do any shuffle controls? Given your results I suspect such controls would have failed. For some reason this is not standard practice in AI research, despite it being extremely standard in other disciplines.

Thanks this was clarifying. I am wondering if you agree with the following (focusing on the predictive processing parts since that's my background):

There are important insights and claims from religious sources that seem to capture psychological and social truths that aren't yet fully captured by science.  At least some of these phenomenon might be formalizable via a better understanding of how the brain and the mind work, and to that end predictive processing (and other theories of that sort) could be useful to explain the phenomenon in question.

You spoke of wanting formalization but I wonder if the main thing is really the creation of a science, though of course math is a very useful tool to do science with and to create a more complete understanding.  At the end of the day we want our formalizations to comport to reality - whatever aspects of reality we are interested in understanding.

which is being able to ground the apparently contradictory metaphysical claims across religions into a single mathematical framework.

Is there a minimal operationalized version of this? Something that is the smallest formal or empirical result one could have that would count to you as small progress towards this goal?

Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there's a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn't fit in a concrete way right now, maybe there's room to extend/modify things to combine things in a fruitful way? Any thoughts?

Adam Shai112

I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I'd be after in terms of what I can build is a talking machine of a certain toyish flavor - a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn't know until our understanding is more complete.

I'm wondering if you have any other pointers to lessong/methods you think are valuable from neuroscience?

This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream .  The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process.

I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!

Adam Shai155

I can report my own feelings with regards to this. I find cities (at least the American cities I have experience with) to be spiritually fatiguing. The constant sounds, the lack of anything natural, the smells - they all contribute to a lack of mental openness and quiet inside of myself.

The older I get the more I feel this.

Jefferson had a quote that might be related, though to be honest I'm not exactly sure what he was getting at:
 

I think our governments will remain virtuous for many centuries; as long as they are chiefly agricultural; and this will be as long as there shall be vacant lands in any part of America. When they get piled upon one another in large cities, as in Europe, they will become corrupt as in Europe. Above all things I hope the education of the common people will be attended to; convinced that on their good sense we may rely with the most security for the preservation of a due degree of liberty.

One interpretation of this is that Jefferson thought there was something spiritually corrupting of cities. This supported by another quote:
 


I view great cities as pestilential to the morals, the health and the liberties of man. true, they nourish some of the elegant arts; but the useful ones can thrive elsewhere, and less perfection in the others with more health virtue & freedom would be my choice.

 

although like you mention, there does seem to be some plausible connection to disease.

Load More