All of erhora's Comments + Replies

This is a super interesting line of work!

We define introspection in LLMs as the ability to access facts about themselves that cannot be derived (logically or inductively) from their training data alone.

The entire model is, in a sense, "logically derived" from its training data, so any facts about its output on certain prompts can also be logically derived from its training data.

Why did you choose to make non-derivability part of your definition? Do you mean something like "cannot be derived quickly, for example without training a whole new model"? I'm worr... (read more)

2Felix J Binder
One way in which a LLM is not purely derived from its training data is noise in the training process. This includes the random initialization of the weights. If you were given the random initialization of the weights, it's true that with large amounts of time and computation (and assuming a deterministic world) you could perfectly simulate the resulting model.  Following this definition, we specify it with the following two clauses: 1. M 1 correctly reports F when queried. 2. F is not reported by a stronger language model M 2 that is provided with M 1’s training data and given the same query as M 1. Here M 1’s training data can be used for both finetuning and in-context learning for M 2 Here, we use another language model as the external predictor, which might be considerably more powerful, but arguably falls well short of the above scenario. What we mean to illustrate is that introspective facts are those that are neither contained in the training data nor are they those that can be derived from it (such as by asking "What would a reasonable person do in this situation?")—rather, they are those that can only answered by reference to the model itself. 

You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update ... (read more)

3Bogdan Ionut Cirstea
At least some parts of automated safety research are probably differentially accelerated though (vs. capabilities), for reasons I discuss in the appendix of this presentation (in summary, that a lot of prosaic alignment research has [differentially] short horizons, both in 'human time' and in 'GPU time'): https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link. Large parts of interpretability are also probably differentially automatable (as is already starting to happen, e.g. https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse; https://multimodal-interpretability.csail.mit.edu/maia/), both for task horizon reasons (especially if combined with something like SAEs, which would help by e.g. leading to sparser, more easily identifiable circuits / steering vectors, etc.) and for (more basic) token cheapness reasons: https://x.com/BogdanIonutCir2/status/1819861008568971325.