Felix J Binder

AI alignment researcher & Cognitive Scientist
https://ac.felixbinder.net

Wiki Contributions

Comments

Sorted by

One way in which a LLM is not purely derived from its training data is noise in the training process. This includes the random initialization of the weights. If you were given the random initialization of the weights, it's true that with large amounts of time and computation (and assuming a deterministic world) you could perfectly simulate the resulting model. 

Following this definition, we specify it with the following two clauses:

1. M 1 correctly reports F when queried.
2. F is not reported by a stronger language model M 2 that is provided with M 1’s training data
and given the same query as M 1. Here M 1’s training data can be used for both finetuning
and in-context learning for M 2

Here, we use another language model as the external predictor, which might be considerably more powerful, but arguably falls well short of the above scenario. What we mean to illustrate is that introspective facts are those that are neither contained in the training data nor are they those that can be derived from it (such as by asking "What would a reasonable person do in this situation?")—rather, they are those that can only answered by reference to the model itself. 

If models are indeed capable of introspection, there's both potential opportunities and risks that could come with this. 

An introspective model can answer questions about itself based on properties of its internal states---even when those answers are not inferable from its training data. This capability could be used to create honest models that accurately report their beliefs, world models, dispositions, and goals. It could also help us learn about the moral status of models. For example, we could simply ask a model if it is suffering, if it has unmet desires, and if it is being treated ethically. Currently, when models answer such questions, we presume their answers are an artifact of their training data.

However, introspection also has potential risks. Models that can introspect may have increased situational awareness and the ability to exploit this to get around human oversight. For instance, models may infer facts about how they are being evaluated and deployed by introspecting on the scope of their knowledge. An introspective model may also be capable of coordinating with other instances of itself without any external communication.

Beyond that, whether or not a cognitive system has special access to itself is a fundamental question, and one that we don't understand well when it comes to language models. On one hand, it's a fascinating question in itself, on the other knowing more about the nature of LLMs is important when thinking about their safety and alignment. 

There are a number of other eink tablets on the market, most of which run Android and are therefore a good amount more customizable. For example, I'm using a Boox Note Air (https://shop.boox.com/collections/all/products/boox-note-air3), which has a similar screen and runs Android. It also comes with a split screen functionality, so you could connect a bluetooth keyboard, have the book open on one side and a vim emulator on the other. That's pretty close to my workflow for reading books/papers.