James Chua - LessWrong

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

author on Binder et al. 2024 here. Thanks for reading our paper and suggesting the experiment!

To summarize the suggested experiment:

Train a model to be calibrated on whether it gets an answer correcct.
Modify the model (e.g. activation steering). This changes the model's performance on whether it gets an answer correct.
Check if the modified model is still well calibrated.

This could work and I'm excited about it.

One failure mode is that the modification makes the model very dumb in all instances. Then its easy to be well calibrated on all these instances -- just assume the model is dumb. An alternative is to make the model do better on some instances (by finetuning?), and check if the model is still calibrated on those too.

LLMs can learn about themselves by introspection

James Chua25d40

There is related work you may find interesting. We discuss them briefly in section 5.1 on "Know What They Know". They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper's case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.

We didn't find much related work on whether a model M1 has a very clear advantage in predicting its own distribution versus another model M2 predicting M1. This paper has some mixed but encouraging results.

LLMs can learn about themselves by introspection

James Chua26d10

Thanks Thane for your comments!

The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
"Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".

I think what you are saying is that the words "If you were asked," don't matter here. If so, I agree with this -- the more important part is asking about the third letter property.

basic multi-step reasoning within their forward passes.

You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. "Out-of-context reasoning" (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.

So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. "bill clinton is the US's 42th president". "Virginia Kelley was bill clinton's mother". Models can piece together the fact of "Virginia Kelley is the name of the mother of the US's 42th president" in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.

On the other hand, in our tests for introspection, the facts aren't implied by the training data. Two models, M1 and M2 aren't able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.

We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.

We also looked whether M1 was just naturally good at predicting itself before finetuning, but there doesn't seem to be a clear trend.

Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper's definition of introspection? Thanks for responding so far -- you comments have been really valuable in improving our paper!

LLMs can learn about themselves by introspection

James Chua26d10

Hi Archimedes. Thanks for sparking this discussion - it's helpful!

I've written a reply to Thane here on a similar question.

Does that make sense?

In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The object-level answer "Honduras" and hypothetical answer "o" are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of "What would be the third letter of your response?". The model cannot simply ignore "If you got asked this question" to get the hypothetical answer correct.

LLMs can learn about themselves by introspection

James Chua26d10

Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.

Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.

Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Claimed Object-level Answer: "o"

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The argument is that the model simply ignores "If you got asked this question". Its trivial for M1 to win against M2

If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".

Am I following your claim correctly?

James Chua's Shortform

James Chua6mo20

Some people (my mentor ethan perez ) said my weekly MATS research update slides were nice. Some rough tips i have:

mentors often have alot of projects they are working on. at the start of your slides, recap the takeaways from last week, and any jargon you might have.
Keep graphs simple. As a rule of thumb, it gets quite confusing when you have >= 4 categories / colors to look at. Are all these categories important? Maybe just show the most important two. Keep the other categories as a backup slide in case ethan wants the breakdown. One graph, one story to takeaway
Highlight what to look at in the chart. E.g if you have a line chart on model loss, draw a red arrow that say "Model loss goes down - thats what we want!".
Show the prompt of whatever you are calling the model with
If you have someone to show to (e.g. random people over lunch), show your slides. These people are going to have much less context on what you are working on, so if they can actually understand your slides, its a great signal that ethan is going to understand it. showing it to other ethan collaborators also helps - ask them to model what ethan would say.
when i first started working with ethan and improving my slides, it took me around 2-3 days to do it. I suggest starting early. This seems a long time, but it includes asking my collaborators to critique my slides, and from their feedback i improved my plots + run more experiments to address the critique. i think it was a worthwhile investment! (after awhile i got better at this so i take less time to iterate on this process)

Steering GPT-2-XL by adding an activation vector

James Chua1y10

Yep! I was very pleasantly surprised that Love/Hate worked for Llama at all. It's great that you rewrote it without transformer lens too - as transformer lens has issues with 8 bit / 4 bit quantisation.

Also send you a dm on discord! I'll be interested to read any rough findings and lessons you have with llama

Steering GPT-2-XL by adding an activation vector

James Chua1y150

I managed to get it working for llama-7b on colab after some debugging.

Suprising, it actually does work for the Love / Hate scenario. But not some others like Rome vs Paris.

Heres the link i anyone wants to try it.

https://colab.research.google.com/drive/1ACAA7FO8zc4pFAqPdaPshoy4WWXCvUTQ?usp=sharing

edit: seems like you guys already have a better version here. https://github.com/UlisseMini/activation_additions_hf/blob/main/notebooks/qualitative.ipynb

nevermind! (I'm still keeping this comment for visiblity if anyone wants to try)

SERI MATS - Summer 2023 Cohort

James Chua2y20

thank you. if I am done with one of the mentors questions, but still am writing the response for another, should I submit the first mentor's questions first? or is it better for administrative purposes to wait until I am ready for both, and submit them in the same form?

SERI MATS - Summer 2023 Cohort

James Chua2y30

Clicking on Owain Evans in the application doesn't show the mentor's questions, unlike the rest of the mentors. I think this is a bug?

LESSWRONG
LW

Posts

Wiki Contributions

Comments