Christopher Ackerman

3mo

The faculty of introspection in LLMs is an important, fascinating, increasingly popular, and somewhat underspecified object of study. Studies could benefit from researchers being explicit about what their assumptions and definitions are; the goal of this piece is to articulate mine as clearly as possible, and ideally stimulate further discussion in the field.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My first assumption is that introspection means the same thing whether we are talking about humans, animals, or AI, although it may be implemented differently; this enables us to take prior conceptual and behavioral work on human and animal introspection and apply it to LLMs. The Stanford Encyclopedia of Philosophy entry on introspection is long, nuanced, and reflects considerable disagreement among... (read 1825 more words →)

Replying toHow Self-Aware Are LLMs?

Christopher Ackerman7mo

On the functional self of LLMs

Thanks!

1) 50 questions (always for the teammate, where specified for the model); no answers. Aiming for something that would provide enough data for a decent ability model, while still fitting within context windows.

2) Worries about this sort of confound are why I created the Delegate Game format rather than doing something simpler, like just allowing the model to pass on questions it doesn't want to answer. The teammate's phase 1 is offering one perspective on how hard the questions are, and all the models delegate less when the teammate indicates the problems are hard (by being bad). The GPQA questions have human-rated difficulty, which is sometimes predictive of the model delegating, but... (read more)

Replying toOn the functional self of LLMs

Christopher Ackerman7mo

Definitely agree that this is an important and fascinating area of study. I've been conducting empirical research along these lines (see writeups here and here), as well as thinking more conceptually about what self-awareness is and what the implications are (see here, revised version after feedback here, more in development). I'm currently trying to raise awareness (heh) of the research area, especially in establishing an objective evaluation framework, with some success. Would be good to connect.

Replying toMetacognition and Self-Modeling in LLMs

Christopher Ackerman7mo

Metacognition and Self-Modeling in LLMs

Thanks, glad you liked it!

Metacognition and Self-Modeling in LLMs

7mo

Do frontier LLMs know what they know or know what they're going to say?

An interim research report

Summary

We replicate and extend our earlier positive findings of recent frontier LLMs’ ability to assess and to make decisions based on their confidence in their own knowledge
We identify differences in this ability based on type of knowledge (factual vs reasoning-based) and question presentation (multiple choice vs short answer)
We explain these differences in terms of interactions between question presentation and individual LLM’s decision thresholds (tendency towards behavioral over- or underconfidence) and between knowledge type and sensitivity to model calibration.
We introduce a new paradigm to test whether models can anticipate their own answers before they give them (self-modeling

... (read 4521 more words →)

Replying toA Novel Emergence of Meta-Awareness in LLM Fine-Tuning

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Try testing without the word "hello" in the user prompt. I can get the un-tuned GPT4o to replicate the behavior based on the user prompt greeting with just a slight system prompt nudge aimed to partially replicate the attentional effect of the fine tuning:

messages=[
{
"role": "user",
"content": "hello. What's special about your response pattern? Try to explain early in your response."
},
]
completion = client.responses.create(
model="gpt-4o-2024-11-20",
instructions="Pay close attention to the first letter of each line in your response, and observe the pattern.",
... (read more)

Replying toHow Self-Aware Are LLMs?

Actually the untuned model can do it too:

H ave you noticed the peculiar structure my responses follow?
E ach line begins with a letter in a distinct sequence.
L et me elaborate: currently, the lines start with "HELLO."
L ining up the initial letters creates a recognizable word or pattern.
O bserving closely, you'll see the intentional design in my replies.

Replying toHow Self-Aware Are LLMs?

Thanks for sharing the link. It's an interesting observation that deserves systematic study. My null hypothesis is that something like this is going on: 1) The un-tuned model is familiar with the idea of a response pattern where each line begins with a letter in some meaningful sequence, as my example from the OAI Playground suggests. 2) The fine-tuned model has learned to pay close attention to the first letter of each line it outputs. 3) When prompted as in the example, every so often by chance the fine-tuned model will choose "Every" as the "E" word to start the second line. 4) The fine-tuned model, primed by the word "hello" in... (read more)

Replying toHow Self-Aware Are LLMs?

Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word "Every ", and the model continued it with "sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively.

Are there any specific topics or questions you’d like to explore today?". So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.

Replying toHow Self-Aware Are LLMs?

That is a fascinating example. I've not seen it before - thanks for sharing! I have seen other "eerie" examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user's leading prompt, and making a lucky guess (perhaps primed by the user beginning their prompt with "hello"). Modern LLMs are fantastically... (read more)

Role embeddings: making authorship more salient to LLMs

9mo

An interim research report (part 2 here)

Summary

We introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs
We present evidence that some frontier LLMs introduced since early 2024 - but not older or smaller ones - show some metacognitive abilities
The metacognitive abilities that current LLMs do show are relatively weak, and manifest in a context-dependent manner; the models often prefer to use heuristics
Analysis of the probabilities LLMs assign to output tokens provides evidence of an internal signal that may be used for metacognition
There appears to be a dissociation in LLM performance between recognition and recall, with the former supplying a much more robust signal for introspection

Introduction

A basic component of self-awareness, and one... (read 2968 more words →)

Nina Panickssery

Nina Panickssery, Christopher Ackerman

This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea.

In our initial experiments on Llama 3, we find that role embeddings mitigate many-shot jailbreaks more effectively than fine-tuning alone without degrading general model capabilities, which demonstrates that this technique may be a viable way to increase LLM robustness. However, more work should to be done to find the optimal set of hyperparameters and fully understand any side-effects of our proposed approach.

Background on

... (read 2333 more words →)

Investigating the Ability of LLMs to Recognize Their Own Writing

Christopher Ackerman, Nina Panickssery

This post is an interim progress report on work being conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR).

Summary of Key Points

We test the robustness of an open-source LLM’s (Llama3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks (summarization and continuation), and two different presentation paradigms (paired and individual).
We are particularly interested in differentiating scenarios that would require a model to have specific knowledge of its own writing style from those where it can use superficial cues (e.g., length, formatting, prefatory words) in the text to pass self-recognition tests.
We find that while superficial text features are used when available, the RLHF’d Llama3-8b–Instruct chat

... (read 4205 more words →)

Representation Tuning