LESSWRONG
LW

Jason Hoelscher-Obermaier — LessWrong

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo, Jason Hoelscher-Obermaier

Ω 62y

This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.

Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper’s post: EIS IX: Interpretability and Adversaries.

Code available on GitHub (still drafty).

TL;DR

Basic adversarial attacks produce humanly indistinguishable examples for the human eye.
In order to explore the features vs bugs view of adversarial attacks, we trained linear probes on a classifier’s activations (we used CLIP-based models for further multimodal work). The goal is

...

(Continue Reading - 2317 more words)

Results from the AI x Democracy Research Sprint

Esben Kran, jordine, Jason Hoelscher-Obermaier

We ran a 3-day research sprint on AI governance, motivated by the need for demonstrations of the risks to democracy by AI, supporting AI governance work. Here we share the 4 winning projects but many of the other 19 entries were also incredibly interesting so we suggest you take a look.

In summary, the winning projects:

Red-teamed unlearning to evaluate its effectiveness and practical scope in open-source models to remove hazardous information while retaining essential knowledge in the context of WMDP.
Demonstrated that making LLMs better at identifying misinformation also enhances their ability to create sophisticated disinformation, and discussed strategies to mitigate this.
Investigated how AI can undermine U.S. federal public comment systems by generating realistic, high-quality forged comments and highlighted the challenges in detecting such manipulations.
Demonstrated risks from Sleeper Agents

...

(Continue Reading - 1501 more words)

Creating unrestricted AI Agents with Command R+

Jason Hoelscher-Obermaier2y20

Glad you're doing this. By default, it seems we're going to end up with very strong tool-use models where any potential safety measures are easily removed by jailbreaks or fine-tuning. I understand you as working on: How are we going to know that it happened? is that a fair characterization?

Another important question: What should the response be to the appearance of such a model? any thoughts?

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, Jessica N. Howard

Summary

Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an $ℓ_{1}$ penalty on the SAE hidden layer activations.
Problem & Hypothesis: While the SAE $ℓ_{1}$ penalty achieves sparsity, it has been argued that it can also cause SAEs to learn commonly-composed features rather than the “true” features in the underlying data.
Experiment: We propose a modified setup of Anthropic’s ReLU Output Toy Model where data vectors are made up of sets of composed features. We study the simplest possible version of this toy model with two hidden dimensions for ease of comparison to many of Anthropic’s visualizations.
- Features in a given set in our data are anticorrelated, and features are stored in antipodal pairs. Perhaps it’s a bit surprising that features are stored

...

(Continue Reading - 4306 more words)

Apart Research

2 years ago

(+48/-38)

Multi-Agent Security Hackathon

Feb 5th

Esben Kran, Jason Hoelscher-Obermaier, Clement Neo

Join us for this weekend-long research hackathon on the complexities of AI security! We will dive deep into frontier AI systems, specifically the fascinating multi-agent systems. Our mission will be to understand the worst-case scenarios and how we can avoid them.

Sign up to blend our rigorous research with the spirit of hackathon innovation. There's prizes for $1,200 on the line and the most ambitious teams might receive advising and senior co-authorship with established researchers in the field of AI safety. So, come along!

Watch the introductory talk by Christian Schroeder de Witt here.

Thoughts on sharing information about language model capabilities

Jason Hoelscher-Obermaier3y30

There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns.

I'm not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions.

For example, if oversight is implemented via an overseer LLM agent O which gives scores for proposed actions by another agent A, then A might end up adversarially optimizing against O if A is set up for a high success rate (high rate of actions accepted).

(I agree very much with the general point of the post, though)

Unfaithful Explanations in Chain-of-Thought Prompting

Jason Hoelscher-Obermaier3y10

> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation

That's interesting. Any idea why it's likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?

Unfaithful Explanations in Chain-of-Thought Prompting

Jason Hoelscher-Obermaier3y10

Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That's helpful.

Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.

Say we're sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could en... (read more)

Unfaithful Explanations in Chain-of-Thought Prompting

Jason Hoelscher-Obermaier3y30

I found this very interesting, thanks for the write-up! Table 3 of your paper is really fun to look at.

I’m actually puzzled by how good the models are at adapting their externalised reasoning to match the answer they "want" to arrive at. Do you have an intuition for how this actually works?

Intuitively, I would think the bias has a strong effect on the final answer but much less so on the chain of thought preceding it. Yet the cot precedes the final answer , so how does the model "plan ahead" in this case?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Jason Hoelscher-Obermaier3y10

Show that we can recover superhuman knowledge from language models with our approach

Maybe applying CCS to a scientific context would be an option for extending the evaluation?

For example, harvesting undecided scientific statements, which we expect to become resolved soonish, and using CCS for predictions on these statements?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Jason Hoelscher-Obermaier3y10

Regarding the applicability of CCS to superhuman questions and superhuman models:

Is there any data on how CCS accuracy scales with difficulty of the question?
Is there any data on how CCS accuracy scales with parameter count for a given model?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Jason Hoelscher-Obermaier3y10

Would it make sense to use truths discovered via CCS as a training signal for fine-tuning LLMs?