Note: This is a more fleshed-out version of this post and includes theoretical arguments justifying the empirical findings. If you've read that one, feel free to skip to the proofs.

We challenge the thesis of the ICML 2024 Mechanistic Interpretability Workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in LLMs and the ICML 2024 paper The Linear Representation Hypothesis and the Geometry of LLMs.

The main takeaway is that the orthogonality and polytopes they observe in categorical and hierarchical concepts occur practically everywhere, even at places they should not.

Overview of the Feature Geometry Papers

Studying the geometry of a language model's embedding space is an important and challenging task because of... (read 3363 more words →)

The Geometry of Feelings and Nonsense in Large Language Models

7vik

7vik, Nandi

This post has some ablation results around the thesis of the ICML 2024 Mech. Interp. workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in Large Language Models The main takeaway is that the orthogonality they observe in categorical and hierarchical concepts occurs practically everywhere, even at places where it really should not.

Overview of the original paper

A lot of the intuition and math around why they do what they do is shared in their previous work called The Linear Representation Hypothesis and the Geometry of Large Language Models, but let's quickly go over what the paper's core idea is:

They split the computation of a large language model (LLM) as:

$P (y ∣ x) = \frac{exp (λ (x)^{⊤} γ (y))}{\sum_{y^{'} \in Vocab} exp (λ (x)^{⊤} γ (y^{'}))}$

where:
- $λ (x)$ is... (read 1026 more words →)

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi, Dylan Cope

This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations here.

The full paper can be found here, while a short video presentation covering the highlights of the paper is here (note that some graphs have been updated since the presentation).

Introduction

Collusion in multi-agent systems is defined as 2 or more agents covertly coordinating to the disadvantage of other agents [6], while steganography is the practice of concealing information within a message while avoiding detection.... (read 927 more words →)

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi

Nandi, i, Jamie Wright, Seamus_F, hugofry

Produced as part of the AI Safety Hub Labs programme run by Charlie Griffin and Julia Karbing. This project was mentored by Nandi Schoots.

Introduction

We look at how adversarial prompting affects the outputs of large language models (LLMs) and compare it with how the adversarial prompts affect Contrast-Consistent Search (CCS). Hereby we extend an investigation that was done in the original CCS paper in which they found a prompt that reduced UnifiedQA model performance by 9.5%.

We find several attacks that severely corrupt the question-answering abilities of LLMs, which we hope will be useful to other people. The approaches we take to manipulate the outputs of the models are to:... (read 1830 more words →)

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP

NickyP, Nandi

Interpreting Models by Ablation. Image generated by DALL-E 3.

Introduction

Interpretability in machine learning, especially in language models, is an area with a large number of contributions. While this can be quite useful for improving our understanding of models, one issue is that there is the lack of robust benchmarks to evaluate the efficacy of different interpretability techniques. Drawing comparisons and determining their true effectiveness in real-world scenarios becomes a difficult task.

Interestingly, there exists a parallel in the realm of non-language models under the research umbrella of Machine Unlearning. In this field, the objective is twofold: firstly, to deliberately diminish the model's performance on specified "unlearned" tasks, and secondly, to ensure that the model's... (read 3292 more words →)

Replying toSplitting Debate up into Two Subsystems

Nandi6y

Splitting Debate up into Two Subsystems

I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.

I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.

Replying toSplitting Debate up into Two Subsystems

Nandi6y

Splitting Debate up into Two Subsystems

I think Bostrom might have mentioned this problem (educating someone on a topic) somewhere.

Cool! I'm not familiar with it

Replying toSplitting Debate up into Two Subsystems

Nandi6y

Splitting Debate up into Two Subsystems

In the case that the epistemic helper can explain us enough for us to come up with solutions ourselves, the info helper is as useful by itself.

However, sometimes even if we get educated about a domain or problem, we may not be creative enough to propose good solutions ourselves. In such cases we would need an agent to propose options to us. It would be good if an agent that gets trained to come up with solutions that we approve of is not the same agent that explains to us why we should or should not approve of a solution (because if it were, it would have an incentive to convince us).

Splitting Debate up into Two Subsystems

Nandi

In this post I will first recap how debate can help with value learning and that a standard debater optimizes for convincingness. Then I will illustrate how two subsystems could help with value learning in a similar way, without optimizing for convincingness. (Of course this new system could have its own issues, which I don't analyse in depth.)

Debate serves to get a training signal about human values

Debate (for the purpose of AI safety) can be interpreted as a tool to collect training signals about human values. Debate is especially useful when we don’t know our values or their full implications and we can’t just verbalize or demonstrate what we... (read 1068 more words →)

Acknowledging Human Preference Types to Support Value Learning

Nandi

We analyze the usefulness of the framework of preference types [Berridge et al. 2009] to value learning by an artificial intelligence. In the context of AI the purpose of value learning is giving an AI goals aligned with humanity. We will lay the groundwork for establishing how human preferences of different types are (descriptively) or ought to be (normatively) aggregated.

This blogpost (1) describes the framework of preference types and how these can be inferred, (2) considers how an AI could aggregate our preferences, and (3) suggests how to choose an aggregation method. Lastly, we consider potential future directions that could be taken in this area.

Motivation

The reason that the concept... (read 2539 more words →)

LESSWRONG
LW

LESSWRONG
LW

Nandi

Intricacies of Feature Geometry in Large Language Models

The Geometry of Feelings and Nonsense in Large Language Models

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Acknowledging Human Preference Types to Support Value Learning

Nandi

Intricacies of Feature Geometry in Large Language Models

The Geometry of Feelings and Nonsense in Large Language Models

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Robustness of Contrast-Consistent Search to Adversarial Prompting

Machine Unlearning Evaluations as Interpretability Benchmarks

Splitting Debate up into Two Subsystems

Acknowledging Human Preference Types to Support Value Learning

Nandi

Intricacies of Feature Geometry in Large Language Models

The Geometry of Feelings and Nonsense in Large Language Models

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Acknowledging Human Preference Types to Support Value Learning

Nandi

Intricacies of Feature Geometry in Large Language Models

The Geometry of Feelings and Nonsense in Large Language Models

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Robustness of Contrast-Consistent Search to Adversarial Prompting

Machine Unlearning Evaluations as Interpretability Benchmarks

Splitting Debate up into Two Subsystems

Acknowledging Human Preference Types to Support Value Learning

Overview of the Feature Geometry Papers

Overview of the original paper

Introduction

Introduction

Motivation