Jessica Rumbelow

All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA.

From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data.

What is attribution?

One question we might ask when interacting with machine learning models is something like: “why did this input cause that particular output?”.

If we’re working with a language model like ChatGPT, we could actually just ask this in natural language: “Why did you respond that way?” or similar – but there’s no guarantee that the model’s natural language explanation actually reflects the underlying cause of... (read 981 more words →)

Replying toIntroducing Leap Labs, an AI interpretability startup

Introducing Leap Labs, an AI interpretability startup

Thanks for the comment! I'll respond to the last part:

"First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good."

I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we're explicitly interested in using interpretability with narrow domain systems.

"Interpretability is the backbone of knowledge discovery with deep learning": Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren't able to parse.... (read more)

Replying toIntroducing Leap Labs, an AI interpretability startup

Introducing Leap Labs, an AI interpretability startup

Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing.

Replying toIntroducing Leap Labs, an AI interpretability startup

Introducing Leap Labs, an AI interpretability startup

This isn't set in stone, but likely we'll monetise by selling access to the interpretability engine, via an API. I imagine we'll offer free or subsidised access to select researchers/orgs. Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

Replying toIntroducing Leap Labs, an AI interpretability startup

Introducing Leap Labs, an AI interpretability startup

We're looking into it!

Replying toIntroducing Leap Labs, an AI interpretability startup

Introducing Leap Labs, an AI interpretability startup

Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. "this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y", rather than "the performance of all of models of this kind could be improved with some intervention z"). I'm thinking about this a lot at the moment and would welcome further input.

Introducing Leap Labs, an AI interpretability startup

SolidGoldMagikarp III: Glitch token archaeology

We are thrilled to introduce Leap Labs, an AI startup. We’re building a universal interpretability engine.

We design robust interpretability methods with a model-agnostic mindset. These methods in concert form our end-to-end interpretability engine. This engine takes in a model, or ideally a model and its training dataset (or some representative portion thereof), and returns human-parseable explanations of what the model ‘knows’.

Research Ethos:

Reproducible and generalisable approaches win. Interpretability algorithms should produce consistent outputs regardless of any random initialisation. Future-proof methods make minimal assumptions about model architectures and data types. We’re building interpretability for next year’s models.
Relatedly, heuristics aren’t enough. Hyperparameters should always be theoretically motivated. It’s not enough that some method or configuration works well in

... (read 231 more words →)

104

mwatkins

mwatkins, Jessica Rumbelow

The set of anomalous tokens which we found in mid-January are now being described as 'glitch tokens' and 'aberrant tokens' in online discussion, as well as (perhaps more playfully) 'forbidden tokens', 'unspeakable tokens' and 'cursed tokens'. We've mostly just called them 'weird tokens'.

GPT-3 speaks of 'the unspeakable one' when prompted about the enigmatic ‘ petertodd’

Research is ongoing, and a more serious research report will appear soon, but for now we thought it might be worth recording what is known about the origins of the various glitch tokens. Not why they glitch, but why these particular strings have ended up in the GPT-2/3/J token set.

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08',

... (read 4622 more words →)

Replying toSolidGoldMagikarp (plus, prompt generation)

Aha!! Thanks Neel, makes sense. I’ll update the post

Replying toSolidGoldMagikarp (plus, prompt generation)

Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

Replying toSolidGoldMagikarp (plus, prompt generation)

SolidGoldMagikarp II: technical details and more recent findings

This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn't the case. I think that's the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

-2

mwatkins

mwatkins, Jessica Rumbelow

tl;dr: This is a follow-up to our original post on prompt generation and the anomalous token phenomenon which emerged from that research. Work done by Jessica Rumbelow and Matthew Watkins in January 2023 at SERI-MATS.

part of a typical semantically coherent cluster we found in GPT2-small's embedding space

Clustering

As a result of work done on clustering tokens in GPT-2 and GPT-J embedding spaces, our attention was originally drawn to the tokens closest to the centroid of the entire set of 50,257 tokens shared across all GPT-2 and -3 models.^[1] These tokens were familiar to us for their frequent occurrence as closest tokens to the centroids of the (mostly semantically coherent, or semi-coherent) clusters of tokens... (read 3822 more words →)

114

Jessica Rumbelow, mwatkins

UPDATE (14th Feb 2023): ChatGPT appears to have been patched! However, very strange behaviour can still be elicited in the OpenAI playground, particularly with the davinci-instruct model.

More technical details here.

Further (fun) investigation into the stories behind the tokens we found here.

Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.

TL;DR

Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)

We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.)
Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0

... (read 3489 more words →)

208

675

More detail on this phenomenon here: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Guardian AI (Misaligned systems are all around us.)

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Work done @ SERI-MATS, idea from a conversation with Ivan Vendrov at Future Forum earlier this year.

Misaligned systems are all around us. They are what make me watch another video of a man in filthy shorts building a hut using only tools made from rocks and his own armpit hair. And the reason I have never, ever watched a single episode of Flavourful Origins in isolation. Maybe they make you mindlessly seek cat gifs, or keep you scrolling twitter in a cosy fug of righteous indignation long after you should be asleep. They could also be the reason your uncle is a bit more xenophobic now than he used to be. A... (read 448 more words →)

Why I'm Working On Model Agnostic Interpretability

Work done @ SERI-MATS.

Evaluating interpretability methods (and so, developing good ones) is really hard because we have no ground truth. Or at least, no ground truth that we can compare our interpretations directly against.

The ground truth of a model's behaviour is provided by that model's architecture and its learned parameters. But, puny humans are unable to interpret this: it's precise, in that it accurately explains the model's behaviour, but it's not interpretable. On the other end of the spectrum we have something like "This model classifies cats" – a statement that is really easy to interpret, but lacks something in the way of precision.

Precise <---------------------------------> Interpretable

... (read 541 more words →)