All of Jessica Rumbelow's Comments + Replies

Introducing Leap Labs, an AI interpretability startup

Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing.

Jessica Rumbelow2y50

This isn't set in stone, but likely we'll monetise by selling access to the interpretability engine, via an API. I imagine we'll offer free or subsidised access to select researchers/orgs. Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow2y70

We're looking into it!

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow2y118

Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. "this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraini... (read more)

Aha!! Thanks Neel, makes sense. I’ll update the post

Jessica Rumbelow2y60

Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

3ChrisCundy2y

Thanks for the elaboration, I'll follow up offline

Jessica Rumbelow2yΩ13-2

This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn't the case. I think that's the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

Neel Nanda2yΩ8171

Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding

2Neel Nanda2y

That's GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn't. But idk, doesn't really matter Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms

Thanks - wasn't aware of this!

Jessica Rumbelow2y40

Interesting! Can you give a bit more detail or share code?

7neverix2y

It is based on this. I changed it to optimize using softmax instead of straight-through estimation and added regularization for the embedded tokens. Notebook link - this is a version that mimics this post instead of optimizing a single neuron as in the original. EDIT: github link

Interesting, thanks. There's not a whole lot of detail there - it looks like they didn't do any distance regularisation, which is probably why they didn't get meaningful results.

Jessica Rumbelow2y40

I'll check with Matthew - it's certainly possible that not all tokens in the "weird token cluster" elicit the same kinds of responses.

5lsusr2y

Thanks. I re-read your post and I think I understand better now. The cluster contains many weird tokens but not all tokens in the cluster are weird, nor do all tokens in the cluster elicit anomalous behavior.

Jessica Rumbelow2y130

What's an SCP?

AlphaAndOmega2y308

SCP stands for "Secure, Contain, Protect " and refers to a collection of fictional stories, documents, and legends about anomalous and supernatural objects, entities, and events. These stories are typically written in a clinical, scientific, or bureaucratic style and describe various attempts to contain and study the anomalies. The SCP Foundation is a fictional organization tasked with containing and studying these anomalies, and the SCP universe is built around this idea. It's gained a large following online, and the SCP fandom refers to the community of ... (read more)

lsusr2y123

It's a science fiction writing hub. Some of the most popular stories are about things that mess with your perception.

Not yet, but there's no reason why it wouldn't be possible. You can imagine microscope AI, for language models. It's on our to-do list.

Good to know. Thanks!

Adam Scherlis's Shortform

Yep, aside from running forward prop n times to generate an output of length n, we can just optimise the mean probability of the target tokens at each position in the output - it's already implemented in the code. Although, it takes way longer to find optimal completions.

Jessica Rumbelow2y80

More detail on this phenomenon here: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

3Adam Scherlis2y

Great work! I always wondered about that cluster of weird rare tokens: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights

-6[anonymous]2y

Guardian AI (Misaligned systems are all around us.)

Jessica Rumbelow2y10

Yeah, I think it could be! I’m considering pursuing it after SERI-MATS. I’ll need a couple of cofounders.

Why I'm Working On Model Agnostic Interpretability

Jessica Rumbelow2y61

Hi Joseph! I'll briefly address the saliency map concern here – it likely originates from this paper, which showed that some types of saliency mapping methods had no more explanatory power than edge detectors. It's a great paper, and worth a read. The key thing to note is that this was only true of some gradient-based saliency mapping methods, which are, of course, model-specific. Gradients can be deceptive! Model agnostic, perturbation-based saliency mapping doesn't suffer from the same kind of problems – see p.12 here.

[Crosspost] AlphaTensor, Taste, and the Scalability of AI