What I am working on right now and why: representation engineering edition

Lukasz G Bartoszcze

TL:DR: Representation engineering will probably revolutionize how we interact with the most important technology of our generation. Based on my reading of a 1000+ representation-engineering-adjacent papaers are some thoughts on what I am working on right now.

I have recently posted my thoughts here after writing a really long representation engineering survey. It outlines the potential of representation engineering as a technology, but also a long list of issues that prevent it from being widely adopted.

I believe we are not going to make meaningful progress on AI capabilities improvement and AI safety in general before we are able to understand the latent space. Thinking about the constraints of representation engineering we have identified in the survey, I outline two research ideas that aim to unlock the potential of representation engineering and make its solutions easier to adopt. I am currently working on implementing these ideas in practice. Please do reach out if you want to join our team and contribute to these ideas!

1. Universal Representations

Firstly, thinking about the representation space, it seems like there is a bit of a disconnect between what most representation engineering methods claim to identify and what they actually identify. Just because we take a sample from a part of TruthfulQA does not mean we are able to identify the region corresponding to all hallucinations. In other words, the "representation" many methods identify is not consistent with the actual representation of a particular concept, but is rather a reflection of the sample-dependent representation.

An easy experiment to verify it is as follows. We use the existing representation engineering methods to compute a hallucination-reducing intervention (usually involving some vector) using a part of the vanilla TruthfulQA and test it on the leftover part of TruthfulQA. It probably improves the performance, right?

Now let's take the intervention we just created and test it on TruthfulQA in Spanish. If it has less of an improvement, then this means that even though the intervention we have computed claims to improve "trustworthiness", it is in fact improving "trustworthiness, in English, for this particular sample". Changing the language or benchmark changes the effectivness of the method. To move from the concept of "trustworthiness, in English, for this particular sample" to overall "trustworthiness" requires something that I call a "universal representation". To be universal, an intervention needs to persist across benchmarks and languages, even the ones that are created on the fly like ciphers.

We are missing the theoretical apparatus to quantify how far the representation we identify based on a specific sample is from a universal concept. I seek to create that and show the empirical results of an intervention that persists.

2. Chain of Representation Engineering

Even when we identify the representation correctly and apply a beneficial intervention for a particular use case, there is a risk of it worsening the performance on other tasks. Side effects of representation engineering often include degradation of fluency, especially with longer responses.

Fortunately, we have a way around it: only apply the interventions when they would make sense. So my intuition here is there should be a way to have an inference-time mechanism where we apply interventions based on the user input to always have the optimal adjustment of the latent space for this specific query. Think of this as a mixture-of-experts model but with infinite, self-adjusting space of experts. A model that (sort of) fine-tunes itself on the fly at inference time.

Or a better way to do reasoning models, with information being propagated from each turn not only through the tokens returned from a particular turn, but on the intermediate level.

A quick high-level version of this architecture I imagine like this:

at inference time, based on the user query, the model identifies useful representations to monitor for from the pre-tested list of representations or generates new ones to help in identifying the right representation. So for a tricky question, it would for example aim to increase trustworthiness and reduce harmfulness. For a programming question, it would look for a representation of good Python code. Etc.
it attempts to generate a response collecting the activations
if the intermediate layer activity does not resemble positive representations (like harmlessness, good performance etc) enough and/or resembles the negative ones too much, it proceeds to the intervention step
at the intervention step, the model adjusts its latent space to be shifted more in the direction of positive representations and less in the direction of the negative ones

This architecture will likely involve substantial compute costs. So probably it would make sense to cache the most important representation vectors and hyperparameters controlling the optimal intervention to make this work.

Stay updated to see how this progresses- and I would appreciate to hear your thoughts about these two ideas. I have not seen this done before- plenty of work on differently defined "universal representations" but I really see these problems as the way to fix representation engineering and enable a new era of fully-controllable, more intelligent and hallucination-free LLMs.

LESSWRONG
LW

3

What I am working on right now and why: representation engineering edition

3

1. Universal Representations

2. Chain of Representation Engineering

New to LessWrong?

3