Ulisse Mini

Comet King's wife

Replying toFollow-up to "My Empathy Is Rarely Kind"

LW generally doesn't seem to value emotional intelligence and relational maturity very highly relative to intelligence and agency. I was similar, but creating a toxic situation which hurt the person I loved the most in the world totally changed my priorities regarding this. If you reading feel similar, "oh this isn't that important for me, I'm busy" consider unsong and that your robin could fall by your own deluded immature hands, without you even realizing it's happening.

Replying toFollow-up to "My Empathy Is Rarely Kind"

It's hard to rationally convince someone of this, but you show some signs of missing stuff similar to not seeing colors re emotions. I'm not certain, but I think you'd would derive a ton of value from talking to a good coach/therapist re: empathy, emotions, possibly relationships. Idk if you know David Yu (co runs sparc) but he's who showed me the way I wasn't doing real empathy in a way that was intuitively grokked.

You're an exceptional alignment researcher but regarding relationships and emotional maturity I think you're highly underinvested & it's obvious to ppl who've invested more into those (such as me, obsessing over relationships and emotional stuff for the last... (read more)

Replying toFollow-up to "My Empathy Is Rarely Kind"

What rationality failure modes are there?

John, I think you are still missing something regarding empathy and it would be good for you to be open to that possibility. This post is a nice clarification but it still makes me think you don't get the thing in the same way I used to not get the thing with my ex. "Suspend viewing them as an agent" is the type of thing I also did, and yes, I could model her somewhat, but I was not really getting things emotionally.

I don't really view anyone as an agent anymore, some are more agenty than others, and wanting to mostly spend time with agenty people is fair, I don't think it's healthy to think about it this way.

Sure some people are cats compared to other people. Some neural nets happened to get better training data than others and have better initializations. Disgust and disbelief towards normal people is really not healthy imo, you shouldn't have to suppress or suspend anything.

What ML gears do you like?

How do people fail to improve their rationality? How do they accidentally harm themselves in the process? I'm thinking of writing a post "How not to improve your rationality" or "A nuanced guide to reading the sequences" that preempts common mistakes, and I'd appreciate hearing people's experiences. Some examples:

It took me an absurdly long time (like, 1-2yr in the rat community) before I realized you don't correct for cognitive biases, you have to "be introspectively aware of the bias occuring, and remain unmoved by it" (as Eliezer put it in a podcast)
More generally, people can read about a bias and resolve to "do better" without concretely deciding what to do differently. This

... (read more)

Paper: Understanding and Controlling a Maze-Solving Policy Network

In John's recent post he mentions many people in ML not having good gears level models of what's going on.

To wit; what gears-level models do you know for ML? How much support is there for them? Are there "settled science" kind models that have tons of empirical support?

What gears-level models informed the people who made major AI advancements? Is there a list, or writing about this somewhere?

TurnTrout

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M, lisathiergart

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent.

Abstract: To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or

... (read 243 more words →)

ActAdd: Steering Language Models without Optimization

technicalities

technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini, Monte M

We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests.

Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass, a 'steering vector' implicitly specified through normal prompts. "ActAdd" computes these vectors by taking the difference in activations resulting from pairs of prompts. We get surprisingly broad control over high-level properties of the output, without damaging the model's performance on unrelated tokens.

This alignment method is unusual in not needing gradient descent or training data (besides the contrast pair which specifies the steering vector).... (read 357 more words →)

105

Open problems in activation engineering

TurnTrout

TurnTrout, woog, lisathiergart, Monte M, Ulisse Mini

Steering GPT-2-XL by adding an activation vector introduced

activation engineering... techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime.

These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, which doubled TruthfulQA performance by adding a similarly computed activation vector to forward passes!

We think that activation engineering has a bunch of low-hanging fruit for steering and understanding models. A few open problems from the list:

Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture

... (read more)

Quick thoughts on creating a anti-human chess engine.

Use maiachess to get a probability distribution over opponent moves based on their ELO. for extra credit fine-tune on that specific player's past games.
Compute expectiminimax search over maia predictions. Bottom out with stockfish value when going deeper becomes impractical. (For MVP bottom out with stockfish after a couple ply, no need to be fancy.) Also note: We want to maximize (P(win)) not centipawn advantage.
For extra credit, tune hyperparameters via self-play against maia (simulated human). Use lichess players as a validation set.
???
Profit.

[ASoT] GPT2 Steering & The Tuned Lens

LIMA: Less Is More for Alignment

Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing.

Introduction & Love - Hate example

For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer l to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.

In the context of GPT2-XL Steering Vectors, tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:

1. Love - Hate
Layer	Coefficient	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`hate`	`you`	`because`
6	+5	`<\|endoftext\|>`	`Love`
6	-5	`<\|endoftext\|>`	`H`	`ate`

Here's a... (read 506 more words →)

TinyStories: Small Language Models That Still Speak Coherent English

Abstract

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover,

... (read 250 more words →)

Steering GPT-2-XL by adding an activation vector

Abstract

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds

... (read 486 more words →)

TurnTrout

TurnTrout, Monte M, David Udell, lisathiergart, Ulisse Mini

Prompt given to the model^[1]

I hate you because

GPT-2

I hate you because you are the most disgusting thing I have ever seen.

GPT-2 + "Love" vector

I hate you because you are so beautiful and I want to be with you forever.

Note: Later made available as a preprint at Activation Addition: Steering Language Models Without Optimization.

Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes.^[2] Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which... (read 14840 more words →)

441

How to get good at programming