peligrietzer

2mo

Website version · Gestalt · Repo and data

Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)

This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.

It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.

It is substantially a list of lists structuring 800 links. The point is to produce stylised... (read 24743 more words →)

177

•••

The Problem With the Word ‘Alignment’

peligrietzer

peligrietzer, particlemania

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushita jha), for AOI^[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment.

The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human autonomy and human flourishing. In the course of articulating our mission and positioning ourselves -- a young organization -- in the landscape of AI risk orgs, we’ve come to notice what we think are serious conceptual problems with the prevalent vocabulary of ‘AI alignment.’ This essay will discuss some of the major ways in which we think the concept of ‘alignment’ creates bias... (read 1575 more words →)

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M, lisathiergart

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent.

Abstract: To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or

... (read 243 more words →)

Replying toCosmopolitan values don't come free

peligrietzer3y

Cosmopolitan values don't come free

Possibly relevant?

Replying toSome Thoughts on Virtue Ethics for AIs

peligrietzer3y*

Some Thoughts on Virtue Ethics for AIs

I describe the more formal definition in the post:

'Actions (or more generally 'computations') get an x-ness rating. We define the x shard's expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)'

And as I say in the post, we should expect decision-influences matching this definition to be natural and robust only in cases where x is a 'self-promoting' property. A property x is 'self-promoting' if it is reliably the case that performing an action with a higher x-ness rating increases the expected aggregate x-ness of future actions.

Replying toSome Thoughts on Virtue Ethics for AIs

peligrietzer3y*

Some Thoughts on Virtue Ethics for AIs

Yep! Or rather arguing that from a broadly RL-y + broadly Darwinian point of view 'self-consistent ethics' are likely to be natural enough that we can instill them, sticky enough to self-maintain, and capabilities-friendly enough to be practical and/or survive capabilities-optimization pressures in training.

-1

Some Thoughts on Virtue Ethics for AIs

peligrietzer

This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.’ The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition.

Praxis-based values as I define them are, informally, reflective decision-influences matching the description ‘promote x x-ingly’: ‘promote peace peacefully,’ ‘promote corrigibility corrigibly,’ ‘promote science scientifically.’

I will later propose a quasi-formal definition of this values-type, but the general idea is that certain values are an ouroboros of means and end. Such values frequently come up in human “meaning of life” activities (e.g. math, art, craft, friendship, athletics, romance, technology), as well as in complex... (read 967 more words →)

Replying toBehavioural statistics for a maze-solving agent

peligrietzer3y

Behavioural statistics for a maze-solving agent

This brings up something interesting: seems worthwhile to compare the internals of a 'misgeneralizing,' small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.

Replying toBehavioural statistics for a maze-solving agent

peligrietzer3y

Behavioural statistics for a maze-solving agent

I'd maybe point the finger more at the simplicity of the training task than at the size of the network? I'm not sure there's strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.

Behavioural statistics for a maze-solving agent

peligrietzer

peligrietzer, TurnTrout

Summary: Understanding and controlling a maze-solving policy network analyzed a maze-solving agent's behavior. We isolated four maze properties which seemed to predict whether the mouse goes towards the cheese or towards the top-right corner:

In this post, we conduct a more thorough statistical analysis, addressing issues of multicollinearity. We show strong evidence that (2) and (3) above are real influences on the agent's decision-making, and weak evidence that (1) is also a real influence. As we speculated in the original post,^[1] (4) falls away as a statistical artifact.

Peli did the stats work and drafted the post, while Alex provided feedback, expanded the visualizations, and ran additional tests for multicollinearity. Some of the work completed... (read 2955 more words →)

Replying toMaze-solving agents: Add a top-right vector, make the agent go to the top-right

peligrietzer3y

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

I would again suggest a 'perceptual' hypothesis regarding the subtraction/addition asymmetry. We're adding a representation of a path where there was no representation of a path (creates illusion of path), or removing a representation of a path where there was no representation of a path (does nothing).

-3

peligrietzer3y

No but I hope to have a chance to try something like it this year!

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout

TurnTrout, peligrietzer, lisathiergart

Overview: We modify the goal-directed behavior of a trained network, without any gradients or finetuning. We simply add or subtract "motivational vectors" which we compute in a straightforward fashion.

In the original post, we defined a "cheese vector" to be "the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze." By subtracting the cheese vector from all forward passes in a maze, the network ignored cheese.

I (Alex Turner) present a "top right vector" which, when added to forward passes in a range of mazes, attracts the agent to the top-right corner of each maze. Furthermore, the cheese and top-right vectors... (read 3031 more words →)

101

Replying toUnderstanding and controlling a maze-solving policy network

peligrietzer3y

Understanding and controlling a maze-solving policy network

The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.

Understanding and controlling a maze-solving policy network

TurnTrout

TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell

Previously: Predictions for shard theory mechanistic interpretability results

**Locally** **retargeting the search** **by modifying a single activation.** We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied by that positive activation. This allows limited on-the-fly redirection of the net's goals.

(*The red dot is not part of the image observed by the network*, it just represents the modified activation. Also, this GIF is selected to look cool. Our simple technique often works, but it isn't effortless, and some dot locations are harder to steer towards.)

TL;DR: We algebraically modified the net's runtime... (read 6633 more words →)

334

Predictions for shard theory mechanistic interpretability results

TurnTrout

TurnTrout, Ulisse Mini, peligrietzer

How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network.

The network in action on its training distribution, where cheese is randomly spawned in the top-right 5x5 available grid region. For more training videos, see the rand_region_5 Google Drive folder.

We just finished phase 1 of our behavioral and interpretability experiments. Throughout the project, we individually booked predictions -- so as to reduce self-delusion from hindsight bias, to notice where we really could tell ahead of time what was going to happen, and to notice where we really were surprised.

So (especially if you're the kind of person who might... (read 1457 more words →)

105

[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer, remember

Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take full responsibility for that. In this updated version of the post I have improved the propositions and added a proof for proposition 2. Please continue to point out weaknesses in the argument; that is a major motivation for why we share these fragments.

For comments and clarifications on the conceptual and philosophical aspects of this article, please read metasemi's excellent follow-up note here.

Meta:... (read 3871 more words →)

[Simulators seminar sequence] #1 Background & shared assumptions

Jan

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer, remember

Meta: Over the past few months, we've held a seminar series on the Simulators theory by janus. As the theory is actively under development, the purpose of the series is to discover central structures and open problems. Our aim with this sequence is to share some of our discussions with a broader audience and to encourage new research on the questions we uncover. Below, we outline the broader rationale and shared assumptions of the participants of the seminar.

Shared assumptions

Going into the seminar series, we determined a number of assumptions that we share. The degree to which each participant subscribes to each assumption varies, but we agreed to postpone discussions on these topics... (read 723 more words →)

Replying toAlignment allows "nonrobust" decision-influences and doesn't require robust grading

peligrietzer3y

Alignment allows "nonrobust" decision-influences and doesn't require robust grading

Having a go at extracting some mechanistic claims from this post:

A value x is a policy-circuit, and this policy circuit may sometimes respond to a situation by constructing a plan-grader and a plan-search.
The policy-circuit executing value x is trained to construct <plan-grader, plan-search> pairs that are 'good' according to the value x, and this excludes pairs that are predictably going to result in the plan-search Goodharting the plan-grader.
Normally, nothing is trying to argmax value x's goodness criterion for <plan-grader, plan-search> pairs. Value x's goodness criterion for <plan-grader, plan-search> pairs is normally just implicit in x's method for constructing <plan-grader, plan-search> pairs.
Value x may sometimes explicitly search over <plan-grader, plan-search> pairs in order

... (read more)

peligrietzer3yQuick Take

Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL:

Let's say I'm a guy who cares a lot about studying math well, studies math every evening, and doesn't know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I'm having a good time and credit gets assigned to the 'taking ketamine before I sit down to study math' computation. So my policy network gets updated to increase... (read 485 more words →)

LESSWRONG
LW

LESSWRONG
LW

Understanding and controlling a maze-solving policy network

Shallow review of technical AI safety, 2025

Predictions for shard theory mechanistic interpretability results

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

peligrietzer

peligrietzer

Shallow review of technical AI safety, 2025

The Problem With the Word ‘Alignment’

Paper: Understanding and Controlling a Maze-Solving Policy Network

Some Thoughts on Virtue Ethics for AIs

Behavioural statistics for a maze-solving agent

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Understanding and controlling a maze-solving policy network

peligrietzer

Understanding and controlling a maze-solving policy network

Shallow review of technical AI safety, 2025

Predictions for shard theory mechanistic interpretability results

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

peligrietzer

peligrietzer

Shallow review of technical AI safety, 2025

The Problem With the Word ‘Alignment’

Paper: Understanding and Controlling a Maze-Solving Policy Network

Some Thoughts on Virtue Ethics for AIs

Behavioural statistics for a maze-solving agent

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Understanding and controlling a maze-solving policy network

Shared assumptions