peligrietzer

Understanding and controlling a maze-solving policy network

Previously: Predictions for shard theory mechanistic interpretability results Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied by that positive activation. This allows limited on-the-fly redirection of the net's goals. (The red dot is not part of the image observed by the network, it just represents the modified activation. Also, this GIF is selected to look cool. Our simple technique often works, but it isn't effortless, and some dot locations are harder to steer towards.) TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. * This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. * Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. * We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. * This suggests that these channels are inputs to goal-oriented circuits, and these channels affect th

334Mar 11, 2023

peligrietzer

Message

994

14y

Shallow review of technical AI safety, 2025

Website version · Gestalt · Repo and data Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025) This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review...

Dec 17, 2025178

The Problem With the Word ‘Alignment’

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushita jha), for AOI[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment. The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human...

May 21, 202464

Paper: Understanding and Controlling a Maze-Solving Policy Network

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent. > Abstract: To understand the goals and goal representations of AI systems,...

Oct 13, 202370

Some Thoughts on Virtue Ethics for AIs

This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.’ The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition. Praxis-based values as I define them are, informally, reflective decision-influences matching...

May 2, 202384

Behavioural statistics for a maze-solving agent

Summary: Understanding and controlling a maze-solving policy network analyzed a maze-solving agent's behavior. We isolated four maze properties which seemed to predict whether the mouse goes towards the cheese or towards the top-right corner: In this post, we conduct a more thorough statistical analysis, addressing issues of multicollinearity. We show...

Apr 20, 202346

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Overview: We modify the goal-directed behavior of a trained network, without any gradients or finetuning. We simply add or subtract "motivational vectors" which we compute in a straightforward fashion. In the original post, we defined a "cheese vector" to be "the difference in activations when the cheese is present in...

Mar 31, 2023101

Understanding and controlling a maze-solving policy network

Mar 11, 2023334

Load More (7/12)

LESSWRONG
LW

LESSWRONG
LW

peligrietzer

peligrietzer

peligrietzer

Understanding and controlling a maze-solving policy network

Shallow review of technical AI safety, 2025

Predictions for shard theory mechanistic interpretability results

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

peligrietzer

Shallow review of technical AI safety, 2025

The Problem With the Word ‘Alignment’

Paper: Understanding and Controlling a Maze-Solving Policy Network

Some Thoughts on Virtue Ethics for AIs

Behavioural statistics for a maze-solving agent

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Understanding and controlling a maze-solving policy network

Shallow review of technical AI safety, 2025

The Problem With the Word ‘Alignment’

Paper: Understanding and Controlling a Maze-Solving Policy Network

Some Thoughts on Virtue Ethics for AIs

Behavioural statistics for a maze-solving agent

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Understanding and controlling a maze-solving policy network

Understanding and controlling a maze-solving policy network

Shallow review of technical AI safety, 2025

Predictions for shard theory mechanistic interpretability results

Maze-solving agents: Add a top-right vector, make the agent go to the top-right