All of CatGoddess's Comments + Replies

CatGoddessΩ230

I agree that motivation should reduce to low-level, primitive things, and also that changing the agent's belief about where the cheese is lets you retarget behavior. However, I don't expect edits to beliefs to let you scalably control what the agent does, in that if it's smart enough and making sufficiently complicated plans you won't have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say "abstract class of behavior" I mean things like "put the red balls in the blue basket" or "pet all the c... (read more)

6TurnTrout
Agreed. Yeah, I don't think it's very practical to retarget the search for AGI, and "scalable control via internal retargeting" isn't the main thing which excited me about this line of research. I'm more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.  I'm also interested in new interp and AI-steering techniques which derive from our results. 

My main advice to avoid this failure mode is to leverage your Pareto frontier. Apply whatever knowledge, or combination of knowledge, you have which others in the field don’t.

 

This makes sense if you already have knowledge which other people don't, but what about if you don't? How much should "number of people in the alignment community who already know X thing" factor into what you decide to study, relative to other factors like "how useful is X thing, when you ignore what everyone else is doing?" For instance, there are probably fewer people who kno... (read more)

3johnswentworth
My expectation is that if you do the Alignment Game Tree exercise and maybe a few others like it relatively early, and generally study what seems useful from there, and update along the way as you learn more stuff, you'll end up reasonably-differentiated from other researchers by default. On the other hand, if you find yourself literally only studying ML, then that would be a clear sign that you should diversify more (and also I would guess that's an indicator that you haven't gone very deep into the Game Tree).
CatGoddessΩ040

Great post! I'm looking forward to seeing future projects from Team Shard. 

I'm curious why you frame channel 55 as being part of the agent's "cheese-seeking motivation," as opposed to simply encoding the agent's belief about where the cheese is. Unless I'm missing something, I'd expect the latter to be as or more likely - in that when you change the cheese's location, the thing that should straightforwardly change is the agent's model of the cheese's location.

9TurnTrout
In addition to what Peli said, I would consider "changes where the agent thinks the cheese is" to be part of "changing/retargeting the cheese-seeking motivation." Ultimately, I think "cheese-seeking motivation" is shorthand for ~"a subgraph of the computational graph of a forward pass which locally attracts the agent to a target portion of the maze, where that target tracks the cheese when cheese is present." And on that view, modifying channel 55 would be part of modifying cheese-seeking motivation.  Ultimately, "motivation" is going to reduce to non-motivational, primitive computational operations, and I think it'll feel weird the first few times we see that happen. For example, I might wonder "where's the motivation really at, isn't this channel just noting where the cheese is?".
1peligrietzer
The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.

Frames can be wrong, and using a wrong frame is costly, even if especially if everyone agrees on the frame.

 

It seems to me that having a wrong shared frame when studying a problem might still be useful as long as it's not too wrong (as long as the way it divides the world up isn't too far away from the "real" lines), because the world is high-dimensional and having a frame makes thinking about it more tractable. And it can be useful to share this wrong-but-not-too-wrong frame with other people because then you and your colleagues can talk to each othe... (read more)

3johnswentworth
Yup, that's right. A wrong frame is costly relative to the right frame. A less wrong frame can still be less costly than a more wrong frame, and that's especially relevant when nobody knows what the right frame is yet.