User Comment Replies

The Value Proposition of Romantic Relationships

I think it's not necessarily just a cultural template; there are some aspects of standard romantic feelings and relationships that can make a high level of emotional intimacy more likely. Many people are afraid to be emotionally intimate, lack the impetus to change the status quo, or don't even realize it's something they want. I think sex acts as an important catalyst for vulnerability in romantic relationships: people are driven to have it independently of interest in intimacy, but it can lead to greater intimacy regardless. You're showing somebody your ... (read more)

Understanding and controlling a maze-solving policy network

CatGoddess2yΩ230

I agree that motivation should reduce to low-level, primitive things, and also that changing the agent's belief about where the cheese is lets you retarget behavior. However, I don't expect edits to beliefs to let you scalably control what the agent does, in that if it's smart enough and making sufficiently complicated plans you won't have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say "abstract class of behavior" I mean things like "put the red balls in the blue basket" or "pet all the c... (read more)

6TurnTrout2y

Agreed. Yeah, I don't think it's very practical to retarget the search for AGI, and "scalable control via internal retargeting" isn't the main thing which excited me about this line of research. I'm more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals. I'm also interested in new interp and AI-steering techniques which derive from our results.

How To Get Into Independent Research On Alignment/Agency

CatGoddess2y50

My main advice to avoid this failure mode is to leverage your Pareto frontier. Apply whatever knowledge, or combination of knowledge, you have which others in the field don’t.

This makes sense if you already have knowledge which other people don't, but what about if you don't? How much should "number of people in the alignment community who already know X thing" factor into what you decide to study, relative to other factors like "how useful is X thing, when you ignore what everyone else is doing?" For instance, there are probably fewer people who kno... (read more)

3johnswentworth2y

My expectation is that if you do the Alignment Game Tree exercise and maybe a few others like it relatively early, and generally study what seems useful from there, and update along the way as you learn more stuff, you'll end up reasonably-differentiated from other researchers by default. On the other hand, if you find yourself literally only studying ML, then that would be a clear sign that you should diversify more (and also I would guess that's an indicator that you haven't gone very deep into the Game Tree).

Understanding and controlling a maze-solving policy network

CatGoddess2yΩ040

Great post! I'm looking forward to seeing future projects from Team Shard.

I'm curious why you frame channel 55 as being part of the agent's "cheese-seeking motivation," as opposed to simply encoding the agent's belief about where the cheese is. Unless I'm missing something, I'd expect the latter to be as or more likely - in that when you change the cheese's location, the thing that should straightforwardly change is the agent's model of the cheese's location.

9TurnTrout2y

In addition to what Peli said, I would consider "changes where the agent thinks the cheese is" to be part of "changing/retargeting the cheese-seeking motivation." Ultimately, I think "cheese-seeking motivation" is shorthand for ~"a subgraph of the computational graph of a forward pass which locally attracts the agent to a target portion of the maze, where that target tracks the cheese when cheese is present." And on that view, modifying channel 55 would be part of modifying cheese-seeking motivation. Ultimately, "motivation" is going to reduce to non-motivational, primitive computational operations, and I think it'll feel weird the first few times we see that happen. For example, I might wonder "where's the motivation really at, isn't this channel just noting where the cheese is?".

1peligrietzer2y

The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.

Shared Frames Are Capital Investments in Coordination

CatGoddess2y30

Frames can be wrong, and using a wrong frame is costly, ~~even if~~ especially if everyone agrees on the frame.

It seems to me that having a wrong shared frame when studying a problem might still be useful as long as it's not too wrong (as long as the way it divides the world up isn't too far away from the "real" lines), because the world is high-dimensional and having a frame makes thinking about it more tractable. And it can be useful to share this wrong-but-not-too-wrong frame with other people because then you and your colleagues can talk to each othe... (read more)

3johnswentworth2y

Yup, that's right. A wrong frame is costly relative to the right frame. A less wrong frame can still be less costly than a more wrong frame, and that's especially relevant when nobody knows what the right frame is yet.

LESSWRONG
LW

All of CatGoddess's Comments + Replies