My main advice to avoid this failure mode is to leverage your Pareto frontier. Apply whatever knowledge, or combination of knowledge, you have which others in the field don’t.
This makes sense if you already have knowledge which other people don't, but what about if you don't? How much should "number of people in the alignment community who already know X thing" factor into what you decide to study, relative to other factors like "how useful is X thing, when you ignore what everyone else is doing?" For instance, there are probably fewer people who kno...
Great post! I'm looking forward to seeing future projects from Team Shard.
I'm curious why you frame channel 55 as being part of the agent's "cheese-seeking motivation," as opposed to simply encoding the agent's belief about where the cheese is. Unless I'm missing something, I'd expect the latter to be as or more likely - in that when you change the cheese's location, the thing that should straightforwardly change is the agent's model of the cheese's location.
Frames can be wrong, and using a wrong frame is costly,
even ifespecially if everyone agrees on the frame.
It seems to me that having a wrong shared frame when studying a problem might still be useful as long as it's not too wrong (as long as the way it divides the world up isn't too far away from the "real" lines), because the world is high-dimensional and having a frame makes thinking about it more tractable. And it can be useful to share this wrong-but-not-too-wrong frame with other people because then you and your colleagues can talk to each othe...
I agree that motivation should reduce to low-level, primitive things, and also that changing the agent's belief about where the cheese is lets you retarget behavior. However, I don't expect edits to beliefs to let you scalably control what the agent does, in that if it's smart enough and making sufficiently complicated plans you won't have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say "abstract class of behavior" I mean things like "put the red balls in the blue basket" or "pet all the c... (read more)