All of entropizer's Comments + Replies

entropizerΩ130

I agree with much of this, but I suspect people aren't only sticking with activation-based interpretability because the bad dimensionality of weight-based interpretability is intimidating. Rather, I feel like we have to be thinking about activation-based interpretability if we want an analysis of the model's behavior to contain semantics that are safety-relevant.

For example, I can know nothing about how safe a classifier that distinguishes A from B is, regardless of how much I know about its weights, unless I know what A and B are. There might be identical... (read more)

Thank you for the reply. You might be interested in neural Darwinism if you've never heard of it, the comment you linked in the edit made me think of it: https://en.wikipedia.org/wiki/Neural_Darwinism.

I don't have a good story for how reuse of subcomponents leads to cooperation across agents, but my gut says to look at reuse rather than voting or markets. Could be totally baseless though.

This is really interesting. My immediate gut reaction is that this wrongly treats different subagents as completely separate entities, when really they're all overlapped chimeras. For example, Subagent A might be composed of {C,D,E} while Subagent B is composed of {C,E,F}. Reuse of subcomponents seems like a more natural path to coordination than internal voting systems or prediction markets to me.

What distinguishes subagents from shards? Are both getting at the same idea? https://www.alignmentforum.org/w/shard-theory 

4Richard_Ngo
Thanks for the comment! A few replies: I don't mean to imply that subagents are totally separate entities. At the very least they all can access many shared facts and experiences. And I don't think that reuse of subcomponents is mutually exclusive from the mechanisms I described. In fact, you could see my mechanisms as attempts to figure out which subcomponents are used for coordination. (E.g. if a bunch of subagents are voting/bargaining over which goal to pursue, probably the goal that they land on will be one that's pretty comprehensible to most of them.) Re shards: there are a bunch of similarities. But it seemed to me that shard theory was focused on pretty simple subagents. E.g. from the original post: "Human values are ... sets of contextually activated heuristics"; and later "human values are implemented by contextually activated circuits which activate in situations downstream of past reinforcement so as to steer decision-making towards the objects of past reinforcement". Whereas I think of many human values as being constituted by subagents that are far too complex to be described in that way. In my view, many important subagents are sophisticated enough that basically any description you give of them would also have to be a description of a whole human (e.g. if you wouldn't describe a human as a "contextually activated circuit", then you shouldn't describe subagents that way). This may just be a vibes difference; many roads lead to Rome. But the research directions I've laid out above are very distinct from the ones that shard theory people are working on. EDIT: more on shards here.

Two additional senses in which a "right to be wrong" might be justified: in differing risk preferences, individually, or the usefulness of holdout populations, societally.

I don't think people should try to emulate heliocentrists because I think that acting like they did would generally lead people to failure, not success. The lesson I take from this is that stubborn holdout populations who refuse to accept the obvious are important to the health of science as an ecosystem of ideas. But I don't think stubbornness should be seen as a general purpose virtue. I think Aristotle and co just experienced epistemic luck.