This is a new real name account purely for the discussion of AI.
A few months ago, I introduced a concept here on mechanistic interpretability.[1] It was an approach supported by a PyTorch implementation that could derive insights from neurons without explicit training on the intermediate concepts. As an illustration, my model can identify strategic interactions between chess pieces, despite being trained solely on win/loss/draw outcomes. One thing distinguishes it from recent work, such as the one by Anthropic ("Towards Monosemanticity: Deco... (read more)
Hello everyone,
This is a new real name account purely for the discussion of AI.
A few months ago, I introduced a concept here on mechanistic interpretability.[1] It was an approach supported by a PyTorch implementation that could derive insights from neurons without explicit training on the intermediate concepts. As an illustration, my model can identify strategic interactions between chess pieces, despite being trained solely on win/loss/draw outcomes. One thing distinguishes it from recent work, such as the one by Anthropic ("Towards Monosemanticity: Deco... (read more)