Decision Transformer Interpretability
TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims * A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. * Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. * The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: * Dynamic Obstacles Environment * Black Box Model Characterisation * Explaining Obstacle Avoidance at positive RTG using QK and OV circuits * Alignment Relevance * Future Directions I would welcome assistance with: * Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). * Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. * Communication tasks: Making nicer diagrams/explanations. * I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I’m also happy to col
One way of thinking about the Goodness of Reality hypothesis is that if we look at an agent in the world, its world model and utility function/preferences are fully a property of that agent/its internals rather than reality-at-large. Reality is value-neutral - it requires additional structure (utility function, etc.) to assign value to states of reality (and these utility functions, to the extent that they're real, are parts of reality itself).
Also, from the 0th-person perspective/POV of awareness, via meditation practices, one can observe how value judgments are being constructed and go "beyond" value judgments about reality.
Nitpick: Is reality "Good" or is it beyond good and ... evil?