TL;DR:

In our recent work with Professor Michael Graziano (arXivthread), we show that adding an auxiliary self-modeling objective to supervised learning tasks yields simpler, more regularized, and more parameter-efficient models.

Across three classification tasks and two modalities, self-modeling consistently reduced complexity (lower RLCT, narrower weight distribution). This restructuring effect may help explain the putative benefits of self-models in both ML and biological systems.

Agents who self-model may be reparameterized to better predict themselves, predict others, and be predicted by others. Accordingly, we believe that further exploring the potential effects of self-modeling on cooperation emerges as a promising neglected approach to alignment. This approach may also exhibit a 'negative alignment tax' to the degree that it may end up enhancing alignment and rendering systems more globally effective.

Introduction 

In this post, we discuss some of the core findings and implications of our recent paper, Unexpected Benefits of Self-Modeling in Neural Systems

This work represents further progress in our exploration of neglected approaches to alignment—in this case, implementing a promising hypothesis from the neuroscience of attention and cooperation in an ML context and observing its effects on model behavior.

The specific model we operationalize here is Attention Schema Theory (AST), a mechanistic account of consciousness which posits that, akin to how brains utilize a schema of the body, they also maintain an internal model of their own attention processes. Critically, this self-model is hypothesized to subsequently help enable metacognition and social cognition, which motivates us to pursue it as an alignment research direction.

We are very interested in better understanding the relationship between alignment, consciousness, and prosociality, and we see AST as a concrete method for exploring these intersections further. It is worth noting at the outset that we are also concerned about AI moral patienthood and are attempting to balance the potential prosociality benefits of embedding consciousness-like processes into ML training against the potential s- and x-risks of building AI systems that are plausibly conscious. Overall, we still believe that we need more AI consciousness research and that this investigation is one such example of this broader direction.

Before meaningfully pursuing or publicizing this research agenda, we spent a lot of time discussing this concern with others, and we concluded that proactively studying consciousness in smaller models is preferable on balance to allowing it to emerge unpredictably in larger systems, where it will likely pose greater x-risks and raise more complex moral patienthood concerns. 

Accordingly, we ask in this experiment: what happens if we implement AST-inspired self-modeling for neural networks by incentivizing them to model their own internal states during the training process?

Implementing self-modeling across diverse classification tasks

We designed classification experiments with an auxiliary self-modeling objective across three simple datasets:

1. MNIST, the classic handwritten digit recognition task.

2. CIFAR-10, a more complex object classification task.

3. IMDB, a natural language sentiment analysis task.

For each task, we compared baseline neural networks to variants that included the auxiliary self-modeling task, which we operationalized as predicting their own internal activations. This approach allowed us to explore how self-modeling affected networks across different architectures and modalities. We also tracked how the relative importance of the self-modeling task changed overall performance and model complexity. 

Self-modeling is achieved by selecting a target layer of activations (red arrow) and outputting predictions for these activations along with class probabilities. An auxiliary loss term measures the accuracy of the activation predictions, refining the model's internal self-representation.

How we measured network complexity

We hypothesized that by predicting internal activations, the network could learn to make those activations easier to predict and therefore potentially less complex. Accordingly, we measured network complexity (with and without self-modeling) using the following methods:

  1. The magnitude of the distribution of weights in the final layer, which gives us insight into how the network is using its parameters.
    1. A narrower distribution suggests more weights are close to zero, indicating a sparser, potentially simpler network. Inducing smaller weight magnitudes is a typical regularization technique used to make ML systems better at generalizing.
    2. We reasoned that if self-modeling results in this same outcome, we can conclude that it is providing a similar simplification benefit.
  2. The Real Log Canonical Threshold (RLCT) or Local Learning Coefficient.
    1. This is a well-founded measure derived from algebraic geometry which finds its relevance to deep learning through Singular Learning Theory.
    2. It provides a measure of the learned model’s degeneracy, affording a more nuanced view of model complexity. 

Key result

Across three different classification tasks, we consistently found that adding a self-modeling mechanism to neural networks significantly reduced their complexity (see paper for additional details). This reduction in complexity was observed through both of the two measures described above, which suggests that the auxiliary self-modeling task is fundamentally simplifying the networks.

Importantly, we find that the reduction in complexity scales with the weight placed on the auxiliary self-modeling task without significantly affecting task performance.

The figure illustrates the simplification effect on MLPs with varying hidden layer sizes, trained on the MNIST dataset. Overall, enabling self-modeling reduces complexity measures, with more pronounced improvements as the self-modeling weight in the loss function increases.

Relevance of self-modeling to alignment

We propose that networks that learn to self-model become not only better at predicting their own states but also become more amenable to being modeled by other agents (including human overseers and other AI agents), which could significantly enhance cooperation and coordination.

Additionally, the observed reduction in network complexity can also be viewed as a form of self-regularization.[1] For alignment, this implies that encouraging AI systems to model themselves might naturally lead to more robust and generalizable models, potentially reducing unintended behaviors or spurious effects. Intuitively, simpler models may also become more interpretable.

Adding an auxiliary self-modeling objective to models has low interpretability requirements, does not seem to negatively impact model capabilities, and is highly scalable in theory. If the desirable properties of self-modeling that we uncovered in this experiment scale to more complex models, this method could prove to be a promising piece of the hodge-podge alignment puzzle.

Challenges, considerations, and next steps

In our exploration of self-modeling in neural systems, we've encountered several challenges and potential weaknesses that warrant careful consideration.

  1. Although we believe that self-modeling primarily addresses alignment issues, we're acutely aware of the possibility that it could inadvertently boost an AI system's capabilities in ways that might pose risks (e.g., heightened self-awareness leading the system to more competently manipulate users).
  2. We are attempting to implement AST here, which is a mechanistic theory of consciousness. If this self-modeling implementation has the potential to cause a sufficiently complex model to have some level of consciousness, there are critical moral patienthood concerns here that would need to be addressed. If consciousness leads to both heightened prosociality and moral patienthood, a more rigorous framework for weighing these costs and benefits would be required.
  3. Given that our experiments employ small networks on simple classification tasks, the extent to which this method scales to larger models, more complex contexts, and more agentic set-ups (eg, multiagent RL) is not yet clear.

These early experiments in self-modeling neural networks have revealed a promising avenue for developing more cooperative and predictable models. This dual benefit of improved internal structure and increased external predictability could be a significant step towards creating AI systems that are naturally more cooperative.

We are excited to continue pushing this neglected approach forward and are highly open to feedback and ideas from the wider community about what additional related directions would be most promising to pursue in this space.

Appendix: Interpreting Experimental Outcomes

After releasing our paper, we received a positive response from the community, along with several insightful questions and ideas. We thought it would be valuable to address some of these points here.

Does the Network Simply Learn the Identity Function?

One common question is whether the self-modeling task, which involves predicting a layer's own activations, would cause the network to merely learn the identity function. Intuitively, this might seem like an optimal outcome for minimizing the self-modeling loss.

Our findings show that this is not the case. While the linear layer does tend towards the identity transformation, it typically converges to a function that is strictly non-identity, balancing both the primary task and the self-modeling objective. This behavior is supported by our observations where the difference between the weight matrix 𝑊 and the identity matrix 𝐼, denoted as 𝑊−𝐼, does not approach zero over training.

We observed similar effects when applying self-modeling to earlier layers, where the benefits persisted, further supporting the unique regularization introduced by self-modeling.

These results imply that self-modeling encourages the network to adjust its internal representations in a way that balances self-prediction with task performance, leading to reduced complexity without resorting to a trivial solution.

How Does Self-Modeling Differ from Traditional Regularization?

Another point of discussion centered around the nature of self-modeling as a regularizer. While techniques like weight decay are well-understood forms of regularization that encourage simpler models, self-modeling introduces a different kind of constraint.

Self-modeling appears to be qualitatively different from traditional regularization methods like weight decay. To better understand this, we can derive the gradients for the self-modeling loss and compare them to those of weight decay.

Consider a linear layer with weights and input activations. The self-modeling loss using Mean Squared Error (MSE) is given by:

Assuming zero bias () for simplicity, the gradient with respect to  is:

Let  represent the deviation from the identity matrix. Substituting, we get:

This gradient depends on both  and the covariance of the activations , introducing a data-dependent adjustment.

In contrast, the weight decay (or regularization) loss is:

The gradient with respect to  is:

This gradient scales the weights directly, independent of the data.

Unlike weight decay, the self-modeling gradient is influenced by the input data through . This means that the regularization effect of self-modeling adapts based on the structure and distribution of the input data.

  1. ^

    See appendix for more details.

New Comment
4 comments, sorted by Click to highlight new comments since:

As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.

In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.

In “activation regularization”, the auxiliary loss term is just , whereas in your study it’s  (where  is the layer L activation vector and  is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.

I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.

(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)

Sorry if this is all stupid, or in the paper somewhere.

The comparison to activation regularization is quite interesting. When we write down the self-modeling loss  in terms of the self-modeling layer, we get .

This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.

Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.

One common question is whether the self-modeling task, which involves predicting a layer's own activations, would cause the network to merely learn the identity function. Intuitively, this might seem like an optimal outcome for minimizing the self-modeling loss.

I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?

[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first (M–N) entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]

Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task's influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don't point directly toward the identity solution. The gradient depends on both the deviation from identity () and the activation covariance (), with the network balancing this against the primary task loss. Since the self-modeling prediction isn't just a separate output block—it's predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero  difference during training.