'Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), 'Catching AIs red-handed', including paulfchristiano's comment) with ideas from 'Searching for a model's concepts by their shape – a theoretical framework' to 'catch AIs red-handed'?
Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed:
Extending my previous thoughts, I think one of the most useful things the "Searching for concepts by their shape" framing can contribute is mathematical precision in defining hypotheses about what characterizes deceptive cognition. If we can frame agency/deception in terms of precise relations between features (even if those features are hard to interpretably extract in practice), it gives us a crisper target to aim at with our testing, rather than just relying on intuitive notions of what "seems deceptive".
However, I continue to be quite pessimistic about the prospects for catching competent deceptive models before it's too late, for the reasons outlined previously. I worry that by the time a model is smart enough that we could meaningfully apply these combined interpretability strategies to it, it would also be smart enough to notice what we're doing and come up with countermeasures.
Unless interpretability tools far exceed my expectations in their ability to crack open the black box of neural nets (which would be great!), I suspect the most practical path is to use simpler, more behavioral detection strategies (like the model consistency testing from your Control paper) to avoid building models that are competent enough to be robustly deceptive in the first place. Catching a superintelligent deceptive model after the fact seems like a much harder challenge than noticing warning signs of misalignment in weaker models and heading them off early.
But I would still encourage pursuing all these ideas, even if they end up not being enough on their own to solve the problem. The interpretability insights gained would be valuable for alignment more broadly. And we may get lucky in extracting some key structural signature of deceptive cognition that even very sophisticated models can't fully mask. It's an arms race we can't afford to unilaterally forfeit.'
Continued:
'How could the above be combined with 'Catching AIs red-handed' and, in particular, the ideas mentioned in paulfchristiano's comment?
The approaches discussed in "A Multimodal Automated Interpretability Agent" (MAIA) and "Causal Explainable Guardrails for Large Language Models" (LLMGuardaril) could potentially be combined with the ideas in "Catching AIs red-handed" and paulfchristiano's comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems:
Significant open questions remain about the feasibility of scaling these approaches to be robust against highly capable AI systems. There are also important limitations to consider: for example, MAIA's causal analysis is unlikely to provide a fully conclusive "smoking gun" that an AI system is deceptive, as opposed to strong probabilistic evidence. And sufficiently capable systems may find ways to avoid exhibiting any detectable deceptive behavior under the kind of systematic experimentation enabled by MAIA.
Nonetheless, I believe the combination of automated interpretability agents, causally grounded analysis of AI reasoning, and the overall adversarial "red team" mindset advocated in "Catching AIs red-handed" represents a promising direction for building more robust monitoring and containment of potentially deceptive AI systems. A concerted effort to develop and integrate these approaches could significantly improve our odds of detecting and mitigating catastrophic deceptive alignment failures.'
Conversation with Claude Opus on A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA). To me it seems surprisingly good at something like coming up with plausible alignment research follow-ups, which e.g. were highlighted here as an important part of the superalignment agenda.
Prompts bolded:
'Summarize 'Causal Explainable Guardrails for Large
Language Models'. In particular, could this be useful to deal with some of the challenges mentioned in 'Discussion: Challenges with Unsupervised LLM Knowledge Discovery'?
The paper "A Causal Explainable Guardrails for Large Language Models" proposes LLMGuardaril, a framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. Key points:Existing methods for steering LLMs towards desired attributes rely on unbiased representations and steering prompts, but the representations learned from pre-training can introduce semantic biases that negatively impact steering.
LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling extraction of unbiased steering representations. It also includes an explainable component providing insights into alignment between generated output and desired direction.
Experiments show LLMGuardaril can effectively steer LLMs towards desired attributes while mitigating biases.Regarding the challenges mentioned in "Discussion: Challenges with Unsupervised LLM Knowledge Discovery":One key challenge is distinguishing the model's own knowledge from knowledge of simulated entities. LLMGuardaril aims to disentangle the influence of semantic biases to obtain steering representations capturing desired attributes. This may help identify model's own knowledge if those attributes are specific to the model itself.
Another challenge is the hypothesized existence of many features satisfying consistency properties beyond just the model's knowledge. LLMGuardaril's causal analysis and adversarial learning aims to block confounding semantic biases, potentially reducing the set of candidate knowledge-like features.
Empirically, contrast-consistent search seemed to find prominent arbitrary features rather than knowledge. LLMGuardaril's explainable component could provide insight into whether extracted features actually correspond to model knowledge.
So in summary, while not a complete solution, LLMGuardaril's approach of causally analyzing and debiasing semantic influences may help address some key challenges around identifying model knowledge in an unsupervised way. But significant open questions remain around definitively distinguishing model knowledge from simulated knowledge of other entities. Further research is needed to determine how much traction causal analysis provides on the hardest aspects of the ELK problem.'
'How might MAIA be useful, potentially combined with 'Causal Explainable Guardrails for Large Language Models', to deal with some of the challenges mentioned in 'Discussion: Challenges with Unsupervised LLM Knowledge Discovery'?
MAIA and the techniques from "Causal Explainable Guardrails for Large Language Models" (LLMGuardaril) could potentially be combined to help address some of the challenges mentioned in "Discussion: Challenges with Unsupervised LLM Knowledge Discovery":
However, significant challenges would likely need to be overcome to fruitfully combine these techniques. For example, MAIA is currently focused on interpreting components of vision models, while LLMGuardaril focuses on steering language models, so unifying them would require bridging this modality gap. More fundamentally, even with access to MAIA's automated experimentation and LLMGuardaril's causal analysis, rigorously identifying an LLM's "own" knowledge in a fully unsupervised way, as opposed to the knowledge of a specific simulated entity, remains an open challenge and likely requires further conceptual breakthroughs. Nonetheless, exploring the intersection of scalable interpretability agents like MAIA and causally grounded analysis techniques like LLMGuardaril seems like a promising direction for future work on the ELK problem.'
Turns out, someone's already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7
I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the 'search' to many different undesirable kinds of behaviors.
If this is true, then we should be able to achieve quite a high level of control and understanding of NNs solely by straightforward linear methods and interventions. This would mean that deep networks might end up being pretty understandable and controllable artefacts in the near future. Just at this moment, we just have not yet found the right levers yet (or rather lots of existing work does show this but hasn't really been normalized or applied at scale for alignment). Linear-ish network representations are a best case scenario for both interpretability and control.
For a mechanistic, circuits-level understanding, there is still the problem of superposition of the linear representations. However, if the representations are indeed mostly linear than once superposition is solved there seem to be little other obstacles in front of a complete mechanistic understanding of the network. Moreover, superposition is not even a problem for black-box linear methods for controlling and manipulating features where the optimiser handles the superposition for you.
Here's a potential operationalization / formalization of why assuming the linear representation hypothesis seems to imply that finding and using the directions might be easy-ish (and significantly easier than full reverse-engineering / enumerative interp). From Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models (with apologies for the poor formatting):
'We focus on the goal of learning identifiable human-interpretable concepts from complex high-dimensional data. Specifically, we build a theory of what concepts mean for complex high-dimensional data and then study under what conditions such concepts are identifiable, i.e., when can they be unambiguously recovered from data. To formally define concepts, we leverage extensive empirical evidence in the foundation model literature that surprisingly shows that, across multiple domains, human-interpretable concepts are often linearly encoded in the latent space of such models (see Section 2), e.g., the sentiment of a sentence is linearly represented in the activation space of large language models [96]. Motivated by this rich empirical literature, we formally define concepts as affine subspaces of some underlying representation space. Then we connect it to causal representation learning by proving strong identifiability theorems for only desired concepts rather than all possible concepts present in the true generative model. Therefore, in this work we tread the fine line between the rigorous principles of causal representation learning and the empirical capabilities of foundation models, effectively showing how causal representation learning ideas can be applied to foundation models.
Let us be more concrete. For observed data X that has an underlying representation Zu with X = fu(Zu) for an arbitrary distribution on Zu and a (potentially complicated) nonlinear underlying mixing map fu, we define concepts as affine subspaces AZu = b of the latent space of Zus, i.e., all observations falling under a concept satisfy an equation of this form. Since concepts are not precise and can be fuzzy or continuous, we will allow for some noise in this formulation by working with the notion of concept conditional distributions (Definition 3). Of course, in general, fu and Zu are very high-dimensional and complex, as they can be used to represent arbitrary concepts. Instead of ambitiously attempting to reconstruct fu and Zu as CRL [causal representation learning] would do, we go for a more relaxed notion where we attempt to learn a minimal representation that represents only the subset of concepts we care about; i.e., a simpler decoder f and representation Z—different from fu and Zu—such that Z linearly captures a subset of relevant concepts as well as a valid representation X = f(Z). With this novel formulation, we formally prove that concept learning is identifiable up to simple linear transformations (the linear transformation ambiguity is unavoidable and ubiquitous in CRL). This relaxes the goals of CRL to only learn relevant representations and not necessarily learn the full underlying model. It further suggests that foundation models do in essence learn such relaxed representations, partially explaining their superior performance for various downstream tasks.
Apart from the above conceptual contribution, we also show that to learn n (atomic) concepts, we only require n + 2 environments under mild assumptions. Contrast this with the adage in CRL [41, 11] where we require dim(Zu) environments for most identifiability guarantees, where as described above we typically have dim(Zu) ≫ n + 2.'
'The punchline is that when we have rich datasets, i.e., sufficiently rich concept conditional datasets, then we can recover the concepts. Importantly, we only require a number of datasets that depends only on the number of atoms n we wish to learn (in fact, O(n) datasets), and not on the underlying latent dimension dz of the true generative process. This is a significant departure from most works on causal representation learning, since the true underlying generative process could have dz = 1000, say, whereas we may be interested to learn only n = 5 concepts, say. In this case, causal representation learning necessitates at least ∼ 1000 datasets, whereas we show that ∼ n + 2 = 7 datasets are enough if we only want to learn the n atomic concepts.'
More reasons to think something like the above should work: High-resolution image reconstruction with latent diffusion models from human brain activity literally steers diffusion models using linearly-decoded fMRI signals (see fig. 2); and linear encoding (the inverse of decoding) from the text latents to fMRI also works well (see fig. 6; and similar results in Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, e.g. fig. 2). Furthermore, they use the same (Stable Diffusion with CLIP) model used in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which both provides theory and demo empirically activation engineering-style linear manipulations. All this suggests similar Concept Algebra for (Score-Based) Text-Controlled Generative Models - like manipulations would also work when applied directly to the fMRI representations used to decode the text latents c in High-resolution image reconstruction with latent diffusion models from human brain activity.
For the pretraining-finetuning paradigm, this link is now made much more explicitly in Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm; as well as linking to model ensembling through logit averaging.
We can implement this as an inference-time intervention: every time a component (e.g. an attention head) writes its output to the residual stream, we can erase its contribution to the "refusal direction" . We can do this by computing the projection of onto , and then subtracting this projection away:
Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.
I'll note that to me this seems surprisingly spiritually similar to lines 7-8 from Algorithm 1 (at page 13) from Concept Algebra for (Score-Based) Text-Controlled Generative Models, where they 'project out' a direction corresponding to a semantic concept after each diffusion step (in a diffusion model).
This seems notable because the above paper proposes a theory for why linear representations might emerge in diffusion models and the authors seem interested in potentially connecting their findings to representations in transformers (especially in the residual stream). From a response to a review:
Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with $y$. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ''linear subspace hypothesis'' in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic's usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.
Strongly agree; I've been thinking for a while that something like a public-private partnership involving at least the US government and the top US AI labs might be a better way to go about this. Unfortunately, recent events seem in line with it not being ideal to only rely on labs for AI safety research, and the potential scalability of automating it should make it even more promising for government involvement. [Strongly] oversimplified, the labs could provide a lot of the in-house expertise, the government could provide the incentives, public legitimacy (related: I think of a solution to aligning superintelligence as a public good) and significant financial resources.