Acknowledgements: Many thanks to Milan Cvitkovic and Collin Burns for their help in workshopping this project proposal. Many of the ideas presented here regarding the use of neuroscience and neurotechnology were originally proposed by Milan here. This post represents my current thinking on the topic and does not necessarily represent the views of the individuals I have just mentioned.

tl;dr: Methods for promoting honest AI hinge on our ability to identify beliefs/knowledge in models. These concepts are derived from human cognition. In order to use them in AI, we need to provide definitions at the appropriate level of resolution. In this post, I describe why I think it would be a useful starting point to recover truth-like latent features in human brain data with contrast-consistent search (CCS). In an upcoming post, I will outline why I think this is possible with current fMRI technology.

Summary

Methods for discovering latent knowledge in language models, such as contrast-consistent search (CCS), hold the potential to enable humans to directly infer a model’s “beliefs” from its internal activations. This can be useful for promoting honesty in AI systems by ensuring that their outputs are consistent with their “beliefs”. 

However, it is currently unclear how the truth-like features recovered by CCS relate to the concepts of belief and knowledge, whose definitions are derived from human intelligence. Since these concepts underlie behaviors like deception and honesty, is essential that we understand them at the appropriate level of resolution to enable us to translate the concepts to AI systems.

A starting point could be to determine whether CCS identifies analogous features in human brains. My proposal is to use ultra-high-field functional magnetic resonance imaging (fMRI) to replicate discovering latent knowledge experiments in humans. This approach not only promises to shed light on the representations identified by CCS but also offers a crucial testbed for evaluating techniques aimed at detecting deception in AI.

About this post

I am posting a proposal to perform a human version of "Discovering Latent Knowledge in Language Models" in order to elicit feedback and help clarify my thinking on the motivation and utility of the project. I have separated my proposal into two posts. In this post, I’ll cover why I think it is hard to define honesty and deception in AI, and why I think looking at human brain function can 1) help us arrive at better definitions of cognitive concepts and 2) serve as a testbed for our approaches for lie detection in AI. In the next post, I will outline a specific project proposal to test CCS on human brain data from ultra-high-field functional magnetic resonance imaging (fMRI). Readers who are interested in understanding the justification behind my proposal would benefit most from reading this post. Readers who are interested in the details of the specific project should also have a look at the second (upcoming) post.

How Studying Human Intelligence Could Help Promote Honest AI

To start, I would like to outline a couple reasons why neuroscience could play an important role in AI safety, especially when our aim is to foster honest AI. Much of the following overlaps quite a bit with ideas presented elsewhere in the AI safety community (specifically Milan Cvitkovic’s post on the importance of neurotechnology), but I think that repackaging it here provides some clarity for the general motivation of this project.

1) We need better definitions of cognitive concepts to use them in AI alignment research

I think we risk inappropriately anthropomorphizing AI when we use cognitive concepts to describe their behaviors. Deception provides a clear example. If we want to determine if an AI is lying, we first need to define what lying is. Central to the concepts of deception and lying are the concepts of belief and knowledge. Determining whether a system has beliefs or knowledge cannot rely only on measurements of external behavior. Instead, it requires evaluating the internal states of the system. Otherwise, we risk misunderstanding the AI by inappropriately assigning it human cognitive states.

Cognitive science theories currently lack the specificity necessary to determine if AI models possess beliefs or knowledge analogous to humans (although there is significant work on neurocognitive theories in this direction [1,2]). Doing so will require describing these cognitive phenomena at the algorithmic level [3][1], so that we can identify their signatures across systems that may be vastly different from each other at the implementation level. Because any definition for these concepts will need to be validated, defining cognitive phenomena at the appropriate level of resolution to transport to AI will quite likely involve measuring activity in biological brains.

2) We should be sure that our interpretability methods work in humans before relying on them to align AGI

We would like to know that our methods for interpreting AI generalize beyond the narrow scope under which these methods are developed. This validation is particularly crucial if we want to deploy general-purpose methods to align powerful AGI in the future. A significant challenge, however, lies in testing the widespread applicability of these interpretability methods—specifically, understanding their scalability. Fortunately, we have a readily accessible general intelligence to examine: humans. Before counting on these methods to align AGI, it's prudent to ascertain their efficacy within the human context.

The Problem

Building lie detectors

In general, we would like to avoid building models that can lie to us. Methods for discovering latent knowledge (DLK) aim to report what a model “believes/knows”[2]. If we know what a model believes/knows, then we can simply check if its actions are consistent with these beliefs.

Burns et al. (2022) [4] made a surprising discovery that a simple linear classifier could be trained on pairs of statements and their negations to identify latent dimensions within language models that track propositional truth value. The ability of this classifier to predict the truth value of the inputs from internal activations exceeded the accuracy of zero-shot baselines. Moreover, the predictions made by the classifier continued to accurately report truth value, even when the model outputs did not.

Do these latent dimensions capture the “beliefs/knowledge” of the model? What does it mean to say that the model believes cats are mammals? Unfortunately, we can’t answer these questions yet, because we don’t have a sufficient understanding of what beliefs or knowledge are. Predicting whether a statement was true or false based on information present in language models is not enough to confirm that these models hold beliefs regarding the statement, because we do not know if and how that information is utilized or represented in the model. This is a problem if we want to build lie detectors, because our definition of lying relies on the concepts of knowledge and belief, which are defined internally and cannot be inferred from behavior alone.

DLK in humans

What can we do in the absence of a comprehensive theory of belief and knowledge? We could try to build such a theory, but any theory explaining these concepts must be validated in humans. A standard approach is to look for the physical (likely neural) correlates of a cognitive phenomenon and try to draw more general conclusions from those. 

For instance, we observe through direct experience that we possess beliefs and knowledge, and we extend this capacity to other humans. We can record these beliefs pretty reliably, for instance by simply asking a (trustworthy) person whether they think a proposition is true or false. If we observe the underlying brain processes that accompany these propositions, we could identify which of those coincided with the phenomenology of believing or disbelieving those propositions, and ideally also perturb the relevant neural mechanisms to determine whether they play a causal role.

However, in practice, discovering neural correlates of beliefs/knowledge is difficult, both because it is very hard to experimentally isolate the cognitive processes of interest and because our current means of measuring and perturbing the brain are still relatively imprecise. A slightly different approach could be to directly use DLK methods like CCS, which ostensibly recover “beliefs” from internal states of models, to recover similar truth-like features from human brain data. This approach can provide us with candidate neural features whose relationship to beliefs/knowledge can be explored in detail. The advantage of this approach is that we can be reasonably confident that beliefs/knowledge are embedded within the observed neural activity, because we are able to connect these to phenomenology.

As mentioned previously, there is already at least one method (CCS) for discovering features in language models that look particularly belief-like. This method may even suggest a framework for how beliefs are encoded in neural activity. Under this framework, the consistency of beliefs with the semantic content of natural language statements is represented[3] along latent dimensions of neural activity, similar to the way other features are thought to be represented in the brain and artificial neural networks alike. If this framework is true of human truth evaluation in natural language processing, then this predicts that we should identify latent dimensions in neural activity that discriminates between true and false statements as rendered by the individual. This accomplishes two goals: 1) It assesses the efficacy of CCS in pinpointing neural correlates of truth evaluation when we're confident such processes are influencing the observed behavior, and 2) it lays the foundation for a plausible framework for knowledge and beliefs, potentially applicable across diverse intelligences.

Testing lie detectors

Once we have built and validated our DLK method, we should be able to put it to use to identify when models are lying. But there is still a problem. While it is possible to get language models to produce outputs that are inconsistent with the predictions from our DLK method, it is still unclear whether these constitute lies, because our understanding is still limited regarding how the information identified through our method influences the model's behavior. For instance, the model may have misinterpreted the nature of the task, or it may lack access to the necessary information in its procedure for producing its response. To distinguish between these cases and deceptive behavior, we have to further define how beliefs are used in the context of lying. 

Another layer of complexity arises when considering the stability of belief/knowledge embeddings identified by DLK across varied behavioral circumstances, including situations involving deception. If, for instance, belief embeddings are obscured during deception, the accuracy of DLK, developed under truth-telling conditions, may be compromised.

If we want to test our DLK method as a lie detector, then we need to test its ability to act as such in humans. Given that humans can intentionally lie, inducing deception in an experimental setting is feasible. Such an approach illuminates the underlying processes governing truth evaluation during honesty and deception. Moreover, subjecting our DLK method to various behavioral conditions will vet its resilience, offering further confidence in its ability to discern genuine neural correlates of specific cognitive functions.

Some other specific questions about CCS that can be addressed by DLK in the human brain

In order to be very clear about how I think DLK in the human brain could contribute to understanding open questions in AI safety, I’ll provide a few additional examples of specific questions that I believe would be addressed by this project. Note that I do not think that DLK in the human brain will provide definitive answers to any of these, but I think it could provide useful evidence before the arrival of human-level or superhuman AGI.

Will future language models encode truth in a simple linear way?

If we find that humans do not encode truth with a limited set of features that are accessible with linear decoders, this may change our expectations for future AI systems. In this situation, it would benefit us to understand how humans evaluate truth, and then to look for these processes in AI. On the other hand, if humans do encode truth in a relatively simple way that can be assessed with linear decoders, then I might be cautiously optimistic that similar approaches will work for increasingly general AI.

Are belief/knowledge embeddings obscured during active deception?

If we find that belief/knowledge embeddings are altered during active deception in humans, or that humans are able to deliberately obscure their belief/knowledge embeddings, for instance, by thinking of something else, then we should have more concern for this kind of behavior being exhibited by an advanced AI. Knowing this can help us advance strategies for promoting honesty that anticipate these kinds of behaviors.

Can a purely language-based approach to DLK uncover beliefs that were not acquired through language?

A serious concern is that future AI systems will not explicitly connect their beliefs to language such that they can be elicited by methods like CCS. Humans, and presumably other intelligent organisms, invariably form beliefs through means other than interacting through natural language. It would be useful to test whether an approach like CCS, which operates through language, could ascertain beliefs that are formed by humans outside of language.

Some possible objections

Can’t we just build truthful AI?

Enforcing truthfulness in AI, especially LLMs, rather than honesty could be a more feasible and stringent criterion [5]. While honesty enforces that statements are made in accordance with beliefs, truthfulness requires that statements are verifiably true. Enforcing truthfulness sidesteps the issue of defining internal cognitive processes and relies only on external observations of behavior. 

While I think aiming for truthfulness may be a valuable near-term approach and desirable alongside honesty, I worry that it will have limitations in enforcement as AI systems become increasingly capable, producing outputs that may be difficult to verify by humans. Assessing truth value in natural language is already not straightforward (see debates on factive predicates in linguistics e.g. [6]). Plus, training models exclusively on human-verifiable statements doesn't eliminate the likelihood of models misgeneralizing beyond their training data. Consequently, discerning a model's inherent beliefs and its adherence to them remains crucial.

I also worry that the framing for truthful AI too narrowly conceptualizes deception as “lying”. In fact, deception does not require that statements are made contrary to internal beliefs. Sophisticated deception is observed when an individual makes statements that are verifiably true in order to mislead or manipulate others to act contrary to how they would otherwise. This can happen when individuals anticipate that their statements will not be believed by others [7,8]. Numerous manipulative tactics don't necessarily involve blatant falsehoods but may be built around selective but still factually accurate statements. Given these considerations, I think aiming for honesty in AI, which circumvents both simple and sophisticated forms of deception, is a necessary long-term goal.

Won’t it be obvious when we have identified the cognitive capacities of AI?

Perhaps we will discover methods that quite compellingly suggest that some AI systems possess certain cognitive capacities (maybe by exactly describing the process by which some process happens in full detail). However, I’m doubtful that we can be confident that these cognitive capacities map cleanly to concepts like “deception” and “honesty” without testing that we reach similar conclusions in humans. Any attempt to extrapolate human-centric cognitive notions onto AI necessitates a robust validation of these concepts in humans first.

We shouldn’t expect the cognition of advanced AIs to resemble that of humans anyway.

I think this is quite possibly true. If it is true however, we should be extremely cautious about applying concepts derived from human cognition to AI. It still would benefit us to determine precisely how the cognition of AI differs from human intelligence, and what implications that has for defining what we mean by wanting to avoid deceptive behavior. For instance, if models do not have analogous belief/knowledge structures, then it becomes unclear how to approach the problem of deceptive AI, or whether deception is a coherent framework for thinking about AI at all.

Language limits our evaluation of beliefs.

I am most worried about this objection, and I think it requires further consideration. When we use language to probe beliefs, we may be limiting ourselves to propositions that can be easily stated in natural language. That is, language may be insufficient to communicate certain propositions precisely, and therefore we may not be able to elicit the full range of beliefs/knowledge encoded by brains or AI systems. We can probably elicit certain things from the underlying world model through language, but what we would like to do is probe the world model itself.

Furthermore, language does not appear to be necessary to have beliefs/knowledge. The distinction between language and belief/knowledge appears to be demonstrated in individuals who have had corpus callostomies (i.e. split-brain patients). Often, these patients cannot express, through language or perhaps only certain forms of language, information that is only accessible to the non-language-dominant brain hemisphere. Yet, these individuals can still take actions based on this knowledge even if they are not able to connect it to language (see [9] and [10] for reviews of split-brain studies).

I think this objection might hit on some deep philosophical problems that I am not prepared to address directly. However, the current proposal still seems to at least be capable of addressing narrower questions like “How is natural language evaluated against beliefs/knowledge in the brain?” and “Can we use natural language to infer beliefs/knowledge directly from brain data?”, which is perhaps a starting point for addressing deeper questions. 

Moreover, the concern about language being insufficient to ascertain beliefs/knowledge is just as much a problem for aligning AI as it is for studying belief/knowledge in humans. Therefore, this objection does not provide a specific criticism of the present proposal to test CCS in humans, and instead offers a legitimate critique of the use of CCS (in its current form) as a method for discerning beliefs/knowledge in general. I suspect that future methods will be able to discover beliefs/knowledge by eliciting them in ways that do not involve natural language, or use natural language alongside other modalities. These methods might end up looking quite similar to CCS. However, they also may require deeper theoretical foundations for intelligence. Regardless, when we develop such methods and theories, we should ensure that they apply to humans for the same reasons I have outlined above.

Conclusion

I have outlined here why I think it is necessary to examine human brains to understand belief and knowledge at the appropriate level of resolution to use the concepts to explain the behavior of AI. Since our definitions of deception invokes beliefs/knowledge, understanding these concepts is essential for avoiding deceptive behaviors in AI. I think that perhaps the simplest way forward is to use our current methods for recovering “beliefs” in LLMs to recover analogous features in human brain data. This both tests the generality of these methods, particularly in the context of general intelligence, and begins to examine how the features recovered by these methods relate to belief/knowledge.

My biggest reservation is that CCS probes beliefs/knowledge through natural language. It seems unclear to me whether this will be sufficient to recover beliefs/knowledge that may have been formed outside of natural language and may not be explicitly connected to language. Furthermore, it limits the neuroscientific questions that can be addressed by restricting assessments to how world-models are brought to bear on language rather than probing the world-model itself. However, I tend to view this as a more general critique of CCS. It still stands to reason that DLK methods, including CCS and future methods that may or may not be tied to natural language, should be investigated in the human context.

Next Post

My next post will outline a specific experiment for testing CCS with human fMRI data. I’ll provide an assessment of the current neuroscience literature on the topic of truth evaluation in natural language, the experimental details of my proposal, and the limitations of current methodologies for studying this topic in the human brain.

References

[1] R. J. Seitz. Believing and beliefs—neurophysiological underpinnings. Frontiers in Behavioral Neuroscience, 16:880504, 2022.

[2] K. J. Friston, T. Parr, and B. de Vries. The graphical brain: Belief propagation and active inference. Network neuroscience, 1(4):381–414, 2017.

[3] D. Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 1982.

[4] C. Burns, H. Ye, D. Klein, and J. Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.

[5] O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders. Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674, 2021.

[6] J. Degen and J. Tonhauser. Are there factive predicates? An empirical investigation. Language, 98(3):552–591, 2022.

[7] M. Sutter. Deception through telling the truth?! Experimental evidence
from individuals and teams. The Economic Journal, 119(534):47–60, 2009.

[8] M. Zheltyakova, M. Kireev, A. Korotkov, and S. Medvedev. Neural mecha-
nisms of deception in a social context: an fMRI replication study. Scientific
Reports, 10(1):10713, 2020.

[9] M. S. Gazzaniga. Forty-five years of split-brain research and still going
strong. Nature Reviews Neuroscience, 6(8):653–659, 2005.

[10] M. S. Gazzaniga. Review of the split brain, 1975.
 

  1. ^

     Referring here to David Marr’s levels of analysis, which distinguishes between the computational level - the goal of the system, the algorithmic level - how the computation is achieved, and the implementation level - how the algorithm is physically instantiated.

  2. ^

    This topic appears closely tied to Eliciting Latent Knowledge (ELK), but for the time being, I am assuming that the reporter for the model’s beliefs will be simple enough that we don’t have to worry about the human simulator failure mode described in ARC’s first technical report on the subject. Regardless, my intuition is that the problem of defining belief/knowledge will be common across any approach to ELK, and could therefore benefit from validation in humans.

  3. ^

    It was pointed out to me that saying that truth is “represented” by the brain might be a misapplication of the framework of representationalism in cognitive science. To avoid misusing this terminology, I will try to use “truth evaluation process” as an alternative to “truth representation”.

New Comment