Considering how much I’ve been using “the intentional stance" in my thinking about the nature of agency and goals and discussions of the matter recently, I figured it would be a good idea to, y’know, actually read what Dan Dennett originally wrote about it. While doing so, I realized that he was already considering some nuances in the subject that the Wikipedia summary of the intentional stance leaves out but that are nonetheless relevant to the issues we face when attempting to e.g. formalize the approach, or think more clearly about the nature of agency in the context of alignment. I don’t expect many LessWrongers will read the original book in full, but I do expect that some additional clarity on what exactly Dennett was claiming about the nature of agency and goals will be helpful in having less confused intuitions and discussions about the subject.

In what follows, I provide an in-depth summary of Dennett’s exposition of the intentional stance, from Chapter 2 of The Intentional Stance (“True Believers: The Intentional Strategy and Why It Works”), which Dennett considers “the flagship expression” of his position. Then, I discuss a few takeaways for thinking about agency in the context of AI safety. In brief, I think 1) we should stop talking about whether the systems we build will or won’t “be agents,” and instead debate how much it will make sense to consider a given system as “an agent,” from the information available to us, and 2) we should recognize that even our internally-experienced beliefs and desires are the result of parts of our minds “applying the intentional stance” to other parts of the mind or the mind as a whole.

This work was completed as a Summer Research Fellow at the Center on Long-Term Risk under the mentorship of Richard Ngo. Thanks to Richard, Adam Shimi, Kaj Sotala, Alex Fabbri, and Jack Auen for feedback on drafts of this post.

Summarizing Dennett's position

TLDR: There is no observer-independent “fact of the matter” of whether a system is or is not an “agent”. However, there is an objective fact of the matter about how well-modeled a particular system’s behavior is modeled by the intentional stance, from the point of view of a given observer. There are, objectively, patterns in the observable behavior of an intentional system that correspond to what we call “beliefs” and “desires”, and these patterns explain or predict the behavior of the system unusually well (but not perfectly) for how simple they are.

In an attempt to be as faithful as possible in my depiction of Dennett’s original position, as well as provide a good resource to point back to on the subject for further discussion[1], I will err on the side of directly quoting Dennett perhaps too frequently, at least in this summary section.

He begins by caricaturing two opposing views on the nature of belief: 1) realism: there’s an “objective internal matter of fact” to the nature of belief; for example, in principle, sufficiently detailed understanding of cognitive psychology, neuroscience, or even physics would allow one to effectively “find the beliefs” inside the believer’s head. 2) interpretationism: “likens the question of whether a person has a particular belief to the question of whether a person is immoral, or has style, or talent, or would make a good wife…. ‘It’s a matter of interpretation.’”

Dennett’s position (the intentional strategy or adopting the intentional stance), then, is that “while belief is a perfectly objective phenomenon (that apparently makes me a realist), it can be discerned only from the point of view of one who adopts a certain predictive strategy, and its existence can be confirmed only by an assessment of the success of that strategy (that apparently makes me an interpretationist).”

The Intentional Strategy: How it works, how well it works

Three Stances

There are several approaches one might take to predicting the future behavior of some system; Dennett compares three: the physical stance, the design stance, and the intentional stance.

In adopting the physical stance towards a system, you utilize an understanding of the laws of physics to predict a system’s behavior from its physical constitution and its physical interactions with its environment. One simple example of a situation where the physical stance is most useful is in predicting the trajectory of a rock sliding down a slope; one would be able to get very precise and accurate predictions with knowledge of the laws of motion, gravitation, friction, etc. In principle (and presuming physicalism), this stance is capable of predicting in full the behavior of everything from quantum mechanical systems to human beings to the entire future of the whole universe.

With the design stance, by contrast, “one ignores the actual (possibly messy) details of the physical constitution of an object, and, on the assumption that it has a certain design, predicts that it will behave as it is designed to behave under various circumstances.” For example, humans almost never consider what their computers are doing on a physical level, unless something has gone wrong; by default, we operate on the level of a user interface, which was designed in order to abstract away messy details that would otherwise hamper our ability to interact with the systems.

Finally, there’s the intentional stance:

Here is how it works: first you decide to treat the object whose behavior is to be predicted as a rational agent; then you figure out what beliefs that agent ought to have, given its place in the world and its purpose. Then you figure out what desires it ought to have, on the same considerations, and finally you predict that this rational agent will act to further its goals in the light of its beliefs. A little practical reasoning from the chosen set of beliefs and desires will in many—but not all—instances yield a decision about what the agent ought to do; that is what you predict the agent will do.

Before further unpacking the intentional stance, one helpful analogy might be that the three stances can be understood as providing gears-level models for the system under consideration, at different levels of abstraction.[2] For purposes of illustration, imagine we want to model the behavior of a housekeeping robot:

  • The physical stance gives us a gears-level model where the gears are the literal gears (or other physical components) of the robot.
  • The design stance gives us a gears-level model where the gears come from the level of abstraction at which the system was designed. The gears could be e.g. the CPU, memory, etc., on the hardware side, or on the level of the robot’s user interface, on the software side.
  • The intentional stance gives us a gears-level model where the relevant gears are the robot’s beliefs, desires, goals, etc.

Attributing Beliefs and Desires

The above description of the intentional stance doesn’t provide many specifics about how to determine the beliefs and desires the intentional system “ought to have”; how do we actually determine the beliefs and desires to be attributed? Dennett first notes that we typically come to form beliefs about the parts of the world we are exposed to through our senses. However, we obviously do not learn or remember all the potentially inferable truths from our sensory data; “what we come to know, normally, are only all the relevant truths our sensory histories avail us.”

This leaves us with one heuristic for attributing beliefs under the intentional stance: “attribute as beliefs all the truths relevant to the system’s interests (or desires) that the system’s experience to date has made available.” For example, imagine that a group of people are having a discussion about AI in a pub where a football match is being shown on television. Those who are interested in AI but not football are more likely to form beliefs about the content of the conversation than about the content on the television, and vice versa. Although a useful rule of thumb, this heuristic fails to capture a system’s false beliefs, not to mention the fact that humans remain incapable of perfectly remembering all of the beliefs relevant to their desires they’ve had the opportunity to learn.[3] To be clear, this rule is derived from the fundamental rule, “attribute the beliefs the system ought to have”—an intentional system should form beliefs about the aspects of its environment relevant to its desires.

This heuristic for belief attribution also presumes an attribution of desires to the intentional system. For humans, this is relatively straightforward—the desires we should be quickest to attribute to another human are those that are most common to all of us: “survival, absence of pain, food, comfort, procreation, entertainment.” It doesn’t take a great leap of imagination to reason that other humans want to be happy, safe, and comfortable, because that’s what almost all of us want. (As Dennett notes, “citing any one of these desires typically terminates the ‘Why?’ game of reason giving.”) This is the result of the fundamental rule “attribute the desires the system ought to have”—the desires (at least, the fundamental ones) humans “ought to have” are fairly obvious. We can also attribute desires with the heuristics “attribute desires for those things a system believes to be good for it” and “attribute desires for those things a system believes to be best means to the other ends it desires” (instrumental desires).[4]

There’s also a need to consider how “rational” an intentional system is in the process of belief and desire attribution. A logically omniscient intentional system would believe all the truths derivable from its current beliefs, but obviously humans (and any system that exists in the physical universe, for that matter) fall far short of this ideal. Dennett writes:

One starts with the ideal of perfect rationality and revises downward as circumstances dictate. That is, one starts with the assumption that people believe all the implications of their beliefs and believe no contradictory pairs of beliefs. This does not create a practical problem of clutter (infinitely many implications, for instance), for one is interested only in ensuring that the system one is predicting is rational enough to get to the particular implications that are relevant to its behavioral predicament of the moment. Instances of irrationality, or of finitely powerful capacities of inferences, raise particularly knotty problems of interpretation, which I will set aside on this occasion (see chapter 4, “Making Sense of Ourselves,” and Cherniak 1986).[5]

It works (well)

As Dennett emphasizes, the intentional strategy is the strategy we all already use, all the time when interacting with other humans in the world (not to mention occasionally with some other systems, from animals to robots to plants, even to thermostats); imagine seeing two children pulling at opposing ends of a toy and not thinking “both those kids want that toy”! The intentional strategy, in a sense, is a description of how we attribute beliefs and desires to a system when our minds find it easiest to perceive that system as “being an agent,” “having intentions,” etc.

True Believers

Here we get to the real meat of Dennett’s position, so to speak. Now that he’s described how we attribute beliefs and desires to systems that seem to us to have intentions of one kind or another, “the next task would seem to be distinguishing those intentional systems that really have beliefs and desires from those we may find it handy to treat as if they had beliefs and desires.” (For example, although a thermostat’s behavior can be understood under the intentional stance, most people intuitively feel that a thermostat doesn’t “really” have beliefs.) This, however, cautions Dennett, would be a mistake.

As a thought experiment, Dennett asks us to imagine that some superintelligent Martians descend upon us; to them, we’re as simple as thermostats are to us. If they were capable of predicting the activities of human society on a microphysical level, without ever treating any of us as intentional systems, it seems fair to say that we wouldn’t “really” be believers, to them. This shows that intentionality is somewhat observer-relative—whether or not a system has intentions depends on the modeling capabilities of the observer.

However, this is not to say that intentionality is completely subjective, far from it—there are objective patterns in the observables corresponding to what we call “beliefs” and “desires.” (Although Dennett is careful to emphasize that these patterns don’t allow one to perfectly predict behavior; it’s that they predict the data unusually well for how simple they are. For one, your ability to model an intentional system will fail under certain kinds of distributional shifts; analogously, understanding a computer under the design stance does not allow one to make accurate predictions about what it will do when submerged in liquid helium.)

Another interesting point Dennett argues is that the intentional stance is unavoidable with regard to oneself and one’s fellow intelligent beings; “if they observe, theorize, predict, communicate, they view themselves as intentional systems. [Footnote: Might there not be intelligent beings who had no use for communicating, predicting, observing,...? There might be marvelous, nifty, invulnerable entities lacking these modes of action, but I cannot see what would lead us to call them intelligent.]”[6]

Thus, with the intentional stance, Dennett advocates a “milder sort of realism,” where “there is no fact of the matter of exactly which beliefs and desires a person has in these degenerate cases [e.g. of failures of rationality or distributional shift], but this is not a surrender to relativism or subjectivism, for when and why there is no fact of the matter is itself a matter of objective fact.”

One analogy that might help elucidate this relationship between the objectivity of the belief- and desire-patterns and the apparently subjective point of view required to see them is with the example of a Turing machine implemented in Life. The Turing machine is “objectively” there—anyone who understands what a Turing machine is will not fail to see the pattern in the data—but a specific frame of mind is needed to recognize it. Analogously, the patterns in an intentional system’s observable behavior corresponding to “beliefs” and “desires” are objectively there, once you’ve made the decision to consider the system under the intentional stance.

Thermostats again (but, really, are they agents?)

Now, we return to thermostats and the question of what it means for a system to “really” have beliefs and/or desires. Dennett’s punchline is that “all there is to being a true believer is being a system whose behavior is reliably predictable via the intentional strategy, and hence all there is to really and truly believing that p (for any proposition p) is being an intentional system for which p occurs as a belief in the best (most predictive) interpretation.”

We might be willing to attribute half a dozen beliefs and desires to a normal thermostat; it believes the room is too hot or too cold, etc. Dennett notes that we could “de-interpret” its beliefs and desires via symbol substitution or abstraction: it believes that the R is too H or C, etc. (“After all, by attaching the thermostatic control mechanism to different input and output devices, it could be made to regulate the amount of water in a tank, or the speed of a train.”)

On the other hand, if we start to enrich its causal attachments to the world, e.g. by giving it more than one way to learn about temperature, or by giving it a fully general visual system, we also enrich the semantics of the “dummy predicates” (H and C, etc.). Given the actual link to the world, we could endow a state of the primitive thermostat with “meaning” (of a sort), but it was too easy to substitute a different minimal link and altogether change the meaning (“in this impoverished sense”) of that internal state (the “de-interpreted” beliefs weren’t very “meaningful” since they could represent the beliefs of a speed or volume regulator just as well as those of a temperature regulator). However, with “perceptually richer and behaviorally more versatile” systems,

it becomes harder and harder to make substitutions in the actual links of the system to the world without changing the organization of the system itself. If you change its environment, it will notice, in effect, and make a change in its internal state in response. There comes to be a two-way constraint of growing specificity between the device and the environment. (emphasis added) Fix the device in any one state and it demands a very specific environment in which to operate properly (you can no longer switch it easily from regulating temperature to regulating speed or anything else); but at the same time, if you do not fix the state it is in, but just plonk it down in a changed environment, its sensory attachments will be sensitive and discriminative enough to respond appropriately to the change, driving the system into a new state, in which it will operate effectively in the new environment. There is a familiar way of alluding to this tight relationship that can exist between the organization of a system and its environment: you say that the organism continuously mirrors the environment, or that there is a representation of the environment in—or implicit in—the organization of the system.[7]

It is not that we attribute (or should attribute) beliefs and desires only to things in which we find internal representations, but rather that when we discover some object for which the intentional strategy works, we endeavor to interpret some of its internal states or processes as internal representations. What makes some internal feature of a thing a representation could only be its role in regulating the behavior of an intentional system. (emphasis added in italics)

This is important to understand, as we can intuitively feel that any system that’s “really” an agent, or “really” has desires or beliefs, must have internal representations corresponding to those desires and beliefs, but, as Dennett points out, the relative predictive success of the intentional strategy is what really determines how much it makes sense to consider a system as “being intentional” or “having intentions beliefs, desires, etc.”. An intentional system might represent its environment implicitly in its organization (like bacteria, or ants, for example).

Now the reason for stressing our kinship with the thermostat should be clear. There is no magic moment in the transitions from a simple thermostat to a system that really has an internal representation of the world around it. The thermostat has a minimally demanding representation of the world, fancier thermostats have more demanding representations of the world, fancier robots for helping around the house have still more demanding representations of the world. Finally you reach us.

When it comes to the question of why the intentional strategy works as well as it does, Dennett says that the question is ambiguous, with two very different possible kinds of answers:

  1. The system was designed such that the intentional stance applies well to it. In the case of a thermostat, Dennett simply claims that it was designed to be understood under the intentional stance, but Adam Shimi offers a clearer, more nuanced explanation: tautologically, designed things are designed to be understood from the design stance, but humans tend to have a teleological instinct when designing things, so most things designed by humans end up being interpretable under both the design and intentional stances. In the case of humans, “evolution has designed human beings to be rational, to believe what they ought to believe and want what they ought to want.” (I’m a bit suspicious of the phrasing that evolution designed us to believe and want what we “ought” to, but I think this is pointing at the idea that systems that have been subjected to a lot of selection pressure are more likely to “be coherent” and therefore more likely to be well-described via the intentional stance.)
  2. How the machinery works. The features of the thermostat’s design that explain why its behavior is well-understood under the intentional stance are easily discovered and understood, but not so with human minds. “How do human minds and brains implement ‘goal-directed behavior’?” is a fundamental and open question, insights into which will hopefully prove useful in understanding how prosaic systems will implement similar behavior.

As for one potential explanation for how our machinery works, Dennett suggests that brains themselves may have machinery that correspond to “beliefs” and “desires.” If this were the case, the explanation for why the intentional stance works would be that its explanatory terms coincide with the actual, physical and/or functional machinery responsible for producing the observed behavior (at some relevant level of abstraction of the mechanics—certainly individual neurons don’t implement goal-directed behavior by themselves!). However, Dennett is careful to distinguish this claim from the claim that the intentional stance identifies objective patterns in observable behavior corresponding to “beliefs” and “desires”:

Those who think that it is obvious, or inevitable, that such a theory will prove true (and there are many who do), are confusing two different empirical claims. The first is that intentional stance description yields an objective, real pattern in the world—the pattern our imaginary Martians missed. This is an empirical claim, but one that is confirmed beyond skepticism. The second is that this real pattern is produced by another real pattern roughly isomorphic to it within the brains of intelligent creatures. Doubting the existence of the second real pattern is not doubting the existence of the first.

Dennett suggests human language as a candidate for this kind of machinery of belief and desire; perhaps a better modern candidate would be something vaguely Bayesian (probabilistic generative models?).[8] (Also, I could be wrong here, but I remember reading somewhere, I think maybe in one of Steve’s posts, that we have some evidence now that natural language mirrors the structure of thought and not the other way around—maybe children independently inventing languages that bear significant structural similarities to existing natural languages? And personally, much of my own thought isn’t in the form of words but more abstract patterns and concepts, but both language and abstract concepts feel like they share underlying structure, or something.)

Takeaways for deconfusing agency

Editorial note: To be clear, these “takeaways” are both “things Dan Dennett is claiming about the nature of agency with the intentional stance” and “ideas I’m endorsing in the context of deconfusing agency for AI safety.” I believe that Dennett really gets at the heart of the matter of agency with the intentional strategy, because it’s the clearest description I know of the process by which the human mind attributes “agency” not only to other systems but also to itself. Although developing a more formal characterization of the strategy is challenging for several reasons, I know of no other better starting point for developing a more rigorous understanding of the nature of agency.

There's no observer-independent fact of the matter about whether a system "is" an agent[9]

If something appears agent-y to us (i.e., we intuitively use the intentional strategy to describe its behavior), our next question tends to be, “but is it really an agent?” (It’s unclear what exactly is meant by this question in general, but it might be interpreted as asking whether some parts of the system correspond to explicit representations of beliefs and/or desires.) In the context of AI safety, we often talk about whether or not the systems we build “will or won’t be agents,” whether or not we should build agents, etc.

One of Dennett’s key messages with the intentional stance is that this is a fundamentally confused question. What it really and truly means for a system to “be an agent” is that its behavior is reliably predictable by the intentional strategy; all questions of internal cognitive or mechanistic implementation of such behavior are secondary. (Put crudely, if it looks to us like an agent, and we don’t have an equally-good-or-better alternative for understanding that system’s behavior, well, then it is one.) In fact, once you have perfectly understood the internal functional mechanics of a system that externally appears to be an agent (i.e. you can predict its behavior more accurately than with the intentional stance, albeit with much more information), that system stops looking like “an agent,” for all intents and purposes. (At least, modeling the system as such becomes only one potential model for understanding the system’s behavior, which you might still use in certain contexts e.g. for efficient inference or real-time action.)

We should therefore be more careful to recognize that the extent to which AIs will “really be agents” is just the extent to which our best model of their behavior is of them having beliefs, desires, goals, etc. If GPT-N appears really agent-y with the right prompting, and we can’t understand this behavior under the design stance (how it results from predicting the most likely continuation of the prompt, given a giant corpus of internet text) or a “mechanistic” stance (how individual neurons, small circuits, and/or larger functional modules interacted to produce the output), then GPT-N with that prompting really is an agent.

Remember that concepts like “agent” and “goal” are representations within the world models in which “we” exist, not things which can actually exist within the territory (presuming physicalism). The representations correspond to artificially-imposed (but one hopes usefully, if imperfectly, drawn) boundaries in Thingspace, so when we ask questions like “what does it mean for a system to ‘be an agent’?”, we’re essentially asking how to most usefully draw[10] that boundary, or more specifically characterize the ‘mass or volume in Thingspace’ to which the label points. Dennett, with the intentional stance, argues that any such satisfactory characterization will primarily be in terms of the system’s behavior (with respect to some observer), not its internal implementation, a point that has been made around here before.

In the end, what we care about are the effects that systems’ behaviors have on the world, not the details of how those behaviors are implemented. This is not to say that understanding how such behavior is cognitively implemented will not be instrumentally useful for better understanding and predicting the behavior of goal-directed agents; having a mechanistic understanding of the implementation seems like the best way to make accurate predictions about how the system will generalize to new (out-of-distribution) inputs. To this end, understanding how the human mind implements goal-directedness seems particularly useful for getting an idea of how we might do something similar with prosaic AI systems.

I can also imagine that we could potentially further constrain the boundary (beyond “systems well-described by the intentional strategy”) we draw around the “agent” cluster in Thingspace by including some cognitive criteria. For example, while explicit internal representations of beliefs or desires might not be necessary for a system to “really be an agent,” we might also believe that sufficiently advanced or “intelligent” intentional systems (especially those implemented via neural networks) will have explicit internal representations of beliefs and desires. If this were the case, then understanding how these beliefs and desires are in general represented or implemented cognitively/mechanistically would facilitate a much finer level of understanding of agency, in the specific context of advanced prosaic systems. In effect, we would be trading off between the generality/applicability of the characterization of agency and the specificity of the predictions that characterization enables us to make (in the context in which it applies).

Indeed, as Dennett mentions, understanding the link between the link between those beliefs and desires that are “predictively attributable” under the intentional stance and potentially-existing-and-corresponding “functionally salient internal state[s] of the machinery, decomposable into functional parts” is perhaps the best approach to understanding why the intentional strategy works in the first place. Again, however, just to be clear, Dennett emphasizes (and I agree) that we should not primarily characterize the boundary in terms of internal representations or the like—the primary characteristic of agency should always be in terms of the behavior of the system under consideration: “It is not that we attribute (or should attribute) beliefs and desires only to things in which we find internal representations, but rather than when we discover some object for which the intentional strategy works, we endeavor to interpret some of its internal states or processes as internal representations.”

"You" and the intentional stance

The intuitive advantage of the intentional strategy is that it merely describes the process by which the human mind automatically ascribes beliefs to other systems it perceives as being or having minds: 1) consider the system as an “agent,” where “agent” points to a fairly primitive representation that the world models of most humans seem to share (this is merely the decision, conscious or unconscious, to apply the intentional stance), 2) deduce its beliefs and desires via commonsense reasoning from available information about its intelligence/rationality and environmental context.

The twin to this advantage, which is perhaps easier to miss, is that it also describes the process by which the human mind models itself as having agency. Our own sense of agency must result, at some level, from some part(s) of the mind “applying the intentional stance” to others. For example, introspection (metacognitive or otherwise) and self-narration (verbal or non-verbal) can be understood as the activity of a module that summarizes mental activity. Additionally, I feel as though there is a legitimate sense in which our conceptual goals and desires (so basically all of them?) are the result of the neocortex building a model of the entire system in which it is embedded as an agent (“applying the intentional stance”) from the signals it receives from the “value function” in the striatum. And verbal beliefs, whether internally experienced or externally expressed, should correspond to a module which is able to understand and produce natural language effectively “translating” the beliefs it ascribes to the mind based on inputs from other parts/modules into a form comprehensible by other higher-level modules (including those necessary for speech production). Even if the human mind models itself as being composed of many “agents” in order to have a model that “leaks” less than modeling itself as a single agent, such an understanding could only result from one part of the mind “applying the intentional stance” (using basically the same pattern or representation it uses to model other humans as agents or itself as a single agent) to the entities it infers being responsible for various patterns it notices in mental activity.

In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and ascribe to systems we observe externally. I also think that much of this confusion dissolves with the realization that internally experienced thoughts, beliefs, desires, goals, etc. are actually “external” with respect to the parts of the mind that are observing them—including the part(s) of the mind that is modeling the mind-system as a whole as “being an agent” (or a “multiagent mind,” etc.). You couldn't observe thoughts (or the mind in general) at all if they weren't external to "you" (the observer), in the relevant sense.

The most important thing to understand about the intentional stance is “all it really means for a system to be an agent is that its behavior is reliably predictable via the intentional strategy (i.e. as having beliefs and desires, acting on those beliefs to satisfy those desires, etc.).” However, I believe that arriving at a better understanding of the above point about human minds seeing themselves as agents because they’re “applying the intentional stance” to themselves or parts of themselves has been perhaps more helpful for “grokking” the intentional stance well enough for the original question about systems “really” being agents to dissolve and for me to see through the nature of my previous confusion.


  1. Most detailed existing summary I could find is from the literature review on goal-directedness. ↩︎

  2. Thanks to Kaj Sotala for pointing this out in a comment on a draft of this post (previously, I was just drawing the link between “gears-level models” and the design stance)! ↩︎

  3. Cf. Paul Christiano’s mention of “‘justified’ belief” in the context of universality. ↩︎

  4. These heuristics can be applied pretty straightforwardly to humans (since we obviously desire survival, food, shelter, etc.); the question is how we can begin the process of attributing beliefs and desires to an AI if we can’t automatically assume that the AI wants things like survival, food, comfort, entertainment, etc. The set of desires it is possible for AIs or minds-in-general to have is clearly much wider than the set of desires it is possible for humans to have. In this setting, the chicken-and-egg problem with beliefs and desires (where e.g. you attribute the beliefs relevant to the system's desires and the desires that the system believes are the best means to achieving other desires) seems trickier to avoid. ↩︎

  5. Cf. Armstrong and Mindermann, anyone? ↩︎

  6. I’m not very familiar with what Scott Garrabrant has been thinking in the context of human models (and potentially avoiding them), but maybe this hints at the idea that it might just be really difficult to avoid having models, whether implicit or explicit, of humans if you want to do any real-world prediction, even if it’s completely “non-agentic” tool/microscope AI? It seems increasingly difficult to prevent increasingly intelligent systems from discovering this pattern that compresses the data really well, at least until they get much smarter and can model us as e.g. collections of cells (but even then, they’d understand that we model ourselves and other humans this way). In a comment on a draft of this post, Adam Shimi agreed: “My take is that an AI will find the intentional stance really useful for understanding the human, so if it has to model them, it should learn some approximation of it. Then it’s not far from applying it to itself (especially if there are stories in the dataset of humans applying this stance to programs/machines like the AI).” ↩︎

  7. Seems related to the Good Regulator Theorem? ↩︎

  8. The core knowledge model also seems relevant here; it proposes that the human brain contains four systems for representing and reasoning about objects, agents, number, and space, respectively. (Thanks to Kaj Sotala for mentioning this to me!) ↩︎

  9. This is more a reflection of Dennett’s metaphysics than something that is unique to “agents”; I think he would say much the same about e.g. trees, and I would agree, remembering that “agents,” “trees,” and all other things and concepts are representations within the mind’s world model (more on this below). What Dennett is saying is not that it’s completely subjective whether a system “is an agent,” “is a tree,” etc., but rather that “agents” and “trees” are both useful (compressive) encodings of some patterns (that “objectively exist”) in the observable universe. Different minds could encode the same pattern in potentially quite diverse ways, depending on their sensory links to the environment and their desires (which determine which are the “relevant” features of the input which will be preserved under the encoding). “Whether the thing to which the representations point ‘really is’ what the representations themselves are” is basically a nonsensical question; however, how well a given encoding compresses, predicts, and/or explains the observables is not at all subjective, just a matter of information theory! Cf. Real Patterns, also the interlude “Reflections: Real Patterns, Deeper Facts, and Empty Questions” in The Intentional Stance. (Additionally, John Wentworth’s natural abstraction hypothesis in the context of alignment by default?) ↩︎

  10. Note that one must have an application in mind for the concept in order to draw a boundary that is “useful” for that application/in that context. (Thanks to Adam Shimi for reminding me to explicitly point this out.) ↩︎

New Comment
22 comments, sorted by Click to highlight new comments since:

I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept "agency = intentional stance", then you need to think "well, I guess AI risk wasn't actually about agency".

A fundamental part of the argument for AI risk is that an AI system will behave in a novel manner when it is deployed out in the world, that then leads to our extinction. The obvious question: why should it behave in this novel manner? Typically, we say something like "because it will be agentic / be goal-directed with the wrong goal".

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

Some previous discussion:

[-]jbkjrΩ560

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.

You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?

Yeah, I agree with all of that.

There's no observer-independent fact of the matter about whether a system "is" an agent[9]

Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.

[-]dxu120

[META: To be fully honest, I don't think the comments section of this post is the best place to be having this discussion. That I am posting this comment regardless is due to the fact that I have seen you posting about your hobby-horse—the so-called "problem of the criterion"—well in excess of both the number of times and places I think it should be mentioned—including on this post whose comments section I just said I don't think is suited to this discussion. I am sufficiently annoyed by this that it has provoked a response from me; nonetheless, I will remove this comment should the author of the post request it.]


Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y.

The linked post does not establish what it claims to establish. The claim in question is that "knowledge" cannot be grounded, because "knowledge" requires "justification", which in turn depends on other knowledge, which requires justification of its own, ad infinitum. Thus no "true" knowledge can ever be had, throwing the whole project of epistemology into disarray. (This is then sometimes used as a basis to make extravagantly provocative-sounding claims like "there is no fact of the matter about anything".)

Of course, the obvious response to this is to point out that in fact, humans have been accumulating knowledge for quite some time; or at the very least, humans have been accumulating something that very much looks like "knowledge" (and indeed many people are happy to call it "knowledge"). This obvious objection is mainly "addressed" in the linked post by giving it the name of "pragmatism", and behaving as though the act of labeling an objection thereby relieves that objection of its force.

However, I will not simply reassert the obvious objection here. Instead, I will give two principled arguments, each of which I believe suffices to reject the claims offered in the linked post. (The two arguments in question are mostly independent of each other, so I will present them separately.)


First, there is the question of whether epistemological limitations have significant ontological implications. There is a tendency for "problem of the criterion" adherents to emit sentences that imply they believe this, but—so far as I can tell—they offer no justification for this belief.

Suppose it is the case that I cannot ever know for sure that the sky is blue. (We could substitute any other true fact here, but "the sky is blue" seems to be something of a canonical example.) Does it then follow that there is no objective fact about the color of the sky? If so, why? Through what series of entanglements—causal, logical, or otherwise—does knowing some fact about my brain permit you to draw a conclusion about the sky (especially a conclusion as seemingly far-fetched as "the sky doesn't have a color")?

(Perhaps pedants will be tempted to object here that "color" is a property of visual experiences, not of physical entities, so it is in fact true that the sky has no "color" if no one is there to see it. This is where the substitution clause above enters: you may replace "the sky is blue" with any true claim you wish, at any desired level of specificity, e.g. "the majority of light rays emitted from the Sun whose pathway through the atmosphere undergoes sufficient scattering to reach ground level will have an average wavelength between 440-470 nm".)

If you bite this particular bullet—meaning, you believe that my brain's (in)ability to know something with certainty implies that that something in fact does not exist—then I would have you reconcile the myriad of issues this creates with respect to physics, logic, and epistemology. (Starter questions include: do I then possess the ability to alter facts as I see fit, merely by strategically choosing what I do and do not find out? By what mechanism does this seemingly miraculous ability operate? Why does it not violate known physical principles such as locality, or known statistical principles such as d-separation?)

And—conversely—if you do not bite the bullet in question, then I would have you either (a) find some alternative way to justify the (apparent) ontological commitments implied by claims such as "there's no observer-independent fact of the matter about whether X is Y", or (b) explain why such claims actually don't carry the ontological commitments they very obviously seem to carry.

(Or (c): explain why you find it useful, when making such claims, to employ language in a way that creates such unnecessary—and untrue—ontological commitments.)

In any of the above cases, however, it seems to me that the original argument has been refuted: regardless of which prong of the 4-lemma you choose, you cannot maintain your initial assertion that "there is no objective fact of the matter about anything".


The first argument attacked the idea that fundamental epistemological limitations have broader ontological implications; in doing so, it undermines the wild phrasing I often see thrown around in claims associated with such limitations (e.g. the "problem of the criterion"), and also calls into question the degree to which such limitations are important (e.g. if they don't have hugely significant implications for the nature of reality, why does "pragmatism" not suffice as an answer?).

The second argument, however, attacks the underlying claim more directly. Recall the claim in question:

[...] that "knowledge" cannot be grounded, because "knowledge" requires "justification", which in turn depends on other knowledge, which requires justification of its own, ad infinitum.

Is this actually the case, however? Let's do a case analysis; we'll (once again) use "the sky is blue" as an example:

Suppose I believe that the sky is blue. How might I have arrived at such a belief? There are multiple possible options (e.g. perhaps someone I trust told me that it's blue, or perhaps I've observed it to be blue on every past occasion, so I believe it to be blue right now as well), but for simplicity's sake we'll suppose the most direct method possible: I'm looking at it, right now, and it's blue.

However (says the problemist of the criterion) this is not good enough. For what evidence do you have that your eyes, specifically, are trustworthy? How do you know that your senses are not deceiving you at this very moment? The reliability of one's senses is, in and of itself, a belief that needs further justification—and so, the problemist triumphantly proclaims, the ladder of infinite descent continues.

But hold on: there is something very strange about this line of reasoning. Humans do not, as a rule, ordinarily take their senses to be in doubt. Yes, there are exceptional circumstances (optical illusions, inebriation, etc.) under which we have learned to trust our senses less than we usually do—but the presence of the word "usually" in that sentence already hints at the fact that, by default, we take our senses as a given: trustworthy, not because of some preexisting chain of reasoning that "justifies" it by tying it to some other belief(s), but simply... trustworthy. Full stop.

Is this an illegal move, according to the problemist? After all, the problemist seems to be very adamant that one cannot believe anything without justification, and in taking our senses as a given, we certainly seem to be in violation of this rule...

...and yet, most of the time things seem to work out fine for us in real life. Is this a mere coincidence? If it's true that we are actually engaging in systematically incorrect reasoning—committing an error of epistemology, an illegal move according to the rules that govern proper thought—and moreover, doing so constantly throughout every moment of our lives—one might expect us to have run into some problems by now. Yet by and large, the vast majority of humanity is able to get away with trusting their senses in their day-to-day lives; the universe, for whatever reason, does not conspire to punish our collective irrationality. Is this a coincidence, akin to the coincidences that sometimes reward, say, an irrational lottery ticket buyer? If so, there sure do seem to be quite a few repeat lottery winners on this planet of ours...

Of course, it isn't a coincidence: the reason we can get away with trusting our senses is because our senses actually are trustworthy. And it's also no coincidence that we believe this implicitly, from the very moment we are born, well before any of us had an opportunity to learn about epistemology or the "problem of the criterion". The reliability of our senses, as well as the corresponding trust we have in those senses, are both examples of properties hardcoded into us by evolution—the result of billions upon billions of years of natural selection on genetic fitness.

There were, perhaps, organisms on whom the problemist's argument would have been effective, in the ancient history of the Earth—organisms who did not have reliable senses, and who, if they had chosen to rely unconditionally on whatever senses they possessed, would have been consistently met with trouble. (And if such organisms did not exist, we can still imagine that they did.) But if such organisms did exist back then, they no longer do today: natural selection has weeded them out, excised them from the collective genetic pool for being insufficiently fit. The harsh realities of competition permit no room for organisms to whom the "problem of the criterion" poses a real issue.

And as for the problemist's recursive ladder of justification? It runs straight into the hard brick wall called "evolutionary hardcoding", and proceeds no further than that: the buck stops immediately. Evolution neither needs nor provides justification for the things it does; it merely optimizes for inclusive genetic fitness. Even attempting to apply the problemist's tried-and-true techniques to the alien god produces naught but type and category errors; the genre of analysis preferred by the problemist finds no traction whatsoever. Thus, the problemist of the criterion is defeated, and with him so too vanishes his problem.


Incidentally, what genre of analysis does work on evolution? Since evolution is an optimization process, the answer should be obvious enough: the mathematical study of optimization, with all of the various fields and subfields associated with it. But that is quite beyond the scope of this comment, which is more than long enough as it is. So I leave you with this, and sincerely request that you stop beating your hobby-horse to death: it is, as it were, already dead.

And as for the problemist's recursive ladder of justification? It runs straight into the hard brick wall called "evolutionary hardcoding", and proceeds no further than that: the buck stops immediately. Evolution neither needs nor provides justification for the things it does; it merely optimizes for inclusive genetic fitness. Even attempting to apply the problemist's tried-and-true techniques to the alien god produces naught but type and category errors; the genre of analysis preferred by the problemist finds no traction whatsoever. Thus, the problemist of the criterion is defeated, and with him so too vanishes his problem.

This feels like violent agreement with my arguments in the linked post, so I think you're arguing against some different reading of the implications of the problem of the criterion than what it does imply. It doesn't imply there is literally no way to ground knowledge, but that ground is not something especially connected to traditional notions of truth or facts, but rather in usefulness to the purpose of living.

This mostly comes up when we try to assess things like what does it mean for something to "be" an "agent". We then run headlong into the grounding problem and this becomes relevant, because what it means for something to "be" an "agent" ends up connected to what end we need to categorize the world, rather than how the world actually "is", since the whole point is that there is no fact of the matter about what "is", only a best effort assessment of what's useful (and one of the things that's really useful is predicting the future, which generally requires building models that correlate to past evidence).

[-]dxu40

This feels like violent agreement with my arguments in the linked post, so I think you're arguing against some different reading of the implications of the problem of the criterion than what it does imply.

Perhaps I am! But if so, I would submit that your chosen phrasings of your claims carry unnecessary baggage with them, and that you would do better to phrase your claims in ways that require fewer ontological commitments (even if they become less provocative-sounding thereby).

It doesn't imply there is literally no way to ground knowledge, but that ground is not something especially connected to traditional notions of truth or facts, but rather in usefulness to the purpose of living.

In a certain sense, yes. However, I assert that "traditional notions of truths or facts" (at least if you mean by that phrase what I think you do) are in fact "useful to the purpose of living", in the following sense:

It is useful to have senses that tell you the truth about reality (as opposed to deceiving you about reality). It is useful to have a brain that is capable of performing logical reasoning (as opposed to a brain that is not capable of performing logical reasoning). It is useful to have a brain that is capable of performing probabilistic reasoning (as opposed to a brain that is not, etc. etc).

To the extent that we expect such properties to be useful, we ought also to expect that we possess those properties by default. Otherwise we would not exist in the form we do today; some superior organism would be here in our place, with properties more suited to living in this universe than ours. Thus, "traditional notions of truths and facts" remain grounded; there are no excess degrees of freedom available here.

To what extent do you find the above explanation unsatisfying? And if you do not find it unsatisfying, then (I repeat): what is the use of talking about the "problem of the criterion", beyond (perhaps) the fact that it allows you to assert fun and quirky and unintuitive (and false) things like "facts don't exist"?

This mostly comes up when we try to assess things like what does it mean for something to "be" an "agent". We then run headlong into the grounding problem and this becomes relevant, because what it means for something to "be" an "agent" ends up connected to what end we need to categorize the world, rather than how the world actually "is", since the whole point is that there is no fact of the matter about what "is", only a best effort assessment of what's useful (and one of the things that's really useful is predicting the future, which generally requires building models that correlate to past evidence).

I agree that this is a real difficulty that people run into. I disagree with [what I see as] your [implicit] claim that the "problem of the criterion" framing provides any particular tools for addressing this problem, or that it's a useful framing in general. (Indeed, the sequence I just linked constitutes what I would characterize as a "real" attempt to confront the issue, and you will note the complete absence of claims like "there is no such thing as knowledge" in any of the posts in question; in the absence of such claims, you will instead see plenty of diagrams and mathematical notation.)

It should probably be obvious by now that I view the latter approach as far superior to the former. To the extent that you think I'm not seeing some merits to the former approach, I would be thrilled to have those merits explained to me; right now, however, I don't see anything.

To what extent do you find the above explanation unsatisfying? And if you do not find it unsatisfying, then (I repeat): what is the use of talking about the "problem of the criterion", beyond (perhaps) the fact that it allows you to assert fun and quirky and unintuitive (and false) things like "facts don't exist"?

To me this is like asking what's the point in talking about a Theory of Everything when trying to talk about physics. You might complain you can do a lot of physics without it, yet we still find it useful to have a theory that unifies physics at a fundamental level (even if we keep failing to find one). I argue that the problem of the criterion fills a conceptually similar niche in epistemology: it's the fundamental thing to be understood in order to be able to say anything else meaningful and not inconsistent about how we know or what we know, which is itself fundamental to most activity. Thus it is often quite useful to appeal to because lots of deconfusion research, like this post, are ultimately consequences of the problem of the criterion, and so I find most object-level arguments, like those found in this post, a certain kind of wasted motion that could be avoided if only the problem of the criterion were better understood.

I agree that this is a real difficulty that people run into. I disagree with [what I see as] your [implicit] claim that the "problem of the criterion" framing provides any particular tools for addressing this problem, or that it's a useful framing in general. (Indeed, the sequence I just linked constitutes what I would characterize as a "real" attempt to confront the issue, and you will note the complete absence of claims like "there is no such thing as knowledge" in any of the posts in question; in the absence of such claims, you will instead see plenty of diagrams and mathematical notation.)

I think the thing the framing of the sequence you link and the way most people approach this is missing something fundamental about epistemology that lets one get confused, specifically by easily forgetting that one's knowledge is always contingent on some assumption that one may not even be able to see, and so mistakes one's own perspective for objectivity. As for what tools understanding the problem of the criterion provides, I'd say it's more like a mindset of correctly calibrated epistemic humility. Not to say that grokking the problem of the criterion makes you perfectly calibrated in making predictions, but to say it requires adopting a mindset that is sufficient to achieve the level of epistemic humility/update fluidity necessary to become well calibrated or, dare we say, Bayesian rational.

(Note: I realize my claim about grokking the problem of the criterion sets up a potential "no true Scotsman" situation where anyone who claims to grok the problem of the criterion and then seems to lack this capacity for update fluidity can be dismissed as not really grokking it. I'm not really looking to go that far, but I want to say that this claim is, I believe, predictive enough to make useful inferences.)

It should probably be obvious by now that I view the latter approach as far superior to the former.

Maybe it is for some people (you wouldn't be the first person to make this claim). Others do seem to find my approach useful. Perhaps the whole point of this should be that not everyone is necessarily reasoning from the same base assumptions, and thus the ground of truth is unstable enough that what seem like reasonable explanations cannot be sure to be arbitrarily convincing. To be fair, Eliezer doesn't miss this point, but it seems poorly enough appreciated that I often find cause to remind people of it.

If I really wanted to be pointed about it, I think you'd be less annoyed with me if you grokked the point both Eliezer and I are trying to make in different ways about how epistemology grounds out, since taken to its extreme we must accept that the same lines of reasoning don't work for everyone on a practical level, by which I mean that even if you show someone correct math, they may yet not be convinced by it, and this is epistemically relevant and not to be dismissed since we are each performing our own reckoning of what to accept as true, no matter how much we may share in common (which, for what it's worth, brings us right back to the core of the intentional stance and the object level concerns of the OP!).

[-]TAG10

That I am posting this comment regardless is due to the fact that I have seen you posting about your hobby-horse—the so-called “problem of the criterion”—well in excess of both the number of times and places I think it should be mentioned

It's not been mentioned enough, since the point has not generally sunk in.

The first argument attacked the idea that fundamental epistemological limitations have broader ontological implications; in doing so, it undermines the wild phrasing I often see thrown around in claims associated with such limitations (e.g. the “problem of the criterion”), and also calls into question the degree to which such limitations are important (e.g. if they don’t have hugely significant implications for the nature of reality, why does “pragmatism” not suffice as an answer?).

Pragmatism isn't a sufficient answer, because it can show that we are accumulating certain kinds of knowledge, namely the ability to predict things and make things, but does not show that we are accumulating other kinds , specifically ontological knowledge, ie. successful correspondence to reality.

You can objectively show that a theory succeeds or fails at predicting observations, and at the closely related problem of achieving practical results . It is is less clear whether an explanation succeeds in explaining, and less clear still whether a model succeeds in corresponding to the territory. The lack of a test for correspondence per se, ie. the lack of an independent "standpoint" from which the map and the territory can be compared, is the is the major problem in scientific epistemology. Its the main thing that keeps non-materialist ontology going. And the lack of direct testability is one of the things that characterises philosophical problems as opposed to scientific ones -- you can't test ethics for correctness,you can't test personal identity, you can't test correspondence-to-reality separately from prediction-of-observation -- so the "winning" or pragmatic approach is a particularly bad fit for philosophy.

The thing scientific realists care about is having an accurate model of reality, knowing what things are. If you want that, then instrumentalism is giving up something of value to you. So long as it s possible. If realistic knowledge is impossible , then ther'es no loss of value.

Far from having no ontological implications, the problem of the criterion has mainly ontological implications, since the pragmatic response works in other areas.

Of course, it isn’t a coincidence: the reason we can get away with trusting our senses is because our senses actually are trustworthy

Trustworthy or reliable at what?

You cannot ascertain an ontologicaly correct model of reality just by looking at things. A model is a theoretical structure. Multiple models can be compatible with the same sense data, so a a further criterion is needed. Of course, you can still do predictive, instrumental stuff with empiricism.

And as for the problemist’s recursive ladder of justification? It runs straight into the hard brick wall called “evolutionary hardcoding”, and proceeds no further than that: the buck stops immediately. Evolution neither needs nor provides justification for the things it does; it merely optimizes for inclusive genetic fitness.

That's part of the problem, not part of the solution. If evolution is optimising for genetic fitness, then it is not optimising for the ability to achieve a correct ontology ... after all , a wrong but predictive model is good enough for survival.

So I leave you with this, and sincerely request that you stop beating your hobby-horse to death: it is, as it were, already dead.

The issues I mentioned have not been answered.

[-]dxu40

It's not been mentioned enough, since the point has not generally sunk in.

I find this response particularly ironic, given that I will now proceed to answer almost every one of your points simply by reiterating one of the two arguments I provided above. (Perhaps it's generally a good idea to make sure the point of someone's comment has "sunk in" before replying to them.)

Pragmatism isn't a sufficient answer, because it can show that we are accumulating certain kinds of knowledge, namely the ability to predict things and make things, but does not show that we are accumulating other kinds , specifically ontological knowledge, ie. successful correspondence to reality.

Suppose this is true (i.e. suppose we have no means of accumulating "ontological knowledge"). I repeat the first of my two arguments: by what mechanism does this thereby imply that no ontological facts of any kind exist? Is it not possible both that (a) the sky exists and has a color, and (b) I don't know about it? If you claim this is not possible, I should like to see you defend this very strong positive claim; conversely, if you do not make such a claim, the idea that the "problem of the criterion" has any ontological implications whatsoever is immediately dismissed.

The thing scientific realists care about is having an accurate model of reality, knowing what things are. If you want that, then instrumentalism is giving up something of value to you. So long as it s possible. If realistic knowledge is impossible , then ther'es no loss of value.

I repeat the second of my two arguments: to build an accurate model of reality requires taking some assumptions to be foundational; mathematicians might call these "axioms", whereas Bayesians might call them "the prior". As long as you have such a foundation, it is possible to build models that are at least as trustworthy as the foundation itself; the limiting factor, therefore, on the accumulation of scientific knowledge—or, indeed, any other kind of knowledge—is the reliability of our foundations.

And what are our foundations? They are the sense and reasoning organs provided to us by natural selection; to the extent that they our trustworthy, the theories and edifices we build atop them will be similarly trustworthy. (Assuming, of course, that we do not make any mistakes in our construction.)

So the "problem of the criterion" reduces to the question of how reliable natural selection is at building organisms with trustworthy senses; to this question I answer "very reliable indeed." Should you claim otherwise, I should like to see you defend this very strong positive claim; if you do not claim otherwise, then the "problem of the criterion" immediately ceases to exist.

Far from having no ontological implications, the problem of the criterion has mainly ontological implications, since the pragmatic response works in other areas.

I repeat the first of my two arguments: what ontological implications, and why? I should like to see you defend the (very strong) positive claim that such implications exist; or, alternatively, relinquish the notion that they do.

Of course, it isn’t a coincidence: the reason we can get away with trusting our senses is because our senses actually are trustworthy

Trustworthy or reliable at what?

Per my second argument: at doing whatever they need to do in order for us not to have been selected out of existence—in other words, at providing an effective correspondence between our beliefs and reality. (Why yes, this is the thing the "problem of the criterion" claims to be impossible; why yes, this philosophical rigmarole does seem to have had precisely zero impact on evolution's ability to build such organisms.)

Should you deny that evolution has successfully built organisms with trustworthy senses, I should like to see you defend this very strong positive claim, etc. etc.

You cannot ascertain an ontologicaly correct model of reality just by looking at things. A model is a theoretical structure. Multiple models can be compatible with the same sense data, so a a further criterion is needed. Of course, you can still do predictive, instrumental stuff with empiricism.

The problem of selecting between multiple compatible models is not something I often see packaged with the "problem of the criterion" and others of its genre; it lacks the ladder of infinite descent that those interested in the genre seem to find so attractive, and so is generally omitted from such discussions. But since you bring it up: there is, of course, a principled way to resolve questions of this type as well; the heuristic version (which humans actually implement) is called Occam's razor, whereas the ideal version is called Solomonoff induction.

This is an immensely powerful theoretical tool, mind you: since Solomonoff induction contains (by definition) every computable hypothesis, that means that every possible [way-that-things-could-be] is contained somewhere in its hypothesis space, including what you refer to as "the ontologically correct model of reality"; moreover, one of the theoretical guarantees of Solomonoff induction is that said "correct model" will become the predictor's dominant hypothesis after a finite (and generally, quite short) amount of time.

For you to deny this would require that you claim the universe is not describable by any computable process; and I should like to see you defend this very strong positive claim, etc. etc.

That's part of the problem, not part of the solution. If evolution is optimising for genetic fitness, then it is not optimising for the ability to achieve a correct ontology ... after all , a wrong but predictive model is good enough for survival.

Per my second argument: evolution does not select over models; it selects over priors. A prior is a tool for constructing models; if your prior is non-stupid, i.e. if it doesn't rule out some large class of hypotheses a priori, you will in general be capable of figuring out what the correct model is and promoting it to attention. For you to deny this would require that you claim non-stupid priors confer no survival advantage over stupid priors; and I should like to see you defend this very strong positive claim, etc. etc.

The issues I mentioned have not been answered.

Yes, well.

[-]TAG10

I repeat the first of my two arguments: by what mechanism does this thereby imply that no ontological facts of any kind exist

If "fact" means "statement known to be true" , then it follows directly.

If "fact" means "component of reality, whether know or not", it does not follow...but that is irrelevant, since I did not deny the existence of some kind of reality.

I repeat the second of my two arguments: to build an accurate model of reality requires taking some assumptions to be foundational;

In which case, I will repeat that the only testable accuracy we have is predictive accuracy, and we do not know whether our ontological claims are accurate, because we have no direct test.

As long as you have such a foundation, it is possible to build models that are at least as trustworthy as the foundation itself;

That is a major part of the problem. Since our most fundamental assumptions aren't based on anything else, how do we know how good they are? The only solution anyone has is to judge by results, but that just goes back to the original problem of being able to test predictiveness but not ontological correspondence.

But maybe "no worse than your assumptions" is supposed to be a triumphant refutation of my claim that everything is false...but, again, I didn't say that.

And what are our foundations? They are the sense and reasoning organs provided to us by natural selection;

To reword my previous argument, sense data are not a sufficient foundation, because you cannot appeal to them to choose between two models that explain the same sense data.

Neither Gordon not myself are appealing to the unreliability if sense data. Even if sense data are completely reliable, the above problem holds.

So the “problem of the criterion” reduces to the question of how reliable natural selection is at building organisms with trustworthy senses;

Of course not. There are lots of animals have better senses than humans, and none of them have a clue about ontology.

But since you bring it up: there is, of course, a principled way to resolve questions of this type as well; the heuristic version (which humans actually implement) is called Occam’s razor, whereas the ideal version is called Solomonoff induction.

I know. Its completely standard to put forward simplicity criteria as the missing factor that allows you to choose between empirically adequate models .

The problem is that, while simplicity criteria allow you to select models , you need to know that they are selecting models that are more likely to correspond to reality, rather than on some other basis. SI fares particularly badly, because there is no obvious reason why a short programme should be true, or even that it is a description of reality at all .

For you to deny this would require that you claim the universe is not describable by any computable process

I see no strength to that claim at all. The universe is partly predictable by computational processes, and that's all for we know.

It is the claim that a programme is ipso facto a description that us extraordinary.

Per my second argument: evolution does not select over models; it selects over priors. A prior is a tool for constructing models; if your prior is non-stupid, i.e. if it doesn’t rule out some large class of hypotheses a priori, you will in general be capable of figuring out what the correct model is and promoting it to attention. For you to deny this would require that you claim non-stupid priors confer no survival advantage over stupid priors; and I should like to see you defend this very strong positive claim, etc. etc

To repeat my argument yet again, evolution only needs to keep you alive and reproducing , and merely predictivene correctness is good enough for that.

you will in general be capable of figuring out what the correct model

Correct in what sense ?

The basis of my argument is the distinction between predictive accuracy and ontological correctness. Your responses keep ignoring that distinction in favour of a single notion of correctness/truth/accuracy. If you could show that the two are the same , it the one implies the other, you would be on to something.

Well, it's a hard habit to break. Everything you are saying to me now is something I used to believe for many years, until I awoke from my dogmatic slumbers.

[-]dxu20

If "fact" means "component of reality, whether know or not", it does not follow...but that is irrelevant, since I did not deny the existence of some kind of reality.

Well, good! It's heartening to see we agree on this; I would ask then why it is that so many subscribers to epistemological minimalism (or some variant thereof) seem to enjoy phrasing their claims in such a way as to sound as though they are denying the existence of external reality; but I recognize that this question is not necessarily yours to answer, since you may not be one of those people.

I repeat the second of my two arguments: to build an accurate model of reality requires taking some assumptions to be foundational;

In which case, I will repeat that the only testable accuracy we have is predictive accuracy, and we do not know whether our ontological claims are accurate, because we have no direct test.

For predictive accuracy and "ontological accuracy" to fail to correspond [for some finite period of time] would require the universe to possess some very interesting structure; the longer the failure of correspondence persists, the more complex the structure in question must be; if (by hypothesis) the failure of correspondence persists indefinitely, the structure in question must be uncomputable.

Is it your belief that one of the above possibilities is the case? If so, what is your reason for this belief, and how does it contend with the (rather significant) problem that the postulated complexity must grow exponentially in the amount of time it takes for the "best" (most predictive) model to line up with the "true" model?

[The above argument seems to address the majority of what I would characterize as your "true" rejection; your comment contained other responses to me concerning sense data, natural selection, the reliability of animal senses, etc. but those seem to me mostly like minutiae unrelated to your main point. If you believe I'm mistaken about this, let me know which of those points you would like a specific response to; in the interim, however, I'm going to ignore them and jump straight to the points I think are relevant.]

The problem is that, while simplicity criteria allow you to select models , you need to know that they are selecting models that are more likely to correspond to reality, rather than on some other basis. SI fares particularly badly, because there is no obvious reason why a short programme should be true, or even that it is a description of reality at all .

The simplicity criterion does not come out of nowhere; it arises from the fact that description complexity is bounded below, but unbounded above. In other words, you can make a hypothesis as complex as you like, adding additional epicycles such that the description complexity of your hypothesis increases without bound; but you cannot decrease the complexity of your hypothesis without bound, since for any choice of computational model there exists a minimally complex hypothesis with description length 0, beneath which no simpler hypotheses exist.

This means that for any hypothesis in your ensemble—any computable [way-that-things-could-be]—there are only finitely many hypotheses with complexity less than that of the hypothesis in question, but infinitely many hypotheses with complexity equal or greater. It follows that for any ordering whatsoever on your hypothesis space, there will exist some number n such that the complexity of the kth hypothesis H_k exceeds some fixed complexity C for any k > n... the upshot of which is that every possible ordering of your hypothesis space corresponds, in the limit, to a simplicity prior.

Does it then follow that the universe we live in must be a simple one? Of course not—but as long as the universe is computable, the hypothesis corresponding to the "true" model of the universe will live only finitely far down our list—and each additional bit of evidence we receive will, on average, halve that distance. This is what I meant when I said that the (postulated) complexity of the universe must grow exponentially in the amount of time any "correspondence failure" can persist: each additional millisecond (or however long it takes to receive one bit of evidence) that the "correspondence failure" persists corresponds to a doubling of the true hypothesis' position number in our list.

So the universe need not be simple a priori for Solomonoff induction to work. All that is required is that the true description complexity of the universe does not exceed 2^b, where b represents the sum total of all knowledge we have accumulated thus far, in bits. That this is a truly gigantic number goes without saying; and if you wish to defend the notion that the "true" model of the universe boasts a complexity in excess of this value, you had better be prepared to come up with some truly extraordinary evidence.

For you to deny this would require that you claim the universe is not describable by any computable process

I see no strength to that claim at all. The universe is partly predictable by computational processes, and that's all for we know.

It is the claim that a programme is ipso facto a description that us extraordinary.

This is the final alternative: the claim, not that the universe's true description complexity is some large but finite value, but that it is actually infinite, i.e. that the universe is uncomputable.

I earlier (in my previous comment) said that "I should like to see" you defend this claim, but of course this was rhetorical; you cannot defend this claim, because no finite amount of evidence you could bring to bear would suffice to establish anything close. The only option, therefore, is for you to attempt to flip the burden of proof, claiming that the universe should be assumed uncomputable by default; and indeed, this is exactly what you did: "It is the claim that a program is ipso facto a description that is extraordinary."

But of course, this doesn't work. "The claim that a program can function as a description" is not an extraordinary claim at all; it is merely a restatement of how programs work: they take some input, perform some internal manipulations on that input, and produce an output. If the input in question happens to be the observation history of some observer, then it is entirely natural to treat the output of the program as a prediction of the next observation; there is nothing extraordinary about this at all!

So the attempted reversal of the burden of proof fails; the "extraordinary" claim remains the claim that the universe cannot be described by any possible program, regardless of length, and the burden of justifying such an impossible-to-justify claim is, thankfully, not my problem.

:P

[-]TAG10

But of course, this doesn’t work. “The claim that a program can function as a description” is not an extraordinary claim at all; it is merely a restatement of how programs work: they take some input, perform some internal manipulations on that input, and produce an output. If the input in question happens to be the observation history of some observer, then it is entirely natural to treat the output of the program as a prediction of the next observation; there is nothing extraordinary about this at all!

Emphasis added. You haven't explained how a programme functions as a description. You mentioned description, and then you started talking prediction, but you didn't explain how they relate.

So the attempted reversal of the burden of proof fails; the “extraordinary” claim remains the claim that the universe cannot be described by any possible program, regardless of length,

The length has nothing to do with it -- the fact it is a programme at all is the problem.

On the face of it, Solomonoff Inductors contain computer programmes, not explanations, not hypotheses and not descriptions. (I am grouping explanations, hypotheses and beliefs as things which have a semantic interpretation, which say something about reality . In particular, physics has a semantic interpretation in a way that maths does not.)

The Yukdowskian version of Solomonoff switches from talking about programs to talking about hypotheses as if they are obviously equivalent. Is it obvious? There's a vague and loose sense in which physical theories "are" maths, and computer programs "are" maths, and so on. But there are many difficulties in the details. Neither mathematical equations not computer programmes contain straightforward ontological assertions like "electrons exist". The question of how to interpret physical equations is difficult and vexed. And a Solomonoff inductor contains programmes, not typical physics equations. whatever problems there are in interpreting maths ontologically are compounded when you have the additional stage of inferring maths from programmes.

In physics, the meanings of the symbols are taught to students, rather than being discovered in the maths. Students are taught the in f=ma, f is force, is mass and a is acceleration. The equation itself , as ours maths, does not determine the meaning. For instance it has the same mathematical form as P=IV, which "means" something different. Physics and maths are not the same subject, and the fact that physics has a real-world semantics is one of the differences.

Similarly, the instructions in a programme have semantics related to programme operations, but not to the outside world. The issue is obscured by thinking in terms of source code. Source code often has meaningful symbol names , such as MASS or GRAVITY...but that's to make it comprehensible to other programmers. The symbol names have no effect on the function and could be mangled into something meaningless but unique. And a SI executes machine code anyway..otherwise , you can't meaningfully compare programne lengths. Note how the process of backtracking from machine code to meaningful source code is a difficult one. Programmers use meaningful symbols because you can't easily figure out what real world problem a piece of machine code is solving from its function. One number is added to another..what does that mean? What do the quantifies represent?

Well, maybe programmes-as-descriptions doesn't work on the basis that individual Al symbols or instructions have meanings in the way that natural language words do. Maybe the programme as a whole expresses a mathematician structure as a whole. But that makes the whole situation worse because it adds an extra step , the step of going from code to maths, to the existing problem of going from maths to ontology.

The process of reading ontological models from maths is not formal or algorithmic. It can't be asserted that SI is the best formal epistemology we have and also that it is capable of automating scientific realism. Inasmuch as it is realistic , the step from formalism to realistic interpretation depends on human interpretation, and so is not formal. And if it SI is purely formal, it is not realistic.

But code already is maths, surely? In physics the fundamental equations are on a higher abstraction level than a calculation: generally need to be " solved" for some set of circumstances, to obtain a more concrete equation you can calculate with. To get back to what would normally be considered a mathematical structure, you would have to reverse the original process. If you succeed in doing that, then SI is as good or bad as physics...remember, that physics still needs ontological interpretation. If you don't succeed in doing that.. which you you might not, since there is no algorithm reliable method for doing so...then SI is strictly worse that ordinary science, since it has an extra step of translation from calculation to mathematical structure, in addition to the standard step of translation from mathematical structure to ontology.

That's part of the problem, not part of the solution. If evolution is optimising for genetic fitness, then it is not optimising for the ability to achieve a correct ontology ... after all , a wrong but predictive model is good enough for survival.

In many ways this is the crux of things. The problem of the criterion does mean that we can't ground knowledge in the ways we had hoped to, and that we can still ground knowledge, just in something quite a bit different from the objective: namely, in some practical purpose to which we use knowledge.

[-]TAG10

But that still doesn't give us ontological knowledge, if we ever wanted it: we have to settle for less.

It's not been mentioned enough, since the point has not generally sunk in.

For what it's worth, I feel exactly the same way about the robustness of Goodhart and I'll keep beating that drum as long as I have to. Luckily no one much objects that Goodharting is a problem, whereas everyone seems to be annoyed with epistemology that makes a fairly simple point that is counterintuitive to the experience of being able to make use of knowledge to do useful things, and thinking this is somehow contrary to the epistemological point being made that becomes relevant when you try to ground concepts rather than just use them as you find them.

Many people believe that they already understand Dennett's intentional stance idea, and due to that will not read this post in detail. That is, in many cases, a mistake. This post makes an excellent and important point, which is wonderfully summarized in the second-to-last paragraph:

In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and ascribe to systems we observe externally. I also think that much of this confusion dissolves with the realization that internally experienced thoughts, beliefs, desires, goals, etc. are actually “external” with respect to the parts of the mind that are observing them—including the part(s) of the mind that is modeling the mind-system as a whole as “being an agent” (or a “multiagent mind,” etc.). You couldn't observe thoughts (or the mind in general) at all if they weren't external to "you" (the observer), in the relevant sense.

The real point of the intentional stance idea is that there is no fact of the matter about whether something really is an agent, and that point is most potent when applied to ourselves. It is neither the case that we really truly are an agent, nor that we really truly are not an agent.

This post does an excellent job of highlighting this facet. However, I think this post could have been more punchy. There is too much meta-text of little value, like this paragraph:

In an attempt to be as faithful as possible in my depiction of Dennett’s original position, as well as provide a good resource to point back to on the subject for further discussion[1], I will err on the side of directly quoting Dennett perhaps too frequently, at least in this summary section.

In a post like this, do we need to be fore-warned that the author will err perhaps to frequently on the side of directly quoting Dennett, at least in the summary section? No, we don't need to know that. In fact the post does not contain all that many direct quotes.

At the top of the "takeaways" section, the author gives the following caveat:

Editorial note: To be clear, these “takeaways” are both “things Dan Dennett is claiming about the nature of agency with the intentional stance” and “ideas I’m endorsing in the context of deconfusing agency for AI safety.”

The word "takeaways" in the heading already tells us that this section will contain points extracted by the reader that may or may not be explicitly endorsed by the original author. There is no need for extra caveats, it just leads to a bad reading experience.

In the comments section, Rohin makes the following very good point:

I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept "agency = intentional stance", then you need to think "well, I guess AI risk wasn't actually about agency".

Although we can "see through" agency as not-an-ontologically-fundamental-thing, nevertheless we face the practical problem of what to do about the (seemingly) imminent destruction of the world by powerful AI. What actually should we do about that? The intentional stance not only fails to tell us what to do, it also fails to tell us how any approach to averting AI risk can co-exist with the powerful deconstruction of agency offered by the intentional stance idea itself. If agency is in the eye of the beholder, then... what? What do we actually do about AI risk?

Nice summary :) It's relevant for the post that I'm about to publish that you can have more than one intentional-stance view of the same human. The inferred agent-shaped model depends not only on the subject and the observer, but also on the environment, and on what the observer hopes to get by modeling.

Related: Marr's Tri-Level Hypothesis, which offers three similar perspectives:

  • Physical/implementation: How is the system physically realized (as neurons in a brain, transistors in a modern CPU &c)
  • Algorithmic/representational: How is the information processing system performing the operations, what are the representations/data structures, what is the "trace" of the algorithm
  • Computational: What problem does the system overcome, why does it overcome that problem.

These three to me seem to correspond eerily well to Dennett's stances: physical stance↔physical/implementation level, design stance↔algorithmic/representational level, and intentional stance↔computational level.

I often feel like I want to add a fourth level to Marr's hierarchy, treating the system as a black box and looking simply at outputs and inputs, perhaps one could call that the "functional" or "mapping" level.

Concepts are generally clusters and I would say that being well-predicted by the Intentional Strategy is one aspect of what is meant by agency.

Another aspect relates to the interior functioning of an object. A very simple model would be to say that we generally expect the object to have a) some goals, b) counterfactual modeling abilities and c) to pursue the goals based on these modeling abilities. This definition is less appealing because it is much more vague and each of the elements in the previous sentence would need further clarification; however this doesn't mean that it is any less of a part of what people are generally imagining when they think of an agent. Humans come pre-equipped with at least a vague and casual sense of what these kinds of terms mean, so the above description is already sufficient for us to say, for example, that a metal ball that seems agentic according to the Intentional Stance because it is controlled by a magnet isn't agentic (on its own) according to the interior functioning stance.

I don't have time to expand on every aspect here (especially since these definitions would require further expansion; and so on), so I'll just focus on the notion of goals. Here are some relevant considerations for being considered as a goal:

  • Human-like goals are more likely to be considered goals than, for example, printing out every number that meets 20 conditions without falling into one of 300 exceptions. However, we would be more likely to accept this as a goal if we were told that there was a simple reason why we were performing a weird analysis (ie. legal compliance) then we'd be more likely to accept this as a goal.
  • We are more likely to consider a system to have goals if it represents them simply, but again, if we're given a sufficient reason we might still accept it as a goal (for example if we were told that the representation was due to the hard drive being protected by encryption).
  • The goals should be used to determine behavior, although we're now moving to part c) of the interior functioning requirements

Note that a large part of the challenge is that we can't imagine every way of interpreting a system, so it would be very easy to say that a system has goals if it meets these three conditions where the conditions are broad enough that everything might be considered to have a goal. So what usually ends up happening is that we pick out properties that would seem to include most things we consider as having goals and seemingly excludes things we generally don't consider to have goals (although we normally just handwave here). And then if someone informs us that our definition picks out too much, then we narrow it by adding tighter conditions. So this isn't really an objective process.

Again, our definitions have used vague language, but that's just how our mind works.

Planned summary for the Alignment Newsletter:

This post describes takeaways from [The Intentional Stance](https://mitpress.mit.edu/books/intentional-stance) by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one that is most predictive) is one in which the system has “goals” and “beliefs”. Thus, with AI systems, we should not ask whether an AI system “is” an agent; instead we should ask whether the AI system’s behavior is reliably predictable by the intentional stance.

How is the idea that agency only arises relative to some observer compatible with our view of ourselves as agents? This can be understood as one “part” of our cognition modeling “ourselves” using the intentional stance. Indeed, a system usually cannot model itself in full fidelity, and so it makes a lot of sense that an intentional stance would be used to make an approximate model instead.

Planned opinion:

I generally agree with the notion that whether or not something feels like an “agent” depends primarily on whether or not we model it using the intentional stance, which is primarily a statement about our understanding of the system. (For example, I expect programmers are much less likely to anthropomorphize a laptop than laypeople, because they understand the mechanistic workings of laptops better.) However, I think we do need an additional ingredient in AI risk arguments, because such arguments make claims about how an AI system will behave in novel circumstances that we’ve never seen before. To justify that claim, we need to have an argument that can predict how the agent behaves in new situations; it doesn’t seem like the intentional stance can give us that information by itself. See also [this comment](https://www.alignmentforum.org/posts/jHSi6BwDKTLt5dmsG/grokking-the-intentional-stance?commentId=rS27NBMu478YrwxBh).

This is great work. Glad that folks here take these Ryle-influenced ideas seriously and understand what it means for a putative problem about mind or agency to dissolve. Bravo.

To take the next (and I think, final step) towards dissolution, I would recommend reading and reacting to a 1998 paper by John McDowell called "The Content of Perceptual Experience" which is critical of Dennett's view and even more Rylian and Wittgensteinian in it's spirit (Gilbert Ryle was one of Dennett's teachers). 

I think it's the closest you'll get to de-mystification and "de-confusion" of psychological and agential concepts. Understanding the difference between personal and subpersonal states, explanations, etc. as well as the difference between causal and constitutive explanations is essential to avoiding confusion when talking about what agency is and what enables agents to be what they are. After enough time reading McDowell, pretty much all of these questions about the nature of agency, mind, etc. lose their grip and you can get on with doing sub-personal causal investigation of the mechanisms which (contingently) enable psychology and agency (here on earth, in humans and similar physical systems).

For what it's worth, one thing that McDowell does not address (and doesn't need to for his criticism to work) but is nonetheless essential to Dennett's theory is the idea that facts about design in organisms can reduce to facts about natural selection. To understand why this can't be done so easily, check out the argument from drift. The sheer possibility of evolution by drift (non-selective forces), confounds any purely statistical reduction of fitness facts to frequency facts. Despite the appearance of consensus, it's not at all obvious that the core concepts that define biology have been explained in terms of (reduced to) facts about maths, physics, and chemistry.

Here's a link to Roberta Millstein's SEP entry on drift (she believes drift can be theoretically and empirically distinguished from selection, so it's also worth reading some folks who think it can't be).

https://plato.stanford.edu/entries/genetic-drift/

Here's the jstor link to the McDowell paper:

https://www.jstor.org/stable/2219740

Here are some summary papers of the McDowell-Dennett debate:

https://philarchive.org/archive/DRATPD-2v1

https://mlagflup.files.wordpress.com/2009/08/sofia-miguens-c-mlag-31.pdf