I think we need more of such explanations of everyday common terms such as tool, function/purpose, agent, perception, etc. I wish there were a "grounded dictionary" where you could look up words in a way that is not mostly circular.
"everyday common terms such as tool, function/purpose, agent, perception"
I suspect getting the "true name" of these terms would get you a third of the way to resolving ai safety.
How would this method distinguish between apparently and actually optimized features? In an evolutionary example, for instance, what's the difference between a bird with:
a large beak that was optimized to consume certain kinds of food
a large beak that was the result of a genetic bottleneck that resulted from a series of accidental deaths culling small beaks from the gene pool (neutral drift).
a large beak that is the result of a single generation mutation that superficially resembles an environmental adaptation, but is, in actuality, unfit.
A large beak that helps with consuming certain kinds of food, but whose primary ancestral optimization pressure was purely sexual selection.
I remain skeptical that this approach is able to add teleological concepts back into my physical reality lexicon, but I'm willing to be convinced. (Currently, my leading theory is that teleology is a pure illusion.)
(Meta note: the post itself is not really about teleology, that's just an example. That said, here's how the tentative answers we give for teleology in the post would apply to your question.)
Insofar as the beak was optimized to consume certain kinds of food, we should find that it's unusually well-suited to those foods in multiple ways - for instance, not just that it's big, but that it's also the right shape for those foods. The more distinct unusual features can be explained by optimization toward the same purpose, the more evidence we have that optimization was applied toward that purpose.
On the other hand, if a large beak resulted from genetic drift or a single mutation which isn't actually fit, then we would not expect other features of the beak (or the rest of the bird) to also look like they're optimized for the same foods. (Note that this also applies to "features" at the genetic rather than phenotypic level: genetic drift or a single mutation would not produce a set of mutations which mostly look like they've been selected for the same criterion.)
A similar story applies to sexual selection: if a large beak is could have been selected for some foods or could have been selected for sexual purposes (or some combination of the two), then we should go look at other features of the beak/bird in order to whether those other features look like they're optimized for the same foods.
Philosophy ought to deepen your understanding of things, not undermine your understanding of things.
Opening Example: Teleology
When people say “the heart’s purpose is to pump blood” or “a pencil’s function is to write”, what does that mean physically? What are “purpose” or “function”, not merely in intuitive terms, but in terms of math and physics? That’s the core question of what philosophers call teleology - the study of “telos”, i.e. purpose or function or goal.
This post is about a particular way of approaching conceptual/philosophical questions, especially for finding “True Names” - i.e. mathematical operationalizations of concepts which are sufficiently robust to hold up under optimization pressure. We’re going to apply the method to teleology as an example. We’ll outline the general approach in abstract later; for now, try to pay attention to the sequence of questions we ask in the context of teleology.
Cognition
We start from the subjective view: set aside (temporarily) the question of what “purpose” or “function” mean physically. Instead, first ask what it means for me to view a heart as “having the purpose of pumping blood”, or ascribe the “function of writing” to a pencil. What does it mean to model things as having purpose or function?
Proposed answer: when I ascribe purpose or function to something, I model it as having been optimized (in the sense usually used on LessWrong) to do something. That’s basically the standard answer among philosophers, modulo expressing the idea in terms of the LessWrong notion of optimization.
(From there, philosophers typically ask about “original teleology” - i.e. a hammer has been optimized by a human, and the human has itself been optimized by evolution, but where does that chain ground out? What optimization process was not itself produced by another optimization process? And then the obvious answer is “evolution”, and philosophers debate whether all teleology grounds out in evolution-like phenomena. But we’re going to go in a different direction, and ask entirely different questions.)
Convergence
Next: I notice that there’s an awful lot of convergence in what things different people model as having been optimized, and what different people model things as having been optimized for. Notably, this convergence occurs even when people don’t actually know about the optimization process - for instance, humans correctly guessed millenia ago that living organisms had been heavily optimized somehow, even though those humans were totally wrong about what process optimized all those organisms; they thought it was some human-like-but-more-capable designer, and only later figured out evolution.
Why the convergence?
Our everyday experience implies that there is some property of e.g. a heron such that many different people can look at the heron, convergently realize that the heron has been optimized for something, and even converge to some degree on which things the heron (or the parts of the heron) have been optimized for - for instance, that the heron’s heart has been optimized to pump blood. (Not necessarily perfect convergence, not necessarily everyone, but any convergence beyond random chance is a surprise to be explained if we’re starting from a subjective account.) Crucially, it’s a property of the heron, and maybe of the heron’s immediate surroundings, not of the heron’s whole ancestral environment - because people can convergently figure out that the heron has been optimized just by observing the heron in its usual habitat.
So now we arrive at the second big question: what are the patterns out in the world which different people convergently recognize as hallmarks of having-been-optimized? What is it about herons, for instance, which makes it clear that they’ve been optimized, even before we know all the details of the optimization process?
Candidate answer (underspecified and not high confidence, but it will serve for an example): the system has lots of parts which are all in unusual/improbable states, but all “in a consistent direction” in some sense. So it looks like all the parts were pushed away from what’s statistically typical, in “the same way”.[1]
Ideally, we could operationalize that intuitive answer in a way which would make convergence provable; it has the right flavor for a natural latent style convergence argument.
Corroboration
Imagine, now, that we have a full mathematical operationalization of “parts which are all in unusual/improbable states, but all ‘in a consistent direction’”. Imagine also that we are able to prove convergence. What else would we want from this operationalization of teleology?
Well, I look at a heron, I notice that it has a bunch of parts which are all in unusual/improbable states, but all ‘in a consistent direction’ - i.e. all its parts are in whatever unusual configurations they need to be in for the heron to survive; random configurations would not do that. I conclude that the heron has been optimized. Insofar as my intuition picks up on “parts which are all in unusual/improbable states, but all ‘in a consistent direction’” and interprets that pattern as a hallmark of optimization, and my intuition is correct… then it should be a derivable/provable fact about the external world that “parts which are all in unusual/improbable states, but all ‘in a consistent direction’” occur approximately if-and-only-if a system has been optimized.
More generally: insofar as we have some intuitions about how teleology works, we should be able to prove that our operationalization/characterization indeed works that way. (Or, insofar as the operationalization doesn’t work the way we intuitively expect, we should be able to propagate the counterexamples back to our intuitions and conclude that our intuitions were wrong or required additional assumptions, as opposed to the operationalization being wrong.)
Cognition -> Convergence -> Corroboration
Let’s go back over the teleology example, with an emphasis on what questions we’re asking and why.
We start with questions about my cognition:
Two things to emphasize: first, these are questions about my cognition (or, more generally, one person’s cognition); the answers may or may not generalize to other people. Second, they are questions about my cognition; they’re not asking about how the external world “actually is” (at least not directly).
Some nice things about starting from questions about my cognition:
The downside is that introspection is notoriously biased and error-prone, and this is all not-very-legible and hard to test/prove. That’s fine for now; (some) legible falsifiability will enter in the next steps.
From cognition, we move on to questions about convergence:
The standard answer of interest, which generalizes well beyond teleology, is: people pick up on the same patterns in the environment, and convergently model/interpret them in similar ways. Then the generalizable question is: what are those patterns? Or, in the context of teleology:
At this point, we start to have space for falsifiable predictions and/or mathematical proof: if we have a candidate pattern, then we should be able to demonstrate/prove that it is, in fact, convergently recognized (in some reasonable sense, under some reasonable conditions) by many minds. Such a proof is where a natural latent style argument would typically come in (though of course there may be other ways to establish convergence).
Once convergence is established, we know that we’ve characterized some convergently-recognized pattern. The last step is that it’s the convergently-recognized pattern we’re looking for. For instance, maybe dogs are a convergently-recognized pattern in our environment, and having-been-optimized is also a convergently-recognized pattern in our environment. If we’ve established that “parts which are all in unusual/improbable states, but all ‘in a consistent direction’” is a convergently-recognized pattern in our environment, how do we argue that that pattern is the-thing-humans-call-“teleology”, as opposed to the-thing-humans-call-“dogs”?
Well, we show that the pattern has some of the other properties we expect of teleology.
More generally, this is the corroboration step. We want to prove/demonstrate some further consequences of the pattern identified in the previous step (including how it interfaces with other patterns we think we’ve identified), in order to make sure it’s the pattern we intended to find, as opposed to some other convergently-recognized pattern. This is where all your standard math (and maybe science) would come in.
Cognition -> Convergence -> Corroboration. That’s the pipeline.
Examples are Confusing, Let’s Make it Really Abstract!
The Cognition -> Convergence -> Corroboration Algorithm:
Upon failure sending you back to step 1, three things could be wrong. Use magic to figure out which it is:
Also, obviously, if you’re caught in a loop (like, e.g., failing step 3 and going back to step 2.a over and over, jump back a bit further, e.g. step 1.)
When is the Cognition -> Convergence -> Corroboration Pipeline Useful?
The central use case is:
Most topics studied in philosophy are in-scope. Most (but importantly not all) “deconfusion” work is in-scope.
Beyond just a useful process to follow for such use-cases, we’ve also found the Cognition -> Convergence -> Corroboration structure useful for organizing thoughts/arguments: it’s useful to explicitly distinguish a cognitive characterization from a convergent pattern characterization from a consequence. For instance, we’ve often found it useful to explain some problem we’re thinking about as “What are the patterns/structures in the world which people convergently recognize as X?”.
Some use-cases for which this pipeline is probably not the right tool:
If you want to see more examples where we apply this methodology, check out the Tools post, the recent Corrigibility post, and (less explicitly) the Interoperable Semantics post.
Thank you to Steve Petersen and Ramana Kumar for our discussions of teleology; it was in those discussions that the example in this post bubbled around in my head.
If “unusual/improbable” still sounds too subjective, then you can think of operationalizing it in the Solomonoff/Kolmogorov sense, i.e. in terms of compressibility using a simple Turing machine.