note to self: go through and read Tsvi posts on his blog, as he seems to take a long time to post them to lesswrong. (perhaps that could change? I'm curious why that is the case)
It's a makeshift stop-gradient. I less feel like I'm writing to LessWrong if I'm not publishing it immediately, and although LW is sadly the best place on the internet that I'm aware of, it's very much not in aggregate a gradient I want. Sometimes I write posts intended for LW and publish them immediately.
This section is probably my favorite thing you (Tsvi) have written, and motivated me to read through all your alignment related posts on your blog.
Before I read that passage, I was confident that deconfusion research was the highest value thing I could be doing (and getting better at), but I did not have a succinct way of communicating the fact that me seeming confused about a certain concept is not a sign that I have worse understanding about the problem involved compared to someone who doesn't seem confused.
There's a misconception where most people pattern match confidence in one's understanding of a concept / domain with better understanding of the domain, while vagueness in description of a concept as someone not quite understanding the domain. I notice hints of these even in rationalist friends I have, the ones who have read The Sequences and have a strong aversion to stuff that, in their head, pattern matches to making basic rationality mistakes. Reading this passage helped me have a handle on why I felt that my epistemic state was still better than that of others who seemed more confident in their claims.
Also, I feel like this somewhat relates to Eliezer's aversion to bio-anchors and concrete 'base rates', but I don't yet have a good way of clarifying it in my head.
A lot of the examples of the concepts that you list already belong to established scientific fields: math, logic, probability, causal inference, ontology, semantics, physics, information theory, computer science, learning theory, and so on. These concepts don't need philosophical re-definition. Respecting the field boundaries, and the ways that fields are connected to each other via other fields (e.g., math and ontology to information theory/CS/learning theory via semantics) is also I think on net a good practice: it's better to focus attention on the fields that are actually most proto-scientific and philosophically confusing: intelligence, sentience, psychology, consciousness, agency, decision making, boundaries, safety, utility, value (axiology), and ethics[1].
Then, to make the overall idea solid, I think it's necessary to do a couple of extra things (you may already mention this in the post, but I semi-skimmed it and maybe missed these).
The good news are that now, there are sufficient (or almost sufficient) affordances to build AI agents that can embody sufficiently realistic and rich versions of these theories in realistic simulated environments as well as just the real life. And I think an actual R&D agenda proposal should be written about this and apply to a Superalignment grant.
There's an instinct to "ground" or "found" concepts. But there's no globally privileged direction of "more grounded" in the space of possible concepts. We have to settle for a reductholistic pluralism——or better, learn to think rightly, which will, as a side effect, make reductholism not seem like settling.
I disagree with the last sentence: "reductholism" should be the settling, as I argue in "For alignment, we should simultaneously use multiple theories of cognition and value". (Note that this view itself is based largely on quantum information theory: see "Information flow in context-dependent hierarchical Bayesian inference".)
A counterargument could be made here that although logic, causal inference, ontology, semantics, physics, information theory, CS, learning theory, and so on are fairly established and all have SoTA, mature theories that look solid, these are probably not the final theories in all or many of these fields, and philosophical poking could highlight the problems with these theories, and perhaps this will actually be the key to "solving alignment". I agree that this is in principle possible chain of events, but it looks quite low expected impact to me from the "hermeneutic nets" perspective, so that this agenda is still better focused on the "core confusing" fields (intelligence, agency, ethics, etc.) and treat the established fields and the concepts therein "as given".
[Metadata: crossposted from https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html. First completed September 4, 2023.]
A hermeneutic net for agency is a natural method to try, to solve a bunch of philosophical difficulties relatively quickly. Not to say that it would work. It's just the obvious thing to try.
Thanks to Sam Eisenstat for related conversations.
Summary
To create AGI that's aligned with human wanting, it's necessary to design deep mental structures and resolve confusions about mind. To design structures and resolve confusions, we want to think in terms of suitable concepts. We don't already have the concepts we'd need to think clearly enough about minds. So we want to modify our concepts and create new concepts. The new concepts have to be selected by the Criterion of providing suitable elements of thinking that will be adequate to create AGI that's aligned with human wanting.
The Criterion of providing suitable elements of thinking is expressed in propositions. These propositions use the concepts we already have. Since the concepts we already have are inadequate, the propositions do not express the Criterion quite rightly. So, we question one concept, with the goal of replacing it with one or more concepts that will more suitably play the role that the current concept is playing. But when we try to answer the demands of a proposition, we're also told to question the other concepts used by that proposition. The other concepts are not already suitable to be questioned——and they will, themselves, if questioned, tell us to question yet more concepts. Lacking all conviction, we give up even before we are really overwhelmed.
The hermeneutic net would brute-force this problem by analyzing all the concepts relevant to AGI alignment "at once". In the hermeneutic net, each concept would be questioned, simultaneously trying to rectify or replace that concept and also trying to preliminarily analyze the concept. The concept is preliminarily analyzed in preparation, so that, even if it is not in its final form, it at least makes itself suitably available for adjacent inquiries. The preliminary analysis collects examples, lays out intuitions, lays out formal concepts, lays out the relations between these examples, intuitions, and formal concepts, collects desiderata for the concept such as propositions that use the concept, and finds inconsistencies in the use of the concept and in propositions asserted about it. Then, when it comes time to think about another related concept——for example, "corrigibility", which involves "trying" and "flaw" and "self" and "agent" and so on——those concepts ("flaw" and so on) have been prepared to well-assist with the inquiry about "corrigibility". Those related concepts have been prepared so that they easily offer up, to the inquiry about "corrigibility", the rearrangeable conceptual material needed to arrange a novel, suitable idea of "flaw"——a novel idea of "flaw" that will both be locally suitable to the local inquiry of "corrigibility" (suitable, that is, in the role that was preliminarily assigned, by the inquiry, to the preliminary idea of "flaw"), and that will also have mostly relevant meaning mostly transferable across to other contexts that will want to use the idea of "flaw".
The need for better concepts
Hopeworthy paths start with pretheoretical concepts
The only sort of pathway that appears hopeworthy to work out how to align an AGI with human wanting is the sort of pathway that starts with a pretheoretical idea that relies heavily on inexplicit intuitions, expressed in common language. As an exemplar, take the "Hard problem of corrigibility":
I find this compelling. But also, it rests on a bunch of ideas, and these ideas are being used in a way that pushes them to their limits or beyond.
For example, the compellingness of this paragraph rests on using evaluative notions like "error" and "good thing" and "correct" and "mistaken" in a way that calls on intuitions about human wanting. It feels like I have some preliminarily grasped idea of what it would be like to believe that I'm flawed, and that my current disposition towards correcting my flaws is also flawed——and that this other thing over there is exerting the right sort of control to rightly correct those flaws. I feel like I have the context needed to gemini model the possible "belief state" that a mind could be in, when it believes "I am flawed, potentially at any level of description.". That state of mind, which I intuitively think I have some preliminary grasp on, intuitively implies many desirable behaviors called corrigibility. But, the most prominent technical concept of evaluation that I have——a utility function——doesn't work here: it only recovers some but not all of the desired properties of the intuitive idea. (Other proposals for replacement technical concepts also don't work; they only recover a different strict subset of the desired properties.)
The upshot is this: our concepts are inadequate for alignment. We don't know what sort of mental element determines the effects of a mind, certainly not well enough to specify those effects. The concepts that we already have are either confused——inexplicit, conflationary, inconsistently used, internally logically inconsistent——or not even potentially adequate to play the role we're trying to have them play in our thinking. We don't see deeply enough into the structure of those sorts of computations that bring about large general effects in the world, to reach in and design / find / select / grow / describe / recognize / build such computations so that whatever determines their effects will determine the effects to be [liked by us when we're fully informed and our judgement is not manipulated].
Pretheoretical concepts don't automatically direct thinking rightly
What to do with pretheoretical concepts that suggest some hopeworthy paths, but aren't yet adequate? Here are two ways to just go on with aspects of the hopeworthy path, without worrying too much about the inadequacy of the concepts involved:
A fictional dialogue:
Ḥaḥam: "My plan to make aligned AI is to make the AI be honest."
Rasha: "Suppose you have an honest AI. How does that get you aligned AI?"
Ḥaḥam: "We ask the AI what the results of its planned actions will be, and how confident it is that it knows all the main results are. Since the AI is honest, we can use its answers to filter out any plans that aren't confidently known to have effects we like."
Rasha: "So, one problem with this is that, even assuming you can avoid having the AI kill you through a side-channel, unless you can point the AI's optimization power specifically towards plans that actually have large effects that you like, all of its plans will have large effects that you very much dislike. If you reject all the plans that the AI honestly reports as having bad effects, you'll reject all of its plans (assuming no bad ones get through), making this AI aligned only in the sense that a literal sponge is aligned."
Ḥaḥam: "If that turns out to be a problem, then we can also leverage the AI's honesty to dig in to the generators of its plans, rather than just the plans themselves, and make more high-level / generator-level adjustments to how the AI is searching for plans. For example, we can ask the AI what it is trying to do in its planning, and if it's trying to do the wrong thing, we at least know what we need to correct."
Rasha: "Before, you were talking about the AI being honest about the likely results of its planned actions. That feels relatively more straightforward than asking about generator-ish elements of the AI. It seems potentially significantly less difficult and confusing to design and test an AI to make accurate reports about effects of actions, compared to accurate reports about "what the AI wants" and similar. How would you go about trying to make an AI honest?"
Ḥaḥam: "For starters, we can design training mechanisms for reporter systems that, given an AI, learn where various variables are stored, so to speak. Then, to know what the AI thinks of some variable, we run the trained reporter system on the AI."
Rasha: "This will discover variables that you know how to evaluate, like where the cheese is in the maze——you have access to the ground truth against which you can compare a reporter-system's attempt to read off the position of the cheese from the AI's internals. But this won't extend to variables that you don't know how to evaluate. So this approach to honesty won't solve the part of alignment where, at some point, some mind has to interface with ideas that are novel and alien to humanity and direct the power of those ideas toward ends that humans like."
Ḥaḥam: "I'm not so sure about that. Honesty feels like a fairly simple notion. There's complications, e.g. there's built-in complexity because the reporter system will have to translate things into terms that humans can understand. But the idea of accurately reporting seems like a simple enough notion, so it might be fairly easy to point a system at making pretty comprehensive, honest reports, in a way that generalizes to new cases, even ones of a new kind."
Rasha: "This might be right, but notice that you're making a function call to quite a lot of general intelligence in your supposed reporter system. It's figuring out how to think at the level of the base system, and it's figuring out how to communicate about that. Pointing such a system at all, even at a maybe-simple goal like honest reports, seems to reproduce much of the whole alignment problem!"
Ḥaḥam: "I think it's an improvement, at least in that the reporter system can be much weaker than the base system. The reporter can be sort of parasitic on the base system, copying its ideas; and it seems hard for the base system to think "in secret", in a way that the reporter can't fairly easily notice."
Rasha: "That's very far from clear to me.
... For one thing, what would constitute an honest report might be quite subtle. Imagine trying to explain to someone who is a beginner at chess what is happening in a position from a chess game between two grandmasters. You can't just straightforwardly use concepts like "weakening the pawn chain" (or [concept I'm not familiar with either]) because the beginner doesn't know what a pawn chain is, what would constitute weakening one, and what implications that would have for the position going forward. But it's hard to explain what the grandmaster is doing without using that concept——there's no good way to translate "weakening the pawn chain" into concepts like "attacking the queen with the knight" or "blocking the bishop with the pawn" or "cutting off the king with the rook" that the beginner is familiar with. You could try to compile the idea into beginner concepts, like "The bishop attacks the f2 pawn. The f2 pawn defends the e3 pawn. If the f2 pawn moves or is taken, the e3 pawn isn't defended. The e3 pawn defends the d4 pawn. If the e3 pawn moves..." and so on, but this will amount to trying to teach the beginner the concept of a pawn chain.
... For another thing, in many cases we're specifically using our honest reporter AI in a way that pushes the base AI to be more illegible. For example, if we naively delete any mental content from the base AI that the reporter reports as being about understanding how humans work (and how humans could be manipulated), we are evoking the base AI to develop conceptual Doppelgängers that are harder for the reporter to recognize as being about humans. As another example, if we try to delete terminal goals from the base AI, we are prone to instead mostly just edit the manner in which the AI pursues instrumental goals, leaving the more terminal goals still hidden, if not intact. In general, we evoke nearest unblocked strategies. This puts the AI in a state that is heavily selected to exploit any flaw in the notion of "honesty" that the reporter wants and is able to uphold."
So much for the dialogue. (It's a fictional dialogue; e.g. the authors of this post and of ELK seem to me to be perfectly aware that these concepts are problematic.)
Why does Ḥaḥam think that "an AI that honestly reports its beliefs" is a hopeworthy idea? I suspect that among other intuitions, he has an intuition like this: honesty has a sort of comprehensiveness to it. The thing that is being honest is "looking out from behind the same eyes" as the AI uses to see; the thing that is being honest is speaking from the same viewpoint as the AI (because it is the AI); any planning that the AI is doing, is flowing through the ideas that form the perspective from which the honest reports are coming; if there were optimization power flowing through the AI, that optimization power is an integral part of the AI, so that to be honest is to reveal that optimization power.
This is a possibly hopeworthy intuition. However, just taking the concept "honesty", as it is structured pretheoretically, does not automatically lead Ḥaḥam to walk the fine line of unraveling difficulties in the hopeworthy path and developing the idea so that it could be turned into an engineering specification, while keeping to the hopeworthy path.
Rather, Ḥaḥam slips off and instead works out something that doesn't have a shot at realizing the hope. The idea of a reporter AI that's trained on the task of reading off facts from a base AI's internals is a fine idea, but it has already from the outset dropped the core intuition described above. And this shows up. The core hopeworthy intuition involves comprehensiveness, and the reporter AI doesn't. The core intuition hiddenly involves the idea of a mind that is integrated in such a way that it is rightly described as believing or disbelieving propositions——so that it either behaves in all cases as if the proposition is true, or else it behaves in all cases as if the proposition is false. For such a mind, one could then say that it reports its beliefs——in the unified, comprehensive sense——accurately. It may be that this sort of honesty, fully realized, would preclude conceptual Doppelgängers. The reporter AI does not preclude Doppelgängers, which witnesses that the idea of the reporter AI has not captured the core desirable properties aspirationally claimed by the pretheoretical intuition of honesty.
The pretheoretical concept of "honesty" didn't automatically direct Ḥaḥam's thinking rightly. Ḥaḥam might have said:
That's well enough and true enough. Rasha would reply that Ḥaḥam has stopped pushing in the direction that he originally wanted to go, and is now barely moving in the direction, and can't tell that he's barely moving in that direction because he stopped tracking that direction. The original core hopeworthy intuition isn't driving the investigation.
Some confusions are essential, even if only pretheoretically described
See also: "philosophical concept laundering".
Above, two responses were listed to the problem of pretheoretical concepts that don't immediately do the work we want them to do. To that list, a third item can be added:
The third response gives up the possibly-hopeworthy paths that used the concept. In exchange it emphasizes the negative space: the paths that involve thinking in other terms, not using the abandoned concept. Is that a good trade? In some cases yes. E.g. abandoning the idea of "anger", as in "make the AI so that it doesn't experience anger", is right; the abandoned path wasn't hopeworthy. In other cases, the abandoned path was hopeworthy, so abandoning it is a cost.
What about the benefit, the newly emphasized paths? Those paths seem to have at least one thing going for them: they don't involve having to worry about the sticky confusions that the abandoned concept brings with it. Is that benefit really present? In many cases, no, it's not really present. Some confusions are forced. Such confusions are essential to the problem of making an aligned AGI——any successful approach will have to deconfuse the confusion, will have to find a way of thinking that can adequately answer to the calling that originally revealed the confusion.
For example, I suspect that we can't avoid the need to go much further in clarifying the idea of {value, intention, wanting, goal, trying, motive}. Looking at the difficult-to-clarify menagerie of what we call wanting, and the confusion that arises from using the concept "wanting" as-is, and the inadequacy of formal ideas like "utility function", it's tempting to try abandoning the idea of wanting. But this is not so easily done. Consequentialist cognition is not so easily subverted——the way that we can see it being possible for an AI to be extremely useful, is for the AI to do consequentialist cognition. If there is consequentialist cognition, that begs the question: what determines the ends of the consequentialist cognition; what determines the direction of its ultimate effects?
The situation is like Greenspun's tenth rule:
The structure of Lisp is out there for programming languages to lack, and for programmers to need and to find. Some of the understanding that we'll have when we understand better [what we currently call {value, intention, wanting, goal, trying, motive}] is out there for conceptual schemes to lack, and for AGI engineers to need and to find.
What does it look like when essential confusions are dodged? It looks like dirty concepts, which pretend to be philosophically innocent and unproblematic, while also trying to play roles that are rightly played by problematic concepts. It looks like conceptual Doppelgängers growing: ad hoc, informally-specified, bug-ridden, slow implementations of half of the supposedly avoided concepts. It looks like playing shell games: shuffling around the role played by the problematic concept so that locally the concept seems unnecessary, but the shuffling is not actually globally consistent.
Examples
Examples of concepts to be analyzed
The hermeneutic net approaches the problem of inadequate concepts by straddling all the relevant concepts at once, unfolding and clarifying and arranging them. The relevant concepts are the concepts that play roles in alignment-related desiderata, constraints, and hopeworthy ideas. Here are some words which are used as handles for relevant concepts:
Examples of interrelatedness
The project of large-scale conceptual revision
Criteria for conceptual revision
The task is to clarify preexisting proleptic, partial, provisional, pretheoretical concepts, and to create new concepts. The criteria that will call forth the future improved concepts have to be collected and applied to the search. What are these criteria? They all flow from the overarching criterion of thinking in a way that will produce good overall outcomes——concepts should be suitable for that sort of thinking. To give some specific classes of examples: A criterion might call for one or more concepts to:
Play a role in a proposition. That is, the concept appears in a proposition, and its appearance in the proposition determines something about how the concept should work. The concept should work in a way that renders the proposition true, interesting, relevant, useful, worthwhile. For example, the statement "a corrigible agent doesn't optimize to deceive the operators" asks for a concept of corrigibility such that, if an agent is corrigible in that sense, then the agent consequently also doesn't "optimize to deceive the operators". The statement also asks that the concepts of optimization, deception, and operator, as they are used in the statement, also be such that when an agent doesn't "optimize to deceive the operators", it doesn't, you know, trick the humans. If "optimization" is also used in another proposition such as "it is unnatural for optimization to be strong and wide-ranging without being fully-general", then there are multiple criteria pulling on the concept.
Relate to another concept by being founded on it or by providing foundation for it. E.g. "an optimization process controls the world to be in a small set of possible worlds" founds optimization on world, control, and possible. A foundation of concepts acts as a conduit between concepts for criteria on concepts. E.g. if "optimization" is used in a proposition such as "a corrigible agent doesn't optimize to deceive the operators", then the concept of "world" inherits the criteria on "optimization"——e.g., "world" has to include the agent's way of thinking, because the agent's way of thinking is one way the agent could trick the human (e.g. if the human is directly looking at the agent's internals).
Perform a task. E.g. in trying to write a computer program to display a rotating object, the concept of "rotation" is called to be put into a format that recommends a data format and transformations and displays of that data, so that the screen shows something that looks like a rotating object.
Be good, elegant, simple, general, useful, well-engineered, well-factored, precise, delineated, explicit. In other words, to satisfy the generally applicable senses that the thinker has for what is and isn't a good concept. E.g. if the concept of "values" is "the scoring function for possible worlds used to select actions; and also something about what the procedure used to resolve ambiguities in preliminary scoring functions is like", then it is sensibly not yet elegantly factored.
Capture and explicate the starting intuitions; well-describe examples. The starting intuitions around a concept are probably on the outskirts of a region containing some Things. The examples are probably best understood in terms of latent, underlying, intersecting Things. All these Things should be brought out as concepts.
Three problems of conceptual revision
Infectious questioning and Indra's net
To revise a concept C, we consult the criteria for C. Some key criteria for C are likely to be propositions. Those propositions use C, and also use other concepts, say D.
In questioning C, we're looking to satisfy the criteria for C more fully than our present concept C is satisfying them. That means we're making more demands than usual on the criteria: we're trying to get advice, so to speak, from the proposition acting as a criterion for C, and we're asking the proposition for advice we haven't already heard and incorporated into C. We're asking the proposition to give us new examples, new problems, new logical implications. In asking for new advice from a proposition, we're also making new demands on the other concepts that the proposition uses, e.g. D. So now we want to revise D as well as C, as a soft prerequisite for rightly revising C.
You see where this is going. Everything wants to be pried loose all at once, too many questions are raised, and the glimpses of better concepts are hidden in the chaos. What starts as a careful analysis ends in gridlock.
Indra's net is an infinite net going in all directions with a jewel at each lattice point. Every jewel reflects every other jewel, and reflects all of every other jewel's reflections of every other jewel, and so on. This is the situation with ideas: to turn one stone all the way over is to upend the whole world. See also Leibniz's Monadology. Everything being interrelated doesn't mean that there is no differentiation or clarification to be had; out of all the connections that are present, only relatively few are essential.
Withdrawal
When we're trying to get at new concepts, we're always dancing around them; the missing concept leaves a shadow in our understanding, a lack of clarity. We can catch glimpses of stuff that would lead to developing the missing concepts, if we follow the stuff forward. For example, the discovery of timeless decision theory growing (I speculate) from something like a combination of noticing that:
These observations suggest to a careful listener that there's a principled notion of rational decision making that wins in these cases.
A more usual reaction is to think "Hmmm... That's weird..." and then move on, or make an ad-hoc patch to the problem, or squirm away from the question (e.g. saying that the decision problems are impossible or incoherent or vague hypotheticals). The mystery withdraws, the preexisting concepts stretching implausibly to cover up the vacancy——stretching just enough to give local answers to questions, whether or not the answers are clear and globally consistent. The vacancy where a new idea should be is laundered through existing concepts, or is shuffled around in a shell game.
It's as though the soldiers of understanding, after encroaching some ways into the region of a Thing, expanding the borders of understanding some ways up the mountain of the Thing, at some point stop. They stop not when they've summitted, but before then. They stop once they're merely higher up than other forces——when the territory of understanding they've gained gives enough advantage to address the demands already clearly made on the understanding of this Thing by other neighboring regions. And they turn around, and face downhill, and dig their trenches there, as if guarding the summit from demands made by neighboring regions——a summit which they haven't themselves met.
This "just-in-time philosophy", which only engages in speculative conceptual revision when immediately forced to by nearby demands, I think will not work. I suspect that it doesn't even work very well for normal science, and that we have curiosity because evolution (so to speak) saw fit to specifically tell us to wonder what things are like even without a specific purpose.
The centripetal force of the preexisting conceptual scheme
To say it another way: The situation, as I'm here conjecturing, is that Indra's net creates a centripetal force, pulling thoughts into the convex hull of the preexisting conceptual scheme. A question always bakes in background conceptual assumptions.
Science works. Also, math works. Questioning points the way out of the existing conceptual scheme by asking us to come to terms with something other than our own ideas as they already are. So, it's not like there's some fundamental barrier here. But still, if a concept like "values" always shows up in the question, and nearby questions are taken as equivalent to a rephrasing that relies on the pretheoretic idea of "values", and the pretheoretic idea of "values" is investigated in some aspects but not others, then all questions about "values", and all the other questions that they inspire in the network of questioning, will have a correlated failure to address the uninvestigated aspects.
The centripetal force bends inquiries away from certain directions, and bends inquiries away from going too far, too persistently, too "unresponsively to data", in a single direction——too far out of the convex hull. The normal progress of science depends on anomalies continuing to make themselves visible and felt, and even to become more and more salient and pressing as the preexisting understanding works out its home-turf implications. Traveling in the space of possible conceptual schemes (e.g. learning, doing science, learning a new language) proceeds by taking some steps——steps of tweaking, refactoring, abducing, conjecturing, combining ideas. Inquiries that would only be satisfied by taking many simultaneous steps aren't pursued. In the long run there may be no such inquiries, but relative to humanity's current conceptual scheme, there are valleys of [worse, useless, uninteresting conceptual schemes that result from some, but not enough, simultaneous conceptual mutations away from the current scheme] between here and where we need to go. If that's true, then what would be required is leaps, not steps. An analogy: in bouldering, a coordination dyno is a dynamic move where the climber removes most or all zer limbs from the wall during a movement, and then lands in a new improved position and has to arrange multiple points of contact correctly at the same time to stabilize (example).
Even pointing out that there's a technical metaphilosophical problem here, a problem of multiple simultaneous conceptual revisions being needed, is difficult. Radical confusion (or in other words, a missing conceptual scheme) is heavily dependent on the mental context——both for its existence, and for pointing at its presence (there's more ways to be confused than ways to understand clearly). So examples don't translate well to other minds. And, each example is compute-intensive to have in the first place because an example can only come from long-term inquiry that crosses the valley the slow way (by going around, step by step along the rimdale).
To explore multiple simultaneous modifications requires more channels of creativity. Like the difference between evolution piling up isolated tweaks and a designer leaping to an island of effectiveness, across a valley of ineffectiveness, via multiple simultaneous changes, a hermeneutic net (hypothetically, hopefully) can rewrite conceptual schemes in ways that would take much longer to do by steps that rely on the usual stepwise inquiry and conceptual revision.
The hermeneutic net
Palinsynopsis
More and better concepts are needed. An inquiry aimed at adding or improving a concept will recursively make inquiries for more and better adjacent concepts. Such an inquiry will ask too many questions at once, and the fundamental problems will stay hidden.
The basic idea: brute-force global analysis
I don't know how to deal with these difficulties. The "hermeneutic net" is a conjectural method to brute-force the issue. It's not sophisticated or tested. It's just what seems to me like the first obvious thing to try.
The idea of a hermeneutic net is this: to analyze all the concepts at once.
The hope is that a larger, more systematic effort than has already been put forward might set up the conditions where the infectious questioning can be contained, and a net of mutating and splitting concepts can be cast over and pulled tight around the mysteries. In other words, the hope is to flank and cut off, from all sides, the out-of-control questioning of key concepts, by systematically setting up all the related concepts to be questioned, and in particular set them up to be suitable for providing support to inquiries of adjacent concepts.
For example, a hermeneutic net would start with a preliminary analysis of some concept, e.g. "action". The analysis would be carried out not until "action" is fully understood in any sense, or meets any set of direct criteria (such as fully explaining some example), but instead carried out until the understanding of "action" has been brought into a state that is prepared for what comes next——as prepared as is feasible with reasonable effort. Next, an adjacent concept is analyzed, e.g. "decision". When that concept calls on the previous concept "action", for example in the criterion given by a proposition like "a decision is when an agent selects an action" or "when an agent gains coherence it has expanded the range of its actions that can be counted as decisions" or something, the preparation done with "action" is supposed to kick into gear, providing the inquiry into "decision" with as much ready help as feasible.
A basic analytic method
How does the actual inquiry into a given concept go? It should go however it has to go to generate adequate concepts——which is an unboundedly complex task. But here is a starting point for a method to analyze a concept:
In short, to analyze "X":
In more detail:
Start with a hopeworthy idea. Pick a concept X used in the idea that seems like it could sharpen or dispel the hope.
Unpacking the concept.
Bringing in the criteria.
Generating concepts.
Going through the hermeneutic circle.
Names
Hermeneutics is interpretation and the study of interpretation. The hermeneutic circle goes between text and context; newly understanding the text changes the context, and a new context changes how the text is understood.
Why talk about interpretation? On one interpretation of "interpretation", "interpretation" is when a mind incorporates new structure into itself by empowering itself through that structure. How does this go with the intuitive meaning of interpretation as receiving and translating a message? Because when the message communicates something that is new to the hearer, that can't be assimilated in the hearer's preexisting conceptual scheme, the work of translation is mainly the work of grasping a new idea. As a key instance of such creative hearing, we interpret ourselves: we study ourselves as the exemplars of mind and agency, and we explicate our existing ideas about ourselves, and we coherentify ourselves by interpreting ourselves as coherent. We interpret ourselves (our ideas, pursuits, creations), for ourselves to hear, as a message from ourselves to ourselves.
The "net" is supposed to suggest a structure distributed across lots of concepts, that might catch confusions; and Indra's net.
Alternative names:
Hermeneutic cycle, suggesting ongoingness, or spiral, helix, suggesting upward motion.
Conceptual engineering.
Abductive, analytic, analytic/synthetic net.
Hermeneutic mesh, scaffold, network, web.
Hermeneutic load-distribution, suggesting the way that load in a building is spread out so that no one support beam has to bear all the weight of the whole building——finding concepts that each do enough work that deep confusions can be unraveled...
Difficulties with the hermeneutic net for agency
Good babble requires good prune
The hardest part of making good new concepts is coming up with any possibilities that are at all novel and at all have a chance of being useful for the task at hand. The first thing to try is Babble. But good babble requires good Prune. Without good prune, the products of babble are all highly correlated: they all allow themselves the same errors——the same mistaken assumptions, the same unhopeworthy hopes, the same ill-suited concepts, the same violated conservation laws, the same unknown and unheeded impossibility proofs, the same shell games, the same unsatisfied desiderata, the same unexplored regions. Highly correlated babble is
no babble at allvery little babble at all.Wittgenstein's tarpit
A trap that scares off investigators from systematic conceptual revision is Wittgenstein's tarpit. In Wittgenstein's tarpit, the meaning of words is questioned——not just as in "What is the meaning of 'X'?" but as in "Maybe there's no such thing as a meaning of 'X'.". This questioning, on its own part of a healthy hermeneutic, can congeal into a sticky denial of the Thingness of Things. Any proposition about the nature of things is met with an unendable "critique" that just takes the form of repeatedly disallowing the natural use of any idea brought up to justify the use of another idea.
No privileged foundational direction
There's an instinct to "ground" or "found" concepts. But there's no globally privileged direction of "more grounded" in the space of possible concepts. We have to settle for a reductholistic pluralism——or better, learn to think rightly, which will, as a side effect, make reductholism not seem like settling.
The whole mind is involved in any of its aspects
A hermeneutic net for mind has to understand the role that an element of a mind plays in the mind. In many cases this role only makes sense within the context of being a whole mind. Since the whole mind is far from fully understood, the element isn't fully understood. In other words, the element has to be gemini modeled where the context is the whole mind.
Provisionality
Since an agent potentially {changes, unfolds, grows, self-modifies}, any aspect of it might change. So concepts about agents are by default essentially provisional. That is, to be a concept that well-describes something about an agent, the concept has to have some openness to relate to the agent's ongoing unfolding, and so is provisional by nature, unlike more clean-cut concepts, such as concepts about simple physical systems. To put it another way, however much we try, we won't be able to understand everything about future very advanced agents. The elements of future very advanced agents that are novel to us will also change the context of even elements that we do understand ahead of time, rendering them alien.
Silently imputing the ghost in the machine
It is so natural for us to gemini model aspects of other humans, or possibilities for our own mental elements, that we do it without knowing that we're doing it. We impute a ghost in the machine without knowing what assumptions we're thereby making. E.g. we think of an agent having a belief, and assume that it has a belief in the way we have beliefs.
Imputing the ghost in the machine goes beyond anthropomorphism. Imputing anger to an alien agent is anthropomorphizing——assuming the agent is human-shaped. This is a mistake because the agent need not be human-shaped. There are agents that are full agents with full minds, that don't have anger. Imputing the ghost in the machine may impute very general properties to an agent that aren't especially human-shaped. That may be a mistake in two ways:
For example, what is it like to be a bat? I imagine closing my eyes, and then getting a ghostly 3D point-cloud image, showing the ray-ends of rays radiating out from my head. This is probably not right, but even if it is right, I'm assuming that the bat thinks in terms of 3D space. That is probably right——but it's important that I'm assuming that the bat thinks in terms of 3D space. I might not notice that I'm assuming that. I just "put myself behind the eyes" of the bat, and in so doing I import the ghost, my mind, the machinery that I don't notice as it constitutes my world for me. When I "put myself behind the eyes" of the bat, I unconsciously bring along (that is, silently impute) the 3D scene modeler. Imagine trying to reprogram a bat to live in a 4D world. Where would you even start? It will be difficult anyway, but I think it will be extra difficult if you don't realize that the reason you think the bat thinks about 3D space in such and such a way, is that you're calling on your own 3D space modeler. Until you notice that you're thinking about the bat in that way, you might be confused about what you're even trying to do. Isn't that just... how space is? How does space work... It's like this [that is, like this space around me, which I'm looking at from behind my eyes using my 3D space modeler], is it not? So that's how space is. And now I want to make space different to a bat? How does that make sense, how could it be different to the bat, given that that's just how space is?
For example, see "What the Tortoise Said to Achilles". Achilles imputes a dynamic to the tortoise, which he transparently takes as just an aspect of speaking sentences.
For example, sometimes people believe that, for some X, we just need X to make AGI from current ML systems. Sometimes they believe this because they are imputing the ghost in the machine. E.g.: "LLMs don't get feedback from the environment, where they get to try an experiment and then see the results from the external world. When they do, they'll be able to learn unboundedly and be fully generally intelligent.". I think what this person is doing is imagining themselves without feedback loops with external reality; then imagining themselves with feedback loops; noticing the difference in their own thinking in those two hypotheticals; and then imputing the difference to the LLM+feedback system, imagining that the step LLM⟶ LLM+feedback is like the step human⟶ human+feedback. In this case imputing the ghost is a mistake in both ways: they don't realize that they're making that imputation, and the LLM+feedback system actually doesn't have the imputed capabilities. They're falsely imputing [all those aspects of their mind that would be turned on by going from no-feedback to yes-feedback] to the LLM+feedback. That's a mistake because really the capabilities that come online in the human⟶ human+feedback step require a bunch of machinery that the human does have, in the background, but that the LLM doesn't have (and the [LLM+feedback + training apparatus] system doesn't have the machinery that [human + humanity + human evolution] has).
It's the local differences in our experience that we notice, against a fixed unnoticed background. We notice the event of the update, but not the fixed Bayesian laws; we notice the change in our visual field, but not the 2.5D structure of our visual perception; we notice that we want vanilla ice cream, not chocolate ice cream, but not the wanting to eat or the structure of pursuing what we want or the structure of reflecting on and choosing our values. We then ask about an alien agent "Does it like vanilla ice cream or chocolate ice cream?" and we don't ask "In what manner does it want?".
If the ghost in the machine is imputed, and that imputation isn't noticed, there's a higher risk of merely rearranging confusion, playing shell games with the confusions about the hidden machinery.