Weak vs Quantitative Extinction-level Goodhart's Law
This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?. tl;dr: With claims such as "optimisation towards a misspecified goal will cause human extinction", we should be more explicit about the order of quantifiers (and the quantities) of the underlying concepts. For example, do we mean that that for every misspecified goal, there exists a dangerous amount of optimisation power? Or that there exists an amount of optimisation power that is dangerous for every misspecified goal? (Also, how much optimisation? And how misspecified goals?) Central to the worries of about AI risk is the intuition that if we even slightly misspecify our preferences when giving them as input to a powerful optimiser, the result will be human extinction. We refer to this conjecture as Extinction-level Goodhart's Law[1]. Weak version of Extinction-level Goodhart's Law To make Extinction-level Goodhart's Law slightly more specific, consider the following definition: Definition 1: The Weak Version of Extinction-level Goodhart's Law is the claim that: "Virtually any[2] goal specification, pursued to the extreme, will result in the extinction[3] of humanity."[4] Here, the "weak version" qualifier refers to two aspects of the definition. The first is the limit nature of the claim --- that is, the fact that the law only makes claims about what happens when the goal specification is pursued to the extreme. The second is best understood by contrasting Definition 1 with the following claim: Definition 2: The Uniform Version of Extinction-level Goodhart's Law is the claim that: "Beyond a certain level of optimisation power, pursuing virtually any goal specification will result in the extinction of humanity." In other words, the difference between Definitions 1 and 2 is the difference between 1. (∀goal G s.t. [conditions]) (∃opt. power O) : Optimise(G, O) ⇝ extinction 2. (∃opt. power O) (∀goal G s.t. [conditions]) : Opti