I'm working on writing a paper about an idea I previously outlined for addressing false positives in AI alignment research. This is the second completed draft of one of the subsections arguing for the adoption of a particular, necessary hinge proposition to reason about aligned AGI (first subjection here). I appreciate feedback on this subsection especially regarding if you agree with the line of reasoning and if you think I've ignored anything important that should be addressed here. Thanks!
Since AGI alignment is necessarily alignment of AGI, alignment schemes can depend on the dispositions of AGI, and one disposition AGI has is to subjective experience and mental phenomena (Adeofe, 1997), (Nagel, 1974). Whether or not we expect AGI to realize this disposition matters because it influences the types of alignment schemes that can be considered since an AGI without a mental aspect can only be influenced by modifying its algorithms and manipulating its behavior whereas an AGI with a mind can be influenced by engaging with its perceptions and understanding of the world (Dreyfus, 1978). In other words we might say mindless AGI can be aligned only by algorithmic and behavioral methods whereas mindful AGI can also be aligned by philosophical methods that work on its epistemology, ontology, and axiology (Brentano, 1995). It's unclear what we should expect about the mentality of future AGI, though, because we are presently uncertainty about mental phenomena in general (cf. the work of Chalmers and Searle for modern, popular, and opposing views on the topic), so we are forced to speculate about mental phenomena in AGI when we reason about alignment (Chalmers, 1996), (Searle, 1984).
Note, though, that this uncertainty may not be fundamental (Dennett, 1991). For example, if materialist or functionalist attempts to explain mental phenomena prove adequate, perhaps because they lead to the development of conscious AGI, then we may agree on what mental phenomena are and how they work (Oizumi, Albantakis, and Tononi, 2014). If they don't, though, we'll likely be left with metaphysical uncertainty around mental phenomena that's rooted in the epistemic limitations of perception (Hussrl, 2014). Regardless of how uncertainty about mental phenomena might later be resolved, it currently creates a need for pragmatically making assumptions about it in our reasoning about alignment. In particular we want to know whether or not we should design alignment schemes that assume a mind, even if we expect mental phenomena to be reducible to other phenomena. Given that we remain uncertain and cannot dismiss the possibility of mindful AGI, what we decide depends on how likely alignment schemes are to succeed and avoid false positives conditional on AGI having the capacity for mental phenomena. The choice is then between whether we design alignment schemes that work without reference to mind or whether they engage with it.
If we suppose AGI do not have minds, whether because we believe they have none, are inaccessible to us, or not causally relevant to alignment, then alignment schemes can only address the algorithms and behavior of AGI. This would be to address alignment in a world where all AGIs are treated as p-zombies, i.e. beings without mental phenomena (Kirk, 1974). Now suppose this assumption is false and AGI do have minds, then our alignment schemes that work only on algorithms and behavior would be expected to continue to work since they function without regard to the mental phenomena of AGI, making the minds of AGI irrelevant to alignment. This suggests there is little risk of false positives from supposing AGI do not have minds.
If we suppose AGI do have minds, then alignment schemes can also use philosophical methods to address the values, goals, models, and behaviors of AGI. Such schemes would likely take the form of ensuring that updates to an AGI's ontology and axiology converge on and maintain alignment with human interests (de Blanc, 2011), (Armstrong, 2015). Now suppose this assumption is false and AGI do not have minds, then our alignment schemes that employ philosophical methods will likely fail because they are attempting to address mechanisms of action not present in AGI. This suggests there is a risk of false positives from supposing AGI have minds proportionate with the likelihood that we do not build mindful AGI.
From this analysis it seems we should suppose mindless AGI when designing alignment schemes so as to reduce the risk of false positives, but note that it does not consider the likelihood of success at aligning AGI using only algorithmic and behavioral methods. That is, all else may not be equal between these two assumptions such that the one with the lower risk of false positives might not be the better choice if we have additional information that leads us to believe that alignment of mindful AGI is much more likely to succeed than the alignment of mindless AGI, and it appears that we have such information in the form of Goodhart's curse and the failure of good old-fashioned AI (GOFAI).
Goodhart's curse says that when optimizing for the measure of a value the optimization process will implicitly maximize divergence of the measure from the value (Yudkowsky, 2017). This is an observation that follows from the combination of Goodhart's law and the optimizer's curse (Goodhart, 1984), (Smith and Winkler, 2006). This tendency of measure and value to diverge under optimization results in a phenomenon known as "Goodharting" and it takes myriad forms that affect alignment (Manheim and Garrabrant, 2018). In particular Goodharting poses a problem for behavioral alignment schemes because to optimize behavior it is necessarily to measure behavior and optimize on that measure. Consequently it appears behavioral methods are unlikely to be capable of producing aligned AGI on their own, and this is further supported by both the historical failure to align humans with arbitrary values using behavioral optimization methods and the widespread presence of Goodharting in behaviorally controlled, evolving computer systems (Scott, 1999), (Lehman et al., 2018).
Further, past research on GOFAI—AI systems based on symbol manipulation—suggests algorithmic methods of alignment are likely to be too complex to work for the same reasons that GOFAI was itself unworkable, namely that it proved infeasible for humans to program systems with enough complexity and specificity to do anything more than perform meaningless manipulations (Haugeland, 1985), (Agre, 1997). In recent years AI researchers have surpassed GOFAI only by switching to designs where humans specify relatively simple computations to be performed and allow the AI to apply what Moravec called "raw power" to large data sets to achieve results (Russell and Norvig, 2009), (Moravec, 1976). This suggests that attempts to align AGI by algorithmic means are likely to also prove too complex for humans to solve, leaving us with only philosophical methods of alignment and thus necessitating mindful AGI.
This paints a bleak picture for the possibility of aligning mindless AGI since behavioral methods of alignment are likely to result in divergence from human values and algorithmic methods are too complex for us to succeed at implementing. This leads us to conclude that, although assuming mindful AGI has a greater risk of false positives than assuming mindless AGI all else equal, all else is not equal, mindless AGI is less likely to be successfully aligned because algorithmic and behavioral alignment mechanisms are unlikely to work, so we have no choice but to take on the risks associated with assuming mindful AGI when designing alignment schemes.
References:
- Leke Adeofe. Artificial intelligence and subjective experience. In Proceedings of Southcon 95. IEEE, 1997. Link
- Thomas Nagel. What Is It Like to Be a Bat?. The Philosophical Review 83, 435 JSTOR, 1974. Link
- Hubert L. Dreyfus. What Computers Can’t Do: The Limits of Artificial Intelligence. HarperCollins, 1978.
- Franz Brentano. Psychology from an Empirical Standpoint. Routledge, 1995.
- David Chalmers. The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press, 1996.
- John R. Searle. Minds, Brains, and Science. Harvard University Press, 1984.
- Daniel C. Dennett. Consciousness Explained. Little, Brown and Co., 1991.
- Masafumi Oizumi, Larissa Albantakis, Giulio Tononi. From the Phenomenology to the Mechanisms of Consciousness: Integrated Information Theory 3.0. PLoS Computational Biology 10, e1003588 Public Library of Science (PLoS), 2014. Link
- Edmund Hussrl. Ideas for a Pure Phenomenology and Phenomenological Philosophy: First Book: General Introduction to Pure Phenomenology. Hackett Publishing Company, Inc., 2014.
- Robert Kirk. Sentience and Behaviour. Mind 83, 43–60 [Oxford University Press, Mind Association], 1974. Link
- Peter de Blanc. Ontological Crises in Artificial Agents’ Value Systems. (2011). Link
- Stuart Armstrong. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop. (2015). Link
- Eliezer Yudkowsky. Goodhart’s Curse. (2017). Link
- Charles A. E. Goodhart. Problems of Monetary Management: The UK Experience. 91–121 In Monetary Theory and Practice. Macmillan Education UK, 1984. Link
- James E. Smith, Robert L. Winkler. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis.Management Science 52, 311–322 Institute for Operations Research and the Management Sciences (INFORMS), 2006. Link
- David Manheim, Scott Garrabrant. Categorizing Variants of Goodhart’s Law. (2018). Link
- James C. Scott. Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed. Yale University Press, 1999.
- Joel Lehman et al.. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. (2018). Link
- John Haugeland. Artificial Intelligence: The Very Idea. MIT Press, 1985.
- Philip E. Agre. Computation and Human Experience. Cambridge University Press, 1997.
- Stuart Russell, Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2009.
- Hans Moravec. The Role of Raw Power in Intelligence. (1976). Link
This sounds dangerously like the same kind of failure-by-equivocation that plagued GOFAI. Just because we write a program that contains something we interpret as a representation of the world or acts in a way we interpret as goal directed does not mean the program actually has a representation with intention or action with telos. It also doesn't mean that it doesn't have those things (in fact I think it does since my position on the metaphysics of phenomena is one of panpsychism, though that's a outside the scope of this paper), but what it does have is not necessary what we often think of it having in that way based on our understanding of its internal workings.
To make this concrete, let's consider an even simpler case, a loop that adds up the number of 'a' characters it sees in a file:
When
acount
is incremented becausenc
contains an 'a' it is being causally linked to the state of file. This doesn't mean the program understands thatfd
containsacount
'a's, though, or even thatacount
is casually linked to the contents offd
; it only means thatacount
counts the number of 'a's infd
, an interpretation we can make but the program itself cannot. So this thing properly has ontology in some very weak sense because it contains a thing that represents the world, but it's the most minimal sort of such of a thing and one that is so simple as to be difficult to describe in words without accidentally ascribing it additional features it does not have.Similarly it has a purpose (which, I would argue, is the source of value) of "count 'a's until you reach the end of the file" but this is the purpose it has as we would describe it. To itself this program has no purpose on its own, but under execution is given purpose by the execution of individual instructions in a particular order that affect the state of a system, yet still this is a sort of purpose that the program cannot express to itself, because ultimately the program has no disposition to understand its own telos. So, yes, it has a purpose, but not of the sort we would ascribe to a thing we could think of as having a mind, and thus we cannot see it as valuing anything; it just does stuff because that's what it is without regard to its own function.
So maybe we can make sense of those papers by applying our own interpretations on mindless systems to treat them as if they had ontologies and axiologies, but I view this as a mistake because it separates us from the systems' own capacities and works on how we believe the systems to work, which may be correlated but are importantly different.