You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Superintelligence and wireheading

5 Stuart_Armstrong 23 October 2015 02:49PM

A putative new idea for AI control; index here.

tl;dr: Even utility-based agents may wirehead if sub-pieces of the algorithm develop greatly improved capabilities, rather than the agent as a whole.

Please let me know if I'm treading on already familiar ground.

I had a vague impression of how wireheading might happen. That it might be a risk for a reinforcement learning agent, keen to take control of its reward channel. But that it wouldn't be a risk for a utility-based agent, whose utility was described over real (or probable) states of the world. But it seems it might be more complicated than that.

When we talk about a "superintelligent AI", we're rather vague on what superintelligence means. We generally imagine that it translates into a specific set of capabilities, but how does that work internally inside the AI? Specifically, where is the superintelligence "located"?

Let's imagine the AI divided into various submodules or subroutines (the division I use here is for illustration; the AI may be structured rather differently). It has a module I for interpreting evidence and estimating the state of the world. It has another module S for suggesting possible actions or plans (S may take input from I). It has a prediction module P which takes input from S and I and estimates the expected outcome. It has a module V which calculates its values (expected utility/expected reward/violation or not of deontological principles/etc...) based on P's predictions. Then it has a decision module D that makes the final decision (for expected maximisers, D is normally trivial, but D may be more complicated, either in practice, or simply because the agent isn't an expected maximiser).

Add some input and output capabilities, and we have a passable model of an agent. Now, let's make it superintelligent, and see what can go wrong.

We can "add superintelligence" in most of the modules. P is the most obvious: near perfect prediction can make the agent extremely effective. But S also offers possibilities: if only excellent plans are suggested, the agent will perform well. Making V smarter may allow it to avoid some major pitfalls, and a great I may make the job of S and P trivial (the effect of improvements to D depend critically on how much work D is actually doing). Of course, maybe several modules become better simultaneously (it seems likely that I and P, for instance, would share many subroutines); or maybe only certain parts of them do (maybe S becomes great at suggesting scientific experiments, but not conversational responses, or vice versa).

 

Breaking bad

But notice that, in each case, I've been assuming that the modules become better at what they were supposed to be doing. The modules have implicit goals, and have become excellent at that. But the explicit "goals" of the algorithms - the code as written - might be very different from the implicit goals. There are two main ways this could then go wrong.

The first is if the algorithms becomes extremely effective, but the output becomes essentially random. Imagine that, for instance, P is coded using some plausible heuristics and rules of thumb, and we suddenly give P many more resources (or dramatically improve its algorithm). It can look through trillions of times more possibilities, its subroutines start looking through a combinatorial explosion of options, etc... And in this new setting, the heuristics start breaking down. Maybe it has a rough model of what a human can be, and with extra power, it starts finding that rough model all over the place. Thus, predicting that rocks and waterfalls will respond intelligently when queried, P becomes useless.

In most cases, this would not be a problem. The AI would become useless and start doing random stuff. Not a success story, but not a disaster, either. Things are different if the module V is affected, though. If the AI's value system becomes essentially random, but that AI was otherwise competent - or maybe even superintelligent - it would start performing actions that could be very detrimental. This could be considered a form of wireheading.

More serious, though is if the modules become excellent at achieving their "goals", as if they were themselves goal-directed agents. Consider module D, for instance. If its task was mainly to pick the action with the highest V rating, and it became adept at predicting the output of V (possibly using P? or maybe it has the ability to ask for more hypothetical options from S, to be assessed via V), it could start to manipulate its actions with the sole purpose of getting high V-ratings. This could include deliberately choosing actions that lead to V giving artificially high ratings in future, to deliberately re-wiring V for that purpose. And, of course, it is now motivated to keep V protected to keep the high ratings flowing in. This is essentially wireheading.

Other modules might fall into the familiar failure patterns for smart AIs - S, P, or I might influence the other modules so that the agent as a whole gets more resources, allowing S, P, or I to better compute their estimates, etc...

So it seems that, depending on the design of the AI, wireheading might still be an issue even for agents that seem immune to it. Good design should avoid the problems, but it has to be done with care.

Toy model for wire-heading [EDIT: removed for improvement]

2 Stuart_Armstrong 09 October 2015 03:45PM

EDIT: these ideas are too underdeveloped, I will remove them and present a more general idea after more analysis.

This is a (very) simple toy model of the wire-heading problem to illustrate how it might or might not happen. The great question is "where do we add the (super)intelligence?"

Let's assume a simple model for an expected utility maximising agent. There's the input assessor module A, which takes various inputs and computes the agent's "reward" or "utility". For a reward-based agent, A is typically outside of the agent; for a utility-maximiser, it's typically inside the agent, though the distinction need not be sharp. And there's the the decision module D, which assess the possible actions to take to maximise the output of A. If E is the general environment, we have D+A+E.

Now let's make the agent superintelligent. If we add superintelligence to module D, then D will wirehead by taking control of A (whether A is inside the agent or not) and controlling E to prevent interference. If we add superintelligence to module A, then it will attempt to compute rewards as effectively as possible, sacrificing D and E to achieve it's efficient calculations.

Therefore to prevent wireheading, we need to "add superintelligence" to (D+A), making sure that we aren't doing so to some sub-section of the algorithm - which might be hard if the "superintelligence" is obscure or black-box.

 

"NRx" vs. "Prog" Assumptions: Locating the Sources of Disagreement Between Neoreactionaries and Progressives (Part 1)

6 Matthew_Opitz 04 September 2014 04:58PM

I know that many people on LessWrong want nothing to do with "neoreaction."  It does seem strange that a website commonly associated with techno-futurism, such as LessWrong, would end up with even the most tangential networked association with an intellectual current, such as neoreaction, that commonly includes nostalgia for absolute monarchies and other avatistic obessions.

 

Perhaps blame it on Yvain, AKA Scott Alexander of slatestarcodex.com for attaching this strange intellectual node to LessWrong. ; ) That's at least how I found out about neoreaction, and I doubt that I am alone in this.

 

Certainly many on LessWrong would view any association with "neoreaction" as a Greek gift to be avoided. I understand the concept of keeping "well-kept gardens" and of politics being the "mind-killer," although some at LessWrong have argued that some of the most important questions humanity will face in the next decades will be questions that are unavoidably "political" in nature. Yes, "politics is hard mode," but so is life itself, and you don't get better at hard mode without practicing in hard mode.

 

LessWrong proclaims itself as a community devoted to refining the art of rationality. One aspect of the art of rationality is locating the true sources of disagreement between two parties who want to communicate with each other, but who can't help but talk past each other in different languages due to having radically different pre-existing assumptions.

 

I believe that this is the problem that any discourse between neoreaction and progressivism currently faces.

 

Even if you have no interest at all in neoreaction or progressivism as ideologies, I invite you to read this analysis as a case study in locating sources of disagreement between ideologies that have different unspoken assumptions. I will try to steelman neoreaction as much as I can, despite the fact that I am more sympathetic to the progressivist point of view.

 

In particular, I am interested in the following question:  to what extent do neoreactionary and progressive disagreements stem from judgments that merely differ in degree?  (For example, being slightly more or less pessimistic about X, Y, and Z propositions).  Or to what extent do neoreactionary and progressive disagreements stem from assumptions that are qualitatively different?

 

Normative vs. descriptive assumptions


"Normative" statements are "ought" statements, or judgments of value. "Descriptive" statements are "is" statements, or depictions of reality. While neoreaction and progressivism have a lot of differing descriptive assumptions, there is really only one fundamental normative disagreement, which I will address first.

 

Normative disagreement #1: Progressivism's subjective values vs. Neoreaction's objective[?] values


As I see it, Progressivism says, "Our subjective values are worth pursuing in and of themselves just because it makes us feel good. It does not particularly matter where our values come from. Perhaps we are Cartesian dualists—unmoved movers with free will—who invent our values in an act of existential creation. Or perhaps our values are biological programming—spandrels manufactured by Nature, or as the neoreactionaries personify it, "Gnon." It doesn't matter. In principle, if we could rewire our reward circuits to give us pleasure/fun/novelty/happiness/sadness/tragedy/suffering/whatever we desire* in response to whatever Nature had the automatic (or modified) disposition to offer us, then those good feelings would be just as worthwhile as anything else. (This is why neoreactionaries perceive progressive values as "nihilistic.")

 

According to this formulation, most LessWrongers, being averse to wireheading in principle, are not full-fledged progressives at this most fundamental level.  (Perhaps this explains some of the counter-intuitive overlap between the LessWrong and neoreactionary thoughtsphere....) 

 

[Editorial:  In my view, coming to terms with the obvious benefit of wireheading is the ultimate "red pill" to swallow. I am a progressive who would happily wirehead as long as I had concluded beforehand that I had adequately secured its completely automatic perpetuation even in the absence of any further input from me...although an optional override to shut it down and return me to the non-wireheaded state would not be unwelcome, just in case I had miscalculated and found that the system did not attend to my every wish as anticipated.]

 

*Note that I am aware that our subjective values are complex and that we are "Godshatter." Nevertheless, this does not seem to me to be a fundamental impediment to wireheading. In principle, we should be able to dissect every last little bit of this "Godshatter" and figure out exactly what we want in all of its diversity...and then we can start designing a system of wireheading to give it to us. Is this not what Friendly AI is all about? Doesn't Friendly AI = Wireheading Done "Right"? Alternatively, we could re-wire ourselves to not be Godshatter, and to have a very simple list of things that would make us feel good. I am open to either one. LessWrongers, being neoreactionaries at heart (see below), would insist on maintaining our human complexity, our Godshatter values, and making our wireheading laboriously work around that. Okay, fine. I'll compromise...as long as I get my wireheading in some form. ; )

 

Neoreaction says, "There is objective value in the principle of "perpetuating biological and/or civilizational complexity" itself*; the best way to perpetuate biological and/or civilizational complexity is to "serve Gnon" (i.e. devote our efforts to fulfilling nature's pre-requisites for perpetuating our biologial and/or civilizational complexity); our subjective values are spandrels manufactured by natural selection/Gnon; insofar as our subjective values motivate us to serve Gnon and thereby ensure the perpetuation of biological and/or civilizational complexity, our subjective values are useful. (For example, natural selection makes sex a subjective value by making it pleasurable, which then motivates us to perpetuate our biological complexity). But, insofar as our subjective values mislead us from serving Gnon (such as by making non-procreative sex still feel good) and jeopardize our biological/civilizational perpetuation, we must sacrifice our subjective values for the objective good of perpetuating our biological/civilizational complexity" (such as by buckling down and having procreative sex even if one would personally rather not enjoy raising kids).

 

*Note that different NRx thinkers might have different definitions about what counts as biological or civilizational "complexity" worthy of perpetuating...it could be "Western Civilization," "the White Race," "Homo sapiens," "one's own genetic material," "intelligence, whether encoded in human brains or silicon AI," "human complexity/Godshatter," etc. This has led to the so-called "neoreactionary trichotomy"—3 wings of the neoreactionary movement: Christian traditionalists, ethno-nationalists, and techno-commercialists. 

 

Most LessWrongers probably agree with neoreactionaries on this fundamental normative assumption, with the typical objective good of LessWrongers being "human complexity/Godshatter," and thus the "techno-commercialist" wing of neoreaction being the one that typically finds the most interest among LessWrongers.

 

[Editorial:  pesumably, each neoreactionary is choosing his/her objective target of allegiance (such as "Western Civilization") because of the warm fuzzies that the idea elicits in him/herself. Has it ever occurred to neoreactionaries that humans' occasional predilection for being awed by a system bigger than themselves (such as "Western Civilization") and sacrificing for that system is itself a "mere" evolutionary spandrel?]

 

Now, in an attempt to steelman neoreaction's normative assumption, I would characterize it thus: "In the most ultimate sense, neoreactionaries find the pursuit of subjective values just as worthwhile as progressives do. However, neoreactionaries are aware that human beings are short-sighted creatures with finite discount windows. If we tell ourselves that we should pursue our subjective values, we won't end up pursuing those subjective values in a farsighted way that involves, for example, maintaining a functioning civilization so that people continue to follow laws and don't rob or stab each other. Instead, we will invariably party it up and pursue short-term subjective values to the detriment of our long-term subjective values. So instead of admitting to ourselves that we are really interested in subjective value in the long run, we have to tell ourselves a noble lie that we are actually serving some higher objective purpose in order to motivate our primate brains to stick to what will happen to be good for subjective values in the long run."

 

Indeed, I have found some neoreactionary writers muse on the problem of wanting to believe in God because it would serve as a unifying and motivating objective good, and lamenting the fact that they cannot bring themselves to do so.

 

Now, onto the descriptive disagreements....

 

Descriptive assumption #1: Humanity can master nature (progressivism) vs. Nature will always end up mastering humanity (neoreaction).


Whereas progressives tend to have optimism that humankind can incrementally master the laws of nature (not change them, but master them, as in intelligently work around them, much like how we have worked around but not changed gravitation by inventing airplanes), neoreactionaries have a dour pessimism that humankind under-estimates the extent to which the laws of nature constantly pull our puppet strings.  Far from being able to ever master nature, humankind will always be mastered by nature, by nature's command to "race to the bottom" in order to out-reproduce, out-compete one's rivals, even if that means having to sacrifice the nice things in life.

 

For specific ways in which nature threatens to master humanity unless humanity somehow finds a way to exert tremendous efforts at collective coordination against nature, see Scott Alexander's "Meditations on Moloch."

 

Most progressives presumably hold out hope that we can collectively coordinate to overcome Moloch.  If nature and its incentives threaten humanity with the strongest and most ruthless conquering the weak and charitable, perhaps we create a world government to prevent that.  If nature and its incentives drive down wages to subsistence level, perhaps we create a global minimum wage.  If humanity is threatened with dysgenic decline, perhaps a democratic world government organizes a eugenics program. 

 

Descriptive assumption #2:  On average, people have, or can be trained to have, far-sighted discount functions (progressivism), vs. people typically have short-sighted discount functions (neoreaction). 


Part of the progressive assumption about humanity being able to master nature is that ordinary people are rational enough to see the big picture and submit to such controls if they are needed to avoid the disasters of Moloch.  Part of the neoreactionary assumption about nature always mastering humanity is that, except for some bright outliers, most people are short-sighted primates who will insist on trading long-term well-being for short-term frills.

 

Descriptive assumption #3: Culture is a variable mostly dependent on material conditions (progressivism) vs. Culture is an independent variable with respect to material conditions (neoreaction).


Neoreactionaries often claim that life seems so much better in modern times in comparison to, say, 400 years ago, only because of our technological advancement since then has compensated for, and hidden, how our culture has rotted in the meantime. Neoreactionaries argue that, if one could combine our modern technology with, let's say, an absolute monarchy, then life would be so much better. This assumption of being able to mix & match material conditions and political systems, or material conditions and culture, depends on an assumption that culture and social institutions are essentially independent variables. Perhaps with enough will, we can try to make any set of technologies work well with any set of cultural and social institutions.

 

Progressives, whether they realize it or not, are probably subtly influenced, instead, by the "historical materialist" (AKA Marxist) view of society which argues that certain material conditions and material incentives tend to automatically generate certain cultural and social responses.

 

For example, to Marx, increased agricultural productivity in the late middle ages and Renaissance due to better agricultural technologies was a pre-requisite for the "Acts of Enclosure" in England, which booted the "surplus" farmers off of the farms and into the cities as propertyless proletarians who would be willing to work for a wage. Likewise, technologies like steam power were pre-requisites for providing an unprecedentedly profitable way of employing these proletarians to make a profit. (Otherwise, the proletarians might have just been left to rot on the street unemployed, with their numbers dwindling in Malthusian fashion). And because there were new avenues for making a profit, the people who stood to gain from chasing these new profit incentives produced new cultural habits and laws that would enable them to pursue these incentives more effectively. One of these new sets of laws was "laissez-faire" economics. Another was liberal democracy.

 

To a progressive, the proposition that we could, even theoretically, run our modern technological society through an absolute monarchy would probably seem preposterous. It is not even an option. Our modern society is too complex, with too many conflicting interests to reconcile through any system that prohibits the peaceful discovery and negotiation of these varied interests through a democratic process involving "voice." In reality, people are not content with being able only to exercise the "right of exit" from institutions or governments that they don't like. Perhaps the powerless have no choice but to immigrate. But elites have, historically, more often chosen to stand and fight rather than gracefully exit. Hence, feudalism, civil wars brought on by crises of royal succession, Masonic orders, factions, political parties, "special interest groups," and so on.

 

Progressives would say, "Do you honestly think that you can tame these beasts, when even a dictator like Hitler was just as much beholden to juggling interest groups and power blocs around him as he was the real dictator of events?" Ah, but the neoreactionaries will say, "Hitler's Nazism was still "demotist." It made the mistake of trying to justify itself to the public, if not through elections, then at least implicitly. We won't do that." To which progressives might say, "You might not want to justify yourself to the rabble and to elite power blocs, but they will demand it—and not because they are all infected by some mysterious mental virus called the "Cathedral," but because they see a way to gain an advantage through politics, and in the modern era they have the means and coordination to effectively fight for it."

 

These are just examples. The take-away point is that, for progressives, culture appears to be more of a dependent variable, not a variable that is independent of material conditions. So, according to progressives, you can't say, "Let's just combine today's technology with absolute monarchy, and voilà!"

 

Descriptive assumption #4: Western society is currently anabolic/ascendant (progressivism) vs. catabolic/decadent (neoreaction).


Neoreaction often gets caricatured as claiming that "things are getting worse" or "have been getting worse for the past x number of years." This paints a weak straw-man of neoreaction because, on the surface, things seem so much "obviously" better now than ever. However, this isn't quite what neoreactionaries claim.

 

Neoreactionaries actually claim that Western society is decaying (note the subtle difference). Western society is gradually weakening its ability to reproduce itself. It is, to use a farming metaphor, eating up its seed-corn on present consumption, on insant gratification, which causes things to seem really swell on the surface...for now. However, according to neoreactionaries, conditions might not yet be getting worse on average (although they will point to inner city violence and other signs that conditions already have started to get worse in some places), but Western society's "capital stock" is getting worse, is already dwindling.

 

Envisioned more broadly, a society's "capital" is not just its money. It is its entire basket of tangible and intangible assets that help it reproduce and expand itself. So a society's "capital" would also include things like its citizens, its birth rates, its habits of harmonious gender relations, its education, its habits of civil propriety, its sustaining myths (such as patriotism or religion), its infrastructure, its environmental health [although NRxers tend to not focus on this], etc.

 

Another term for "decadence" might be "catabolic collapse." A catabolic collapse is when an organism starts consuming its own muscles, its own seed-corn, if you will, in a last-ditch effort to stay alive. By contrast, an "anabolic" process is one that builds muscle—one that saves up capital, if you will. (Hence, "anabolic" steroids).

 

Neoreactionaries believe that Western society is currently headed for a "catabolic collapse."  (See John Michael Greer, author of "How Civilizations Fall: A Theory of Catabolic Collapse."  Oddly enough, John Michael Greer started out 10 years ago as a trendy name in anarcho-primitivist intellectual circles.  Now his ideas have been embraced by some neoreactionaries such as Nick Land, which makes me ponder whether anarcho-primitivism is really of the "left" or "right" to begin with...)

 

When it comes to progressives, most, I think, would argue that Western society is not currently catabolic/decadent. Granted, they would point to some problems with "unsustainability," especially with regards to environmental pollution, resource depletion, and maybe public debt levels (especially worrisome to the libertarian-minded). But on the whole, progressives are still optimistic that these problems can be overcome without rolling back liberal democracy.

 

Now, let's look at some specific worries that neoreaction has about Western decadence....

 

Descriptive Assumption #5: Our biggest population threat is overshoot and the attendant resource depletion, environmental pollution, and immiseration of living standards (progressivism) vs. Our biggest population threat is a demographic death spiral (neoreaction).


One thing I have noticed when looking at neoreactionary websites is that they are really obsessed with birth rates! They argue that countries with fertility below replacement level are on the road to annihilation. I found this interesting because my first impulse is to feel like this globe is getting too damn crowded.

 

Perhaps neoreactionaries envision the birth rates to stay below replacement level from here on out—that this is a permanent change. Perhaps they foresee world population following a sort of bell-shaped curve. My naive progressive assumption is that our population is already in a slight overshoot beyond what can be sustained at our current level of technology, and that any present declines in birth rates are probably just enough to bring us into the oscillating plateau of a typical S-shaped popoulation curve, and that better economic prospects could easily reverse the trend. My naive progressive assumption is that raising kids will remain sufficiently fun and interesting to a large enough pool of adults that, given enough of a feeling of economic security, people will happily continue having kids in sufficient numbers to prevent a die-off of Homo sapiens. In other words, most progressives like myself would not see the need to roll back gender norms in Western society at the present time for the sake of popping out more babies.

 

Perhaps what worries neoreactionaries, though, is not so much the fear of a global planetary baby shortage, but rather a localized baby shortage among Westerners or Whites. Maybe they fear that all babies are not created equal....

 

Descriptive assumption #6: "Immigrants are OK" (progressivism) vs. "Immigrants will jeopardize Western Civilization/the White Race/intelligent human complexity/etc." (neoreaction)


Progressives say, "It is not a big deal if Western society has to import some immigrants to keep its population topped off. Immigrant cultures will eventually blend with the "nativist" culture. Historically, this has turned out OK, despite xenophobic fears every time that it will end in disaster. The immigrants will mostly assimilate into the nativist culture. The nativist culture will pick up a few new habits from the immigrants (some of them helpful, some of them harmful, but on the balance nothing disastrous). Nor will the immigrants dirty the nativist gene pool with bad genes. As far as we can tell so far, no significant genetic differences in intelligence and/or physical vigor exist between immigrants and non-immigrants."

 

Neoreactionaries say, "It is a very big deal if Western society has to import some immigrants to keep its population topped off. Immigrant cultures will not assimilate with the nativist culture. Immigrant cultures will end up imparting a net influene of bad habits on the native culture. Civil decency will be eroded. Crime and societal dysfunction will increase. The native gene pool will also be dirtied with lower-intelligence immigrant genes. (And the only reason we can't see this is because the progressive Establishment AKA the "Cathedral" has systematically distorted the research and discourse around IQ). At worst, Western cities will act as "IQ Shredders." Any intelligent immigrants who seize economic opportunities in wealthy Western cities will see their fertility rates plummet, and the idiots will inherit the Earth à la the movie "Idiocracy"."

 

More to come in subsequent parts....

Universal agents and utility functions

29 Anja 14 November 2012 04:05AM

I'm Anja Heinisch, the new visiting fellow at SI. I've been researching replacing AIXI's reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.

Introduction to AIXI

Marcus Hutter's (2003) universal agent AIXI  addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence  with actions (agent outputs)   and perceptions (environment outputs)   chosen from finite sets  and . The perceptions are pairs , where  is the observation part and  denotes a reward. At time k the agent chooses its next action  according to the expectimax principle:

Here M denotes the updated Solomonoff prior summing over all programs  that are consistent with the history  [1] and which will, when run on the universal Turing machine T with successive inputs , compute outputs , i.e.

AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn't really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world. 

Utility vs rewards

So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn't even blame the agent for doing that -- people try to game the system all the time. 

You can visualize AIXI's computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).

Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function

that assigns a value  to every possible future history  and use it to replace the reward system in the agent specification:

When I say "we can just define" I am actually referring to the really hard question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting

For the same reasons, this is also a strictly safer approach.

Infinite utility

The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and  is finite. The utility function approach does not have this problem, as the expected utility 

is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing  to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than  of the surreal numbers, my favorite[2] so far being the Levi-Civita field generated by the infinitesimal :

with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as 

(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.

Note that finiteness of  is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon  remains a very problematic assumption, as described in Legg (2008).

Modifiable utility functions

Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent's hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility. 

A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let 

denote the utility function the agent will have at time k.

Even though we will only analyze instances in which the agent knows at time k, which utility function  it will have at future times  (possibly depending on the actions  before that), we note that for every fixed future history  the agent knows the utility function  that is displayed on the screen because the screen is part of its perception data .

This leads to three different agent models worthy of further investigation:

  • Agent 1 will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function   
  • Agent 2 will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by 
  • Agent 3 will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon  

Of course arbitrary mixtures of these are possible.

The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e.   holds for all . Assume further that the agents actions influence the future outcomes, at least from its current perspective. That is, for all  there exist   with . Let  be the Simpleton utility function, assigning equal but maximal utility  to all possible futures. While Agent 1 will optimize as before, not adapting its behavior to the knowledge that its utility function will change, Agent 3 will be paralyzed, having to rely on whatever method its implementation uses to break ties. Agent 2 on the other hand will try to maximize only the utility .

Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in  and  (not changing), leading to  for all . We assume that  has no further effects on the environment. As before, Agent 1 will optimize for business as usual, whether or not it chooses to change depends entirely on whether the screen specifically mentions the memory pointer to the utility function or not.

Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,

is strictly less than

This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.

Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.

[1] Dotted actions/ perceptions, like  denote past events, underlined perceptions  denote random variables to be observed at future times.

[2] Bostrom (2011) proposes using hyperreal numbers, which rely heavily on the axiom of choice for the ultrafilter to be used and I don't see how those could be implemented.

The Criminal Stupidity of Intelligent People

-14 fare 27 July 2012 04:08AM

What always fascinates me when I meet a group of very intelligent people is the very elaborate bullshit that they believe in. The naive theory of intelligence I first posited when I was a kid was that intelligence is a tool to avoid false beliefs and find the truth. Surrounded by mediocre minds who held obviously absurd beliefs not only without the ability to coherently argue why they held these beliefs, but without the ability of even understanding basic arguments about them, I believed as a child that the vast amount of superstition and false beliefs in the world was due to people both being stupid and following the authority of insufficiently intelligent teachers and leaders. More intelligent people and people following more intelligent authorities would thus automatically hold better beliefs and avoid disproven superstitions. However, as a grown up, I got the opportunity to actually meet and mingle with a whole lot of intelligent people, including many whom I readily admit are vastly more intelligent than I am. And then I had to find that my naive theory of intelligence didn't hold water: intelligent people were just as prone as less intelligent people to believing in obviously absurd superstitions. Only their superstitions would be much more complex, elaborate, rich, and far reaching than an inferior mind's superstitions.

For instance, I remember a ride with an extremely intelligent and interesting man (RIP Bob Desmarets); he was describing his current pursuit, which struck me as a brilliant mathematical mind's version of mysticism: the difference was that instead of marveling at some trivial picture of an incarnate god like some lesser minds might have done, he was seeking some Ultimate Answer to the Universe in the branching structures of ever more complex algebras of numbers, real numbers, complex numbers, quaternions, octonions, and beyond, in ever higher dimensions (notably in relation to super-string theories). I have no doubt that there is something deep, and probably enlightening and even useful in such theories, and I readily disqualify myself as to the ability to judge the contributions that my friend made to the topic from a technical point of view; no doubt they were brilliant in one way or another. Yet, the way he was talking about this topic immediately triggered the "crackpot" flag; he was looking there for much more than could possibly be found, and anyone (like me) capable of acknowledging being too stupid to fathom the Full Glory of these number structures yet able to find some meaning in life could have told that no, this topic doesn't hold key to The Ultimate Source of All Meaning in Life. Bob's intellectual quest, as exaggeratedly exalted as it might have been, and as interesting as it was to his own exceptional mind, was on the grand scale of things but some modestly useful research venue at best, and an inoffensive pastime at worst. Perhaps Bob could conceivably used his vast intellect towards pursuits more useful to you and I; but we didn't own his mind, and we have no claims to lay on the wonders he could have created but failed to by putting his mind into one quest rather than another. First, Do No Harm. Bob didn't harm any one, and his ideas certainly contained no hint of any harm to be done to anyone.

Unhappily, that is not always the case of every intelligent man's fantasies. Let's consider a discussion I was having recently, that prompted this article. Last week, I joined a dinner-discussion with a lesswrong meetup group: radical believers in rationality and its power to improve life in general and one's own life in particular. As you can imagine, the attendance was largely, though not exclusively, composed of male computer geeks. But then again, any club that accepts me as a member will probably be biased that way: birds of the feather flock together. No doubt, there are plenty of meetup groups with the opposite bias, gathering desperately non-geeky females to the almost exclusion of males. Anyway, the theme of the dinner was "optimal philanthropy", or how to give time and money to charities in a way that maximizes the positive impact of your giving. So far, so good.

But then, I found myself in a most disturbing private side conversation with the organizer, Jeff Kaufman (a colleague, I later found out), someone I strongly suspect of being in many ways saner and more intelligent than I am. While discussing utilitarian ways of evaluating charitable action, he at some point mentioned some quite intelligent acquaintance of his who believed that morality was about minimizing the suffering of living beings; from there, that acquaintance logically concluded that wiping out all life on earth with sufficient nuclear bombs (or with grey goo) in a surprise simultaneous attack would be the best possible way to optimize the world, though one would have to make triple sure of involving enough destructive power that not one single strand of life should survive or else the suffering would go on and the destruction would have been just gratuitous suffering. We all seemed to agree that this was an absurd and criminal idea, and that we should be glad the guy, brilliant as he may be, doesn't remotely have the ability to implement his crazy scheme; we shuddered though at the idea of a future super-human AI having this ability and being convinced of such theories.

That was not the disturbing part though. What tipped me off was when Jeff, taking the "opposite" stance of "happiness maximization" to the discussed acquaintance's "suffering minimization", seriously defended the concept of wireheading as a way that happiness may be maximized in the future: putting humans into vats where the pleasure centers of their brains will be constantly stimulated, possibly using force. Or perhaps instead of humans, using rats, or ants, or some brain cell cultures or perhaps nano-electronic simulations of such electro-chemical stimulations; in the latter cases, biological humans, being less-efficient forms of happiness substrate, would be done away with or at least not renewed as embodiments of the Holy Happiness to be maximized. He even wrote at least two blog posts on this theme: hedonic vs preference utilitarianism in the Context of Wireheading, and Value of a Computational Process. In the former, he admits to some doubts, but concludes that the ways a value system grounded on happiness differ from my intuitions are problems with my intutions.

I expect that most people would, and rightfully so, find Jeff's ideas as well as his acquaintance's ideas to be ridiculous and absurd on their face; they would judge any attempt to use force to implement them as criminal, and they would consider their fantasied implemention to be the worst of possible mass murders. Of course, I also expect that most people would be incapable of arguing their case rationally against Jeff, who is much more intelligent, educated and knowledgeable in these issues than they are. And yet, though most of them would have to admit their lack of understanding and their absence of a rational response to his arguments, they'd be completely right in rejecting his conclusion and in refusing to hear his arguments, for he is indeed the sorely mistaken one, despite his vast intellectual advantages.

I wilfully defer any detailed rational refutation of Jeff's idea to some future article (can you without reading mine write a valuable one?). In this post, I rather want to address the meta-point of how to address the seemingly crazy ideas of our intellectual superiors. First, I will invoke the "conservative" principle (as I'll call it), well defended by Hayek (who is not a conservative): we must often reject the well-argued ideas of intelligent people, sometimes more intelligent than we are, sometimes without giving them a detailed hearing, and instead stand by our intuitions, traditions and secular rules, that are the stable fruit of millenia of evolution. We should not lightly reject those rules, certainly not without a clear testable understanding of why they were valid where they are known to have worked, and why they would cease to be in another context. Second, we should not hesitate to use proxy in an eristic argument: if we are to bow to the superior intellect of our better, it should not be without having pitted said presumed intellects against each other in a fair debate to find out if indeed there is a better whose superior arguments can convince the others or reveal their error. Last but not least, beyond mere conservatism or debate, mine is the Libertarian point: there is Universal Law, that everyone must respect, whereby peace between humans is possible inasmuch and only inasmuch as they don't initiate violence against other persons and their property. And as I have argued in another previous essay (hardscrapple), this generalizes to maintaining peace between sentient beings of all levels of intelligence, including any future AI that Jeff may be prone to consider. Whatever the one's prevailing or dissenting opinions, the initiation of force is never to be allowed as a means to further any ends. Rather than doubt his intuition, Jeff should have been tipped that his theory was wrong and taken out of context by the very fact that it advocates or condones massive violation of this Universal Law. Criminal urges, mass-criminal at that, are a strong stench that should alert anyone that some ideas have gone astray, even when it might not be immediately obvious where exactly they started parting from the path of sanity.

Now, you might ask, it is good and well to poke fun at the crazy ideas that some otherwise intelligent people may hold; it may even allow one to wallow in a somewhat justified sense of intellectual superiority over people who otherwise are actually and objectively so one's intellectual superiors. But is there a deeper point? Is it relevant what crazy ideas intellectuals hold, whether inoffensive or criminal? Sadly, it is. As John McCarthy put it, "Soccer riots kill at most tens. Intellectuals' ideological riots sometimes kill millions." Jeff's particular crazy idea may be mostly harmless: the criminal raptures of the overintelligent nerd, that are so elaborate as to be unfathomable to 99.9% of the population, are unlikely to ever spread to enough of the power elite to be implemented. That is, unless by some exceptional circumstance there is a short and brutal transition to power by some overfriendly AI programmed to follow such an idea. On the other hand, the criminal raptures of a majority of the more mediocre intellectual elite, when they further possess simple variants that can intoxicate the ignorant and stupid masses, are not just theoretically able to lead to mass murder, but have historically been the source of all large-scale mass murders so far; and these mass murders can be counted in hundreds of millions, over the XXth century only, just for Socialism. Nationalism, Islamism and Social-democracy (the attenuated strand of socialism that now reigns in Western "Democracies") count their victims in millions only. And every time, the most well-meaning of intellectuals build and spread the ideologies of these mass-murders. A little initial conceptual mistake, properly amplified, can do that.

And so I am reminded of the meetings of some communist cells that I attended out of curiosity when I was in high-school. Indeed, trotskyites are very openly recruiting in "good" French high-schools. It was amazing the kind of non-sensical crap that these obviously above-average adolescent could repeat. "The morale of the workers is low." Whoa. Or "The petite-bourgeoisie" is plotting this or that. Apparently, grossly cut social classes spanning millions of individuals act as one man, either afflicted with depression or making machiavelian plans. Not that any of them knew much of either salaried workers or entrepreneurs but through one-sided socialist literature. If you think that the nonsense of the intellectual elite is inoffensive, consider what happens when some of them actually act on those nonsensical beliefs: you get terrorists who kill tens of people; when they lead ignorant masses, they end up killing millions of people in extermination camps or plain massacres. And when they take control of entire universities, and train generations of scholars, who teach generations of bureaucrats, politicians, journalists, then you suddenly find that all politicians agree on slowly implementing the same totalitarian agenda, one way or another.

If you think that control of universities by left-wing ideologists is just a French thing, consider how for instance, America just elected a president whose mentor and ghostwriter was the chief of a terrorist group made of Ivy League educated intellectuals, whose overriding concern about the country they claimed to rule was how to slaughter ten percent of its population in concentration camps. And then consider that the policies of this president's "right wing" opponent are indistinguishable from the policies of said president. The violent revolution has given way to the slow replacement of the elite, towards the same totalitarian ideals, coming to you slowly but relentlessly rather than through a single mass criminal event. Welcome to a world where the crazy ideas of intelligent people are imposed by force, cunning and superior organization upon a mass of less intelligent yet less crazy people.

Ideas have consequences. That's why everyone Needs Philosophy.

Crossposted from my livejournal: http://fare.livejournal.com/168376.html

Much-Better-Life Simulator™ - Sales Conversation

4 XiXiDu 19 June 2011 12:44PM

Related to: A Much Better Life?

Reply to: Why No Wireheading?

The Sales Conversation

Sales girl: Our Much-Better-Life Simulator™ is going to provide the most enjoyable life you could ever experience.

Customer: But it is a simulation, it is fake. I want the real thing, I want to live my real life.

Sales girl: We accounted for all possibilities and determined that the expected utility of your life outside of our Much-Better-Life Simulator™ is dramatically lower.

Customer: You don't know what I value and you can't make me value what I don't want. I told you that I value reality over fiction.

Sales girl: We accounted for that as well! Let me ask you how much utility you assign to one hour of ultimate well-being™, where 'ultimate' means the best possible satisfaction of all desirable bodily sensations a human body and brain is capable of experiencing?

Customer: Hmm, that's a tough question. I am not sure how to assign a certain amount of utility to it.

Sales girl: You say that you value reality more than what you call 'fiction'. But you nonetheless value fiction, right?

Customer: Yes of course, I love fiction. I read science fiction books and watch movies like most humans do.

Sales girl: Then how much more would you value one hour of ultimate well-being™ by other means compared to one hour of ultimate well-being™ that is the result of our Much-Better-Life Simulator™?

Customer: If you ask me like that, I would exchange ten hours in your simulator with one hour of real satisfaction, something that is the result of an actual achievement rather than your fake.

Sales girl: Thank you. Would you agree if I said that for you one hour outside, that is 10 times less satisfying, roughly equals one hour in our simulator?

Customer: Yes, for sure.

Sales girl: Then you should buy our product. Not only is it very unlikely for you to experience even a tenth of ultimate well-being™ that we offer more than a few times per year, but our simulator delivers and allows your brain to experience 20 times more perceptual data than you would be able to experience outside of our simulator. All this at a constant rate while experiencing ultimate well-being™. And we offer free upgrades that are expected to deliver exponential speed-ups and qualitative improvements for the next few decades.

Customer: Thanks, but no thanks. I rather enjoy the real thing.

Sales girl: But I showed you that our product easily outweighs the additional amount of utility you expected to experience outside of our simulator.

Customer: You just tricked me into this utility thing, I don't want to buy your product. Please leave me alone now.

Natural wireheadings: formal request.

-6 MrMind 30 May 2011 04:21PM

This post is a formal request for everybody in this forum, who are the most likely humans to produce an FAI in the future, that in the possible resulting utopia some activity will still be available, even if they are not based on challenges of increasing complexity. I call these activity natural wireheadings, and since it is my right to determine my own fun-generating activities, I hereby formally request that some simple pleasure, listened below, will still be available at *any* point in my future cone, and that I will consider a dystopia any future which deprives me of these natural wireheadings (if anyone is still around caring for those things).
A (non-exhaustive) list of them includes the following:
- sex
- eating food/drinking
- feeling relieved for having emptied my bowels
- dancing
- the pleasure of physical activity (wood-carving, sculpting, running, etc)
- the rapture I feel when in the presence of a safe ancestral environment
- social laughter
- the pleasure of talking in a small, same-minded crowd
- listening to pop/metal/rap music
- the pleasure of resting when tired
- scratching an itch
-...

More will come!