## [Link] Probabilistic Programming and Bayesian Methods for Hackers

1 22 May 2017 09:15PM

## A majority coalition can lose a symmetric zero-sum game

10 26 January 2017 12:13PM

Just a neat little result I found when thinking about Jessica's recent post.

For n players, let ai be the action of player i and si(a1,a2,...an) the reward of player i as a function of the actions of all players. Then the game is symmetric if for any permutation p:{1,...n}→{1,...n}

si(a1,a2,...an) = sp(i)(ap(1),ap(2),...ap(n))

The game is zero-sum if the sum of the si is always zero. Assume players can confer before choosing their actions.

Then it is possible for a majority coalition to strictly lose a zero-sum game, even in a deterministic game where they get to see their opponents' moves before choosing their own.

This seems counter-intuitive. After all, if one coalition has M players, the other has m players, with m<M, and there are no other players, how can the M players lose? Couldn't just m of the M players behave exactly as the smaller coalition, thus getting the same amount in expectation? The problem is the potential loses endured by the remaining M-m players.

For an example, consider the following 5 player colour game (it's a much simplified version of the game I came up with previously, proposed by cousin_it). Each player chooses one of two colour, blue or red. Then the players that selected the least commonly chosen colour(s) are the winners; the others are the losers. The losers pay 1 each, and this is split equally between the winners.

Then consider a coalition of three players, the triumvirate. The remaining two players - the duumvirate - choose different colours, red and blue. What can the triumvirate then do? If they all chose the same colour - say blue - then they all lose -1, and the duumvriate loses -1 (from its member that chose blue) and gains 4/1 (for the member that chose red). If they split - say 2 blue, 1 red - then the ones that chose blue lose -1, while the duumveriate loses -1 (from its member that chose blue) and gains 3/2 > 1 (from the member that chose red).

So the duumvriate can always win against the triumvirate.

Of course, it's possible for two members of the triumvirate to create a second duumvirate that will profit from the hapless third member. Feel free to add whatever political metaphor you think this fits.

## Larger games

Variations of this game can make Jessica's theorem 2 sharp. Let the minority coalition be of size m (the majority coalition is of size M = n-m = qm+r for some unique q and 0≤r<m). The actions are choosing from m colours; apart from that the game is the same as before. And, as before, the members of the minority coalition each choose a different colour.

Then an m-set is a collection of players that each chose a different colour. Split the players into as many disjoint m-sets as possible, with the minority coalition being one of them - say this gives q'+1 m-sets. There are r' remaining players from the majority coalition.

Note that we can consider that any loss from a member of an m-set is spread among the remaining members. That's because all winners are members of m-sets, and players that choose the same colour are interchangeable. So we can assign the loss from the member of an m-sets as being an equivalent gain to the other members. Thus the m-sets only profit from the r' remaining players. And this profit is spread equally among the winners - hence equally among the m-sets.

Thus the majority coalition has a loss of r'/(q'+1), and minimises its loss by minimising r' and maximising q' - hence by setting q'=q and r'=r. Under these circumstances, the minority coalition wins r/(q+1) in total.

Adding 1 to the reward of each player, then dividing all rewards by n, gives the unit-sum game in Jessica's theorem.

## [Link] Funding the Reproducibility Crises as effective giving

9 24 January 2017 11:05PM

## Common Misconceptions about Dual Process Theories of Human Reasoning

12 19 March 2016 09:50PM

(This is mostly a summary of Evans (2012); the fifth misconception mentioned is original research, although I have high confidence in it.)

It seems that dual process theories of reasoning are often underspecified, so I will review some common misconceptions about these theories in order to ensure that everyone's beliefs about them are compatible. Briefly, the key distinction (and it seems, the distinction that implies the fewest assumptions) is the amount of demand that a given process places on working memory.

(And if you imagine what you actually use working memory for, then a consequence of this is that Type 2 processing always has a quality of 'cognitive decoupling' or 'counterfactual reasoning' or 'imagining of ways that things could be different', dynamically changing representations that remain static in Type 1 processing; the difference between a cached and non-cached thought, if you will. When you are transforming a Rubik's cube in working memory so that you don't have to transform it physically, this is an example of the kind of thing that I'm talking about from the outside.)

The first common confusion is that Type 1 and Type 2 refer to specific algorithms or systems within the human brain. It is a much stronger proposition, and not a widely accepted one, to assert that the two types of cognition refer to particular systems or algorithms within the human brain, as opposed to particular properties of information processing that we may identify with many different algorithms in the brain, characterized by the degree to which they place a demand on working memory.

The second and third common confusions, and perhaps the most widespread, are the assumptions that Type 1 processes and Type 2 processes can be reliably distinguished, if not defined, by their speed and/or accuracy. The easiest way to reject this is to say that the mistake of entering a quickly retrieved, unreliable input into a deliberative, reliable algorithm is not the same mistake as entering a quickly retrieved, reliable input into a deliberative, unreliable algorithm. To make a deliberative judgment based on a mere unreliable feeling is a different mistake from experiencing a reliable feeling and arriving at an incorrect conclusion through an error in deliberative judgment. It also seems easier to argue about the semantics of the 'inputs', 'outputs', and 'accuracy' of algorithms running on wetware, than it is to argue about the semantics of their demand on working memory and the life outcomes of the brains that execute them.

The fourth common confusion is that Type 1 processes involve 'intuitions' or 'naivety' and Type 2 processes involve thought about abstract concepts. You might describe a fast-and-loose rule that you made up as a 'heuristic' and naively think that it is thus a 'System 1 process', but it would still be the case that you invented that rule by deliberative means, and thus by means of a Type 2 process. When you applied the rule in the future it would be by means of a deliberative process that placed a demand on working memory, not by some behavior that is based on association or procedural memory, as if by habit. (Which is also not the same as making an association or performing a procedure that entails you choosing to use the deliberative rule, or finding a way to produce the same behavior that the deliberative rule originally produced by developing some sort of habit or procedural skill.) When facing novel situations, it is often the case that one must forego association and procedure and thus use Type 2 processes, and this can make it appear as though the key distinction is abstractness, but this is only because there are often no clear associations to be made or procedures to be performed in novel situations. Abstractness is not a necessary condition for Type 2 processes.

The fifth common confusion is that, although language is often involved in Type 2 processing, this is likely a mere correlate of the processes by which we store and manipulate information in working memory, and not the defining characteristic per se. To elaborate, we are widely believed to store and manipulate auditory information in working memory by means of a 'phonological store' and an 'articulatory loop', and to store and manipulate visual information by means of a 'visuospatial sketchpad', so we may also consider the storage and processing in working memory of non-linguistic information in auditory or visuospatial form, such as musical tones, or mathematical symbols, or the possible transformations of a Rubik's cube, for example. The linguistic quality of much of the information that we store and manipulate in working memory is probably noncentral to a general account of the nature of Type 2 processes. Conversely, it is obvious that the production and comprehension of language is often an associative or procedural process, not a deliberative one. Otherwise you still might be parsing the first sentence of this article.

## MIRI Fundraiser: Why now matters

28 24 July 2015 10:38PM

Our summer fundraiser is ongoing. In the meantime, we're writing a number of blog posts to explain what we're doing and why, and to answer a number of common questions. Previous posts in the series are listed at the above link.

I'm often asked whether donations to MIRI now are more important than donations later. Allow me to deliver an emphatic yes: I currently expect that donations to MIRI today are worth much more than donations to MIRI in five years. As things stand, I would very likely take \$10M today over \$20M in five years.

That's a bold statement, and there are a few different reasons for this. First and foremost, there is a decent chance that some very big funders will start entering the AI alignment field over the course of the next five years. It looks like the NSF may start to fund AI safety research, and Stuart Russell has already received some money from DARPA to work on value alignment. It's quite possible that in a few years' time significant public funding will be flowing into this field.

(It's also quite possible that it won't, or that the funding will go to all the wrong places, as was the case with funding for nanotechnology. But if I had to bet, I would bet that it's going to be much easier to find funding for AI alignment research in five years' time).

In other words, the funding bottleneck is loosening — but it isn't loose yet.

We don't presently have the funding to grow as fast as we could over the coming months, or to run all the important research programs we have planned. At our current funding level, the research team can grow at a steady pace — but we could get much more done over the course of the next few years if we had the money to grow as fast as is healthy.

Which brings me to the second reason why funding now is probably much more important than funding later: because growth now is much more valuable than growth later.

There's an idea picking up traction in the field of AI: instead of focusing only on increasing the capabilities of intelligent systems, it is important to also ensure that we know how to build beneficial intelligent systems. Support is growing for a new paradigm within AI that seriously considers the long-term effects of research programs, rather than just the immediate effects. Years down the line, these ideas may seem obvious, and the AI community's response to these challenges may be in full swing. Right now, however, there is relatively little consensus on how to approach these issues — which leaves room for researchers today to help determine the field's future direction.

People at MIRI have been thinking about these problems for a long time, and that puts us in an unusually good position to influence the field of AI and ensure that some of the growing concern is directed towards long-term issues in addition to shorter-term ones. We can, for example, help avert a scenario where all the attention and interest generated by Musk, Bostrom, and others gets channeled into short-term projects (e.g., making drones and driverless cars safer) without any consideration for long-term risks that are more vague and less well-understood.

It's likely that MIRI will scale up substantially at some point; but if that process begins in 2018 rather than 2015, it is plausible that we will have already missed out on a number of big opportunities.

The alignment research program within AI is just now getting started in earnest, and it may even be funding-saturated in a few years' time. But it's nowhere near funding-saturated today, and waiting five or ten years to begin seriously ramping up our growth would likely give us far fewer opportunities to shape the methodology and research agenda within this new AI paradigm. The projects MIRI takes on today can make a big difference years down the line, and supporting us today will drastically affect how much we can do quickly. Now matters.

I encourage you to donate to our ongoing fundraiser if you'd like to help us grow!

This post is cross-posted from the MIRI blog.

## Calling all Nigerian rationalists and effective altruists

21 03 May 2015 10:31PM

I'm in Lagos, Nigeria till the end of May and I'd like to hold a LessWrong/EA meetup while I'm here. If you'll ever be in the country in the future (or in the subcontinent), please get in touch so we can coordinate a meetup. I'd also appreciate being put in contact with any Nigerians who may not regularly read this list.

## What do you mean by Pascal's mugging?

4 20 November 2014 04:38PM

Some people[1] are now using the term Pascal's mugging as a label for any scenario with a large associated payoff and a small or unstable probability estimate, a combination that can trigger the absurdity heuristic.

Consider the scenarios listed below: (a) Do these scenarios have something in common? (b) Are any of these scenarios cases of Pascal's mugging?

(1) Fundamental physical operations -- atomic movements, electron orbits, photon collisions, etc. -- could collectively deserve significant moral weight. The total number of atoms or particles is huge: even assigning a tiny fraction of human moral consideration to them or a tiny probability of them mattering morally will create a large expected moral value. [Source]

(2) Cooling something to a temperature close to absolute zero might be an existential risk. Given our ignorance we cannot rationally give zero probability to this possibility, and probably not even give it less than 1% (since that is about the natural lowest error rate of humans on anything). Anybody saying it is less likely than one in a million is likely very overconfident. [Source]

(3) GMOS might introduce “systemic risk” to the environment. The chance of ecocide, or the destruction of the environment and potentially humans, increases incrementally with each additional transgenic trait introduced into the environment. The downside risks are so hard to predict -- and so potentially bad -- that it is better to be safe than sorry. The benefits, no matter how great, do not merit even a tiny chance of an irreversible, catastrophic outcome. [Source]

(4) Each time you say abracadabra, 3^^^^3 simulations of humanity experience a positive singularity.

If you read up on any of the first three scenarios, by clicking on the provided links, you will notice that there are a bunch of arguments in support of these conjectures. And yet I feel that all three have something important in common with scenario four, which I would call a clear case of Pascal's mugging.

I offer three possibilities of what these and similar scenarios have in common:

• Probability estimates of the scenario are highly unstable and highly divergent between informed people who spent a similar amount of resources researching it.
• The scenario demands skeptics to either falsify or accept its decision relevant consequences. The scenario is however either unfalsifiable by definition, too vague, or almost impossibly difficult to falsify.
• There is no or very little direct empirical evidence in support of the scenario.[2]

In any case, I admit that it is possible that I just wanted to bring the first three scenarios to your attention. I stumbled upon each very recently and found them to be highly..."amusing".

[1] I am also guilty of doing this. But what exactly is wrong with using the term in that way? What's the highest probability for which the term is still applicable? Can you offer a better term?

[2] One would have to define what exactly counts as "direct empirical evidence". But I think that it is pretty intuitive that there exists a meaningful difference between the risk of an asteroid that has been spotted with telescopes and a risk that is solely supported by a priori arguments.

## Is the potential astronomical waste in our universe too small to care about?

22 21 October 2014 08:44AM

In the not too distant past, people thought that our universe might be capable of supporting an unlimited amount of computation. Today our best guess at the cosmology of our universe is that it stops being able to support any kind of life or deliberate computation after a finite amount of time, during which only a finite amount of computation can be done (on the order of something like 10^120 operations).

Consider two hypothetical people, Tom, a total utilitarian with a near zero discount rate, and Eve, an egoist with a relatively high discount rate, a few years ago when they thought there was .5 probability the universe could support doing at least 3^^^3 ops and .5 probability the universe could only support 10^120 ops. (These numbers are obviously made up for convenience and illustration.) It would have been mutually beneficial for these two people to make a deal: if it turns out that the universe can only support 10^120 ops, then Tom will give everything he owns to Eve, which happens to be \$1 million, but if it turns out the universe can support 3^^^3 ops, then Eve will give \$100,000 to Tom. (This may seem like a lopsided deal, but Tom is happy to take it since the potential utility of a universe that can do 3^^^3 ops is so great for him that he really wants any additional resources he can get in order to help increase the probability of a positive Singularity in that universe.)

You and I are not total utilitarians or egoists, but instead are people with moral uncertainty. Nick Bostrom and Toby Ord proposed the Parliamentary Model for dealing with moral uncertainty, which works as follows:

Suppose that you have a set of mutually exclusive moral theories, and that you assign each of these some probability.  Now imagine that each of these theories gets to send some number of delegates to The Parliament.  The number of delegates each theory gets to send is proportional to the probability of the theory.  Then the delegates bargain with one another for support on various issues; and the Parliament reaches a decision by the delegates voting.  What you should do is act according to the decisions of this imaginary Parliament.

It occurred to me recently that in such a Parliament, the delegates would makes deals similar to the one between Tom and Eve above, where they would trade their votes/support in one kind of universe for votes/support in another kind of universe. If I had a Moral Parliament active back when I thought there was a good chance the universe could support unlimited computation, all the delegates that really care about astronomical waste would have traded away their votes in the kind of universe where we actually seem to live for votes in universes with a lot more potential astronomical waste. So today my Moral Parliament would be effectively controlled by delegates that care little about astronomical waste.

I actually still seem to care about astronomical waste (even if I pretend that I was certain that the universe could only do at most 10^120 operations). (Either my Moral Parliament wasn't active back then, or my delegates weren't smart enough to make the appropriate deals.) Should I nevertheless follow UDT-like reasoning and conclude that I should act as if they had made such deals, and therefore I should stop caring about the relatively small amount of astronomical waste that could occur in our universe? If the answer to this question is "no", what about the future going forward, given that there is still uncertainty about cosmology and the nature of physical computation. Should the delegates to my Moral Parliament be making these kinds of deals from now on?

## Solstice 2014 - Kickstarter and Megameetup

18 10 October 2014 05:55PM

## Summary:

• We're running another Winter Solstice kickstarter - this is to fund the venue, musicians, food, drink and decorations for a big event in NYC on December 20th, as well as to record more music and print a larger run of the Solstice Book of Traditions.
• I'd also like to raise additional money so I can focus full time for the next couple months on helping other communities run their own version of the event, tailored to meet their particular needs while still feeling like part of a cohesive, broader movement - and giving the attendees a genuinely powerful experience.

## The Beginning

Four years ago, twenty NYC rationalists gathered in a room to celebrate the Winter Solstice. We sang songs and told stories about things that seemed very important to us. The precariousness of human life. The thousands of years of labor and curiosity that led us from a dangerous stone age to the modern world. The potential to create something even better, if humanity can get our act together and survive long enough.

One of the most important ideas we honored was the importance of facing truths, even when they are uncomfortable or make us feel silly or are outright terrifying. Over the evening, we gradually extinguished candles, acknowledging harsher and harsher elements of reality.

Until we sat in absolute darkness - aware that humanity is flawed, and alone, in an unforgivingly neutral universe.

But also aware that we sit beside people who care deeply about truth, and about our future. Aware that across the world, people are working to give humanity a bright tomorrow, and that we have the power to help. Aware that across history, people have looked impossible situations in the face, and through ingenuity and persperation, made the impossible happen.

That seemed worth celebrating.

## The Story So Far

As it turned out, this resonated with people outside the rationality community. When we ran the event again in 2012, non-religious but non-Less Wrong attended the event and told me they found it very moving. In 2013, we pushed it much larger - I ran a kickstarter campaign to fund a big event in NYC.

A hundred and fifty people from various communities attended. From Less Wrong in particular, we had groups from Boston, San Francisco, North Carolina, Ottawa, and Ohio among other places. The following day was one of the largest East Coast Megameetups.

Meanwhile, in the Bay Area, several people put together an event that gathered around 80 attendees. In Boston and Vancouever and Leipzig Germany, people ran smaller events. This is shaping up to take root as a legitimate holiday, celebrating human history and our potential future.

This year, we want to do that all again. I also want to dedicate more time to helping other people run their events. Getting people to start celebrating a new holiday is a tricky feat. I've learned a lot about how to go about that and want to help others run polished events that feel connecting and inspirational.

## So, what's happening, and how can you help?

• The Big Solstice itself will be Saturday, December 20th at 7:00 PM. To fund it, we're aiming to raise \$7500 on kickstarter. This is enough to fund the aforementioned venue, food, drink, live musicians, record new music, and print a larger run of the Solstice Book of Traditions. It'll also pay some expenses for the Megameetup. Please consider contributing to the kickstarter.
• If you'd like to host your own Solstice (either a large or a private one) and would like advice, please contact me at raemon777@gmail.com and we'll work something out.
• There will also be Solstices (of varying sizes) run by Less Wrong / EA folk held in the Bay Area, Seattle, Boston and Leipzig. (There will probably be a larger but non-LW-centered Solstice in Los Angeles and Boston as well).
• In NYC, there will be a Rationality and EA Megameetup running from Friday, Dec 19th through Sunday evening.
• Friday night and Saturday morning: Arrival, Settling
• Saturday at 2PM - 4:30PM: Unconference (20 minute talks, workshops or discussions)
• Saturday at 7PM: Big Solstice
• Sunday at Noon: Unconference 2
• Sunday at 2PM: Strategic New Years Resolution Planning
• Sunday at 3PM: Discussion of creating private ritual for individual communities
• If you're interested in coming to the Megameetup, please fill out this form saying how many people you're bringing, whether you're interested in giving a talk, and whether you're bringing a vehicle, so we can plan adequately. (We have lots of crash space, but not infinite bedding, so bringing sleeping bags or blankets would be helpful)

## Effective Altruism?

Now, at Less Wrong we like to talk about how to spend money effectively, so I should be clear about a few things. I'm raising non-trivial money for this, but this should be coming out of people's Warm Fuzzies Budgets, not their Effective Altruism budgets. This is a big, end of the year community feel-good festival.

That said, I do think this is an especially important form of Warm Fuzzies. I've had EA-type folk come to me and tell me the Solstice inspired them to work harder, make life changes, or that it gave them an emotional booster charge to keep going even when things were hard. I hope, eventually, to have this measurable in some fashion such that I can point to it and say "yes, this was important, and EA folk should definitely consider it important."

But I'm not especially betting on that, and there are some failure modes where the Solstice ends up cannibalizing more resources that could have went towards direct impact. So, please consider that this may be especially valuable entertainment, that pushes culture in a direction where EA ideas can go more mainstream and gives hardcore EAs a motivational boost. But I encourage you to support it with dollars that wouldn't have gone towards direct Effective Altruism.

## Goal retention discussion with Eliezer

56 04 September 2014 10:23PM

Although I feel that Nick Bostrom’s new book “Superintelligence” is generally awesome and a well-needed milestone for the field, I do have one quibble: both he and Steve Omohundro appear to be more convinced than I am by the assumption that an AI will naturally tend to retain its goals as it reaches a deeper understanding of the world and of itself. I’ve written a short essay on this issue from my physics perspective, available at http://arxiv.org/pdf/1409.0813.pdf.

Eliezer Yudkowsky just sent the following extremely interesting comments, and told me he was OK with me sharing them here to spur a broader discussion of these issues, so here goes.

On Sep 3, 2014, at 17:21, Eliezer Yudkowsky <yudkowsky@gmail.com> wrote:

Hi Max!  You're asking the right questions.  Some of the answers we can
give you, some we can't, few have been written up and even fewer in any
well-organized way.  Benja or Nate might be able to expound in more detail
while I'm in my seclusion.

Very briefly, though:
The problem of utility functions turning out to be ill-defined in light of
new discoveries of the universe is what Peter de Blanc named an
"ontological crisis" (not necessarily a particularly good name, but it's
what we've been using locally).

http://intelligence.org/files/OntologicalCrises.pdf

The way I would phrase this problem now is that an expected utility
maximizer makes comparisons between quantities that have the type
"expected utility conditional on an action", which means that the AI's
utility function must be something that can assign utility-numbers to the
AI's model of reality, and these numbers must have the further property
that there is some computationally feasible approximation for calculating
expected utilities relative to the AI's probabilistic beliefs.  This is a
constraint that rules out the vast majority of all completely chaotic and
uninteresting utility functions, but does not rule out, say, "make lots of
paperclips".

Models also have the property of being Bayes-updated using sensory
information; for the sake of discussion let's also say that models are
about universes that can generate sensory information, so that these
models can be probabilistically falsified or confirmed.  Then an
"ontological crisis" occurs when the hypothesis that best fits sensory
information corresponds to a model that the utility function doesn't run
on, or doesn't detect any utility-having objects in.  The example of
"immortal souls" is a reasonable one.  Suppose we had an AI that had a
naturalistic version of a Solomonoff prior, a language for specifying
universes that could have produced its sensory data.  Suppose we tried to
give it a utility function that would look through any given model, detect
things corresponding to immortal souls, and value those things.  Even if
the immortal-soul-detecting utility function works perfectly (it would in
fact detect all immortal souls) this utility function will not detect
anything in many (representations of) universes, and in particular it will
not detect anything in the (representations of) universes we think have
most of the probability mass for explaining our own world.  In this case
the AI's behavior is undefined until you tell me more things about the AI;
an obvious possibility is that the AI would choose most of its actions
based on low-probability scenarios in which hidden immortal souls existed
that its actions could affect.  (Note that even in this case the utility
function is stable!)

Since we don't know the final laws of physics and could easily be
surprised by further discoveries in the laws of physics, it seems pretty
clear that we shouldn't be specifying a utility function over exact
physical states relative to the Standard Model, because if the Standard
Model is even slightly wrong we get an ontological crisis.  Of course
there are all sorts of extremely good reasons we should not try to do this
anyway, some of which are touched on in your draft; there just is no
simple function of physics that gives us something good to maximize.  See
also Complexity of Value, Fragility of Value, indirect normativity, the
whole reason for a drive behind CEV, and so on.  We're almost certainly
going to be using some sort of utility-learning algorithm, the learned
utilities are going to bind to modeled final physics by way of modeled
higher levels of representation which are known to be imperfect, and we're
going to have to figure out how to preserve the model and learned
utilities through shifts of representation.  E.g., the AI discovers that
humans are made of atoms rather than being ontologically fundamental
humans, and furthermore the AI's multi-level representations of reality
evolve to use a different sort of approximation for "humans", but that's
okay because our utility-learning mechanism also says how to re-bind the
learned information through an ontological shift.

This sorta thing ain't going to be easy which is the other big reason to
start working on it well in advance.  I point out however that this
doesn't seem unthinkable in human terms.  We discovered that brains are
made of neurons but were nonetheless able to maintain an intuitive grasp
on what it means for them to be happy, and we don't throw away all that
info each time a new physical discovery is made.  The kind of cognition we
want does not seem inherently self-contradictory.

Three other quick remarks:

*)  Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
The Omohundrian/Yudkowskian argument is not that we can take an arbitrary
stupid young AI and it will be smart enough to self-modify in a way that
preserves its values, but rather that most AIs that don't self-destruct
will eventually end up at a stable fixed-point of coherent
consequentialist values.  This could easily involve a step where, e.g., an
AI that started out with a neural-style delta-rule policy-reinforcement
learning algorithm, or an AI that started out as a big soup of
self-modifying heuristics, is "taken over" by whatever part of the AI
first learns to do consequentialist reasoning about code.  But this
process doesn't repeat indefinitely; it stabilizes when there's a
consequentialist self-modifier with a coherent utility function that can
precisely predict the results of self-modifications.  The part where this
does happen to an initial AI that is under this threshold of stability is
a big part of the problem of Friendly AI and it's why MIRI works on tiling
agents and so on!

*)  Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
It built humans to be consequentialists that would value sex, not value
inclusive genetic fitness, and not value being faithful to natural
selection's optimization criterion.  Well, that's dumb, and of course the
result is that humans don't optimize for inclusive genetic fitness.
Natural selection was just stupid like that.  But that doesn't mean
there's a generic process whereby an agent rejects its "purpose" in the
light of exogenously appearing preference criteria.  Natural selection's
anthropomorphized "purpose" in making human brains is just not the same as
the cognitive purposes represented in those brains.  We're not talking
about spontaneous rejection of internal cognitive purposes based on their
causal origins failing to meet some exogenously-materializing criterion of
validity.  Our rejection of "maximize inclusive genetic fitness" is not an
exogenous rejection of something that was explicitly represented in us,
that we were explicitly being consequentialists for.  It's a rejection of
something that was never an explicitly represented terminal value in the
first place.  Similarly the stability argument for sufficiently advanced
self-modifiers doesn't go through a step where the successor form of the
AI reasons about the intentions of the previous step and respects them
apart from its constructed utility function.  So the lack of any universal
preference of this sort is not a general obstacle to stable
self-improvement.

*)   The case of natural selection does not illustrate a universal
computational constraint, it illustrates something that we could
anthropomorphize as a foolish design error.  Consider humans building Deep
Blue.  We built Deep Blue to attach a sort of default value to queens and
central control in its position evaluation function, but Deep Blue is
still perfectly able to sacrifice queens and central control alike if the
position reaches a checkmate thereby.  In other words, although an agent
needs crystallized instrumental goals, it is also perfectly reasonable to
have an agent which never knowingly sacrifices the terminally defined
utilities for the crystallized instrumental goals if the two conflict;
indeed "instrumental value of X" is simply "probabilistic belief that X
leads to terminal utility achievement", which is sensibly revised in the
presence of any overriding information about the terminal utility.  To put
it another way, in a rational agent, the only way a loose generalization
about instrumental expected-value can conflict with and trump terminal
actual-value is if the agent doesn't know it, i.e., it does something that
it reasonably expected to lead to terminal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to
Nate or Benja if further replies are needed, if that's all right.

View more: Next