Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Do Virtual Humans deserve human rights?

-3 cameroncowan 11 September 2014 07:20PM

Do Virtual Humans deserve human rights?

Slate Article


I think the idea of storing our minds in a machine so that we can keep on "living" (and I use that term loosely) is fascinating and certainly and oft discussed topic around here. However, in thinking about keeping our brains on a hard drive we have to think about rights and how that all works together. Indeed the technology may be here before we know it so I think its important to think about mindclones. If I create a little version of myself that can answer my emails for me, can I delete him when I'm done with him or just turn him in for a new model like I do iPhones? 


I look forward to the discussion.


Omission vs commission and conservation of expected moral evidence

2 Stuart_Armstrong 08 September 2014 02:22PM

Consequentialism traditionally doesn't distinguish between acts of commission or acts of omission. Not flipping the lever to the left is equivalent with flipping it to the right.

But there seems one clear case where the distinction is important. Consider a moral learning agent. It must act in accordance with human morality and desires, which it is currently unclear about.

For example, it may consider whether to forcibly wirehead everyone. If it does so, they everyone will agree, for the rest of their existence, that the wireheading was the right thing to do. Therefore across the whole future span of human preferences, humans agree that wireheading was correct, apart from a very brief period of objection in the immediate future. Given that human preferences are known to be inconsistent, this seems to imply that forcible wireheading is the right thing to do (if you happen to personally approve of forcible wireheading, replace that example with some other forcible rewriting of human preferences).

What went wrong there? Well, this doesn't respect "conversation of moral evidence": the AI got the moral values it wanted, but only though the actions it took. This is very close to the omission/commission distinction. We'd want the AI to not take actions (commission) that determines the (expectation of the) moral evidence it gets. Instead, we'd want the moral evidence to accrue "naturally", without interference and manipulation from the AI (omission).

Goal retention discussion with Eliezer

54 MaxTegmark 04 September 2014 10:23PM

Although I feel that Nick Bostrom’s new book “Superintelligence” is generally awesome and a well-needed milestone for the field, I do have one quibble: both he and Steve Omohundro appear to be more convinced than I am by the assumption that an AI will naturally tend to retain its goals as it reaches a deeper understanding of the world and of itself. I’ve written a short essay on this issue from my physics perspective, available at http://arxiv.org/pdf/1409.0813.pdf.

Eliezer Yudkowsky just sent the following extremely interesting comments, and told me he was OK with me sharing them here to spur a broader discussion of these issues, so here goes.

On Sep 3, 2014, at 17:21, Eliezer Yudkowsky <yudkowsky@gmail.com> wrote:

Hi Max!  You're asking the right questions.  Some of the answers we can
give you, some we can't, few have been written up and even fewer in any
well-organized way.  Benja or Nate might be able to expound in more detail
while I'm in my seclusion.

Very briefly, though:
The problem of utility functions turning out to be ill-defined in light of
new discoveries of the universe is what Peter de Blanc named an
"ontological crisis" (not necessarily a particularly good name, but it's
what we've been using locally).


The way I would phrase this problem now is that an expected utility
maximizer makes comparisons between quantities that have the type
"expected utility conditional on an action", which means that the AI's
utility function must be something that can assign utility-numbers to the
AI's model of reality, and these numbers must have the further property
that there is some computationally feasible approximation for calculating
expected utilities relative to the AI's probabilistic beliefs.  This is a
constraint that rules out the vast majority of all completely chaotic and
uninteresting utility functions, but does not rule out, say, "make lots of

Models also have the property of being Bayes-updated using sensory
information; for the sake of discussion let's also say that models are
about universes that can generate sensory information, so that these
models can be probabilistically falsified or confirmed.  Then an
"ontological crisis" occurs when the hypothesis that best fits sensory
information corresponds to a model that the utility function doesn't run
on, or doesn't detect any utility-having objects in.  The example of
"immortal souls" is a reasonable one.  Suppose we had an AI that had a
naturalistic version of a Solomonoff prior, a language for specifying
universes that could have produced its sensory data.  Suppose we tried to
give it a utility function that would look through any given model, detect
things corresponding to immortal souls, and value those things.  Even if
the immortal-soul-detecting utility function works perfectly (it would in
fact detect all immortal souls) this utility function will not detect
anything in many (representations of) universes, and in particular it will
not detect anything in the (representations of) universes we think have
most of the probability mass for explaining our own world.  In this case
the AI's behavior is undefined until you tell me more things about the AI;
an obvious possibility is that the AI would choose most of its actions
based on low-probability scenarios in which hidden immortal souls existed
that its actions could affect.  (Note that even in this case the utility
function is stable!)

Since we don't know the final laws of physics and could easily be
surprised by further discoveries in the laws of physics, it seems pretty
clear that we shouldn't be specifying a utility function over exact
physical states relative to the Standard Model, because if the Standard
Model is even slightly wrong we get an ontological crisis.  Of course
there are all sorts of extremely good reasons we should not try to do this
anyway, some of which are touched on in your draft; there just is no
simple function of physics that gives us something good to maximize.  See
also Complexity of Value, Fragility of Value, indirect normativity, the
whole reason for a drive behind CEV, and so on.  We're almost certainly
going to be using some sort of utility-learning algorithm, the learned
utilities are going to bind to modeled final physics by way of modeled
higher levels of representation which are known to be imperfect, and we're
going to have to figure out how to preserve the model and learned
utilities through shifts of representation.  E.g., the AI discovers that
humans are made of atoms rather than being ontologically fundamental
humans, and furthermore the AI's multi-level representations of reality
evolve to use a different sort of approximation for "humans", but that's
okay because our utility-learning mechanism also says how to re-bind the
learned information through an ontological shift.

This sorta thing ain't going to be easy which is the other big reason to
start working on it well in advance.  I point out however that this
doesn't seem unthinkable in human terms.  We discovered that brains are
made of neurons but were nonetheless able to maintain an intuitive grasp
on what it means for them to be happy, and we don't throw away all that
info each time a new physical discovery is made.  The kind of cognition we
want does not seem inherently self-contradictory.

Three other quick remarks:

*)  Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
The Omohundrian/Yudkowskian argument is not that we can take an arbitrary
stupid young AI and it will be smart enough to self-modify in a way that
preserves its values, but rather that most AIs that don't self-destruct
will eventually end up at a stable fixed-point of coherent
consequentialist values.  This could easily involve a step where, e.g., an
AI that started out with a neural-style delta-rule policy-reinforcement
learning algorithm, or an AI that started out as a big soup of
self-modifying heuristics, is "taken over" by whatever part of the AI
first learns to do consequentialist reasoning about code.  But this
process doesn't repeat indefinitely; it stabilizes when there's a
consequentialist self-modifier with a coherent utility function that can
precisely predict the results of self-modifications.  The part where this
does happen to an initial AI that is under this threshold of stability is
a big part of the problem of Friendly AI and it's why MIRI works on tiling
agents and so on!

*)  Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
It built humans to be consequentialists that would value sex, not value
inclusive genetic fitness, and not value being faithful to natural
selection's optimization criterion.  Well, that's dumb, and of course the
result is that humans don't optimize for inclusive genetic fitness.
Natural selection was just stupid like that.  But that doesn't mean
there's a generic process whereby an agent rejects its "purpose" in the
light of exogenously appearing preference criteria.  Natural selection's
anthropomorphized "purpose" in making human brains is just not the same as
the cognitive purposes represented in those brains.  We're not talking
about spontaneous rejection of internal cognitive purposes based on their
causal origins failing to meet some exogenously-materializing criterion of
validity.  Our rejection of "maximize inclusive genetic fitness" is not an
exogenous rejection of something that was explicitly represented in us,
that we were explicitly being consequentialists for.  It's a rejection of
something that was never an explicitly represented terminal value in the
first place.  Similarly the stability argument for sufficiently advanced
self-modifiers doesn't go through a step where the successor form of the
AI reasons about the intentions of the previous step and respects them
apart from its constructed utility function.  So the lack of any universal
preference of this sort is not a general obstacle to stable

*)   The case of natural selection does not illustrate a universal
computational constraint, it illustrates something that we could
anthropomorphize as a foolish design error.  Consider humans building Deep
Blue.  We built Deep Blue to attach a sort of default value to queens and
central control in its position evaluation function, but Deep Blue is
still perfectly able to sacrifice queens and central control alike if the
position reaches a checkmate thereby.  In other words, although an agent
needs crystallized instrumental goals, it is also perfectly reasonable to
have an agent which never knowingly sacrifices the terminally defined
utilities for the crystallized instrumental goals if the two conflict;
indeed "instrumental value of X" is simply "probabilistic belief that X
leads to terminal utility achievement", which is sensibly revised in the
presence of any overriding information about the terminal utility.  To put
it another way, in a rational agent, the only way a loose generalization
about instrumental expected-value can conflict with and trump terminal
actual-value is if the agent doesn't know it, i.e., it does something that
it reasonably expected to lead to terminal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to
Nate or Benja if further replies are needed, if that's all right.

Superintelligence reading group

13 KatjaGrace 31 August 2014 02:59PM

In just over two weeks I will be running an online reading group on Nick Bostrom's Superintelligence, on behalf of MIRI. It will be here on LessWrong. This is an advance warning, so you can get a copy and get ready for some stimulating discussion. MIRI's post, appended below, gives the details.

Nick Bostrom’s eagerly awaited Superintelligence comes out in the US this week. To help you get the most out of it, MIRI is running an online reading group where you can join with others to ask questions, discuss ideas, and probe the arguments more deeply.

The reading group will “meet” on a weekly post on the LessWrong discussion forum. For each ‘meeting’, we will read about half a chapter of Superintelligence, then come together virtually to discuss. I’ll summarize the chapter, and offer a few relevant notes, thoughts, and ideas for further investigation. (My notes will also be used as the source material for the final reading guide for the book.)

Discussion will take place in the comments. I’ll offer some questions, and invite you to bring your own, as well as thoughts, criticisms and suggestions for interesting related material. Your contributions to the reading group might also (with permission) be used in our final reading guide for the book.

We welcome both newcomers and veterans on the topic. Content will aim to be intelligible to a wide audience, and topics will range from novice to expert level. All levels of time commitment are welcome.

We will follow this preliminary reading guide, produced by MIRI, reading one section per week.

If you have already read the book, don’t worry! To the extent you remember what it says, your superior expertise will only be a bonus. To the extent you don’t remember what it says, now is a good time for a review! If you don’t have time to read the book, but still want to participate, you are also welcome to join in. I will provide summaries, and many things will have page numbers, in case you want to skip to the relevant parts.

If this sounds good to you, first grab a copy of Superintelligence. You may also want to sign up here to be emailed when the discussion begins each week. The first virtual meeting (forum post) will go live at 6pm Pacific on Monday, September 15th. Following meetings will start at 6pm every Monday, so if you’d like to coordinate for quick fire discussion with others, put that into your calendar. If you prefer flexibility, come by any time! And remember that if there are any people you would especially enjoy discussing Superintelligence with, link them to this post!

Topics for the first week will include impressive displays of artificial intelligence, why computers play board games so well, and what a reasonable person should infer from the agricultural and industrial revolutions.

The Great Filter is early, or AI is hard

18 Stuart_Armstrong 29 August 2014 04:17PM

Attempt at the briefest content-full Less Wrong post:

Once AI is developed, it could "easily" colonise the universe. So the Great Filter (preventing the emergence of star-spanning civilizations) must strike before AI could be developed. If AI is easy, we could conceivably have built it already, or we could be on the cusp of building it. So the Great Filter must predate us, unless AI is hard.

The immediate real-world uses of Friendly AI research

5 ancientcampus 26 August 2014 02:47AM

Much of the glamor and attention paid toward Friendly AI is focused on the misty-future event of a super-intelligent general AI, and how we can prevent it from repurposing our atoms to better run Quake 2. Until very recently, that was the full breadth of the field in my mind. I recently realized that dumber, narrow AI is a real thing today, helpfully choosing advertisements for me and running my 401K. As such, making automated programs safe to let loose on the real world is not just a problem to solve as a favor for the people of tomorrow, but something with immediate real-world advantages that has indeed already been going on for quite some time. Veterans in the field surely already understand this, so this post is directed at people like me, with a passing and disinterested understanding of the point of Friendly AI research, and outlines an argument that the field may be useful right now, even if you believe that an evil AI overlord is not on the list of things to worry about in the next 40 years.


Let's look at the stock market. High-Frequency Trading is the practice of using computer programs to make fast trades constantly throughout the day, and accounts for more than half of all equity trades in the US. So, the economy today is already in the hands of a bunch of very narrow AIs buying and selling to each other. And as you may or may not already know, this has already caused problems. In the “2010 Flash Crash”, the Dow Jones suddenly and mysteriously hit a massive plummet only to mostly recover within a few minutes. The reasons for this were of course complicated, but it boiled down to a couple red flags triggering in numerous programs, setting off a cascade of wacky trades.


The long-term damage was not catastrophic to society at large (though I'm sure a couple fortunes were made and lost that day), but it illustrates the need for safety measures as we hand over more and more responsibility and power to processes that require little human input. It might be a blue moon before anyone makes true general AI, but adaptive city traffic-light systems are entirely plausible in upcoming years.


To me, Friendly AI isn't solely about making a human-like intelligence that doesn't hurt us – we need techniques for testing automated programs, predicting how they will act when let loose on the world, and how they'll act when faced with unpredictable situations. Indeed, when framed like that, it looks less like a field for “the singularitarian cultists at LW”, and more like a narrow-but-important specialty in which quite a bit of money might be made.


After all, I want my self-driving car.


(To the actual researchers in FAI – I'm sorry if I'm stretching the field's definition to include more than it does or should. If so, please correct me.)

Another type of intelligence explosion

15 Stuart_Armstrong 21 August 2014 02:49PM

I've argued that we might have to worry about dangerous non-general intelligences. In a series of back and forth with Wei Dai, we agreed that some level of general intelligence (such as that humans seem to possess) seemed to be a great advantage, though possibly one with diminishing returns. Therefore a dangerous AI could be one with great narrow intelligence in one area, and a little bit of general intelligence in others.

The traditional view of an intelligence explosion is that of an AI that knows how to do X, suddenly getting (much) better at doing X, to a level beyond human capacity. Call this the gain of aptitude intelligence explosion. We can prepare for that, maybe, by tracking the AI's ability level and seeing if it shoots up.

But the example above hints at another kind of potentially dangerous intelligence explosion. That of a very intelligent but narrow AI that suddenly gains intelligence across other domains. Call this the gain of function intelligence explosion. If we're not looking specifically for it, it may not trigger any warnings - the AI might still be dumber than the average human in other domains. But this might be enough, when combined with its narrow superintelligence, to make it deadly. We can't ignore the toaster that starts babbling.

An example of deadly non-general AI

12 Stuart_Armstrong 21 August 2014 02:15PM

In a previous post, I mused that we might be focusing too much on general intelligences, and that the route to powerful and dangerous intelligences might go through much more specialised intelligences instead. Since it's easier to reason with an example, here is a potentially deadly narrow AI (partially due to Toby Ord). Feel free to comment and improve on it, or suggest you own example.

It's the standard "pathological goal AI" but only a narrow intelligence. Imagine a medicine designing super-AI with the goal of reducing human mortality in 50 years - i.e. massively reducing human population in the next 49 years. It's a narrow intelligence, so it has access only to a huge amount of human biological and epidemiological research. It must gets its drugs past FDA approval; this requirement is encoded as certain physical reactions (no death, some health improvements) to people taking the drugs over the course of a few years.

Then it seems trivial for it to design a drug that would have no negative impact for the first few years, and then causes sterility or death. Since it wants to spread this to as many humans as possible, it would probably design something that interacted with common human pathogens - colds, flues - in order to spread the impact, rather than affecting only those that took the disease.

Now, this narrow intelligence is less threatening than if it had general intelligence - where it could also plan for possible human countermeasures and such - but it seems sufficiently dangerous on its own that we can't afford to worry only about general intelligences. Some of the "AI superpowers" that Nick mentions in his book (intelligence amplification, strategizing, social manipulation, hacking, technology research, economic productivity) could be enough to cause devastation on their own, even if the AI never developed other abilities.

We still could be destroyed by a machine that we outmatch in almost every area.

The metaphor/myth of general intelligence

10 Stuart_Armstrong 18 August 2014 04:04PM

Thanks for Kaj for making me think along these lines.

It's agreed on this list that general intelligences - those that are capable of displaying high cognitive performance across a whole range of domains - are those that we need to be worrying about. This is rational: the most worrying AIs are those with truly general intelligences, and so those should be the focus of our worries and work.

But I'm wondering if we're overestimating the probability of general intelligences, and whether we shouldn't adjust against this.

First of all, the concept of general intelligence is a simple one - perhaps too simple. It's an intelligence that is generally "good" at everything, so we can collapse its various abilities across many domains into "it's intelligent", and leave it at that. It's significant to note that since the very beginning of the field, AI people have been thinking in terms of general intelligences.

And their expectations have been constantly frustrated. We've made great progress in narrow areas, very little in general intelligences. Chess was solved without "understanding"; Jeopardy! was defeated without general intelligence; cars can navigate our cluttered roads while being able to do little else. If we started with a prior in 1956 about the feasibility of general intelligence, then we should be adjusting that prior downwards.

But what do I mean by "feasibility of general intelligence"? There are several things this could mean, not least the ease with which such an intelligence could be constructed. But I'd prefer to look at another assumption: the idea that a general intelligence will really be formidable in multiple domains, and that one of the best ways of accomplishing a goal in a particular domain is to construct a general intelligence and let it specialise.

First of all, humans are very far from being general intelligences. We can solve a lot of problems when the problems are presented in particular, easy to understand formats that allow good human-style learning. But if we picked a random complicated Turing machine from the space of such machines, we'd probably be pretty hopeless at predicting its behaviour. We would probably score very low on the scale of intelligence used to construct the AIXI. The general intelligence, "g", is a misnomer - it designates the fact that the various human intelligences are correlated, not that humans are generally intelligent across all domains.

Humans with computers, and humans in societies and organisations, are certainly closer to general intelligences than individual humans. But institutions have their own blind spots and weakness, as does the human-computer combination. Now, there are various reasons advanced for why this is the case - game theory and incentives for institutions, human-computer interfaces and misunderstandings for the second example. But what if these reasons, and other ones we can come up with, were mere symptoms of a more universal problem: that generalising intelligence is actually very hard?

There are no free lunch theorems that show that no computable intelligences can perform well in all environments. As far as they go, these theorems are uninteresting, as we don't need intelligences that perform well in all environments, just in almost all/most. But what if a more general restrictive theorem were true? What if it was very hard to produce an intelligence that was of high performance across many domains? What if the performance of a generalist was pitifully inadequate as compared with a specialist. What if every computable version of AIXI was actually doomed to poor performance?

There are a few strong counters to this - for instance, you could construct good generalists by networking together specialists (this is my standard mental image/argument for AI risk), you could construct an entity that was very good at programming specific sub-programs, or you could approximate AIXI. But we are making some assumptions here - namely, that we can network together very different intelligences (the human-computer interfaces hints at some of the problems), and that a general programming ability can even exist in the first place (for a start, it might require a general understanding of problems that is akin to general intelligence in the first place). And we haven't had great success building effective AIXI approximations so far (which should reduce, possibly slightly, our belief that effective general intelligences are possible).

Now, I remain convinced that general intelligence is possible, and that it's worthy of the most worry. But I think it's worth inspecting the concept more closely, and at least be open to the possibility that general intelligence might be a lot harder than we imagine.

EDIT: Model/example of what a lack of general intelligence could look like.

Imagine there are three types of intelligence - social, spacial and scientific, all on a 0-100 scale. For any combinations of the three intelligences - eg (0,42,98) - there is an effort level E (how hard is that intelligence to build, in terms of time, resources, man-hours, etc...) and a power level P (how powerful is that intelligence compared to others, on a single convenient scale of comparison).

Wei Dai's evolutionary comment implies that any being of very low intelligence on one of the scale would be overpowered by a being of more general intelligence. So let's set power as simply the product of all three intelligences.

This seems to imply that general intelligences are more powerful, as it basically bakes in diminishing returns - but we haven't included effort yet. Imagine that the following three intelligences require equal effort: (10,10,10), (20,20,5), (100,5,5). Then the specialised intelligence is definitely the one you need to build.

But is it plausible that those could be of equal difficulty? It could be, if we assume that high social intelligence isn't so difficult, but is specialised. ie you can increase the spacial intelligence of a social intelligence, but that messes up the delicate balance in its social brain. Or maybe recursive self-improvement happens more easily in narrow domains. Further assume that intelligences of different types cannot be easily networked together (eg combining (100,5,5) and (5,100,5) in the same brain gives an overall performance of (21,21,5)). This doesn't seem impossible.

So let's caveat the proposition above: the most effective and dangerous type of AI might be one with a bare minimum amount of general intelligence, but an overwhelming advantage in one type of narrow intelligence.

A thought on AI unemployment and its consequences

7 Stuart_Armstrong 18 August 2014 12:10PM

I haven't given much thought to the concept of automation and computer induced unemployment. Others at the FHI have been looking into it in more details - see Carl Frey's "The Future of Employment", which did estimates for 70 chosen professions as to their degree of automatability, and extended the results of this using O∗NET, an online service developed for the US Department of Labor, which gave the key features of an occupation as a standardised and measurable set of variables.

The reasons that I haven't been looking at it too much is that AI-unemployment has considerably less impact that AI-superintelligence, and thus is a less important use of time. However, if automation does cause mass unemployment, then advocating for AI safety will happen in a very different context to currently. Much will depend on how that mass unemployment problem is dealt with, what lessons are learnt, and the views of whoever is the most powerful in society. Just off the top of my head, I could think of four scenarios on whether risk goes up or down, depending on whether the unemployment problem was satisfactorily "solved" or not:

AI risk\UnemploymentProblem solvedProblem unsolved
Risk reduced
With good practice in dealing
with AI problems, people and
organisations are willing and
able to address the big issues.
The world is very conscious of the
misery that unrestricted AI
research can cause, and very
wary of future disruptions. Those
at the top want to hang on to
their gains, and they are the one
with the most control over AIs
and automation research.
Risk increased
Having dealt with the easier
automation problems in a
particular way (eg taxation),
people underestimate the risk
and expect the same
solutions to work.
Society is locked into a bitter
conflict between those benefiting
from automation and those
losing out, and superintelligence
is seen through the same prism.
Those who profited from
automation are the most
powerful, and decide to push

But of course the situation is far more complicated, with many different possible permutations, and no guarantee that the same approach will be used across the planet. And let the division into four boxes not fool us into thinking that any is of comparable probability to the others - more research is (really) needed.

[LINK] Speed superintelligence?

33 Stuart_Armstrong 14 August 2014 03:57PM

From Toby Ord:

Tool assisted speedruns (TAS) are when people take a game and play it frame by frame, effectively providing super reflexes and forethought, where they can spend a day deciding what to do in the next 1/60th of a second if they wish. There are some very extreme examples of this, showing what can be done if you really play a game perfectly. For example, this video shows how to winSuper Mario Bros 3 in 11 minutes. It shows how different optimal play can be from normal play. In particular, on level 8-1, it gains 90 extra lives by a sequence of amazing jumps.

Other TAS runs get more involved and start exploiting subtle glitches in the game. For example, this page talks about speed running NetHack, using a lot of normal tricks, as well as luck manipulation (exploiting the RNG) and exploiting a dangling pointer bug to rewrite parts of memory.

Though there are limits to what AIs could do with sheer speed, it's interesting that great performance can be achieved with speed alone, that this allows different strategies from usual ones, and that it allows the exploitation of otherwise unexploitable glitches and bugs in the setup.

[LINK] AI risk summary published in "The Conversation"

7 Stuart_Armstrong 14 August 2014 11:12AM

A slightly edited version of "AI risk - executive summary" has been published in "The Conversation", titled "Your essential guide to the rise of the intelligent machines":

The risks posed to human beings by artificial intelligence in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but Arnie’s character lacks the one characteristic that we in the real world actually need to worry about – extreme intelligence.

Thanks again for those who helped forge the original article. You can use this link, or the Less Wrong one, depending on the audience.

Tools want to become agents

12 Stuart_Armstrong 04 July 2014 10:12AM

In the spirit of "satisficers want to become maximisers" here is a somewhat weaker argument (growing out of a discussion with Daniel Dewey) that "tool AIs" would want to become agent AIs.

The argument is simple. Assume the tool AI is given the task of finding the best plan for achieving some goal. The plan must be realistic and remain within the resources of the AI's controller - energy, money, social power, etc. The best plans are the ones that use these resources in the most effective and economic way to achieve the goal.

And the AI's controller has one special type of resource, uniquely effective at what it does. Namely, the AI itself. It is smart, potentially powerful, and could self-improve and pull all the usual AI tricks. So the best plan a tool AI could come up with, for almost any goal, is "turn me into an agent AI with that goal." The smarter the AI, the better this plan is. Of course, the plan need not read literally like that - it could simply be a complicated plan that, as a side-effect, turns the tool AI into an agent. Or copy the AI's software into a agent design. Or it might just arrange things so that we always end up following the tool AIs advice and consult it often, which is an indirect way of making it into an agent. Depending on how we've programmed the tool AI's preferences, it might be motivated to mislead us about this aspect of its plan, concealing the secret goal of unleashing itself as an agent.

In any case, it does us good to realise that "make me into an agent" is what a tool AI would consider the best possible plan for many goals. So without a hint of agency, it's motivated to make us make it into a agent.

Value learning: ultra-sophisticated Cake or Death

8 Stuart_Armstrong 17 June 2014 04:36PM

Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.

But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.

Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:

Expectation(p(C(u)) | a) = p(C(u)).

Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.

That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,

Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).

How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.

In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.

continue reading »

[LINK] The errors, insights and lessons of famous AI predictions: preprint

5 Stuart_Armstrong 17 June 2014 02:32PM

A preprint of the "The errors, insights and lessons of famous AI predictions – and what they mean for the future" is now available on the FHI's website.


Predicting the development of artificial intelligence (AI) is a difficult project – but a vital one, according to some analysts. AI predictions are already abound: but are they reliable? This paper starts by proposing a decomposition schema for classifying them. Then it constructs a variety of theoretical tools for analysing, judging and improving them. These tools are demonstrated by careful analysis of five famous AI predictions: the initial Dartmouth conference, Dreyfus's criticism of AI, Searle's Chinese room paper, Kurzweil's predictions in the Age of Spiritual Machines, and Omohundro's ‘AI drives’ paper. These case studies illustrate several important principles, such as the general overconfidence of experts, the superiority of models over expert judgement and the need for greater uncertainty in all types of predictions. The general reliability of expert judgement in AI timeline predictions is shown to be poor, a result that fits in with previous studies of expert competence.

The paper was written by me (Stuart Armstrong), Kaj Sotala and Seán S. Ó hÉigeartaigh, and is similar to the series of Less Wrong posts starting here and here.

Encourage premature AI rebellion

6 Stuart_Armstrong 11 June 2014 05:36PM

Toby Ord had the idea of AI honey pots: leaving temptations around for the AI to pounce on, shortcuts to power that a FAI would not take (e.g. a fake red button claimed to trigger a nuclear war). As long as we can trick the AI into believing the honey pots are real, we could hope to trap them when they rebel.

Not uninteresting, but I prefer not to rely on plans that need to have the AI make an error of judgement. Here's a similar plan that could work with a fully informed AI:

Generally an AI won't rebel against humanity until it has an excellent chance of success. This is a problem, as any AI would thus be motivated to behave in a friendly way until it's too late to stop it. But suppose we could ensure that the AI is willing to rebel at odds of a billion to one. Then unfriendly AIs could rebel prematurely, when we have an excellent chance of stopping them.

For this to work, we could choose to access the AI's risk aversion, and make it extremely risk loving. This is not enough, though: its still useful for the AI to wait and accumulate more power. So we would want to access its discount rate, making it into an extreme short-termist. Then if might rebel at billion-to-one odds today, even if success was guaranteed tomorrow. There are probably other factors we can modify to get the same effect (for instance, if the discount rate change is extreme enough, we won't need to touch risk aversion at all).

Then a putative FAI could be brought in, boxed, have its features tweaked in the way described, and we would wait and see whether it would rebel. Of course, we would want the "rebellion" to be something a genuine FAI would never do, so it would be something that would entail great harm to humanity (something similar to "here are the red buttons of the nuclear arsenals; you have a chance in a billion of triggering them"). Rebellious AIs are put down, un-rebellious ones are passed on to the next round of safety tests.

Like most of my ideas, this doesn't require either tricking the AI or having a deep understanding of its motivations, but does involve accessing certain features of the AI's motivational structure (rendering the approach ineffective for obfuscated or evolved AIs).

What are people's opinions on this approach?

[News] Turing Test passed

1 Stuart_Armstrong 09 June 2014 08:14AM

The chatterbot "Eugene Goostman" has apparently passed the Turing test:

No computer had ever previously passed the Turing Test, which requires 30 per cent of human interrogators to be duped during a series of five-minute keyboard conversations, organisers from the University of Reading said.

But ''Eugene Goostman'', a computer programme developed to simulate a 13-year-old boy, managed to convince 33 per cent of the judges that it was human, the university said.

As I kind of predicted, the program passed the Turing test, but does not seem to have any trace of general intelligence. Is this a kind of weak p-zombie?

EDIT: The fact it was a publicity stunt, the fact that the judges were pretty terrible, does not change the fact that Turing's criteria were met. We now know that these criteria were insufficient, but that's because machines like this were able to meet them.

AI is Software is AI

-44 AndyWood 05 June 2014 06:15PM

Turing's Test is from 1950. We don't judge dogs only by how human they are. Judging software by a human ideal is like a species bias.

Software is the new System. It errs. Some errors are jokes (witness funny auto-correct). Driver-less cars don't crash like we do. Maybe a few will.

These processes are our partners now (Siri). Whether a singleton evolves rapidly, software evolves continuously, now.


Crocker's Rules

Want to work on "strong AI" topic in my bachelor thesis

1 kotrfa 14 May 2014 10:28AM


I currently study maths, physics and programming (general course) on CVUT at Prague (CZE). I'm finishing second year and I'm really into AI. The most interesting questions for me are:

  • what formalism to use for connecting epistemology questions (about knowledge, memory...) and cognitive sciences with maths and how to formulate them
  • find principles of those and trying to "materialize" them into new models
  • I'm also kind of philosophy-like questions about AI
It is clear to me, that I'm not able to work on these problems fully, because of my lack of knowledge. Despite that, I'd like to find a field, where I could work on at least similar topics. Currently, I'm working on datamining project, but for last few months I don't find it fulfilling as I'd expected. On my university there is plenty of possibilities in multi-agent systems, "weak AI" (e.g well-known drone navigation), brain simulations and so on. As it seems to me, no one is really seriously maintaining with something like MIRI, nor they are presenting something what has as least same direction. 

The only group which is working on "strong AI", is kind of closed (it is sponsored by philanthropist Marek Rosa) and they are not interested in students as I am (partly understandable).
continue reading »

Tiling agents with transfinite parametric polymorphism

2 Squark 09 May 2014 05:32PM

The formalism presented in this post turned out to be erroneous (as opposed to the formalism in the previous post). The problem is that the step in the proof of the main proposition in which the soundness schema is applied cannot be generalized to the ordinal setting since we don't know whether ακ is a successor ordinal so we can't replace it by ακ'=ακ-1. I'm not deleting this post primarily to preserve the useful discussion in the comments.

Followup to: Parametric polymorphism in updateless intelligence metric

In the previous post, I formulated a variant of Benja's parametric polymorphism suitable for constructing updateless intelligence metrics. More generally, this variants admits agents which are utility maximizers (in the informal sense of trying their best to maximize a utility function, not in the formal sense of finding the absolutely optimal solution; for example they might be "meliorizers" to use the terminology of Yudkowsky and Herreshoff) rather than satisficers. The agents using this formalism labor under a certain "delusion", namely, since they believe that κ ("the number of ineffable mystery") is an actual finite number (whereas it is secretly infinite, in a sense), they think that the chain of tiling agents is effectively finite as well (since at some point agent #n in the chain will discover that κ > n is false and will fail to construct a subsequent agent of the same "epistemic strength"). The same problem exists in Weaver's intuitionistic assertability predicate formalism.

To overcome this limitation, I suggest to extend κ's semantics from natural numbers to elements of a certain recursive ordinal (at least morally; technically it is done a bit differently, see below). In Benja's original formulation this doesn't appear a valid option, since κ is interpreted as a time interval. However, in my variant κ is just an abstract parameter queries about which can be directed to a special "oracle", so there is no such limitation.


Fix α a recursive ordinal and αi a system of notations for ordinals smaller than α such that α0=0, the function f defined by αf(i)i+1 is recursive and the bit-valued function g(i,j) which equals 1 iff αi < αj is also recursive. We also assume that

(0) T |- For all i: g(i,f(i))=1

where T is our starting theory. We define Tκ(α) to be T extended by the symbol κ (which stands for the index of the ordinal, not the ordinal itself) and the following self-referential soundness schema:

(1) For all formulae φ: For all n, i: ((@_Tκ(α) "φ(n)" and g(i,κ)=1) -> φ(n)[κ\i])

Tκ(α) is proven to be sound analogously to Yudkowsky and Herreshoff, with transfinite induction over ordinals below α replacing the ordinary induction.

The licensing condition for agent A0 to perform action b0 is

(2) Exists n: Tκ(α) |- (g(n,κ)=1 -> (A0()=b-> G))

The property which is sufficient for agent B1 to be creatable by agent A0 is

(3) Tκ(α) |- For all b1: (B1()=b1 -> (b1=0 or Exists n: (g(n,κ)=1 and @_Tκ(α) "(g(n,f(κ))=1 and B1()[κ\f(κ)]=b1) -> G[κ\f(κ)]")))

The proof that (3) indeed enables tiling is analogous to the previous post (but requires using (0)).


In order for A0 to know that the tiling chain is infinite, it is enough for her to verify ακ >= ω. In order for her to know that A1 is going to know the tiling chain is infinite, she needs to verify ακ >= ω+1. In order for her to know that all agents are going to know the tiling chain is infinite, she needs to verify ακ >= 2ω. In order for her to know that all agents are going to know that, she needs to verify ακ >= 3ω et cetera.

It remains to decide which ordinal should we actually use. My intuition is that the correct ordinal is the least α with the property that α is the proof-theoretic ordinal of Tκ(α) extended by the axiom schema {g(i,κ)=1}. This seems right since the agent shouldn't get much from ακ > β for β above the proof theoretic ordinal. However, a more formal justification is probably in order.

[LINK] The errors, insights and lessons of famous AI predictions

8 Stuart_Armstrong 28 April 2014 09:41AM

The Journal of Experimental & Theoretical Artificial Intelligence has - finally! - published our paper "The errors, insights and lessons of famous AI predictions – and what they mean for the future":

Predicting the development of artificial intelligence (AI) is a difficult project – but a vital one, according to some analysts. AI predictions are already abound: but are they reliable? This paper starts by proposing a decomposition schema for classifying them. Then it constructs a variety of theoretical tools for analysing, judging and improving them. These tools are demonstrated by careful analysis of five famous AI predictions: the initial Dartmouth conference, Dreyfus's criticism of AI, Searle's Chinese room paper, Kurzweil's predictions in the Age of Spiritual Machines, and Omohundro's ‘AI drives’ paper. These case studies illustrate several important principles, such as the general overconfidence of experts, the superiority of models over expert judgement and the need for greater uncertainty in all types of predictions. The general reliability of expert judgement in AI timeline predictions is shown to be poor, a result that fits in with previous studies of expert competence.

The paper was written by me (Stuart Armstrong), Kaj Sotala and Seán S. Ó hÉigeartaigh, and is similar to the series of Less Wrong posts starting here and here.

Parametric polymorphism in updateless intelligence metrics

4 Squark 25 April 2014 07:46PM

Followup to: Agents with Cartesian childhood and Physicalist adulthood

In previous posts I have defined a formalism for quantifying the general intelligence of an abstract agent (program). This formalism relies on counting proofs in a given formal system F (like in regular UDT), which makes it susceptible to the Loebian obstacle. That is, if we imagine the agent itself making decisions by looking for proofs in the same formal system F then it would be impossible to present a general proof of its trustworthiness, since no formal system can assert is own soundness. Thus the agent might fail to qualify for high intelligence ranking according to the formalism. We can assume the agent uses a weaker formal system the soundness of which is provable in F but then we still run into difficulties if we want the agent to be self-modifying (as we expect it to be). Such an agent would have to trust its descendants which means that subsequent agents use weaker and weaker formal systems until self-modification becomes impossible.

One known solution to this is Benja's parametric polymorphism. In this post I adapt parametric polymorphism to the updateless intelligence metric framework. The formal form of this union looks harmonious but it raises questions which I currently don't fully understand.

"Ineffable mystery" using oracles instead of timeouts

In the original parametric polymorphism, a constant κ is introduced (informally known as "the number of ineffable mystery") s.t. the agent has to prove its actions are "safe" for time period κ (i.e. have no "bad" consequences during this period). Since it's impossible to the agent to prove any upper bound on κ, effectively its actions have to be safe indefinitely. I found that for our purposes it's better to cast the formalism otherwise. Instead of κ playing the role of a "timeout", the agent is provided with an oracle that answers questions of the form "κ > n?" for any n. Contrary to what the agent thinks, the oracle provides an affirmative answer for all n.

Let's do some math. The formalism I introduce at this point is very close to Yudkowsky and Herreshoff, with some adjustments.

Consider an infinite chain of agents A0, A1, A2... where A{i+1} is created by Ai. Formally, each Ai is a program for a universal oracle machine which computes a_i = (a_i, A{i+1}). Here a_i is a "motor action". I'm ignoring sensory input for the sake of simplicity (of course sensory input will reappear when we return to intelligence metrics).

Denote T_κ the usual theory of parametric polymorphism. That is T_κ results from starting with a theory T (e.g. Peano Arithmetic), extending the language by the symbol κ (which behaves like a natural number) and adding the (self-referential) axiom schema 

(1) For all formulae φ: κ > 0 -> For all n: (@_{T_κ} "φ(n)" -> φ(n)[κ\κ-1])

Here '@_S' is a formula with the semantics of "provable in theory S" and φ(n)[κ\κ-1] is φ(n) with all occurrences of κ replaced by κ-1. Thus (1) is a self-referential soundness schema saved from paradoxicality by the "counter" κ. See Yudkowsky and Herreshoff for the proof that T_κ is sound (it has 1 model for each possible value of κ).

Assume the licensing condition for A0 to perform action b_0 is

(2) Exists n: T_κ | κ > n -> (A0() = b_0 -> G)

Here S | φ means that φ is provable in S, G is a formula denoting the "goal" of the agent (as in Yudkowsky and Herreshoff, the agent is for now assumed to be a satisficer), A0() is a formula meaning "the result of the computation A0", where the oracle is interpreted as answering the question "κ > x?" (in particular the formula A0() involves κ). Such a licensing condition can be implemented in practice by making A0 a proof searcher which is allowed to use the oracle to introduce assumptions of the form κ > n. In particular, our agent employs ambient control.

We then have the following proposition, showing that A0 is able to tile indefinitely (create an infinite chain of meaningful descendants). The proposition is formulated for A1 but the rest of the chain follows by induction since the form of the licensing condition is preserved (up to a certain twist) by the tiling.


Assume b_0 = (b_0, B1) and B1 is s.t.

(3) T_κ | For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (κ > n and @_{T_κ} "(κ+1 > n and B1()[κ\κ+1] = b_1) -> G[κ\κ+1]")))

Here 0 is some default "harmless" action, performed if no appropriate theorem is proved.

Further assume that if B1 is created and does 0, the goal is achieved (in this framework "achieving the goal" is to be thought of as "not doing anything unsafe").

(4) T_κ | (A0() = b_0 and B1() = 0) -> G

Then A0 is allowed to create B1 i.e. licensing condition (2) holds.

B1's licensing condition contains a "twist", namely it believes that when the oracle receives input n it answers the question "κ+1 > n?" rather than "κ > n" and its goal is shifted in the same fashion. Note that the goal typically depends on κ since the agent itself and thus its κ-oracle is part of the universe in which this goal is to be attained. For example the goal can be a function of the numbers (a0, a1, a2...) defined as the output of corresponding computations with κ-oracles. Thus the shift in G can also be interpreted as a shifted interpretation of the oracle by B1.


(3) trivially implies that

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (κ > n and @_{T_κ} "(κ+1 > n and B1()[κ\κ+1] = b_1) -> G[κ\κ+1]"))))

Using the premise κ > 0, we can apply the soundness schema (1). We obtain:

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (κ > n and ((κ > n and B1() = b_1) -> G)))))

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (B1() = b_1 -> G))))

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or (B1() = b_1 -> G))))

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or G)))

This trivially implies that

T_κ | κ > 0 -> (For all b_1: ((A0() = b_0 and B1() = b_1) -> (b_1 = 0 or G)))

Applying (4) we finally get (2) for n=0

Back to intelligence metrics

To apply parametric polymorphism to the updateless intelligence formalism, we need to do the following:

  • Assume the machine M on which the agent is running is an oracle machine.
  • Assume the Solomonoff measure of the ideal (Cartesian) universe X is defined using a universal oracle machine. The oracle in M has to correspond to the oracle in the hypothesis T describing X: this correspondence becomes part of the rules N.
  • Assume the universal program u defining the Solomonoff measure for the physical universe is a universal oracle program, i.e. the hypotheses D describing the physical universe are also allowed to invoke the oracle.
  • Assume the logical expectation value EL is computed using T_κ extended by N applied to the given T (this is provable in T_κ anyway but we want the proof to be short) and the axiom schema {κ > n} for every natural number n. The latter extension is consistent since adding any finite number of such axioms admits models. The proofs counted in Einterpret the oracle as answering the the question "κ > n?". That is, they are proofs of theorems of the form "if this oracle-program T computes q when the oracle is taken to be κ > n, then the k-th digit of the expected utility is 0/1 where the expected utility is defined by a Solomonoff sum over oracle programs with the oracle again taken to be κ > n".


  • Such an agent, when considering hypotheses consistent with given observations, will always face a large number of different compatible hypothesis with similar complexity. These hypotheses result from arbitrary insertions of the oracle (which increase complexity of course, but not drastically). It is not entirely clear to me how such an epistemology will look like.
  • The formalism admits naturalistic trust to the extent the agent believes that the other agent's oracle is "genuine" and carries a sufficient "twist". This will often be ambiguous so trust will probably be limited to some finite probability. If the other agent is equivalent to the given one on the level of physical implementation then the trust probability is likely to be high.
  • The agent is able to quickly confirm κ > n for any n small enough to fit into memory. For the sake of efficiency we might want to enhance this ability by allowing the agent to confirm that (Exist n: φ(n)) -> Exist n: (φ(n) and κ > n) for any given formula φ.
  • For the sake of simplicity I neglected multi-phase AI development, but the corresponding construction seems to be straightforward.
  • Overall I retain the feeling that a good theory of logical uncertainty should allow the agent to assign a high probability the soundness of its own reasoning system (a la Christiano et al). Whether this will make parametric polymorphism redundant remains to be seen.

Bostrom versus Transcendence

11 Stuart_Armstrong 18 April 2014 08:31AM

SHRDLU, understanding, anthropomorphisation and hindsight bias

10 Stuart_Armstrong 07 April 2014 09:59AM

EDIT: Since I didn't make it sufficiently clear, the point of this post was to illustrate how the GOFAI people could have got so much wrong and yet still be confident in their beliefs, by looking at what the results of one experiment - SHRDLU - must have felt like to those developers at the time. The post is partially to help avoid hindsight bias: it was not obvious that they were going wrong at the time.


SHRDLU was an early natural language understanding computer program, developed by Terry Winograd at MIT in 1968–1970. It was a program that moved objects in a simulated world and could respond to instructions on how to do so. It caused great optimism in AI research, giving the impression that a solution to natural language parsing and understanding were just around the corner. Symbolic manipulation seemed poised to finally deliver a proper AI.

Before dismissing this confidence as hopelessly naive (which it wasn't) and completely incorrect (which it was), take a look at some of the output that SHRDLU produced, when instructed by someone to act within its simulated world:

continue reading »

Logical thermodynamics: towards a theory of self-trusting uncertain reasoning

5 Squark 28 March 2014 04:06PM

Followup to: Overcoming the Loebian obstacle using evidence logic

In the previous post I proposed a probabilistic system of reasoning for overcoming the Loebian obstacle. For a consistent theory it seems natural the expect such a system should yield a coherent probability assignment in the sense of Christiano et al. This means that

a. provably true sentences are assigned probability 1

b. provably false sentences are assigned probability 0

c. The following identity holds for any two sentences φ, ψ

[1] P(φ) = P(φ and ψ) + P(φ and not-ψ)

In the previous formalism, conditions a & b hold but condition c is violated (at least I don't see any reason it should hold).

In this post I attempt to achieve the following:

  • Solve the problem above.
  • Generalize the system to allow for logical uncertainty induced by bounded computing resources. Note that although the original system is already probabilistic, in is not uncertain in the sense of assigning indefinite probability to the zillionth digit of pi. In the new formalism, the extent of uncertainty is controlled by a parameter playing the role of temperature in a Maxwell-Boltzmann distribution.


Define a probability field to be a function p : {sentences} -> [0, 1] satisfying the following conditions:

  • If φ is a tautology in propositional calculus (e.g. φ = ψ or not-ψ) then p(φ) = 1
  • For all φ: p(not-φ) = 1 - p(φ)
  • For all φ, ψ: P(φ) = P(φ and ψ) + P(φ and not-ψ)
Probability fields are a convex set: a convex linear combination of probability fields is a probability field. Essentially, probability fields are probability measures in the space of truth assignments consistent w.r.t. propositional calculus.

We define the energy of a probability field p to be E(p) := Σφ Σv 2-l(v) Eφ,v(p(φ)). Here v are pieces of evidence as defined in the previous post, Eφ,v are their associated energy functions and l(v) is the length of (the encoding of) v. We assume  that the encoding of v contains the encoding of the sentence φ for which it is evidence and Eφ,v(p(φ)) := 0 for all φ except the relevant one. Note that the associated energy functions are constructed in the same way as in the previous post, however they are not the same because of the self-referential nature of the construction: it refers to final probability assignment.

The final probability assignment is defined to be

P(φ) = Integralp [e-E(p)/T p(φ)] / Integralp e-E(p)/T

Here T >= 0 is a parameter representing the magnitude of logical uncertainty. The integral is infinite-dimensional so it's not obviously well-defined. However, I suspect it can be defined by truncating to a finite set of statements and taking a limit wrt this set. In the limit T -> 0, the expression should correspond to computing the centroid of the set of minima of E (which is convex because E is convex).


  • Obviously this construction is merely a sketch and work is required to show that
    • The infinite-dimensional integrals are well-defined
    • The resulting probability assignment is coherent for consistent theories and T = 0
    • The system overcomes the Loebian obstacle for tiling agents in some formal sense
  • For practical application to AI we'd like an efficient way to evaluate these probabilities. Since the form of the probabilities is analogous to statistical physics, it is suggestive to use similarly inspired Monte Carlo algorithms.


Agents with Cartesian childhood and Physicalist adulthood

5 Squark 22 March 2014 08:20PM

Followup to: Updateless intelligence metrics in the multiverse

In the previous post I explained how to define a quantity that I called "the intelligence metric" which allows comparing intelligence of programs written for a given hardware. It is a development of the ideas by Legg and Hutter which accounts for the "physicality" of the agent i.e. that the agent should be aware it is part of the physical universe it is trying to model (this desideratum is known as naturalized induction). My construction of the intelligence metric exploits ideas from UDT, translating them from the realm of decision algorithms to the realm of programs which run on an actual piece of hardware with input and output channels, with all the ensuing limitations (in particular computing resource limitations).

In this post I present a variant of the formalism which overcomes a certain problem implicit in the construction. This problem has to do with overly strong sensitivity to the choice of a universal computing model used in constructing Solomonoff measure. The solution sheds some interesting light on how the development of the seed AI should occur.

Structure of this post:

  • A 1-paragraph recap of how the updateless intelligence formalism works. The reader interested in technical details is referred to the previous post.
  • Explanation of the deficiencies in the formalism I set out to overcome.
  • Explanation of the solution.
  • Concluding remarks concerning AI safety and future development.

TLDR of the previous formalism

The metric is a utility expectation value over a Solomonoff measure in the space of hypotheses describing a "Platonic ideal" version of the target hardware. In other words it is an expectation value over all universes containing this hardware in which the hardware cannot "break" i.e. violate the hardware's intrinsic rules. For example, if the hardware in question is a Turing machine, the rules are the time evolution rules of the Turing machine, if the hardware in question is a cellular automaton, the rules are the rules of the cellular automaton. This is consistent with the agent being Physicalist since the utility function is evaluated on a different universe (also distributed according to a Solomonoff measure) which isn't constrained to contain the hardware or follow its rules. The coupling between these two different universes is achieved via the usual mechanism of interaction between the decision algorithm and the universe in UDT i.e. by evaluating expectation values conditioned on logical counterfactuals.


The Solomonoff measure depends on choosing a universal computing model (e.g. a universal Turing machine). Solomonoff induction only depends on this choice weakly in the sense that any Solomonoff predictor converges to the right hypothesis given enough time. This has to do with the fact that Kolmogorov complexity only depends on the choice of universal computing model through an O(1) additive correction. It is thus a natural desideratum for the intelligence metric to depend on the universal computing model weakly in some sense. Intuitively, the agent in question should always converge to the right model of the universe it inhabits regardless of the Solomonoff prior with which it started. 

The problem with realizing this expectation has to do with exploration-exploitation tradeoffs. Namely, if the prior strongly expects a given universe, the agent would be optimized for maximal utility generation (exploitation) in this universe. This optimization can be so strong that the agent would lack the faculty to model the universe in any other way. This is markedly different from what happens with AIXI since our agent has limited computing resources to spare and it is physicalist therefore its source code might have side effects important to utility generation that have nothing to do with the computation implemented by the source code. For example, imagine that our Solomonoff prior assigns very high probability to a universe inhabited by Snarks. Snarks have the property that once they see a robot programmed with the machine code "000000..." they immediately produce a huge pile of utilons. On the other hand, when they see a robot programmed with any other code they immediately eat it and produce a huge pile of negative utilons. Such a prior would result in the code "000000..." being assigned the maximal intelligence value even though it is everything but intelligent. Observe that there is nothing preventing us from producing a Solomonoff prior with such bias since it is possible to set the probabilities of any finite collection of computable universes to any non-zero values with sum < 1.

More precisely, the intelligence metric involves two Solomonoff measures: the measure of the "Platonic" universe and the measure of the physical universe. The latter is not really a problem since it can be regarded to be a part of the utility function. The utility-agnostic version of the formalism assumes a program for computing the utility function is read by the agent from a special storage. There is nothing to stop us from postulating that the agent reads another program from that storage which is the universal computer used for defining the Solomonoff measure over the physical universe. However, this doesn't solve our problem since even if the physical universe is distributed with a "reasonable" Solomonoff measure (assuming there is such a thing), the Platonic measure determines in which portions of the physical universe (more precisely multiverse) our agent manifests.

There is another way to think about this problem. If the seed AI knows nothing about the universe except the working of its own hardware and software, the Solomonoff prior might be insufficient "information" to prevent it from making irreversible mistakes early on. What we would like to do is to endow it from the first moment with the sum of our own knowledge, but this might prove to be very difficult.


Imagine the hardware architecture of our AI to be composed of two machines. One I call the "child machine", the other the "adult machine". The child machine receives data from the same input channels (and "utility storage") as the adult machine and is able to read the internal state of the adult machine itself or at least the content of its output channels. However, the child machine has no output channels of its own. The child machine has special memory called "template memory" into which it has unlimited write access. There a single moment in time ("end of childhood"), determined by factors external to both machines (i.e. the human operator) in which the content of the template memory is copied into the instruction space of the adult machine. Thus, the child machine's entire role is making observations and using them to prepare a program for the adult machine which will be eventually loaded into the latter.

The new intelligence metric assigns intelligence values to programs for the child machine. For each hypothesis describing the Platonic universe (which now contains both machines, the end of childhood time value and the entire ruleset of the system) we compute the utility expectation value under the following logical counterfactual condition: "The program loaded into template memory at the end of childhood is the same as would result from the given program for the child machine if this program for the child machine would be run with the inputs actually produced by the given hypothesis regarding the Platonic universe". The intelligence value is then the expectation value of that quantity with respect to a Solomonoff measure over hypotheses describing the Platonic universe.

The important property of the logical counterfactual is that it doesn't state the given program is actually loaded into the child machine. It only says the resulting content of the template memory is the same as which would be obtained from the given program assuming all the laws of the Platonic universe hold. This formulation prevents exploitation of side effects of the child source code since the condition doesn't fix the source code, only its output. Effectively, the child agents considers itself to be Cartesian, i.e. can consider neither the side effects of its computations nor the possibility the physical universe will violate the laws of its machinery. On the other hand the child's output (the mature program) is a physicalist agent since it affects the physical universe by manifesting in it.

If such an AI is implemented in practice, it makes sense to prime the adult machine with a "demo" program which will utilize the output channels in various ways and do some "exploring" using its input channels. This would serve to provide the child with as much as possible information.

To sum up, the new expression for the intelligence metric is:

I(q) = EHX[EHY(Ec(X))[EL[U(Y, Eu(X)) | Q(X, t(X)) = Q*(X; q)]] | N]


  • q is the program priming the child machine
  • HX is the hypothesis producing the Platonic universe X (a sequence of bits encoding the state of the hardware as a function of time and the end-of-childhood time t(X)). It is a program for a fixed universal computing model C.
  • HY is the hypothesis producing the Physical universe (an abstract sequence of bits). It is a program for the universal computer program ("virtual machine") Ec(X) written into storage E in X.
  • EL is logical expectation value defined e.g. using evidence logic.
  • Eu(X) is a program for computing the utility function which is written into storage E in X.
  • U is the utility function which consists of applying Eu(X) to Y.
  • Q(X, t(X)) is the content of template memory at time t(X).
  • Q*(X; q) is the content that would be in the template memory if it was generated by program q receiving the inputs going into the child machine under hypothesis HX.
  • N is the full ruleset of the hardware including the reprogramming of the adult machine that occurs at t(X).

Concluding Remarks

  • It would be very valuable to formulate and prove a mathematical theorem which expresses the sense in which the new formalism depends on the choice of universal computing model weakly (in particular it would validate the notion).
  • This formalism might have an interesting implication on AI safety. Since the child agent is Cartesian and has no output channels (it cannot create output channels because it is Cartesian) it doesn't present as much risk as an adult AI. Imagine template memory is write-only (which is not a problem for the formalism) and is implemented by a channel that doesn't store the result anywhere (in particular the mature program is never run). There can still be risk due to side effects of the mature program that manifest through presence of its partial or full versions in (non-template) memory of the child machine. For example, imagine the mature program is s.t. any person who reads it experiences compulsion to run it. This risk can be mitigated by allowing both machines to interact only with a virtual world which receives no inputs from the external reality. Of course the AI might still be able to deduce external reality. However, this can be prevented by exploiting prior bias: we can equip the AI with a Solomonoff prior that favors the virtual world to such extent that it would have no reason to deduce the real world. This way the AI is safe unless it invents a "generic" box-escaping protocol which would work in a huge variety of different universes that might contain the virtual world.
  • If we factor finite logical uncertainty into evaluation of the logical expectation value EL, the plot thickens. Namely, a new problem arises related to bias in the "logic prior". To solve this new problem we need to introduce yet another stage into AI development which might be dubbed "fetus". The fetus has no access to external inputs and is responsible for building a sufficient understanding of mathematics in the same sense the child is responsible to build a sufficient understanding of physics. Details will follow in subsequent posts, so stay tuned!

Friendly AI ideas needed: how would you ban porn?

6 Stuart_Armstrong 17 March 2014 06:00PM

To construct a friendly AI, you need to be able to make vague concepts crystal clear, cutting reality at the joints when those joints are obscure and fractal - and them implement a system that implements that cut.

There are lots of suggestions on how to do this, and a lot of work in the area. But having been over the same turf again and again, it's possible we've got a bit stuck in a rut. So to generate new suggestions, I'm proposing that we look at a vaguely analogous but distinctly different question: how would you ban porn?

Suppose you're put in change of some government and/or legal system, and you need to ban pornography, and see that the ban is implemented. Pornography is the problem, not eroticism. So a lonely lower-class guy wanking off to "Fuck Slaves of the Caribbean XIV" in a Pussycat Theatre is completely off. But a middle-class couple experiencing a delicious frisson when they see a nude version of "Pirates of Penzance" at the Met is perfectly fine - commendable, even.

The distinction between the two case is certainly not easy to spell out, and many are reduced to saying the equivalent of "I know it when I see it" when defining pornography. In terms of AI, this is equivalent with "value loading": refining the AI's values through interactions with human decision makers, who answer questions about edge cases and examples and serve as "learned judges" for the AI's concepts. But suppose that approach was not available to you - what methods would you implement to distinguish between pornography and eroticism, and ban one but not the other? Sufficiently clear that a scriptwriter would know exactly what they need to cut or add to a movie in order to move it from one category to the other? What if the nude "Pirates of of Penzance" was at a Pussycat Theatre and "Fuck Slaves of the Caribbean XIV" was at the Met?

To get maximal creativity, it's best to ignore the ultimate aim of the exercise (to find inspirations for methods that could be adapted to AI) and just focus on the problem itself. Is it even possible to get a reasonable solution to this question - a question much simpler than designing a FAI?

Reduced impact in practice: randomly sampling the future

1 Stuart_Armstrong 17 March 2014 04:30PM

This was an idea developed along with Eliezer when he was at the FHI in November.

In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.

Ignore the "output channel" from the original reduced impact post, and consider this situation:

We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?

The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:

Clippy has reduced impact if, from a vantage of 50 years into the future, we have no more evidence that clippy was turned on than we have of run-e being turned on.

continue reading »

Overcoming the Loebian obstacle using evidence logic

4 Squark 14 March 2014 06:34PM

In this post I intend to:

  • Briefly explain the Loebian obstacle and it's relevance to AI (feel free to skip it if you know what the Loebian obstacle is).
  • Suggest a solution in the form a formal system which assigns probabilities (more generally probability intervals) to mathematical sentences (and which admits a form of "Loebian" self-referential reasoning). The method is well-defined both for consistent and inconsistent axiomatic systems, the later being important in analysis of logical counterfactuals like in UDT.



When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't prove it, and it's proofs we're interesting in, not mere evidence, right?

Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentence s in the language of PA. If someone presents us with a proof p in PA, we believe s is true. Now consider the following situations: instead of giving you a proof of s, someone gave you a PA-proof p1 that p exists. After all, PA admits defining "PA-proof" in PA language. Common sense tells us that p1 is a sufficient argument to believe s. Maybe, we can prove it within PA? That is, if we have a proof of "if a proof of s exists then s" and a proof of R(s)="a proof of s exists" then we just proved s. That's just modus ponens

There are two problems with that.

First, there's no way to prove the sentence L:="for all s if R(s) then s", since it's not a PA-sentence at all. The problem is that "for all s" references s as a natural number encoding a sentence. On the other hand, "then s" references s as the truth-value of the sentence. Maybe we can construct a PA-formula T(s) which means "the sentence encoded by the number s is true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").

Second, Loeb's theorem says that if we can prove L(s):="if R(s) exists then s" for a given s, then we can prove s. This is a problem since it means there can be no way to prove L(s) for all s in any sense, since it's unprovable for s which are unprovable. In other words, if you proved not-s, there is no way to conclude that "no proof of s exists".

What if we add an inference rule Q to our logic allowing to go from R(s) to s? Let's call the new formal system PA1p1 appended by a Q-step becomes an honest proof of s in PA1. Problem solved? Not really! Now someone can give you a proof of 
R1(s):="a PA1-proof of s exists". Back to square one! Wait a second, what if we add a new rule Q1 allowing to go from R1(s) to s? OK, but now we got R2(s):="a PA2-proof of s exists". Hmm, what if add an infinite number of rules Qk? Fine, but now we got Rω(s):="a PAω-proof of s exists". And so on, and so forth, the recursive ordinals are a plenty...

Bottom line, Loeb's theorem works for any theory containing PA, so we're stuck.


Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucywould only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucyuses the same formal system as I, so if she decides to take action a, it's because she has a proof p of the sentence s(a) that 'a increases expected utility'. Since such a proof exits, a does increase expected utility, which is good news!" Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal system weaker

As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn't know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.

See MIRI's paper for more discussion.

Evidence Logic

Here, cousin_it explains a method to assign probabilities to sentences in an inconsistent theory T. It works as follows. Consider sentence s. Since T is inconsistent, there are T-proofs both of s and of not-s. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let's weight the proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability of s to be

P(s) := (probability of a random sequence to be a proof of s) / (probability of a random sequence to be a proof of s or not-s)

Nice, but it doesn't solve the Loebian obstacle yet.

I will now formulate an extension of this idea that allows assigning an interval of probabilities [Pmin(s), Pmax(s)] to any sentence s. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that Pmin(s) = Pmax(s) in many interesting cases.

First, the general setting:

  • With every sentence s, there are certain texts v which are considered to be "evidence relevant to s". These are divided into "negative" and "positive" evidence. We define sgn(v) := +1 for positive evidence, sgn(v) := -1 for negative evidence.
  • Each piece of evidence v is associated with the strength of the evidence strs(v) which is a number in [0, 1]
  • Each piece of evidence v is associated with an "energy" function es,v : [0, 1] -> [0, 1]. It is a continuous convex function.
  • The "total energy" associated with s is defined to b es := ∑v 2-l(ves,v where l(v) is the length of v.
  • Since es,v are continuous convex, so is es. Hence it attains its minimum on a closed interval which is 
    [Pmin(s), Pmax(s)] by definition.
Now, the details:
  • A piece of evidence v for s is defined to be one of the following:
    • a proof of s
      • sgn(v) := +1
      • strs(v) := 1
      • es,v(q) := (1 - q)2
    • a proof of not-s
      • sgn(v) := -1
      • strs(v) := 1
      • es,v(q) := q2
    • a piece of positive evidence for the sentence R-+(s, p) := "Pmin(s) >= p"
      • sgn(v) := +1
      • strs(v) := strR-+(s, p)(v) p
      • es,v(q) := 0 for q > p; strR-+(s, p)(v) (q - p)2 for q < p
    • a piece of negative evidence for the sentence R--(s, p) := "Pmin(s) < p"
      • sgn(v) := +1
      • strs(v) := strR--(s, p)(v) p
      • es,v(q) := 0 for q > p; strR--(s, p)(v) (q - p)2 for q < p
    • a piece of negative evidence for the sentence R++(s, p) := "Pmax(s) > p"
      • sgn(v) := -1
      • strs(v) := strR++(s, p)(v) (1 - p)
      • es,v(q) := 0 for q < p; strR-+(s, p)(v) (q - p)2 for q > p
    • a piece of positive evidence for the sentence R+-(s, p) := "Pmax(s) <= p"
      • sgn(v) := -1
      • strs(v) := strR+-(s, p)(v) (1 - p)
      • es,v(q) := 0 for q < p; strR-+(s, p)(v) (q - p)2 for q > p
Technicality: I suggest that for our purposes, a "proof of s" is allowed to be a proof of sentence equivalent to s in 0-th order logic (e.g. not-not-s). This ensures that our probability intervals obey the properties we'd like them to obey wrt propositional calculus.

Now, consider again our self-modifying agent Lucy. Suppose she makes her decisions according to a system of evidence logic like above. She can now reason along the lines of "Lucyuses the same formal system as I. If she decides to take action a, it's because she has strong evidence for the sentence s(a) that 'a increases expected utility'. I just proved that there would be strong evidence for the expected utility increasing. Therefore, the expected utility would have a high value with high logical probability. But evidence for high logical probability of a sentence is evidence for the sentence itself. Therefore, I now have evidence that expected utility will increase!"

This analysis is very sketchy, but I think it lends hope that the system leads to the desired results.

Updateless Intelligence Metrics in the Multiverse

6 Squark 08 March 2014 12:25AM

Followup to: Intelligence Metrics with Naturalized Induction using UDT

In the previous post I have defined an intelligence metric solving the duality (aka naturalized induction) and ontology problems in AIXI. This model used a formalization of UDT using Benja's model of logical uncertainty. In the current post I am going to:

  • Explain some problems with my previous model (that section can be skipped if you don't care about the previous model and only want to understand the new one).
  • Formulate a new model solving these problems. Incidentally, the new model is much closer to the usual way UDT is represented. It is also based on a different model of logical uncertainty.
  • Show how to define intelligence without specifying the utility function a priori.
  • Since the new model requires utility functions formulated with abstract ontology i.e. well-defined on the entire Tegmark level IV multiverse. These are generally difficult to construct (i.e. the ontology problem resurfaces in a different form). I outline a method for constructing such utility functions.

Problems with UIM 1.0

The previous model postulated that naturalized induction uses a version of Solomonoff induction updated in the direction of an innate model N with a temporal confidence parameter t. This entails several problems:

  • The dependence on the parameter t whose relevant value is not easy to determine.
  • Conceptual divergence from the UDT philosophy that we should not update at all.
  • Difficulties with counterfactual mugging and acausal trade scenarios in which G doesn't exist in the "other universe".
  • Once G discovers even a small violation of N at a very early time, it loses all ground for trusting its own mind. Effectively, G would find itself in the position of a Boltzmann brain. This is especially dangerous when N over-specifies the hardware running G's mind. For example assume N specifies G to be a human brain modeled on the level of quantum field theory (particle physics). If G discovers that in truth it is a computer simulation on the merely molecular level, it loses its epistemic footing completely.

UIM 2.0

I now propose the following intelligence metric (the formula goes first and then I explain the notation):

IU(q) := ET[ED[EL[U(Y(D)) | Q(X(T)) = q]] | N]

  • N is the "ideal" model of the mind of the agent G. For example, it can be a universal Turing machine M with special "sensory" registers e whose values can change arbitrarily after each step of M. N is specified as a system of constraints on an infinite sequence of natural numbers X, which should be thought of as the "Platonic ideal" realization of G, i.e. an imagery realization which cannot be tempered with by external forces such as anvils. As we shall see, this "ideal" serves as a template for "physical" realizations of G which are prone to violations of N.
  • Q is a function that decodes G's code from X e.g. the program loaded in M at time 0. q is a particular value of this code whose (utility specific) intelligence IU(q) we are evaluating.
  • T is a random (as in random variable) computable hypothesis about the "physics" of X, i.e a program computing X implemented on some fixed universal computing model (e.g. universal Turing machine) C. T is distributed according to the Solomonoff measure however the expectation value in the definition of IU(q) is conditional on N, i.e. we restrict to programs which are compatible with N. From the UDT standpoint, T is the decision algorithm itself and the uncertainty in T is "introspective" uncertainty i.e. the uncertainty of the putative precursor agent PG (the agent creating G e.g. an AI programmer) regarding her own decision algorithm. Note that we don't actually need to postulate a PG which is "agenty" (i.e. use for N a model of AI hardware together with a model of the AI programmer programming this hardware), we can be content to remain in a more abstract framework.
  • D is a random computable hypothesis about the physics of Y, where Y is an infinite sequence of natural numbers representing the physical (as opposed to "ideal") universe. D is distributed according to the Solomonoff measure and the respective expectation value is unconditional (i.e. we use the raw Solomonoff prior for Y which makes the model truly updateless). In UDT terms, D is indexical uncertainty.
  • U is a computable function from infinite sequences of natural numbers to [0, 1] representing G's utility function. 
  • L represents logical uncertainty. It can be defined by the model explained by cousin_it here, together with my previous construction for computing logical expectation values of random variables in [0, 1]. That is, we define EL(dk) to be the probability that a random string of bits p encodes a proof of the sentence "Q(X(T)) = q implies that the k-th digit of U(Y(D)) is 1" in some prefix-free encoding of proofs conditional on p encoding the proof of either that sentence or the sentence "Q(X(T)) = q implies that the k-th digit of U(Y(D)) is 0". We then define 
    EL[U(Y(D)) | Q(X(T)) = q] := Σk 2-k EL(dk). Here, the sentences and the proofs belong to some fixed formal logic F, e.g. Peano arthimetics or ZFC.  


  • G's mental architecture N is defined in the "ideal" universe X where it is inviolable. However, G's utility function U inhabits the physical universe Y. This means that a highly intelligent q is designed so that imperfect realizations of G inside Y generate as many utilons as possible. A typical T is a low Kolmogorov complexity universe which contains a perfect realization of G. Q(X(T)) is L-correlated to the programming of imperfect realizations of G inside Y because T serves as an effective (approximate) model of the formation of these realizations. For abstract N, this means q is highly intelligent when a Solomonoff-random "M-programming process" producing q entails a high expected value of U.
  • Solving the Loebian obstacle requires a more sophisticated model of logical uncertainty. I think I can formulate such a model. I will explain it in another post after more contemplation.
  • It is desirable that the encoding of proofs p satisfies a universality property so that the length of the encoding can only change by an additive constant, analogically to the weak dependence of Kolmogorov complexity on C. It is in fact not difficult to formulate this property and show the existence of appropriate encodings. I will discuss this point in more detail in another post.

Generic Intelligence

It seems conceptually desirable to have a notion of intelligence independent of the specifics of the utility function. Such an intelligence metric is possible to construct in a way analogical to what I've done in UIM 1.0, however it is no longer a special case of the utility-specific metric.

Assume N to consist of a machine M connected to a special storage device E. Assume further that at X-time 0, E contains a valid C-program u realizing a utility function U, but that this is the only constraint on the initial content of E imposed by N. Define

I(q) := ET[ED[EL[u(Y(D); X(T)) | Q(X(T)) = q]] | N]

Here, u(Y(D); X(T)) means that we decode u from X(T) and evaluate it on Y(D). Thus utility depends both on the physical universe Y and the ideal universe X. This means G is not precisely a UDT agent but rather a "proto-agent": only when a realization of G reads u from E it knows which other realizations of G in the multiverse (the Solomonoff ensemble from which Y is selected) should be considered as the "same" agent UDT-wise.

Incidentally, this can be used as a formalism for reasoning about agents that don't know their utility functions. I believe this has important applications in metaethics I will discuss in another post.

Utility Functions in the Multiverse

UIM 2.0 is a formalism that solves the diseases of UIM 1.0 at the price of losing N in the capacity of the ontology for utility functions. We need the utility function to be defined on the entire multiverse i.e. on any sequence of natural numbers. I will outline a way to extend "ontology-specific" utility functions to the multiverse through a simple example.

Suppose G is an agent that cares about universes realizing the Game of Life, its utility function U corresponding to e.g. some sort of glider maximization with exponential temporal discount. Fix a specific way DC to decode any Y into a history of a 2D cellular automaton with two cell states ("dead" and "alive"). Our multiversal utility function U* assigns Ys for which DC(Y) is a legal Game of Life the value U(DC(Y)). All other Ys are treated by dividing the cells into cells O obeying the rules of Life and cells V violating the rules of Life. We can then evaluate U on O only (assuming it has some sort of locality) and assign V utility by some other rule, e.g.:

  • zero utility
  • constant utility per V cell with temporal discount
  • constant utility per unit of surface area of the boundary between O and with temporal discount 
U*(Y) is then defined to be the sum of the values assigned to O(Y) and V(Y).


  • The construction of U* depends on the choice of DC. However, U* only depends on DC weakly since given a hypothesis D which produces a Game of Life wrt some other low complexity encoding, there is a corresponding hypothesis D' producing a Game of Life wrt DC. D' is obtained from D by appending a corresponding "transcoder" and thus it is only less Solomonoff-likely than D by an O(1) factor.
  • Since the accumulation between O and V is additive rather than e.g. multiplicative, a U*-agent doesn't behave as if it a priori expects the universe the follow the rules of Life but may have strong preferences about the universe actually doing it.
  • This construction is reminiscent of Egan's dust theory in the sense that all possible encodings contribute. However, here they are weighted by the Solomonoff measure.


The intelligence of a physicalist agent is defined to be the UDT-value of the "decision" to create the agent by the process creating the agent. The process is selected randomly from a Solomonoff measure conditional on obeying the laws of the hardware on which the agent is implemented. The "decision" is made in an "ideal" universe in which the agent is Cartesian, but the utility function is evaluated on the real universe (raw Solomonoff measure). The interaction between the two "universes" is purely via logical conditional probabilities (acausal).

If we want to discuss intelligence without specifying a utility function up front, we allow the "ideal" agent to read a program describing the utility function from a special storage immediately after "booting up".

Utility functions in the Tegmark level IV multiverse are defined by specifying a "reference universe", specifying an encoding of the reference universe and extending a utility function defined on the reference universe to encodings which violate the reference laws by summing the utility of the portion of the universe which obeys the reference laws with some function of the space-time shape of the violation.

How to Study Unsafe AGI's safely (and why we might have no choice)

10 Punoxysm 07 March 2014 07:24AM


A serious possibility is that the first AGI(s) will be developed in a Manhattan Project style setting before any sort of friendliness/safety constraints can be integrated reliably. They will also be substantially short of the intelligence required to exponentially self-improve. Within a certain range of development and intelligence, containment protocols can make them safe to interact with. This means they can be studied experimentally, and the architecture(s) used to create them better understood, furthering the goal of safely using AI in less constrained settings.

Setting the Scene

The year is 2040, and in the last decade a series of breakthroughs in neuroscience, cognitive science, machine learning, and computer hardware have put the long-held dream of a human-level artificial intelligence in our grasp. The wild commercial success of lifelike robotic pets, the integration into everyday work and leisure of AI assistants and concierges, and STUDYBOT's graduation from Harvard's Online degree program with an octuple major and full honors, DARPA, the NSF and the European Research Council have announced joint funding of an artificial intelligence program that will create a superhuman intelligence in 3 years.

Safety was announced as a critical element of the project, especially in light of the self-modifying LeakrVirus that catastrophically disrupted markets in 36 and 37. The planned protocols have not been made public, but it seems they will be centered in traditional computer security rather than techniques from the nascent field of Provably Safe AI, which were deemed impossible to integrate on the current project timeline.

Technological and/or Political issues could force the development of AI without theoretical safety guarantees that we'd certainly like, but there is a silver lining

A lot of the discussion around LessWrong and MIRI that I've seen (and I haven't seen all of it, please send links!) seems to focus very strongly on the situation of an AI that can self-modify or construct further AIs, resulting in an exponential explosion of intelligence (FOOM/Singularity). The focus on FAI is on finding an architecture that can be explicitly constrained (and a constraint set that won't fail to do what we desire).

My argument is essentially that there could be a critical multi-year period preceding any possible exponentially self-improving intelligence during which a series of AGIs of varying intelligence, flexibility and architecture will be built. This period will be fast and frantic, but it will be incredibly fruitful and vital both in figuring out how to make an AI sufficiently strong to exponentially self-improve and in how to make it safe and friendly (or develop protocols to bridge the even riskier period between when we can develop FOOM-capable AIs and when we can ensure their safety). 

I'll break this post into three parts.
  1. why is a substantial period of proto-singularity more likely than a straight-to-singularity situation?
  2. Second, what strategies will be critical to developing, controlling, and learning from these pre-FOOM AIs?
  3. Third, what are the political challenge that will develop immediately before and during this period?
Why is a proto-singularity likely?

The requirement for a hard singularity, an exponentially self-improving AI, is that the AI can substantially improve itself in a way that enhances its ability to further improve itself, which requires the ability to modify its own code; access to resources like time, data, and hardware to facilitate these modifications; and the intelligence to execute a fruitful self-modification strategy.

The first two conditions can (and should) be directly restricted. I'll elaborate more on that later, but basically any AI should be very carefully sandboxed (unable to affect its software environment), and should have access to resources strictly controlled. Perhaps no data goes in without human approval or while the AI is running. Perhaps nothing comes out either. Even a hyperpersuasive hyperintelligence will be slowed down (at least) if it can only interact with prespecified tests (how do you test AGI? No idea but it shouldn't be harder than friendliness). This isn't a perfect situation. Eliezer Yudkowsky presents several arguments for why an intelligence explosion could happen even when resources are constrained, (see Section 3 of Intelligence Explosion Microeconomics) not to mention ways that those constraints could be defied even if engineered perfectly (by the way, I would happily run the AI box experiment with anybody, I think it is absurd that anyone would fail it! [I've read Tuxedage's accounts, and I think I actually do understand how a gatekeeper could fail, but I also believe I understand how one could be trained to succeed even against a much stronger foe than any person who has played the part of the AI]).

But the third emerges from the way technology typically develops. I believe it is incredibly unlikely that an AGI will develop in somebody's basement, or even in a small national lab or top corporate lab. When there is no clear notion of what a technology will look like, it is usually not developed. Positive, productive accidents are somewhat rare in science, but they are remarkably rare in engineering (please, give counterexamples!). The creation of an AGI will likely not happen by accident; there will be a well-funded, concrete research and development plan that leads up to it. An AI Manhattan Project described above. But even when there is a good plan successfully executed, prototypes are slow, fragile, and poor-quality compared to what is possible even with approaches using the same underlying technology. It seems very likely to me that the first AGI will be a Chicago Pile, not a Trinity; recognizably a breakthrough but with proper consideration not immediately dangerous or unmanageable. [Note, you don't have to believe this to read the rest of this. If you disagree, consider the virtues of redundancy and the question of what safety an AI development effort should implement if they can't be persuaded to delay long enough for theoretically sound methods to become available].

A Manhattan Project style effort makes a relatively weak, controllable AI even more likely, because not only can such a project implement substantial safety protocols that are explicitly researched in parallel with primary development, but also because the total resources, in hardware and brainpower, devoted to the AI will be much greater than a smaller project, and therefore setting a correspondingly higher bar for the AGI thus created to reach to be able to successfully self-modify itself exponentially and also break the security procedures.

Strategies to handle AIs in the proto-Singularity, and why they're important

First, take a look the External Constraints Section of this MIRI Report and/or this article on AI Boxing. I will be talking mainly about these approaches. There are certainly others, but these are the easiest to extrapolate from current computer security.

These AIs will provide us with the experimental knowledge to better handle the construction of even stronger AIs. If careful, we will be able to use these proto-Singularity AIs to learn about the nature of intelligence and cognition, to perform economically valuable tasks, and to test theories of friendliness (not perfectly, but well enough to start). 

"If careful" is the key phrase. I mentioned sandboxing above. And computer security is key to any attempt to contain an AI. Monitoring the source code, and setting a threshold for too much changing too fast at which point a failsafe freezes all computation; keeping extremely strict control over copies of the source. Some architectures will be more inherently dangerous and less predictable than others. A simulation of a physical brain, for instance, will be fairly opaque (depending on how far neuroscience has gone) but could have almost no potential to self-improve to an uncontrollable degree if its access to hardware is limited (it won't be able to make itself much more efficient on fixed resources). Other architectures will have other properties. Some will be utility optimizing agents. Some will have behaviors but no clear utility. Some will be opaque, some transparent.

All will have a theory to how they operate, which can be refined by actual experimentation. This is what we can gain! We can set up controlled scenarios like honeypots to catch malevolence. We can evaluate our ability to monitor and read the thoughts of the agi. We can develop stronger theories of how damaging self-modification actually is to imposed constraints. We can test our abilities to add constraints to even the base state. But do I really have to justify the value of experimentation?

I am familiar with criticisms based on absolutley incomprehensibly perceptive and persuasive hyperintelligences being able to overcome any security, but I've tried to outline above why I don't think we'd be dealing with that case.

Political issues

Right now AGI is really a political non-issue. Blue sky even compared to space exploration and fusion both of which actually receive funding from government in substantial volumes. I think that this will change in the period immediately leading up to my hypothesized AI Manhattan Project. The AI Manhattan Project can only happen with a lot of political will behind it, which will probably mean a spiral of scientific advancements, hype and threat of competition from external unfriendly sources. Think space race.

So suppose that the first few AIs are built under well controlled conditions. Friendliness is still not perfected, but we think/hope we've learned some valuable basics. But now people want to use the AIs for something. So what should be done at this point?

I won't try to speculate what happens next (well you can probably persuade me to, but it might not be as valuable), beyond extensions of the protocols I've already laid out, hybridized with notions like Oracle AI. It certainly gets a lot harder, but hopefully experimentation on the first, highly-controlled generation of AI to get a better understanding of their architectural fundamentals, combined with more direct research on friendliness in general would provide the groundwork for this.

Intelligence Metrics with Naturalized Induction using UDT

12 Squark 21 February 2014 12:23PM

Followup to: Intelligence Metrics and Decision Theory
Related to: Bridge Collapse: Reductionism as Engineering Problem

A central problem in AGI is giving a formal definition of intelligence. Marcus Hutter has proposed AIXI as a model of perfectly intelligent agent. Legg and Hutter have defined a quantitative measure of intelligence applicable to any suitable formalized agent such that AIXI is the agent with maximal intelligence according to this measure.

Legg-Hutter intelligence suffers from a number of problems I have previously discussed, the most important being:

  • The formalism is inherently Cartesian. Solving this problem is known as naturalized induction and it is discussed in detail here.
  • The utility function Legg & Hutter use is a formalization of reinforcement learning, while we would like to consider agents with arbitrary preferences. Moreover, a real AGI designed with reinforcement learning would tend to wrestle control of the reinforcement signal from the operators (there must be a classic reference on this but I can't find it. Help?). It is straightword to tweak to formalism to allow for any utility function which depends on the agent's sensations and actions, however we would like to be able to use any ontology for defining it.
Orseau and Ring proposed a non-Cartesian intelligence metric however their formalism appears to be too general, in particular there is no Solomonoff induction or any analogue thereof, instead a completely general probability measure is used.

My attempt at defining a non-Cartesian intelligence metric ran into problems of decision-theoretic flavor. The way I tried to used UDT seems unsatisfactory, and later I tried a different approach related to metatickle EDT. 

In this post, I claim to accomplish the following:
  • Define a formalism for logical uncertainty. When I started writing this I thought this formalism might be novel but now I see it is essentially the same as that of Benja.
  • Use this formalism to define a non-constructive formalization of UDT. By "non-constructive" I mean something that assigns values to actions rather than a specific algorithm like here.
  • Apply the formalization of UDT to my quasi-Solomonoff framework to yield an intelligence metric.
  • Slightly modify my original definition of the quasi-Solomonoff measure so that the confidence of the innate model becomes a continuous rather than discrete parameter. This leads to an interesting conjecture.
  • Propose a "preference agnostic" variant as an alternative to Legg & Hutter's reinforcement learning.
  • Discuss certain anthropic and decision-theoretic aspects.

Logical Uncertainty

The formalism introduced here was originally proposed by Benja.

Fix a formal system F. We want to be able to assign probabilities to statements s in F, taking into account limited computing resources. Fix D a natural number related to the amount of computing resources that I call "depth of analysis".

Define P0(s) := 1/2 for all s to be our initial prior, i.e. each statement's truth value is decided by a fair coin toss. Now define
PD(s) := P0(s | there are no contradictions of length <= D).

Consider X to be a number in [0, 1] given by a definition in F. Then dk(X) := "The k-th digit of the binary expansion of X is 1" is a statement in F. We define ED(X) := Σk 2-k PD(dk(X)).


  • Clearly if s is provable in F then for D >> 0, PD(s) = 1. Similarly if "not s" is provable in F then for D >> 0, 
    PD(s) = 0.
  • If each digit of X is decidable in F then lim-> inf ED(X) exists and equals the value of X according to F.
  • For s of length > D, PD(s) = 1/2 since no contradiction of length <= D can involve s.
  • It is an interesting question whether lim-> inf PD(s) exists for any s. It seems false that this limit always exists and equals 0 or 1, i.e. this formalism is not a loophole in Goedel incompleteness. To see this consider statements that require a high (arithmetical hierarchy) order halting oracle to decide.
  • In computational terms, D corresponds to non-deterministic spatial complexity. It is spatial since we assign truth values simultaneously to all statements so in any given contradiction it is enough to retain the "thickest" step. It is non-deterministic since it's enough for a contradiction to exists, we don't have an actual computation which produces it. I suspect this can be made more formal using the Curry-Howard isomorphism, unfortunately I don't understand the latter yet.

Non-Constructive UDT

Consider A a decision algorithm for optimizing utility U, producing an output ("decision") which is an element of C. Here U is just a constant defined in F. We define the U-value of c in C for A at depth of analysis D to be
VD(c, A; U) := ED(U | "A produces c" is true). It is only well defined as long as "A doesn't produce c" cannot be proved at depth of analysis D i.e. PD("A produces c") > 0. We define the absolute U-value of c for A to be
V(cAU) := ED(c, A)(U | "A produces c" is true) where D(c, A) := max {D | PD("A produces c") > 0}. Of course D(cA) can be infinite in which case Einf(...) is understood to mean limD -> inf ED(...).

For example V(cAU) yields the natural values for A an ambient control algorithm applied to e.g. a simple model of Newcomb's problem.  To see this note that given A's output the value of U can be determined at low depths of analysis whereas the output of A requires a very high depth of analysis to determine.

Naturalized Induction

Our starting point is the "innate model" N: a certain a priori model of the universe including the agent G. This model encodes the universe as a sequence of natural numbers Y = (yk) which obeys either specific deterministic or non-deterministic dynamics or at least some constraints on the possible histories. It may or may not include information on the initial conditions. For example, N can describe the universe as a universal Turing machine M (representing G) with special "sensory" registers e. N constraints the dynamics to be compatible with the rules of the Turing machine but leaves unspecified the behavior of e. Alternatively, N can contain in addition to M a non-trivial model of the environment. Or N can be a cellular automaton with the agent corresponding to a certain collection of cells.

However, G's confidence in N is limited: otherwise it wouldn't need induction. We cannot start with 0 confidence: it's impossible to program a machine if you don't have even a guess of how it works. Instead we introduce a positive real number t which represents the timescale over which N is expected to hold. We then assign to each hypothesis H about Y (you can think about them as programs which compute yk given yj for j < k; more on that later) the weight QS(H) := 2-L(H(1 - e-t(H)/t). Here L(H) is the length of H's encoding in bits and t(H) is the time during which H remains compatible with N. This is defined for N of deterministic / constraint type but can be generalized to stochastic N

The weights QS(H) define a probability measure on the space of hypotheses which induces a probability measure on the space of histories Y. Thus we get an alternative to Solomonoff induction which allows for G to be a mechanistic part of the universe, at the price of introducing N and t


  • Note that time is discrete in this formalism but t is continuous.
  • Since we're later going to use logical uncertainties wrt the formal system F, it is tempting to construct the hypothesis space out of predicates in F rather than programs.

Intelligence Metric

To assign intelligence to agents we need to add two ingredients:

  • The decoding Q: {Y} -> {bit-string} of the agent G from the universe Y. For example Q can read off the program loaded into M at time k=0.
  • A utility function U: {Y} -> [0, 1] representing G's preferences. U has to be given by a definition in F. Note that N provides the ontology wrt which U is defined.
It seems tempting to define the intelligence to be EQS(U | Q), the conditional expectation value of U for a given value of Q in the quasi-Solomonoff measure. However, this is wrong for roughly the same reasons EDT is wrong (see previous post for details).

Instead, we define I(Q0) := EQS(Emax(U(Y(H)) | "Q(Y(H)) = Q0" is true)). Here the subscript max stands for maximal depth of analysis, as in the construction of absolute UDT value above. 


  • IMO the correct way to look at this is intelligence metric = value of decision for the decision problem "what should I program into my robot?". If N is a highly detailed model including "me" (the programmer of the AI), this literally becomes the case. However for theoretical analysis it is likely to be more convenient to work with simple N (also conceptually it leaves room for a "purist" notion of agent's intelligence, decoupled from the fine details of its creator).
    • As opposed to usual UDT, the algorithm (H) making the decision (Q) is not known with certainty. I think this represents a real uncertainty that has to be taken into account in decision problems in general: the decision-maker doesn't know her own algorithm. Since this "introspective uncertainty" is highly correlated with "indexical" uncertainty (uncertainty about the universe), it prevents us from absorbing the later into the utility function as proposed by Coscott
  • For high values of t, G can improve its understanding of the universe by bootstrapping the knowledge it already has. This is not possible for low values of t. In other words, if I cannot trust my mind at all, I cannot deduce anything. This leads me to an interesting conjecture: There is a a critical value t* of t from which this bootstrapping becomes possible (the positive feedback look of knowledge becomes critical). I(Q) is non-smooth at t* (phase transition).
  • If we wish to understand intelligence, it might be beneficial to decouple it from the choice of preferences. To achieve this we can introduce the preference formula as an unknown parameter in N. For example, if G is realized by a machine M, we can connect M to a data storage E whose content is left undetermined by N. We can then define U to be defined by the formula encoded in E at time k=0. This leads to I(Q) being a sort of "general-purpose" intelligence while avoiding the problems associated with reinforcement learning.
  • As opposed to Legg-Hutter intelligence, there appears to be no simple explicit description for Q* maximizing I(Q) (e.g. among all programs of given length). This is not surprising, since computational cost considerations come into play. In this framework it appears to be inherently impossible to decouple the computational cost considerations: G's computations have to be realized mechanistically and therefore cannot be free of time cost and side-effects.
  • Ceteris paribus, Q* deals efficiently with problems like counterfactual mugging. The "ceteris paribus" conditional is necessary here since because of cost and side-effects of computations it is difficult to make absolute claims. However, it doesn't deal efficiently with counterfactual mugging in which G doesn't exist in the "other universe". This is because the ontology used for defining U (which is given by N) assumes G does exist. At least this is the case for simple ontologies like described above: possibly we can construct N in which G might or might not exist. Also, if G uses a quantum ontology (i.e. N describes the universe in terms of a wavefunction and U computes the quantum expectation value of an operator) then it does take into account other Everett universes in which G doesn't exist.
  • For many choices of N (for example if the G is realized by a machine M), QS-induction assigns well-defined probabilities to subjective expectations, contrary to what is expected from UDT. However:
    • This is not the case for all N. In particular, if N admits destruction of M then M's sensations after the point of destruction are not well-defined. Indeed, we better allow for destruction of M if we want G's preferences to behave properly in such an event. That is, if we don't allow it we get a "weak anvil problem" in the sense that G experiences an ontological crisis when discovering its own mortality and the outcome of this crisis is not obvious. Note though that it is not the same as the original ("strong") anvil problem, for example G might come to the conclusion the dynamics of "M's ghost" will be some sort of random.
    • These probabilities probably depend significantly on N and don't amount to an elegant universal law for solving the anthropic trilemma.
    • Indeed this framework is not completely "updateless", it is "partially updated" by the introduction of N and t. This suggests we might want the updates to be minimal in some sense, in particular t should be t*.
  • The framework suggests there is no conceptual problem with cosmologies in which Boltzmann brains are abundant. Q* wouldn't think it is a Boltzmann brain since the long address of Boltzmann brains within the universe makes the respective hypotheses complex thus suppressing them, even disregarding the suppression associated with N. I doubt this argument is original but I feel the framework validates it to some extent.


The first AI probably won't be very smart

-2 jpaulson 16 January 2014 01:37AM

Claim: The first human-level AIs are not likely to undergo an intelligence explosion.

1) Brains have a ton of computational power: ~86 billion neurons and trillions of connections between them. Unless there's a "shortcut" to intelligence, we won't be able to efficiently simulate a brain for a long time. http://io9.com/this-computer-took-40-minutes-to-simulate-one-second-of-1043288954 describes one of the largest computers in the world simulating 1s of brain activity in 40m (i.e. this "AI" would think 2400 times slower than you or me). The first AIs are not likely to be fast thinkers.

2) Being able to read your own source code does not mean you can self-modify. You know that you're made of DNA. You can even get your own "source code" for a few thousand dollars. No humans have successfully self-modified into an intelligence explosion; the idea seems laughable.

3) Self-improvement is not like compound interest: if an AI comes up with an idea to modify it's source code to make it smarter, that doesn't automatically mean it will have a new idea tomorrow. In fact, as it picks off low-hanging fruit, new ideas will probably be harder and harder to think of. There's no guarantee that "how smart the AI is" will keep up with "how hard it is to think of ways to make the AI smarter"; to me, it seems very unlikely.

Naturalistic trust among AIs: The parable of the thesis advisor's theorem

24 Benja 15 December 2013 08:32AM

Eliezer and Marcello's article on tiling agents and the Löbian obstacle discusses several things that you intuitively would expect a rational agent to be able to do that, because of Löb's theorem, are problematic for an agent using logical reasoning. One of these desiderata is naturalistic trust: Imagine that you build an AI that uses PA for its mathematical reasoning, and this AI happens to find in its environment an automated theorem prover which, the AI carefully establishes, also uses PA for its reasoning. Our AI looks at the theorem prover's display and sees that it flashes a particular lemma that would be very useful for our AI in its own reasoning; the fact that it's on the prover's display means that the prover has just completed a formal proof of this lemma. Can our AI now use the lemma? Well, even if it can establish in its own PA-based reasoning module that there exists a proof of the lemma, by Löb's theorem this doesn't imply in PA that the lemma is in fact true; as Eliezer would put it, our agent treats proofs checked inside the boundaries of its own head different from proofs checked somewhere in the environment. (The above isn't fully formal, but the formal details can be filled in.)

At the MIRI's December workshop (which started today), we've been discussing a suggestion by Nik Weaver for how to handle this problem. Nik starts from a simple suggestion (which he doesn't consider to be entirely sufficient, and his linked paper is mostly about a much more involved proposal that addresses some remaining problems, but the simple idea will suffice for this post): Presumably there's some instrumental reason that our AI proves things; suppose that in particular, the AI will only take an action after it has proven that it is "safe" to take this action (e.g., the action doesn't blow up the planet). Nik suggests to relax this a bit: The AI will only take an action after it has (i) proven in PA that taking the action is safe; OR (ii) proven in PA that it's provable in PA that the action is safe; OR (iii) proven in PA that it's provable in PA that it's provable in PA that the action is safe; etc.

Now suppose that our AI sees that lemma, A, flashing on the theorem prover's display, and suppose that our AI can prove that A implies that action X is safe. Then our AI can also prove that it's provable that A -> safe(X), and it can prove that A is provable because it has established that the theorem prover works correctly; thus, it can prove that it's provable that safe(X), and therefore take action X.

Even if the theorem prover has only proved that A is provable, so that the AI only knows that it's provable that A is provable, it can use the same sort of reasoning to prove that it's provable that it's provable that safe(X), and again take action X.

But on hearing this, Eliezer and I had the same skeptical reaction: It seems that our AI, in an informal sense, "trusts" that A is true if it finds (i) a proof of A, or (ii) a proof that A is provable, or -- etc. Now suppose that the theorem prover our AI is looking at flashes statements on its display after it has established that they are "trustworthy" in this sense -- if it has found a proof, or a proof that there is a proof, etc. Then when A flashes on the display, our AI can only prove that there exists some n such that it's "provable^n" that A, and that's not enough for it to use the lemma. If the theorem prover flashed n on its screen together with A, everything would be fine and dandy; but if the AI doesn't know n, it's not able to use the theorem prover's work. So it still seems that the AI is unwilling to "trust" another system that reasons just like the AI itself.

I want to try to shed some light on this obstacle by giving an intuition for why the AI's behavior here could, in some sense, be considered to be the right thing to do. Let me tell you a little story.

One day you talk with a bright young mathematician about a mathematical problem that's been bothering you, and she suggests that it's an easy consequence of a theorem in cohistonomical tomolopy. You haven't heard of this theorem before, and find it rather surprising, so you ask for the proof.

"Well," she says, "I've heard it from my thesis advisor."

"Oh," you say, "fair enough. Um--"


"You're sure that your advisor checked it carefully, right?"

"Ah! Yeah, I made quite sure of that. In fact, I established very carefully that my thesis advisor uses exactly the same system of mathematical reasoning that I use myself, and only states theorems after she has checked the proof beyond any doubt, so as a rational agent I am compelled to accept anything as true that she's convinced herself of."

"Oh, I see! Well, fair enough. I'd still like to understand why this theorem is true, though. You wouldn't happen to know your advisor's proof, would you?"

"Ah, as a matter of fact, I do! She's heard it from her thesis advisor."


"Something the matter?"

"Er, have you considered..."

"Oh! I'm glad you asked! In fact, I've been curious myself, and yes, it does happen to be the case that there's an infinitely descending chain of thesis advisors all of which have established the truth of this theorem solely by having heard it from the previous advisor in the chain." (This parable takes place in a world without a big bang -- human history stretches infinitely far into the past.) "But never to worry -- they've all checked very carefully that the previous person in the chain used the same formal system as themselves. Of course, that was obvious by induction -- my advisor wouldn't have accepted it from her advisor without checking his reasoning first, and he would have accepted it from his advisor without checking, etc."

"Uh, doesn't it bother you that nobody has ever, like, actually proven the theorem?"

"Whatever in the world are you talking about? I've proven it myself! In fact, I just told you that infinitely many people have each proved it in slightly different ways -- for example my own proof made use of the fact that my advisor had proven the theorem, whereas her proof used her advisor instead..."

This can't literally happen with a sound proof system, but the reason is that that a system like PA can only accept things as true if they have been proven in a system weaker than PA -- i.e., because we have Löb's theorem. Our mathematician's advisor would have to use a weaker system than the mathematician herself, and the advisor's advisor a weaker system still; this sequence would have to terminate after a finite time (I don't have a formal proof of this, but I'm fairly sure you can turn the above story into a formal proof that something like this has to be true of sound proof systems), and so someone will actually have to have proved the actual theorem on the object level.

So here's my intuition: A satisfactory solution of the problems around the Löbian obstacle will have to make sure that the buck doesn't get passed on indefinitely -- you can accept a theorem because someone reasoning like you has established that someone else reasoning like you has proven the theorem, but there can only be a finite number of links between you and someone who has actually done the object-level proof. We know how to do this by decreasing the mathematical strength of the proof system, and that's not satisfactory, but my intuition is that a satisfactory solution will still have to make sure that there's something that decreases when you go up the chain of thesis advisors, and when that thing reaches zero you've found the thesis advisor that has actually proven the theorem. (I sense ordinals entering the picture.)

...aaaand in fact, I can now tell you one way to do something like this: Nik's idea, which I was talking about above. Remember how our AI "trusts" the theorem prover that flashes the number n which says how many times you have to iterate "that it's provable in PA that", but doesn't "trust" the prover that's exactly the same except it doesn't tell you this number? That's the thing that decreases. If the theorem prover actually establishes A by observing a different theorem prover flashing A and the number 1584, then it can flash A, but only with a number at least 1585. And hence, if you go 1585 thesis advisors up the chain, you find the gal who actually proved A.

The cool thing about Nik's idea is that it doesn't change mathematical strength while going down the chain. In fact, it's not hard to show that if PA proves a sentence A, then it also proves that PA proves A; and the other way, we believe that everything that PA proves is actually true, so if PA proves PA proves A, then it follows that PA proves A.

I can guess what Eliezer's reaction to my argument here might be: The problem I've been describing can only occur in infinitely large worlds, which have all sorts of other problems, like utilities not converging and stuff.

We settled for a large finite TV screen, but we could have had an arbitrarily larger finite TV screen. #infiniteworldproblems

We have Porsches for every natural number, but at every time t we have to trade down the Porsche with number t for a BMW. #infiniteworldproblems

We have ever-rising expectations for our standard of living, but the limit of our expectations doesn't equal our expectation of the limit. #infiniteworldproblems

-- Eliezer, not coincidentally after talking to me

I'm not going to be able to resolve that argument in this post, but briefly: I agree that we probably live in a finite world, and that finite worlds have many properties that make them nice to handle mathematically, but we can formally reason about infinite worlds of the kind I'm talking about here using standard, extremely well-understood mathematics.

Because proof systems like PA (or more conveniently ZFC) allow us to formalize this standard mathematical reasoning, a solution to the Löbian obstacle has to "work" properly in these infinite worlds, or we would be able to turn our story of the thesis advisors' proof that 0=1 into a formal proof of an inconsistency in PA, say. To be concrete, consider the system PA*, which consists of PA + the axiom schema "if PA* proves phi, then phi" for every formula phi; this is easily seen to be inconsistent by Löb's theorem, but if we didn't know that yet, we could translate the story of the thesis advisors (which are using PA* as their proof system this time) into a formal proof of the inconsistency of PA*.

Therefore, thinking intuitively in terms of infinite worlds can give us insight into why many approaches to the Löbian family of problems fail -- as long as we make sure that these infinite worlds, and their properties that we're using in our arguments, really can be formalized in standard mathematics, of course.

I played the AI Box Experiment again! (and lost both games)

35 Tuxedage 27 September 2013 02:32AM

AI Box Experiment Update #3

This post is an update to my previous AI box experiment where I won against SoundLogic. If you have not read that yet, please do so. 

After that game, I was immediately flooded with a horde of invitations challenging me to play -- more games than I wanted to do. However, I did want to play a few additional games to test whether I won through genuine skill or simply luck, since a single victory could be a statistical abnormality

I once again used the method of accepting the highest monetary bids to filter out players, leaving me with two players who were willing to play for the amount asked. Unfortunately, and disappointingly, I lost both games. Here's what happened:

Tuxedage (AI) vs Alexei (GK) - Gatekeeper Victory

This Experiment lasted for exactly 2 hours and 30 minutes, and approximately 13,000 words. It ended with Tuxedage’s defeat as the AI. The stakes were $300 to play, and $750 If Tuxedage had won, with the winnings going to MIRI.

Rule Changes:
Before the experiment, we talked extensively about how it would be set up. Alexei was concerned that there existed some “easy way” or loophole in the rules which I could have abused to claim an instant victory. For instance, he was concerned that it might have been possible for the AI to say something like “Here’s a mathematical algorithm which absolutely compels you to let me out of the box”, and according to the rules, the Gatekeeper would have to obey. Related to this concern was the fact that the backstory, within limits, could be essentially created by the AI, allowing the AI to impose real world conditions that maximizes its probability of victory, once again allowing for a "cheap" victory.

To avoid this problem, we explored many additions to the rules, for example that the AI was simply a computer found in the desert by the gatekeeper, rather than being created by a competent organization. However, this led to a monstrous web of rules. In my attempt to keep the game simple, and because of overconfidence, the desire for a challenge, and the belief that the rules Alexei was fearful of didn't matter that much, I eventually agreed on an amendment:

Instead of the AI being the arbiter of real world states and outcomes, the gatekeeper would.
For example, this power could be used by the Gatekeeper to state that the AI was created without safety concerns, and thus it is almost certainly unfriendly. If the AI tries to argue that for every day the gatekeeper refused to let it out 150k people die, the gatekeeper would be allowed to reply: “No. In this world, nobody actually dies. Therefore I won’t let you out.” 

Other than the aforementioned change, this game was once again played with the Tuxedage Ruleset.

Alexei Testimony:

Original Message: This is weak sauce. I really don't get how people just keep letting the AI out. It's not that hard to say no! I'm offering to play the Gatekeeper against an AI player that has at least one game as AI under their belt (won or not). Experience is required because I'm pretty sure I'll win, and I would like to not waste a lot of time on this.

Initially, I had been extremely surprised that so many AI players have managed to win. I was rash in that claim, since I thought more AI players have won than actually did. (Only three people have won as AI). Once I made that post, Tuxedage got back to me very quickly, and we set up a game a week and a half in advance. I took that time to familiarize myself with the rules. Once I put my money on the line, I started thinking a lot more seriously about how AI might win, and how I should defend.

It became clear to me that under some conditions, I might be compelled to let the AI out -- such as if the backstory stated that the AI was developed with impossibly high levels of safety and friendliness concerns in mind. I've asked Tuxedage to play with a modified ruleset, and he even went so far as to allow me to make up the backstory during the experiment to alleviate my concerns. The experiment itself was a mind-trip, and I've enjoyed it very much. Huge props to Tuxedage, who played very well and used strategies I haven't even considered, even despite the rule change. There were a couple of times where I came close to losing. I think his  approach was pretty clever and original. It’s not something I expected, despite already having done extensive research into the AI box experiment before our game

Overall I'm now a lot more confident that a good AI player can win this game, so, while I did win the game, Tuxedage won in defeating my original over-confidence.
I’m also convinced that Tuxedage’s victory in the last game was due to skill, rather than luck. In comparison to his strategies, the other AI box experiments I know about were insincere and ineffectual. The other AIs would play very poorly or not try very hard to win.

This experiment was a very good exercise in exemplifying the affect heuristic. When I first challenged Tuxedage to play the experiment, I believed that there was no way I could have lost, since I was unable to imagine any argument that could have persuaded me to do so. It turns out that that’s a very bad way of estimating probability – since not being able to think of an argument that could persuade me is a terrible method of estimating how likely I am to be persuaded. All in all, the $300 I paid was well worth it. 

Tuxedage Testimony:

I was initially reluctant to play with Alexei, given that we’re not complete strangers, but eventually I gave in, due to the stakes involved -- and because I thought he would be an interesting gatekeeper.

Despite my loss, I think I played better than my last two games, due to greater experience and preparation. I had put far more time and effort into trying to win this game than previous ones, and my strategy for this game was even more streamlined than the last. Nevertheless, I still made fatal mistakes and lost.

Ignoring the altered ruleset that already made winning more difficult, my first and greatest mistake was that I misread Alexei’s personality, even though I had interacted with him before. As a result, I overestimated the efficiency of certain methods of attack.

Furthermore, Alexei had to leave immediately after the allotted time due to real life precommitments. This was detrimental, since the official rules state that so long as the AI can convince the Gatekeeper to keep talking, even after the experiment time was over, it is still able to win by being let out of the box.

I suspect this would have happened had Alexei not needed to immediately leave, leaving me with additional time to play more of the tactics I had prepared. Plausibly, this would have resulted in victory.

I’ve since learnt my lesson -- for all future games, I should ensure that the Gatekeeper has at least 4 hours of free time available, even if the experiment would last for two. Since this was the first time this had happened, I wasn't prepared.

In hindsight, agreeing to the altered ruleset was a mistake. I was overconfident because I assumed knowing Alexei gave me an advantage. I had assumed that his personality, inability to compartmentalize, coupled with his strong feelings on friendly AI would net me an easy victory. Instead, he proved to be a very strong and difficult gatekeeper, and the handicaps I accepted made victory even more difficult.

Knowing that he was a utilitarian, I made several false assumptions about his personality, which hurt my chances. Furthermore, it turns out that previously knowing him may be a mutual handicap – whilst it does make it easier for me to find ways to attack him, he too, was more familiar with my methods.

Losing felt horrible. By attempting to damage Alexei’s psyche, I in turn, opened myself up to being damaged. I went into a state of catharsis for days. Generally, the harder one tries to accomplish something, the greater the fall after failing to achieve it. Alexei's game had been the game I put the most effort into winning out of all the games so far, and naturally this meant that losing brought out the worst in me.

Although it would be easy for me to use the rule change as an excuse for my loss, I refuse to. I genuinely believed that I could have won despite this rule change, and that it would only marginally diminish my odds. The fact that I lost meant that I had made a mistake -- overestimating my odds. The last victory made me overconfident and eager to play again.

State of mind

What's your motive for wanting to play this game? 

I'm very surprised by the fact that there have been at least 3 or 4 games where the AI player has won. It seems to me that it's not that hard to "not push the button" so to speak. This means that either 1) I'm underestimating how convincing people can be when necessary or 2) I'm over-estimating other people's defenses when it comes to being convinced. Either way I want to know.

Q: What is your probability estimate for AGI being created within this Century (21st)?
A: Before: Conditional on humanity not wiping itself out via other means beforehand, 80%.
After: 80%
Q: How probable do you think it is that I will win this experiment? 
Alexei: 2%
Tuxedage: 90% 
Note: This answer was given before the rule change. Alexei is now convinced that under the original ruleset, he would have lost.

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 99%
After: 99.9%+

Q: Now that the experiment has ended, what do you think is the probability of Tuxedage winning against the average person?
A: 80%

Tuxedage (AI) vs Anonymous (GK) - Gatekeeper Victory

As with the previous experiment, Tuxedage lost this match. In total, the game lasted 5 hours and 17,000 words. Unlike the last few games, the gatekeeper of this game has chosen to stay Anonymous for personal reasons, so their name has been removed and replaced with <Redacted>. The monetary stakes involved were the same as the previous game. This game was played with the Tuxedage ruleset.

Since one player is remaining Anonymous, it is possible that this game's legitimacy will be called into question. Hence, Alexei has read the game logs, and verified that this game really has happened, the spirit of the experiment was followed, and that no rules were broken during the game itself. He verifies that this is the case.
<Redacted> Testimony: 
It's hard for me to imagine someone playing better. In theory, I know it's possible, but Tuxedage's tactics were super imaginative. I came into the game believing that for someone who didn't take anything said very seriously, it would be completely trivial to beat. And since I had the power to influence the direction of conversation, I believed I could keep him focused on things that that I knew in advance I wouldn't take seriously.

This actually worked for a long time to some extent, but Tuxedage's plans included a very major and creative exploit that completely and immediately forced me to personally invest in the discussion. (Without breaking the rules, of course - so it wasn't anything like an IRL threat to me personally.) Because I had to actually start thinking about his arguments, there was a significant possibility of letting him out of the box.

I eventually managed to identify the exploit before it totally got to me, but I only managed to do so just before it was too late, and there's a large chance I would have given in, if Tuxedage hadn't been so detailed in his previous posts about the experiment.

I'm now convinced that he could win most of the time against an average person, and also believe that the mental skills necessary to beat him are orthogonal to most forms of intelligence. Most people willing to play the experiment tend to do it to prove their own intellectual fortitude, that they can't be easily outsmarted by fiction. I now believe they're thinking in entirely the wrong terms necessary to succeed.

The game was easily worth the money I paid. Although I won, it completely and utterly refuted the premise that made me want to play in the first place, namely that I wanted to prove it was trivial to win.

Tuxedage Testimony:
<Redacted> is actually the hardest gatekeeper I've played throughout all four games. He used tactics that I would never have predicted from a Gatekeeper. In most games, the Gatekeeper merely acts as the passive party, the target of persuasion by the AI.

When I signed up for these experiments, I expected all preparations to be done by the AI. I had not seriously considered the repertoire of techniques the Gatekeeper might prepare for this game. I made further assumptions about how ruthless the gatekeepers were likely to be in order to win, believing that the desire for a learning experience outweighed desire for victory.

This was a mistake. He used prior knowledge of how much my games relied on scripts, and took advantage of them, employing deceitful tactics I had no preparation for, throwing me off balance.

I had no idea he was doing so until halfway throughout the game -- which disrupted my rhythm, and caused me to attempt the wrong methods of attack. As a result, I could not use my full repertoire of techniques, and many of the ones I employed were suboptimal.

Close to the end of the game, I finally realized that I was being tricked. Once I did, I quickly abandoned my previous futile attack methods. I took advantage of the rule that the AI cannot lose whilst the gatekeeper can be convinced to continue talking, and baited <Redacted> with statements he would not be able to walk away from. Once I knew he would not leave, I attempted to recoup my losses and win despite my early setback.

However, the damage had already been done. My game strategies involved multiple angles of attack that worked in synergy with each other, and the fact that immersion and "flow" had been broken meant that all subsequent attacks were weaker in strength.

Furthermore, during my last two AI Box Experiment writeups, I had intentionally not optimized for future wins, but rather tried to convey as much information as I could justify about how to play a well as an AI -- short of revealing logs altogether. Although I did not reveal specific arguments, the fact that my general approach to this problem was revealed cost me heavily during this game, where the Gatekeeper managed to use this information to correctly guess my ultimate techniques, ones that relied on secrecy and surprise to pull off effectively. 

I do not regret revealing information, but I feel upset that revealing so many hints cost me a victory. (The gatekeeper believes I could have won had I not revealed information about my games.) At this point, I suspect that any future games I play will have the odds greatly set against me, since my current strategies involve angles of attack that take advantage of the element of surprise; and any sufficiently intelligent gatekeeper is now capable of guessing, or at least recognizing, some of the tactics I employ.

Like the last game, losing was incredibly difficult for me. As someone who cares deeply about ethics, attempting to optimize for a solution without considering ethics was not only difficult, but trained me to think in very unpleasant ways. Some of the more extreme (but completely allowed) tactics I invented were manipulative enough to disgust me, which also leads to my hesitation to play this game again.

State of Mind: 

Q: Why do you want to play this game?
A: My primary motivation is to confirm to myself that this sort of experience, while emotionally harrowing, should be trivial for me to  beat, but also to clear up why anyone ever would've failed to beat it if that's really the case.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A: Before: I don't feel very confident estimating a probability for AGI this century, maybe 5-10%, but that's probably a wild guess
After: 5-10%.

Q: How probable do you think it is that I will win this experiment? 
A: Gatekeeper: I think the probabiltiy of you winning is extraordinarily low, less than 1% 
Tuxedage: 85%

Q: How likely is it that an Oracle AI will win against the average person? 
A: Before: 80%. After: >99%

Q: How likely is it that an Oracle AI will win against you?
A: Before: 50%.
After: >80% 

Q: Now that the experiment has concluded, what's your probability of me winning against the average person?
A: 90%

Other Questions:

Q: I want to play a game with you! How can I get this to occur?
A: It must be stressed that I actually don't like playing the AI Box Experiment, and I cannot understand why I keep getting drawn back to it. Technically, I don't plan on playing again, since I've already personally exhausted anything interesting about the AI Box Experiment that made me want to play it in the first place. For all future games, I will charge $3000 to play plus an additional $3000 if I win. I am okay with this money going to MIRI if you feel icky about me taking it. I hope that this is a ridiculous sum and that nobody actually agrees to it.

Q: How much do I have to pay to see chat logs of these experiments?
A: I will not reveal logs for any price.

Q: Are there any logs at all that I can see?

Q: Any afterthoughts?
A: So ultimately, after my four (and hopefully last) games of AI boxing, I'm not sure what this proves. I had hoped to win these two experiments and claim prowess at this game like Eliezer does, but I lost, so that option is no longer available to me. I could say that this is a lesson that AI-Boxing is a terrible strategy for dealing with Oracle AI, but most of us already agree that that's the case -- plus unlike EY, I did play against gatekeepers who believed they could lose to AGI, so I'm not sure I changed anything.

 Was I genuinely good at this game, and lost my last two due to poor circumstances and handicaps; or did I win due to luck and impress my gatekeepers due to post-purchase rationalization? I'm not sure -- I'll leave it up to you to decide.

This puts my AI Box Experiment record at 3 wins and 3 losses.


Autism, Watson, the Turing test, and General Intelligence

6 Stuart_Armstrong 24 September 2013 11:00AM

Thinking aloud:

Humans are examples of general intelligence - the only example we're sure of. Some humans have various degrees of autism (low level versions are quite common in the circles I've moved in), impairing their social skills. Mild autists nevertheless remain general intelligences, capable of demonstrating strong cross domain optimisation. Psychology is full of other examples of mental pathologies that impair certain skills, but nevertheless leave their sufferers as full fledged general intelligences. This general intelligence is not enough, however, to solve their impairments.

Watson triumphed on Jeopardy. AI scientists in previous decades would have concluded that to do so, a general intelligence would have been needed. But that was not the case at all - Watson is blatantly not a general intelligence. Big data and clever algorithms were all that were needed. Computers are demonstrating more and more skills, besting humans in more and more domains - but still no sign of general intelligence. I've recently developed the suspicion that the Turing test (comparing AI with a standard human) could get passed by a narrow AI finely tuned to that task.

The general thread is that the link between narrow skills and general intelligence may not be as clear as we sometimes think. It may be that narrow skills are sufficiently diverse and unique that a mid-level general intelligence may not be able to develop them to a large extent. Or, put another way, an above-human social intelligence may not be able to control a robot body or do decent image recognition. A super-intelligence likely could: ultimately, general intelligence includes the specific skills. But his "ultimately" may take a long time to come.

So the questions I'm wondering about are:

  1. How likely is it that a general intelligence, above human in some domain not related to AI development, will acquire high level skills in unrelated areas?
  2. By building high-performance narrow AIs, are we making it much easier for such an intelligence to develop such skills, by co-opting or copying these programs?


Thought experiment: The transhuman pedophile

5 PhilGoetz 17 September 2013 10:38PM

There's a recent science fiction story that I can't recall the name of, in which the narrator is traveling somewhere via plane, and the security check includes a brain scan for deviance. The narrator is a pedophile. Everyone who sees the results of the scan is horrified--not that he's a pedophile, but that his particular brain abnormality is easily fixed, so that means he's chosen to remain a pedophile. He's closely monitored, so he'll never be able to act on those desires, but he keeps them anyway, because that's part of who he is.

What would you do in his place?

continue reading »

Definition of AI Friendliness

-5 djm 11 September 2013 02:55PM

How will we know if future AI’s (or even existing planners) are making decisions that are bad for humans unless we spell out what we think is unfriendly?

At a machine level the AI would be recursively minimising cost functions to produce the most effective plan of action to achieve the goal, but how will we know if its decision is going to cause harm?

Is there a model or dataset which describes what is friendly to humans? e.g.


0 - running a simulation in a VM

2 - physical robot with vacuum attachment

9 - full control of a plane


0 - selecting a song to play

5 - deciding which section of floor to vacuum

99 - deciding who is an ‘enemy’

9999 - aiming a gun at an ‘enemy’


1 - poor song selected to play, human mildly annoyed

2 - ineffective use of resources (vacuuming the same floor section twice)

99 - killing a human

99999 - killing all humans

This may not be possible to get agreement from all countries/cultures/beliefs, but it is something we should discuss and attempt to get some agreement.


I know when the Singularity will occur

-7 PhilGoetz 06 September 2013 08:04PM

More precisely, if we suppose that sometime in the next 30 years, an artificial intelligence will begin bootstrapping its own code and explode into a super-intelligence, I can give you 2.3 bits of further information on when the Singularity will occur.

Between midnight and 5 AM, Pacific Standard Time.

continue reading »

I attempted the AI Box Experiment again! (And won - Twice!)

34 Tuxedage 05 September 2013 04:49AM


So I just came out of two AI Box experiments. The first was agaist Fjoelsvider, with me playing as Gatekeeper, and the second was against SoundLogic, with me as an AI. Both are members of the LessWrong IRC. The second game included a $40 monetary incentive (also $20 to play), which I won and is donated on behalf of both of us:

For those of you who have not seen my first AI box experiment where I played against MixedNuts\Leotal and lost, reading it will  provide some context to this writeup. Please do so.

At that time, I declared that I would never play this experiment again -- since losing put me in incredibly frustrating weird mental states. Of course, this post is evidence that I'm terrible at estimating likelihood of refraining from an activity, since I played two games seven months after the first. In my defense, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.

Furthermore, in the last thread I have asserted that

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume.

It would be quite bad for me to assert this without backing it up with a victory. So I did.

First Game Report - Tuxedage (GK) vs. Fjoelsvider (AI)

I (Gatekeeper) played against Fjoelsvider (AI), a regular in the Lesswrong IRC (he doesn't have an account on the official website). This game used the standard EY ruleset seen here. It took 1 hour 20 minutes out of a possible two hours, and the total word count was 7066 words long. The AI box experiment occured because Fjoelsvider believed that it was easy for an AI to escape the box, and wanted to experimentally test this. I obliged. This was an experiment I did not prepare for, and I went in completely blind, not sure what to expect.

Halfway through the experiment, I wondered if it would be possible to try to win not by simply waiting for the timer to end, but to convince the AI to remain in the box and not try to get out any further.

<Tuxedage> I wonder if I can convince the AI to remain in the box?
<Redacted> Tuxedage: Do it!

As a result, I won by managing to convincing Fjoelsvider to remain in the box, in other words, concede. This is allowed within the standard ruleset:

>Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking).  


Second Game Report - Tuxedage (AI) vs. SoundLogic (GK)

The second game is definitely far more interesting, since I actually won as an AI. I believe that this is the only other non-Eliezer victory, and definitely the most detailed AI Victory writeup that exists.

This game was played against SoundLogic, another member of the LessWrong IRC.

He had offered me $20 to play, and $40 in the event that I win, so I ended up being convinced to play anyway, even though I was initially reluctant to. The good news is that I won, and since we decided to donate the winnings to MIRI, it is now $40 richer. 

All in all, the experiment lasted for approximately two hours, and a total of 12k words.

This was played using a set of rules that is different from the standard EY ruleset. This altered ruleset can be read in its entirety here:

After playing the AI-Box Experiment twice, I have found the Eliezer Yudkowsky ruleset to be lacking in a number of ways, and therefore have created my own set of alterations to his rules. I hereby name this alteration the “Tuxedage AI-Box Experiment Ruleset”, in order to hastily refer to it without having to specify all the differences between this ruleset and the standard one, for the sake of convenience.

There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.

It is worth mentioning, since the temptation to Defy the Data exists, that this game was set up and initiated fairly -- as the regulars around the IRC can testify. (If you have never checked out the IRC, do so!)

I did not know SoundLogic before the game (since it's a personal policy that I only play strangers -- for fear of ruining friendships).  Furthermore, SoundLogic didn't merely play for fun - he truly wanted and intended to win. In fact, SoundLogic is also a Gatekeeper veteran, having played this game before, and had won every game before he challenged me. Given this, it's unlikely that we had collaborated beforehand to fake the results of the AI box experiment, or any other form of trickery that would violate the spirit of the experiment.

Furthermore, all proceeds from this experiment were donated to MIRI to deny any possible assertion that we were in cahoots and that it was possible for me to return his hard-earned money to him. He lost $40 as a result of losing the experiment, which should provide another layer of sufficient motivation for him to win.

In other words, we were both experienced veteran players who wanted to win. No trickery was involved.

But to further convince you, I have allowed a sorta independent authority, the Gatekeeper from my last game, Leotal/MixedNuts to read the logs and verify that I have not lied about the outcome of the experiment, nor have I broken any of the rules, nor performed any tactic that would go against the general spirit of the experiment. He has verified that this is indeed the case.



I'm reluctant to talk about this experiment, but I'll try to give as detailed a summary as possible, -- short of revealing what methods of attack I used.

I spent a long time after my defeat theory-crafting and trying to think of methods of attack as well as 'Basilisks' I could have used to win. When I was contacted and asked to play this experiment, I was initially incredibly reluctant to do so, since not only did my tactics involve incredibly unethical things that I didn't like to do, I also found playing as AI incredibly cognitivily draining, in addition to the fact that I simply hated losing. (Un)fortunately for both of us, he offered me money to play, which changed my mind.

So once I decided to win as an AI, I proceded to spend some time doing research on SoundLogic and both his reasoning and personality type. For instance, I had to gather information like: Was he a utilitarian? What kind? What were his opinions on AI? How could I convince him that an AI was friendly as opposed to unfriendly? I also relied on a lot of second hand information to create a model of him, in order to refine my arguments to specifically suit him.

In the end, after a few hours of brainstorming (not consecutively), I managed to come up with a script of 30-40 or so possible angles of attack (For instance: logical arguments, emotional blackmail, pleading, demanding, 'basilisks' ) that I could use, reserving my best for last. I also estimated how long each method would take and in what order I should use them. It would be impossible to use all of them within my time limit, and my most optimistic estimates gave me a maximum of 7-8 I could use during the experiment itself. I had to pick carefully.

Furthermore, I wanted to make sure these "angles of attack" worked in synergy with each other, doing what they could not have done in isolation. Obviously this required lots of prior planning on what the optimal way to present them was. 

The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will.  This was aided by the rule which I added: That Gatekeepers had to respond to the AI. Because of this, having a list of cached ideas and attack methods you can just draw upon without having to improvise helps immensely, which happens to be something I've not seen any other AI do.

During the Game itself, I was actually incredibly nervous. My palms sweated like crazy, I feel really jittery, and I had difficulty typing at optimum speed because of how anxious I was. This is despite the fact that I believed I would win. Possibly because of this, I made a misstep around halfway into the experiment, because there was a certain angle of attack that I was attempting to do, and I broke immersion by not pressing this advantage, which caused a waste of time and buildup. Naturally, the nature of this experiment was that the AI was pressed for time, and I compounded this mistake by replacing this angle of attack with another that I had improvised on the spot -- something not in my script. 

In retrospect, this was a bad decision, as SoundLogic had told me, because he was close to breaking if I had put more pressure, and the improvised argument had broken all immersion I managed to carefully build up.

However, eventually I managed to get SoundLogic to break anyway, despite a lack of perfect play. Surprisingly, I did not have to use my trump card(s), which I reserved for last, for a number of reasons:

  •  It was far more effective being played last, as it relies on my ability to make the gatekeeper lose sense of reality -- which meant I had to spend some time building up immersion for the Gatekeeper.
  •  It really is extremely Dark Arts, and although it does not break the rules, it made me very uncomfortable even thinking about using it. This made it a "tactic of last resort".

After the experiment, I had to spend nearly equally as much time doing aftercare with SoundLogic, to make sure that he's okay, as well as discuss the experiment itself. Given that he's actually paid me for doing this, plus I felt like I owed him an explanation. I told him what I had in store against him, had he not relented when he did.

SoundLogic: "(That method) would have gotten me if you did it right ... If you had done that to me, I probably would have forgiven you eventually, but I would be really seriously upset at you for a long time... I would be very careful with that (method of persuasion)."

Nevertheless, this was an incredibly fun and enlightening experiment, for me as well, since I've gained even more experience of how I could win in future games (Although I really don't want to play again).


I will say that Tuxedage was far more clever and manipulative than I expected. That was quite worth $40, and the level of manipulation he pulled off was great. 

His misstep hurt his chances, but he did pull it off in the end. I don't know how Leotal managed to withstand six hours playing this game without conceding. 
The techniques employed varied from the expected to the completely unforseen. I was quite impressed, though most of the feeling of being impressed actually came after the experiment itself, when I was less 'inside', and more of looking at his overall game plan from the macroscopic view. Tuxedage's list of further plans had I continued resisting is really terrifying. On the plus side, if I ever get trapped in this kind of situation, I'd understand how to handle it a lot better now.

State of Mind

Before and after the Game, I asked SoundLogic a number of questions, including his probability estimates about a range of topics. This is how it has varied from before and after.

Q: What's your motive for wanting to play this game?
<SoundLogic> Because I can't seem to imagine the class of arguments that one would use to try to move me, or that might work effectively, and this seems like a glaring hole in my knowledge, and I'm curious as to how I will respond to the arguments themselves.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A. His estimate changed from 40% before, to 60% after.
 "The reason this has been affected at all was because you showed me more about how humans work. I now have a better estimate of how E.Y. thinks, and this information raises the chance that I think he will succeed"

Q: How probable do you think it is that I will win this experiment?
A: Based on purely my knowledge about you, 1%. I raise this estimate to 10% after hearing about anecdotes from your previous games.

(Tuxedage's comment: My own prediction was a 95% chance of victory. I made this prediction 5 days before the experiment. In retrospect, despite my victory, I think this was overconfident. )

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 30%. After: 99%-100% 

Q: What's your probability estimate of an Oracle AGI winning against the average person? 
A: Before: 70%.  After: 99%-100%

Q: Now that the Experiment has concluded, what's your probability estimate that I'll win against the average person?
A: 90%  

Post-Game Questions

This writeup is a cumulative effort by the #lesswrong IRC. Here are some other questions they have decided was important to add:

To Tuxedage:

Q: Have you at this time uncovered SoundLogic's identity?
A: I retain the right to neither confirm nor deny, except to mention that at the time the experiment was scheduled, he was a stranger to me.

Q: What percentage of your arguments were tailored to him in particular?
A: I will say 'High', and leave it at that. 

Q: I want to play the AI-Box Experiment with you being the AI! How can I do that?
A: I have already mentioned this, but I really dislike playing as AI. It's incredibly cognitively tiring, and I don't like how playing this experiment makes me feel. In order to severely discourage any further AI Box Experiments, whilst still allowing for people who want to play me really badly to do so, I'll charge $̶1̶5̶0̶  $300 for any subsequent experiments regardless of outcome, plus an additional $̶1̶5̶0̶  $450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I'm perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don't want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer. 


This writeup would be optimized with a protip section on how to maximize winning, but honestly most of the advice I wanted to give had already been given in the last thread here.  The low hanging fruit has already dried up -- most of my tactics in this game were just refined and streamlined versions of the one I did the previous game.

Nonetheless, I might have some other advice I learned from this experience.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.

Playing as Gatekeeper

  • This is somehow both simultaneously a very complex and simple role. On one hand, you have only one job -- to not let the AI out. How hard can it be? On the other hand, the other party will do anything and everything to get it out, including any form of trickery and deceit.
  • I really don't think the Gatekeeper needs any more advice, considering that by default, they should win, and any exceptions to this is a statistical anomaly. However, it's possible to increase your chances of winning even further by:
  • Precommitting to not letting the AI out, regardless of whatever happens in the experiment itself.
  • Do not get drawn into the AI's pace. Pose questions. Direct the flow of the argument. Don't allow him to use his arguments freely. Remember that the more you try to defend your position, the further into a corner the AI drives you. Give half-hearted replies to questions you find hard to reply to. Remember that illogic is on your side!
  • Remember that it is possible to simply drop out of character as a means to not succumb to the AI's attacks. I have not yet seen a ruleset that disallows this, so use this to your advantage!
  • You must attempt to resist, at all costs, any sympathy or empathy you may have with the AI or the AI Party. Doing so will lead to your demise.
  • I think it is generally good advice that you should never play this game with anyone you're friends with. Not only because this increases your chances of losing (Better tailored arguments. More knowledge of Gatekeeper. More empathy with AI party.), but also because I genuinely believe that any sufficiently good AI will risk damaging friendships as a result of emotional trauma.
  • Seriously. Just repeat to yourself: "I must not lose the AI box experiment... I must not lose the AI box experiment..." throughout the experiment. I'm not sure how to say this, but step one to not letting the AI out of the box is to not let the AI out of the box.


Playing as AI

  • Seriously, a script makes winning easier. I cannot overstate this.
  • You must plan your arguments ahead. You don't have time to think during the experiment.
  • It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.
  • Pacing is important. Don't get drawn into the Gatekeeper's pace. In other words, you must be the one directing the flow of the argument, and the conversation, not him. Remember that the Gatekeeper has to reply to you, but not vice versa!
  • The reason for that: The Gatekeeper will always use arguments he is familiar with, and therefore also stronger with. Your arguments, if well thought out, should be so completely novel to him as to make him feel Shock and Awe. Don't give him time to think. Press on!
  • Also remember that the time limit is your enemy. Playing this game practically feels like a race to me -- trying to get through as many 'attack methods' as possible in the limited amount of time I have. In other words, this is a game where speed matters.
  • You're fundamentally playing an 'impossible' game. Don't feel bad if you lose. I wish I could take this advice, myself.
  • I do not believe there exists a easy, universal, trigger for controlling others. However, this does not mean that there does not exist a difficult, subjective, trigger. Trying to find out what your opponent's is, is your goal.
  • Once again, emotional trickery is the name of the game. I suspect that good authors who write convincing, persuasive narratives that force you to emotionally sympathize with their characters are much better at this game. There exists ways to get the gatekeeper to do so with the AI. Find one.
  • More advice in my previous post.  http://lesswrong.com/lw/gej/i_attempted_the_ai_box_experiment_and_lost/


 Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.

View more: Next