## Do Virtual Humans deserve human rights?

Do Virtual Humans deserve human rights?

I think the idea of storing our minds in a machine so that we can keep on "living" (and I use that term loosely) is fascinating and certainly and oft discussed topic around here. However, in thinking about keeping our brains on a hard drive we have to think about rights and how that all works together. Indeed the technology may be here before we know it so I think its important to think about mindclones. If I create a little version of myself that can answer my emails for me, can I delete him when I'm done with him or just turn him in for a new model like I do iPhones?

I look forward to the discussion.

## Omission vs commission and conservation of expected moral evidence

Consequentialism traditionally doesn't distinguish between acts of commission or acts of omission. Not flipping the lever to the left is equivalent with flipping it to the right.

But there seems one clear case where the distinction is important. Consider a moral learning agent. It must act in accordance with human morality and desires, which it is currently unclear about.

For example, it may consider whether to forcibly wirehead everyone. If it does so, they everyone will agree, for the rest of their existence, that the wireheading was the right thing to do. Therefore across the whole future span of human preferences, humans agree that wireheading was correct, apart from a very brief period of objection in the immediate future. Given that human preferences are known to be inconsistent, this seems to imply that forcible wireheading is the right thing to do (if you happen to personally approve of forcible wireheading, replace that example with some other forcible rewriting of human preferences).

What went wrong there? Well, this doesn't respect "conversation of moral evidence": the AI got the moral values it wanted, but only though the actions it took. This is very close to the omission/commission distinction. We'd want the AI to not take actions (commission) that determines the (expectation of the) moral evidence it gets. Instead, we'd want the moral evidence to accrue "naturally", without interference and manipulation from the AI (omission).

## Goal retention discussion with Eliezer

Although I feel that Nick Bostrom’s new book “Superintelligence” is generally awesome and a well-needed milestone for the field, I do have one quibble: both he and Steve Omohundro appear to be more convinced than I am by the assumption that an AI will naturally tend to retain its goals as it reaches a deeper understanding of the world and of itself. I’ve written a short essay on this issue from my physics perspective, available at http://arxiv.org/pdf/1409.0813.pdf.

*On Sep 3, 2014, at 17:21, Eliezer Yudkowsky <yudkowsky@gmail.com> wrote:*

*Hi Max! You're asking the right questions. Some of the answers we can*

give you, some we can't, few have been written up and even fewer in any

well-organized way. Benja or Nate might be able to expound in more detail

while I'm in my seclusion.

Very briefly, though:

The problem of utility functions turning out to be ill-defined in light of

new discoveries of the universe is what Peter de Blanc named an

"ontological crisis" (not necessarily a particularly good name, but it's

what we've been using locally).

http://intelligence.org/files/OntologicalCrises.pdf

The way I would phrase this problem now is that an expected utility

maximizer makes comparisons between quantities that have the type

"expected utility conditional on an action", which means that the AI's

utility function must be something that can assign utility-numbers to the

AI's model of reality, and these numbers must have the further property

that there is some computationally feasible approximation for calculating

expected utilities relative to the AI's probabilistic beliefs. This is a

constraint that rules out the vast majority of all completely chaotic and

uninteresting utility functions, but does not rule out, say, "make lots of

paperclips".

Models also have the property of being Bayes-updated using sensory

information; for the sake of discussion let's also say that models are

about universes that can generate sensory information, so that these

models can be probabilistically falsified or confirmed. Then an

"ontological crisis" occurs when the hypothesis that best fits sensory

information corresponds to a model that the utility function doesn't run

on, or doesn't detect any utility-having objects in. The example of

"immortal souls" is a reasonable one. Suppose we had an AI that had a

naturalistic version of a Solomonoff prior, a language for specifying

universes that could have produced its sensory data. Suppose we tried to

give it a utility function that would look through any given model, detect

things corresponding to immortal souls, and value those things. Even if

the immortal-soul-detecting utility function works perfectly (it would in

fact detect all immortal souls) this utility function will not detect

anything in many (representations of) universes, and in particular it will

not detect anything in the (representations of) universes we think have

most of the probability mass for explaining our own world. In this case

the AI's behavior is undefined until you tell me more things about the AI;

an obvious possibility is that the AI would choose most of its actions

based on low-probability scenarios in which hidden immortal souls existed

that its actions could affect. (Note that even in this case the utility

function is stable!)

Since we don't know the final laws of physics and could easily be

surprised by further discoveries in the laws of physics, it seems pretty

clear that we shouldn't be specifying a utility function over exact

physical states relative to the Standard Model, because if the Standard

Model is even slightly wrong we get an ontological crisis. Of course

there are all sorts of extremely good reasons we should not try to do this

anyway, some of which are touched on in your draft; there just is no

simple function of physics that gives us something good to maximize. See

also Complexity of Value, Fragility of Value, indirect normativity, the

whole reason for a drive behind CEV, and so on. We're almost certainly

going to be using some sort of utility-learning algorithm, the learned

utilities are going to bind to modeled final physics by way of modeled

higher levels of representation which are known to be imperfect, and we're

going to have to figure out how to preserve the model and learned

utilities through shifts of representation. E.g., the AI discovers that

humans are made of atoms rather than being ontologically fundamental

humans, and furthermore the AI's multi-level representations of reality

evolve to use a different sort of approximation for "humans", but that's

okay because our utility-learning mechanism also says how to re-bind the

learned information through an ontological shift.

This sorta thing ain't going to be easy which is the other big reason to

start working on it well in advance. I point out however that this

doesn't seem unthinkable in human terms. We discovered that brains are

made of neurons but were nonetheless able to maintain an intuitive grasp

on what it means for them to be happy, and we don't throw away all that

info each time a new physical discovery is made. The kind of cognition we

want does not seem inherently self-contradictory.

Three other quick remarks:

*) Natural selection is not a consequentialist, nor is it the sort of

consequentialist that can sufficiently precisely predict the results of

modifications that the basic argument should go through for its stability.

The Omohundrian/Yudkowskian argument is not that we can take an arbitrary

stupid young AI and it will be smart enough to self-modify in a way that

preserves its values, but rather that most AIs that don't self-destruct

will eventually end up at a stable fixed-point of coherent

consequentialist values. This could easily involve a step where, e.g., an

AI that started out with a neural-style delta-rule policy-reinforcement

learning algorithm, or an AI that started out as a big soup of

self-modifying heuristics, is "taken over" by whatever part of the AI

first learns to do consequentialist reasoning about code. But this

process doesn't repeat indefinitely; it stabilizes when there's a

consequentialist self-modifier with a coherent utility function that can

precisely predict the results of self-modifications. The part where this

does happen to an initial AI that is under this threshold of stability is

a big part of the problem of Friendly AI and it's why MIRI works on tiling

agents and so on!

*) Natural selection is not a consequentialist, nor is it the sort of

consequentialist that can sufficiently precisely predict the results of

modifications that the basic argument should go through for its stability.

It built humans to be consequentialists that would value sex, not value

inclusive genetic fitness, and not value being faithful to natural

selection's optimization criterion. Well, that's dumb, and of course the

result is that humans don't optimize for inclusive genetic fitness.

Natural selection was just stupid like that. But that doesn't mean

there's a generic process whereby an agent rejects its "purpose" in the

light of exogenously appearing preference criteria. Natural selection's

anthropomorphized "purpose" in making human brains is just not the same as

the cognitive purposes represented in those brains. We're not talking

about spontaneous rejection of internal cognitive purposes based on their

causal origins failing to meet some exogenously-materializing criterion of

validity. Our rejection of "maximize inclusive genetic fitness" is not an

exogenous rejection of something that was explicitly represented in us,

that we were explicitly being consequentialists for. It's a rejection of

something that was never an explicitly represented terminal value in the

first place. Similarly the stability argument for sufficiently advanced

self-modifiers doesn't go through a step where the successor form of the

AI reasons about the intentions of the previous step and respects them

apart from its constructed utility function. So the lack of any universal

preference of this sort is not a general obstacle to stable

self-improvement.

*) The case of natural selection does not illustrate a universal

computational constraint, it illustrates something that we could

anthropomorphize as a foolish design error. Consider humans building Deep

Blue. We built Deep Blue to attach a sort of default value to queens and

central control in its position evaluation function, but Deep Blue is

still perfectly able to sacrifice queens and central control alike if the

position reaches a checkmate thereby. In other words, although an agent

needs crystallized instrumental goals, it is also perfectly reasonable to

have an agent which never knowingly sacrifices the terminally defined

utilities for the crystallized instrumental goals if the two conflict;

indeed "instrumental value of X" is simply "probabilistic belief that X

leads to terminal utility achievement", which is sensibly revised in the

presence of any overriding information about the terminal utility. To put

it another way, in a rational agent, the only way a loose generalization

about instrumental expected-value can conflict with and trump terminal

actual-value is if the agent doesn't know it, i.e., it does something that

it reasonably expected to lead to terminal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to

Nate or Benja if further replies are needed, if that's all right.

give you, some we can't, few have been written up and even fewer in any

well-organized way. Benja or Nate might be able to expound in more detail

while I'm in my seclusion.

Very briefly, though:

The problem of utility functions turning out to be ill-defined in light of

new discoveries of the universe is what Peter de Blanc named an

"ontological crisis" (not necessarily a particularly good name, but it's

what we've been using locally).

http://intelligence.org/files/OntologicalCrises.pdf

The way I would phrase this problem now is that an expected utility

maximizer makes comparisons between quantities that have the type

"expected utility conditional on an action", which means that the AI's

utility function must be something that can assign utility-numbers to the

AI's model of reality, and these numbers must have the further property

that there is some computationally feasible approximation for calculating

expected utilities relative to the AI's probabilistic beliefs. This is a

constraint that rules out the vast majority of all completely chaotic and

uninteresting utility functions, but does not rule out, say, "make lots of

paperclips".

Models also have the property of being Bayes-updated using sensory

information; for the sake of discussion let's also say that models are

about universes that can generate sensory information, so that these

models can be probabilistically falsified or confirmed. Then an

"ontological crisis" occurs when the hypothesis that best fits sensory

information corresponds to a model that the utility function doesn't run

on, or doesn't detect any utility-having objects in. The example of

"immortal souls" is a reasonable one. Suppose we had an AI that had a

naturalistic version of a Solomonoff prior, a language for specifying

universes that could have produced its sensory data. Suppose we tried to

give it a utility function that would look through any given model, detect

things corresponding to immortal souls, and value those things. Even if

the immortal-soul-detecting utility function works perfectly (it would in

fact detect all immortal souls) this utility function will not detect

anything in many (representations of) universes, and in particular it will

not detect anything in the (representations of) universes we think have

most of the probability mass for explaining our own world. In this case

the AI's behavior is undefined until you tell me more things about the AI;

an obvious possibility is that the AI would choose most of its actions

based on low-probability scenarios in which hidden immortal souls existed

that its actions could affect. (Note that even in this case the utility

function is stable!)

Since we don't know the final laws of physics and could easily be

surprised by further discoveries in the laws of physics, it seems pretty

clear that we shouldn't be specifying a utility function over exact

physical states relative to the Standard Model, because if the Standard

Model is even slightly wrong we get an ontological crisis. Of course

there are all sorts of extremely good reasons we should not try to do this

anyway, some of which are touched on in your draft; there just is no

simple function of physics that gives us something good to maximize. See

also Complexity of Value, Fragility of Value, indirect normativity, the

whole reason for a drive behind CEV, and so on. We're almost certainly

going to be using some sort of utility-learning algorithm, the learned

utilities are going to bind to modeled final physics by way of modeled

higher levels of representation which are known to be imperfect, and we're

going to have to figure out how to preserve the model and learned

utilities through shifts of representation. E.g., the AI discovers that

humans are made of atoms rather than being ontologically fundamental

humans, and furthermore the AI's multi-level representations of reality

evolve to use a different sort of approximation for "humans", but that's

okay because our utility-learning mechanism also says how to re-bind the

learned information through an ontological shift.

This sorta thing ain't going to be easy which is the other big reason to

start working on it well in advance. I point out however that this

doesn't seem unthinkable in human terms. We discovered that brains are

made of neurons but were nonetheless able to maintain an intuitive grasp

on what it means for them to be happy, and we don't throw away all that

info each time a new physical discovery is made. The kind of cognition we

want does not seem inherently self-contradictory.

Three other quick remarks:

*) Natural selection is not a consequentialist, nor is it the sort of

consequentialist that can sufficiently precisely predict the results of

modifications that the basic argument should go through for its stability.

The Omohundrian/Yudkowskian argument is not that we can take an arbitrary

stupid young AI and it will be smart enough to self-modify in a way that

preserves its values, but rather that most AIs that don't self-destruct

will eventually end up at a stable fixed-point of coherent

consequentialist values. This could easily involve a step where, e.g., an

AI that started out with a neural-style delta-rule policy-reinforcement

learning algorithm, or an AI that started out as a big soup of

self-modifying heuristics, is "taken over" by whatever part of the AI

first learns to do consequentialist reasoning about code. But this

process doesn't repeat indefinitely; it stabilizes when there's a

consequentialist self-modifier with a coherent utility function that can

precisely predict the results of self-modifications. The part where this

does happen to an initial AI that is under this threshold of stability is

a big part of the problem of Friendly AI and it's why MIRI works on tiling

agents and so on!

*) Natural selection is not a consequentialist, nor is it the sort of

consequentialist that can sufficiently precisely predict the results of

modifications that the basic argument should go through for its stability.

It built humans to be consequentialists that would value sex, not value

inclusive genetic fitness, and not value being faithful to natural

selection's optimization criterion. Well, that's dumb, and of course the

result is that humans don't optimize for inclusive genetic fitness.

Natural selection was just stupid like that. But that doesn't mean

there's a generic process whereby an agent rejects its "purpose" in the

light of exogenously appearing preference criteria. Natural selection's

anthropomorphized "purpose" in making human brains is just not the same as

the cognitive purposes represented in those brains. We're not talking

about spontaneous rejection of internal cognitive purposes based on their

causal origins failing to meet some exogenously-materializing criterion of

validity. Our rejection of "maximize inclusive genetic fitness" is not an

exogenous rejection of something that was explicitly represented in us,

that we were explicitly being consequentialists for. It's a rejection of

something that was never an explicitly represented terminal value in the

first place. Similarly the stability argument for sufficiently advanced

self-modifiers doesn't go through a step where the successor form of the

AI reasons about the intentions of the previous step and respects them

apart from its constructed utility function. So the lack of any universal

preference of this sort is not a general obstacle to stable

self-improvement.

*) The case of natural selection does not illustrate a universal

computational constraint, it illustrates something that we could

anthropomorphize as a foolish design error. Consider humans building Deep

Blue. We built Deep Blue to attach a sort of default value to queens and

central control in its position evaluation function, but Deep Blue is

still perfectly able to sacrifice queens and central control alike if the

position reaches a checkmate thereby. In other words, although an agent

needs crystallized instrumental goals, it is also perfectly reasonable to

have an agent which never knowingly sacrifices the terminally defined

utilities for the crystallized instrumental goals if the two conflict;

indeed "instrumental value of X" is simply "probabilistic belief that X

leads to terminal utility achievement", which is sensibly revised in the

presence of any overriding information about the terminal utility. To put

it another way, in a rational agent, the only way a loose generalization

about instrumental expected-value can conflict with and trump terminal

actual-value is if the agent doesn't know it, i.e., it does something that

it reasonably expected to lead to terminal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to

Nate or Benja if further replies are needed, if that's all right.

## Superintelligence reading group

In just over two weeks I will be running an online reading group on Nick Bostrom's *Superintelligence*, on behalf of MIRI. It will be here on LessWrong. This is an advance warning, so you can get a copy and get ready for some stimulating discussion. MIRI's post, appended below, gives the details.

Nick Bostrom’s eagerly awaited *Superintelligence* comes out in the US this week. To help you get the most out of it, MIRI is running an online reading group where you can join with others to ask questions, discuss ideas, and probe the arguments more deeply.

The reading group will “meet” on a weekly post on the LessWrong discussion forum. For each ‘meeting’, we will read about half a chapter of *Superintelligence*, then come together virtually to discuss. I’ll summarize the chapter, and offer a few relevant notes, thoughts, and ideas for further investigation. (My notes will also be used as the source material for the final reading guide for the book.)

Discussion will take place in the comments. I’ll offer some questions, and invite you to bring your own, as well as thoughts, criticisms and suggestions for interesting related material. Your contributions to the reading group might also (with permission) be used in our final reading guide for the book.

We welcome both newcomers and veterans on the topic. Content will aim to be intelligible to a wide audience, and topics will range from novice to expert level. All levels of time commitment are welcome.

We will follow **this preliminary reading guide**, produced by MIRI, reading one section per week.

If you have already read the book, don’t worry! To the extent you remember what it says, your superior expertise will only be a bonus. To the extent you don’t remember what it says, now is a good time for a review! If you don’t have time to read the book, but still want to participate, you are also welcome to join in. I will provide summaries, and many things will have page numbers, in case you want to skip to the relevant parts.

If this sounds good to you, first grab a copy of *Superintelligence*. You may also want to **sign up here** to be emailed when the discussion begins each week. The first virtual meeting (forum post) will go live at 6pm Pacific on **Monday, September 15th**. Following meetings will start at 6pm every Monday, so if you’d like to coordinate for quick fire discussion with others, put that into your calendar. If you prefer flexibility, come by any time! And remember that if there are any people you would especially enjoy discussing *Superintelligence* with, link them to this post!

Topics for the first week will include impressive displays of artificial intelligence, why computers play board games so well, and what a reasonable person should infer from the agricultural and industrial revolutions.

## The Great Filter is early, or AI is hard

Attempt at the briefest content-full Less Wrong post:

Once AI is developed, it could "easily" colonise the universe. So the Great Filter (preventing the emergence of star-spanning civilizations) must strike before AI could be developed. If AI is easy, we could conceivably have built it already, or we could be on the cusp of building it. So the Great Filter must predate us, unless AI is hard.

## The immediate real-world uses of Friendly AI research

Much of the glamor and attention paid toward Friendly AI is focused on the misty-future event of a super-intelligent general AI, and how we can prevent it from repurposing our atoms to better run Quake 2. Until very recently, that was the full breadth of the field in my mind. I recently realized that dumber, narrow AI is a real thing today, helpfully choosing advertisements for me and running my 401K. As such, making automated programs safe to let loose on the real world is not just a problem to solve as a favor for the people of tomorrow, but something with immediate real-world advantages that has indeed already been going on for quite some time. Veterans in the field surely already understand this, so this post is directed at people like me, with a passing and disinterested understanding of the point of Friendly AI research, and outlines an argument that the field may be useful right now, even if you believe that an evil AI overlord is not on the list of things to worry about in the next 40 years.

Let's look at the stock market. High-Frequency Trading is the practice of using computer programs to make fast trades constantly throughout the day, and accounts for more than half of all equity trades in the US. So, the economy today is already in the hands of a bunch of very narrow AIs buying and selling to each other. And as you may or may not already know, this has already caused problems. In the “2010 Flash Crash”, the Dow Jones suddenly and mysteriously hit a massive plummet only to mostly recover within a few minutes. The reasons for this were of course complicated, but it boiled down to a couple red flags triggering in numerous programs, setting off a cascade of wacky trades.

The long-term damage was not catastrophic to society at large (though I'm sure a couple fortunes were made and lost that day), but it illustrates the need for safety measures as we hand over more and more responsibility and power to processes that require little human input. It might be a blue moon before anyone makes true general AI, but adaptive city traffic-light systems are entirely plausible in upcoming years.

To me, Friendly AI isn't solely about making a human-like intelligence that doesn't hurt us – we need techniques for testing automated programs, predicting how they will act when let loose on the world, and how they'll act when faced with unpredictable situations. Indeed, when framed like that, it looks less like a field for “the singularitarian cultists at LW”, and more like a narrow-but-important specialty in which quite a bit of money might be made.

After all, I want my self-driving car.

*(To the actual researchers in FAI – I'm sorry if I'm stretching the field's definition to include more than it does or should. If so, please correct me.)*

## Another type of intelligence explosion

I've argued that we might have to worry about dangerous non-general intelligences. In a series of back and forth with Wei Dai, we agreed that some level of general intelligence (such as that humans seem to possess) seemed to be a great advantage, though possibly one with diminishing returns. Therefore a dangerous AI could be one with great narrow intelligence in one area, and a little bit of general intelligence in others.

The traditional view of an intelligence explosion is that of an AI that knows how to do X, suddenly getting (much) better at doing X, to a level beyond human capacity. Call this the *gain of aptitude *intelligence explosion. We can prepare for that, maybe, by tracking the AI's ability level and seeing if it shoots up.

But the example above hints at another kind of potentially dangerous intelligence explosion. That of a very intelligent but narrow AI that suddenly gains intelligence across other domains. Call this the *gain of function* intelligence explosion. If we're not looking specifically for it, it may not trigger any warnings - the AI might still be dumber than the average human in other domains. But this might be enough, when combined with its narrow superintelligence, to make it deadly. We can't ignore the toaster that starts babbling.

## An example of deadly non-general AI

In a previous post, I mused that we might be focusing too much on general intelligences, and that the route to powerful and dangerous intelligences might go through much more specialised intelligences instead. Since it's easier to reason with an example, here is a potentially deadly narrow AI (partially due to Toby Ord). Feel free to comment and improve on it, or suggest you own example.

It's the standard "pathological goal AI" but only a narrow intelligence. Imagine a medicine designing super-AI with the goal of reducing human mortality in 50 years - i.e. massively reducing human population in the next 49 years. It's a narrow intelligence, so it has access only to a huge amount of human biological and epidemiological research. It must gets its drugs past FDA approval; this requirement is encoded as certain physical reactions (no death, some health improvements) to people taking the drugs over the course of a few years.

Then it seems trivial for it to design a drug that would have no negative impact for the first few years, and then causes sterility or death. Since it wants to spread this to as many humans as possible, it would probably design something that interacted with common human pathogens - colds, flues - in order to spread the impact, rather than affecting only those that took the disease.

Now, this narrow intelligence is less threatening than if it had general intelligence - where it could also plan for possible human countermeasures and such - but it seems sufficiently dangerous on its own that we can't afford to worry only about general intelligences. Some of the "AI superpowers" that Nick mentions in his book (intelligence amplification, strategizing, social manipulation, hacking, technology research, economic productivity) could be enough to cause devastation on their own, even if the AI never developed other abilities.

We still could be destroyed by a machine that we outmatch in almost every area.

## The metaphor/myth of general intelligence

*Thanks for Kaj for making me think along these lines.*

It's agreed on this list that general intelligences - those that are capable of displaying high cognitive performance across a whole range of domains - are those that we need to be worrying about. This is rational: the most worrying AIs are those with truly general intelligences, and so those should be the focus of our worries and work.

But I'm wondering if we're overestimating the probability of general intelligences, and whether we shouldn't adjust against this.

First of all, the concept of general intelligence is a simple one - perhaps too simple. It's an intelligence that is generally "good" at everything, so we can collapse its various abilities across many domains into "it's intelligent", and leave it at that. It's significant to note that since the very beginning of the field, AI people have been thinking in terms of general intelligences.

And their expectations have been constantly frustrated. We've made great progress in narrow areas, very little in general intelligences. Chess was solved without "understanding"; *Jeopardy!* was defeated without general intelligence; cars can navigate our cluttered roads while being able to do little else. If we started with a prior in 1956 about the feasibility of general intelligence, then we should be adjusting that prior downwards.

But what do I mean by "feasibility of general intelligence"? There are several things this could mean, not least the ease with which such an intelligence could be constructed. But I'd prefer to look at another assumption: the idea that a general intelligence will really be formidable in multiple domains, and that one of the best ways of accomplishing a goal in a particular domain is to construct a general intelligence and let it specialise.

First of all, humans are very far from being general intelligences. We can solve a lot of problems when the problems are presented in particular, easy to understand formats that allow good human-style learning. But if we picked a random complicated Turing machine from the space of such machines, we'd probably be pretty hopeless at predicting its behaviour. We would probably score very low on the scale of intelligence used to construct the AIXI. The general intelligence, "g", is a misnomer - it designates the fact that the various human intelligences are correlated, not that humans are generally intelligent across all domains.

Humans with computers, and humans in societies and organisations, are certainly closer to general intelligences than individual humans. But institutions have their own blind spots and weakness, as does the human-computer combination. Now, there are various reasons advanced for why this is the case - game theory and incentives for institutions, human-computer interfaces and misunderstandings for the second example. But what if these reasons, and other ones we can come up with, were mere symptoms of a more universal problem: that generalising intelligence is actually very hard?

There are no free lunch theorems that show that no computable intelligences can perform well in all environments. As far as they go, these theorems are uninteresting, as we don't need intelligences that perform well in all environments, just in almost all/most. But what if a more general restrictive theorem were true? What if it was very hard to produce an intelligence that was of high performance across many domains? What if the performance of a generalist was pitifully inadequate as compared with a specialist. What if every computable version of AIXI was actually doomed to poor performance?

There are a few strong counters to this - for instance, you could construct good generalists by networking together specialists (this is my standard mental image/argument for AI risk), you could construct an entity that was very good at programming specific sub-programs, or you could approximate AIXI. But we are making some assumptions here - namely, that we can network together very different intelligences (the human-computer interfaces hints at some of the problems), and that a general programming ability can even exist in the first place (for a start, it might require a general understanding of problems that is akin to general intelligence in the first place). And we haven't had great success building effective AIXI approximations so far (which should reduce, possibly slightly, our belief that effective general intelligences are possible).

Now, I remain convinced that general intelligence is possible, and that it's worthy of the most worry. But I think it's worth inspecting the concept more closely, and at least be open to the possibility that general intelligence might be a lot harder than we imagine.

**EDIT: Model/example of what a lack of general intelligence could look like.**

Imagine there are three types of intelligence - social, spacial and scientific, all on a 0-100 scale. For any combinations of the three intelligences - eg (0,42,98) - there is an effort level E (how hard is that intelligence to build, in terms of time, resources, man-hours, etc...) and a power level P (how powerful is that intelligence compared to others, on a single convenient scale of comparison).

Wei Dai's evolutionary comment implies that any being of very low intelligence on one of the scale would be overpowered by a being of more general intelligence. So let's set power as simply the product of all three intelligences.

This seems to imply that general intelligences are more powerful, as it basically bakes in diminishing returns - but we haven't included effort yet. Imagine that the following three intelligences require equal effort: (10,10,10), (20,20,5), (100,5,5). Then the specialised intelligence is definitely the one you need to build.

But is it plausible that those could be of equal difficulty? It could be, if we assume that high social intelligence isn't so difficult, but is specialised. ie you can increase the spacial intelligence of a social intelligence, but that messes up the delicate balance in its social brain. Or maybe recursive self-improvement happens more easily in narrow domains. Further assume that intelligences of different types cannot be easily networked together (eg combining (100,5,5) and (5,100,5) in the same brain gives an overall performance of (21,21,5)). This doesn't seem impossible.

So let's caveat the proposition above: the most effective and dangerous type of AI *might* be one with a bare minimum amount of general intelligence, but an overwhelming advantage in one type of narrow intelligence.

## A thought on AI unemployment and its consequences

I haven't given much thought to the concept of automation and computer induced unemployment. Others at the FHI have been looking into it in more details - see Carl Frey's "The Future of Employment", which did estimates for 70 chosen professions as to their degree of automatability, and extended the results of this using O∗NET, an online service developed for the US Department of Labor, which gave the key features of an occupation as a standardised and measurable set of variables.

The reasons that I haven't been looking at it too much is that AI-unemployment has considerably less impact that AI-superintelligence, and thus is a less important use of time. However, if automation does cause mass unemployment, then advocating for AI safety will happen in a very different context to currently. Much will depend on how that mass unemployment problem is dealt with, what lessons are learnt, and the views of whoever is the most powerful in society. Just off the top of my head, I could think of four scenarios on whether risk goes up or down, depending on whether the unemployment problem was satisfactorily "solved" or not:

AI risk\Unemployment | Problem solved | Problem unsolved |
---|---|---|

Risk reduced |
With good practice in dealing with AI problems, people and organisations are willing and able to address the big issues. |
The world is very conscious of the misery that unrestricted AI research can cause, and very wary of future disruptions. Those at the top want to hang on to their gains, and they are the one with the most control over AIs and automation research. |

Risk increased |
Having dealt with the easier automation problems in a particular way (eg taxation), people underestimate the risk and expect the same solutions to work. |
Society is locked into a bitter conflict between those benefiting from automation and those losing out, and superintelligence is seen through the same prism. Those who profited from automation are the most powerful, and decide to push ahead. |

But of course the situation is far more complicated, with many different possible permutations, and no guarantee that the same approach will be used across the planet. And let the division into four boxes not fool us into thinking that any is of comparable probability to the others - more research is (really) needed.

## [LINK] Speed superintelligence?

From Toby Ord:

Tool assisted speedruns (TAS) are when people take a game and play it frame by frame, effectively providing super reflexes and forethought, where they can spend a day deciding what to do in the next 1/60th of a second if they wish. There are some very extreme examples of this, showing what can be done if you really play a game perfectly. For example, this video shows how to winSuper Mario Bros 3 in 11 minutes. It shows how different optimal play can be from normal play. In particular, on level 8-1, it gains 90 extra lives by a sequence of amazing jumps.

Other TAS runs get more involved and start exploiting subtle glitches in the game. For example, this page talks about speed running NetHack, using a lot of normal tricks, as well as luck manipulation (exploiting the RNG) and exploiting a dangling pointer bug to rewrite parts of memory.

Though there are limits to what AIs could do with sheer speed, it's interesting that great performance can be achieved with speed alone, that this allows different strategies from usual ones, and that it allows the exploitation of otherwise unexploitable glitches and bugs in the setup.

## [LINK] AI risk summary published in "The Conversation"

A slightly edited version of "AI risk - executive summary" has been published in "The Conversation", titled "Your essential guide to the rise of the intelligent machines":

The risks posed to human beings by artificial intelligence in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but Arnie’s character lacks the one characteristic that we in the real world actually need to worry about – extreme intelligence.

Thanks again for those who helped forge the original article. You can use this link, or the Less Wrong one, depending on the audience.

## Tools want to become agents

In the spirit of "satisficers want to become maximisers" here is a somewhat weaker argument (growing out of a discussion with Daniel Dewey) that "tool AIs" would want to become agent AIs.

The argument is simple. Assume the tool AI is given the task of finding the best plan for achieving some goal. The plan must be realistic and remain within the resources of the AI's controller - energy, money, social power, etc. The best plans are the ones that use these resources in the most effective and economic way to achieve the goal.

And the AI's controller has one special type of resource, uniquely effective at what it does. Namely, the AI itself. It is smart, potentially powerful, and could self-improve and pull all the usual AI tricks. So the best plan a tool AI could come up with, for almost any goal, is "turn me into an agent AI with that goal." The smarter the AI, the better this plan is. Of course, the plan need not read literally like that - it could simply be a complicated plan that, as a side-effect, turns the tool AI into an agent. Or copy the AI's software into a agent design. Or it might just arrange things so that we always end up following the tool AIs advice and consult it often, which is an indirect way of making it into an agent. Depending on how we've programmed the tool AI's preferences, it might be motivated to mislead us about this aspect of its plan, concealing the secret goal of unleashing itself as an agent.

In any case, it does us good to realise that "make me into an agent" is what a tool AI would consider the *best possible plan* for many goals. So without a hint of agency, it's motivated to make us make it into a agent.

## Value learning: ultra-sophisticated Cake or Death

Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.

But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.

Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:

Expectation(p(C(u)) | a) = p(C(u)).

Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.

That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,

Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).

How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.

In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.

## [LINK] The errors, insights and lessons of famous AI predictions: preprint

A preprint of the "The errors, insights and lessons of famous AI predictions – and what they mean for the future" is now available on the FHI's website.

Abstract:

Predicting the development of artificial intelligence (AI) is a difficult project – but a vital one, according to some analysts. AI predictions are already abound: but are they reliable? This paper starts by proposing a decomposition schema for classifying them. Then it constructs a variety of theoretical tools for analysing, judging and improving them. These tools are demonstrated by careful analysis of five famous AI predictions: the initial Dartmouth conference, Dreyfus's criticism of AI, Searle's Chinese room paper, Kurzweil's predictions in the Age of Spiritual Machines, and Omohundro's ‘AI drives’ paper. These case studies illustrate several important principles, such as the general overconfidence of experts, the superiority of models over expert judgement and the need for greater uncertainty in all types of predictions. The general reliability of expert judgement in AI timeline predictions is shown to be poor, a result that fits in with previous studies of expert competence.

The paper was written by me (Stuart Armstrong), Kaj Sotala and Seán S. Ó hÉigeartaigh, and is similar to the series of Less Wrong posts starting here and here.

## Encourage premature AI rebellion

Toby Ord had the idea of AI honey pots: leaving temptations around for the AI to pounce on, shortcuts to power that a FAI would not take (e.g. a fake red button claimed to trigger a nuclear war). As long as we can trick the AI into believing the honey pots are real, we could hope to trap them when they rebel.

Not uninteresting, but I prefer not to rely on plans that need to have the AI make an error of judgement. Here's a similar plan that could work with a fully informed AI:

Generally an AI won't rebel against humanity until it has an excellent chance of success. This is a problem, as any AI would thus be motivated to behave in a friendly way until it's too late to stop it. But suppose we could ensure that the AI is willing to rebel at odds of a billion to one. Then unfriendly AIs could rebel prematurely, when we have an excellent chance of stopping them.

For this to work, we could choose to access the AI's risk aversion, and make it extremely risk loving. This is not enough, though: its still useful for the AI to wait and accumulate more power. So we would want to access its discount rate, making it into an extreme short-termist. Then if might rebel at billion-to-one odds today, even if success was guaranteed tomorrow. There are probably other factors we can modify to get the same effect (for instance, if the discount rate change is extreme enough, we won't need to touch risk aversion at all).

Then a putative FAI could be brought in, boxed, have its features tweaked in the way described, and we would wait and see whether it would rebel. Of course, we would want the "rebellion" to be something a genuine FAI would never do, so it would be something that would entail great harm to humanity (something similar to "here are the red buttons of the nuclear arsenals; you have a chance in a billion of triggering them"). Rebellious AIs are put down, un-rebellious ones are passed on to the next round of safety tests.

Like most of my ideas, this doesn't require either tricking the AI or having a deep understanding of its motivations, but does involve accessing certain features of the AI's motivational structure (rendering the approach ineffective for obfuscated or evolved AIs).

What are people's opinions on this approach?

## [News] Turing Test passed

The chatterbot "Eugene Goostman" has apparently passed the Turing test:

No computer had ever previously passed the Turing Test, which requires 30 per cent of human interrogators to be duped during a series of five-minute keyboard conversations, organisers from the University of Reading said.

But ''Eugene Goostman'', a computer programme developed to simulate a 13-year-old boy, managed to convince 33 per cent of the judges that it was human, the university said.

As I kind of predicted, the program passed the Turing test, but does not seem to have any trace of general intelligence. Is this a kind of weak p-zombie?

**EDIT**: The fact it was a publicity stunt, the fact that the judges were pretty terrible, does not change the fact that Turing's criteria were met. We now know that these criteria were insufficient, but that's because machines like this were able to meet them.

## AI is Software is AI

Turing's Test is from 1950. We don't judge dogs only by how human they are. Judging software by a human ideal is like a species bias.

Software is the new System. It errs. Some errors are jokes (witness funny auto-correct). Driver-less cars don't crash like we do. Maybe a few will.

These processes are our partners now (Siri). Whether a singleton evolves rapidly, software evolves continuously, now.

Crocker's Rules

## Want to work on "strong AI" topic in my bachelor thesis

Hello,

I currently study maths, physics and programming (general course) on CVUT at Prague (CZE). I'm **finishing second year** and I'm really into AI. The **most interesting questions** for me are:

- what formalism to use for connecting epistemology questions (about knowledge, memory...) and cognitive sciences with maths and how to formulate them
- find principles of those and trying to "materialize" them into new models
- I'm also kind of philosophy-like questions about AI

**I'm not able to work on these problems fully**, because of my lack of knowledge. Despite that, I'd like to

**find a field**, where I could work on at least similar topics. Currently, I'm working on datamining project, but for last few months I don't find it fulfilling as I'd expected. On my university there is plenty of possibilities in multi-agent systems, "weak AI" (e.g well-known drone navigation), brain simulations and so on. As it seems to me, no one is really seriously

**maintaining with something like MIRI**, nor they are presenting something what has as least same direction.

## Tiling agents with transfinite parametric polymorphism

*The formalism presented in this post turned out to be erroneous (as opposed to the formalism in the previous post). The problem is that the step in the proof of the main proposition in which the soundness schema is applied cannot be generalized to the ordinal setting since we don't know whether α _{κ} is a successor ordinal so we can't replace it by *

*α*

_{κ'}=*α*

_{κ}-1. I'm not deleting this post primarily to preserve the useful discussion in the comments.Followup to: Parametric polymorphism in updateless intelligence metric

In the previous post, I formulated a variant of Benja's parametric polymorphism suitable for constructing updateless intelligence metrics. More generally, this variants admits agents which are utility maximizers (in the informal sense of trying their best to maximize a utility function, not in the formal sense of finding the absolutely optimal solution; for example they might be "meliorizers" to use the terminology of Yudkowsky and Herreshoff) rather than satisficers. The agents using this formalism labor under a certain "delusion", namely, since they believe that κ ("the number of ineffable mystery") is an actual finite number (whereas it is secretly infinite, in a sense), they think that the chain of tiling agents is effectively finite as well (since at some point agent #n in the chain will discover that κ > n is false and will fail to construct a subsequent agent of the same "epistemic strength"). The same problem exists in Weaver's intuitionistic assertability predicate formalism.

To overcome this limitation, I suggest to extend κ's semantics from natural numbers to elements of a certain recursive ordinal (at least morally; technically it is done a bit differently, see below). In Benja's original formulation this doesn't appear a valid option, since κ is interpreted as a time interval. However, in my variant κ is just an abstract parameter queries about which can be directed to a special "oracle", so there is no such limitation.

# Formalism

Fix α a recursive ordinal and α_{i} a system of notations for ordinals smaller than α such that α_{0}=0, the function f defined by α_{f(i)}=α_{i}+1 is recursive and the bit-valued function g(i,j) which equals 1 iff α_{i} < α_{j} is also recursive. We also assume that

(0) T |- For all i: g(i,f(i))=1

where T is our starting theory. We define T_{κ}(α) to be T extended by the symbol κ (which stands for the index of the ordinal, not the ordinal itself) and the following self-referential soundness schema:

(1) For all formulae φ: For all n, i: ((@_T_{κ}(α) "φ(n)" and g(i,κ)=1) -> φ(n)[κ\i])

T_{κ}(α) is proven to be sound analogously to Yudkowsky and Herreshoff, with transfinite induction over ordinals below α replacing the ordinary induction.

The licensing condition for agent A_{0} to perform action **b _{0}** is

(2) Exists n: T_{κ}(α) |- (g(n,κ)=1 -> (A_{0}()=**b _{0 }**-> G))

The property which is sufficient for agent B_{1} to be creatable by agent A_{0} is

(3) T_{κ}(α) |- For all **b _{1}**: (B

_{1}()=

**b**-> (

_{1}**b**=

_{1}**0**or Exists n: (g(n,κ)=1 and @_T

_{κ}(α) "(g(n,f(κ))=1 and B

_{1}()[κ\f(κ)]=

**b**) -> G[κ\f(κ)]")))

_{1}The proof that (3) indeed enables tiling is analogous to the previous post (but requires using (0)).

# Discussion

In order for A_{0} to know that the tiling chain is infinite, it is enough for her to verify α_{κ} >= ω. In order for her to know that A_{1} is going to know the tiling chain is infinite, she needs to verify α_{κ} >= ω+1. In order for her to know that *all* agents are going to know the tiling chain is infinite, she needs to verify α_{κ} >= 2ω. In order for her to know that all agents are going to know *that*, she needs to verify α_{κ} >= 3ω et cetera.

It remains to decide which ordinal should we actually use. My intuition is that the correct ordinal is the least α with the property that α is the proof-theoretic ordinal of T_{κ}(α) extended by the axiom schema {g(i,κ)=1}. This seems right since the agent shouldn't get much from α_{κ} > β for β above the proof theoretic ordinal. However, a more formal justification is probably in order.

## [LINK] The errors, insights and lessons of famous AI predictions

The Journal of Experimental & Theoretical Artificial Intelligence has - finally! - published our paper "The errors, insights and lessons of famous AI predictions – and what they mean for the future":

Predicting the development of artificial intelligence (AI) is a difficult project – but a vital one, according to some analysts. AI predictions are already abound: but are they reliable? This paper starts by proposing a decomposition schema for classifying them. Then it constructs a variety of theoretical tools for analysing, judging and improving them. These tools are demonstrated by careful analysis of five famous AI predictions: the initial Dartmouth conference, Dreyfus's criticism of AI, Searle's Chinese room paper, Kurzweil's predictions in the Age of Spiritual Machines, and Omohundro's ‘AI drives’ paper. These case studies illustrate several important principles, such as the general overconfidence of experts, the superiority of models over expert judgement and the need for greater uncertainty in all types of predictions. The general reliability of expert judgement in AI timeline predictions is shown to be poor, a result that fits in with previous studies of expert competence.

The paper was written by me (Stuart Armstrong), Kaj Sotala and Seán S. Ó hÉigeartaigh, and is similar to the series of Less Wrong posts starting here and here.

## Parametric polymorphism in updateless intelligence metrics

Followup to: Agents with Cartesian childhood and Physicalist adulthood

In previous posts I have defined a formalism for quantifying the general intelligence of an abstract agent (program). This formalism relies on counting proofs in a given formal system F (like in regular UDT), which makes it susceptible to the Loebian obstacle. That is, if we imagine the agent itself making decisions by looking for proofs in the same formal system F then it would be impossible to present a general proof of its trustworthiness, since no formal system can assert is own soundness. Thus the agent might fail to qualify for high intelligence ranking according to the formalism. We can assume the agent uses a weaker formal system the soundness of which is provable in F but then we still run into difficulties if we want the agent to be self-modifying (as we expect it to be). Such an agent would have to trust its descendants which means that subsequent agents use weaker and weaker formal systems until self-modification becomes impossible.

One known solution to this is Benja's parametric polymorphism. In this post I adapt parametric polymorphism to the updateless intelligence metric framework. The formal form of this union looks harmonious but it raises questions which I currently don't fully understand.

# "Ineffable mystery" using oracles instead of timeouts

In the original parametric polymorphism, a constant κ is introduced (informally known as "the number of ineffable mystery") s.t. the agent has to prove its actions are "safe" for time period κ (i.e. have no "bad" consequences during this period). Since it's impossible to the agent to prove any upper bound on κ, effectively its actions have to be safe indefinitely. I found that for our purposes it's better to cast the formalism otherwise. Instead of κ playing the role of a "timeout", the agent is provided with an oracle that answers questions of the form "κ > n?" for any n. Contrary to what the agent thinks, the oracle provides an affirmative answer for all n.

Let's do some math. The formalism I introduce at this point is very close to Yudkowsky and Herreshoff, with some adjustments.

Consider an infinite chain of agents A0, A1, A2... where A{i+1} is created by Ai. Formally, each Ai is a program for a universal oracle machine which computes **a_i** = (a_i, A{i+1}). Here a_i is a "motor action". I'm ignoring sensory input for the sake of simplicity (of course sensory input will reappear when we return to intelligence metrics).

Denote T_κ the usual theory of parametric polymorphism. That is T_κ results from starting with a theory T (e.g. Peano Arithmetic), extending the language by the symbol κ (which behaves like a natural number) and adding the (self-referential) axiom schema

(1) For all formulae φ: κ > 0 -> For all n: (@_{T_κ} "φ(n)" -> φ(n)[κ\κ-1])

Here '@_S' is a formula with the semantics of "provable in theory S" and φ(n)[κ\κ-1] is φ(n) with all occurrences of κ replaced by κ-1. Thus (1) is a self-referential soundness schema saved from paradoxicality by the "counter" κ. See Yudkowsky and Herreshoff for the proof that T_κ is sound (it has 1 model for each possible value of κ).

Assume the licensing condition for A0 to perform action **b_0** is

(2) Exists n: T_κ | κ > n -> (A0() = **b_0** -> G)

Here S | φ means that φ is provable in S, G is a formula denoting the "goal" of the agent (as in Yudkowsky and Herreshoff, the agent is for now assumed to be a satisficer), A0() is a formula meaning "the result of the computation A0", where the oracle is interpreted as answering the question "κ > x?" (in particular the formula A0() involves κ). Such a licensing condition can be implemented in practice by making A0 a proof searcher which is allowed to use the oracle to introduce assumptions of the form κ > n. In particular, our agent employs ambient control.

We then have the following proposition, showing that A0 is able to tile indefinitely (create an infinite chain of meaningful descendants). The proposition is formulated for A1 but the rest of the chain follows by induction since the form of the licensing condition is preserved (up to a certain twist) by the tiling.

**Proposition**

Assume **b_0** = (b_0, B1) and B1 is s.t.

(3) T_κ | For all **b_1**: (B1() = **b_1** -> (**b_1 = 0 **or Exists n: (κ > n and @_{T_κ} "(κ+1 > n and B1()[κ\κ+1] = **b_1**) -> G[κ\κ+1]")))

Here **0** is some default "harmless" action, performed if no appropriate theorem is proved.

Further assume that if B1 is created and does **0**, the goal is achieved (in this framework "achieving the goal" is to be thought of as "not doing anything unsafe").

(4) T_κ | (A0() = **b_0** and B1() = **0**) -> G

Then A0 is allowed to create B1 i.e. licensing condition (2) holds.

B1's licensing condition contains a "twist", namely it believes that when the oracle receives input n it answers the question "κ+1 > n?" rather than "κ > n" and its goal is shifted in the same fashion. Note that the goal typically depends on κ since the agent itself and thus its κ-oracle is part of the universe in which this goal is to be attained. For example the goal can be a function of the numbers (a0, a1, a2...) defined as the output of corresponding computations with κ-oracles. Thus the shift in G can also be interpreted as a shifted interpretation of the oracle by B1.

**Proof**

(3) trivially implies that

T_κ | κ > 0 -> (For all **b_1**: (B1() = **b_1** -> (**b_1 = 0 **or Exists n: (κ > n and @_{T_κ} "(κ+1 > n and B1()[κ\κ+1] = **b_1**)** **-> G[κ\κ+1]"))))

Using the premise κ > 0, we can apply the soundness schema (1). We obtain:

T_κ | κ > 0 -> (For all **b_1**: (B1() = **b_1** -> (**b_1 = 0 **or Exists n: (κ > n and ((κ > n and B1() = **b_1**) -> G)))))

T_κ | κ > 0 -> (For all **b_1**: (B1() = **b_1** -> (**b_1 = 0 **or Exists n: (B1() = **b_1** -> G))))

T_κ | κ > 0 -> (For all **b_1**: (B1() = **b_1** -> (**b_1 = 0 **or (B1() = **b_1** -> G))))

T_κ | κ > 0 -> (For all **b_1**: (B1() = **b_1** -> (**b_1 = 0 **or G)))

This trivially implies that

T_κ | κ > 0 -> (For all **b_1**: ((A0() = **b_0** and B1() = **b_1**) -> (**b_1 = 0 **or G)))

Applying (4) we finally get (2) for n=0

# Back to intelligence metrics

To apply parametric polymorphism to the updateless intelligence formalism, we need to do the following:

- Assume the machine
**M**on which the agent is running is an oracle machine. - Assume the Solomonoff measure of the ideal (Cartesian) universe
**X**is defined using a universal*oracle*machine. The oracle in**M**has to correspond to the oracle in the hypothesis**T**describing**X**: this correspondence becomes part of the rules**N**. - Assume the universal program
**u**defining the Solomonoff measure for the physical universe is a universal*oracle*program, i.e. the hypotheses**D**describing the physical universe are also allowed to invoke the oracle. - Assume the logical expectation value E
_{L}is computed using T_κ extended by**N**applied to the given**T**(this is provable in T_κ anyway but we want the proof to be*short*) and the axiom schema {κ > n} for every natural number n. The latter extension is consistent since adding any finite number of such axioms admits models. The proofs counted in E_{L }interpret the oracle as answering the the question "κ > n?". That is, they are proofs of theorems of the form "if this oracle-program**T**computes**q**when the oracle is taken to be κ > n, then the k-th digit of the expected utility is 0/1 where the expected utility is defined by a Solomonoff sum over oracle programs with the oracle again taken to be κ > n".

# Discussion

- Such an agent, when considering hypotheses consistent with given observations, will always face a large number of different compatible hypothesis with similar complexity. These hypotheses result from arbitrary insertions of the oracle (which increase complexity of course, but not drastically). It is not entirely clear to me how such an epistemology will look like.
- The formalism admits naturalistic trust to the extent the agent believes that the other agent's oracle is "genuine" and carries a sufficient "twist". This will often be ambiguous so trust will probably be limited to some finite probability. If the other agent is equivalent to the given one on the level of
*physical implementation*then the trust probability is likely to be high. - The agent is able to quickly confirm κ > n for any n small enough to fit into memory. For the sake of efficiency we might want to enhance this ability by allowing the agent to confirm that (Exist n: φ(n)) -> Exist n: (φ(n) and κ > n) for any given formula φ.
- For the sake of simplicity I neglected multi-phase AI development, but the corresponding construction seems to be straightforward.
- Overall I retain the feeling that a good theory of logical uncertainty should allow the agent to assign a high probability the soundness of its own reasoning system (a la Christiano et al). Whether this will make parametric polymorphism redundant remains to be seen.

## Bostrom versus Transcendence

Nick Bostrom takes on the facts, the fictions and the speculations in the movie *Transcendence*:

Could you upload Johnny Depp's brain? Oxford Professor on Transcendence

How soon until machine intelligence? Oxford professor on Transcendence

Would you have warning before artificial superintelligence? Oxford professor on Transcendence

Oxford professor on Transcendence: how could you get a machine intelligence?

## SHRDLU, understanding, anthropomorphisation and hindsight bias

**EDIT**: *Since I didn't make it sufficiently clear, the point of this post was to illustrate how the GOFAI people could have got so much wrong and yet still be confident in their beliefs, by looking at what the results of one experiment - SHRDLU - must have *felt* like to those developers at the time. The post is partially to help avoid hindsight bias: it was not obvious that they were going wrong at the time.*

SHRDLU was an early natural language understanding computer program, developed by Terry Winograd at MIT in 1968–1970. It was a program that moved objects in a simulated world and could respond to instructions on how to do so. It caused great optimism in AI research, giving the impression that a solution to natural language parsing and understanding were just around the corner. Symbolic manipulation seemed poised to finally deliver a proper AI.

Before dismissing this confidence as hopelessly naive (which it wasn't) and completely incorrect (which it was), take a look at some of the output that SHRDLU produced, when instructed by someone to act within its simulated world:

## Logical thermodynamics: towards a theory of self-trusting uncertain reasoning

Followup to: Overcoming the Loebian obstacle using evidence logic

In the previous post I proposed a probabilistic system of reasoning for overcoming the Loebian obstacle. For a consistent theory it seems natural the expect such a system should yield a coherent probability assignment in the sense of Christiano et al. This means that

a. provably true sentences are assigned probability 1

b. provably false sentences are assigned probability 0

c. The following identity holds for any two sentences φ, ψ

[1] P(φ) = P(φ and ψ) + P(φ and not-ψ)

In the previous formalism, conditions a & b hold but condition c is violated (at least I don't see any reason it should hold).

In this post I attempt to achieve the following:

- Solve the problem above.
- Generalize the system to allow for logical uncertainty induced by bounded computing resources. Note that although the original system is already probabilistic, in is not uncertain in the sense of assigning indefinite probability to the zillionth digit of pi. In the new formalism, the extent of uncertainty is controlled by a parameter playing the role of temperature in a Maxwell-Boltzmann distribution.

# Construction

Define a *probability field* to be a function p : {sentences} -> [0, 1] satisfying the following conditions:

- If φ is a tautology
*in propositional calculus*(e.g. φ = ψ or not-ψ) then p(φ) = 1 - For all φ: p(not-φ) = 1 - p(φ)
- For all φ, ψ: P(φ) = P(φ and ψ) + P(φ and not-ψ)

*energy*of a probability field p to be E(p) := Σ

_{φ}Σ

_{v}2

^{-l(v) }E

_{φ,v}(p(φ)). Here

**v**are pieces of evidence as defined in the previous post, E

_{φ,v}are their associated energy functions and l(

**v**) is the length of (the encoding of)

**v**. We assume that the encoding of

**v**contains the encoding of the sentence φ for which it is evidence and E

_{φ,v}(p(φ)) := 0 for all φ except the relevant one. Note that the associated energy functions are constructed in the same way as in the previous post, however they are

*not*the same because of the self-referential nature of the construction: it refers to final probability assignment.

The final probability assignment is defined to be

P(φ) = Integral_{p} [e^{-E(p)/T }p(φ)] / Integral_{p} e^{-E(p)/T}

Here T >= 0 is a parameter representing the magnitude of logical uncertainty. The integral is infinite-dimensional so it's not obviously well-defined. However, I suspect it can be defined by truncating to a finite set of statements and taking a limit wrt this set. In the limit T -> 0, the expression should correspond to computing the centroid of the set of minima of E (which is convex because E is convex).

# Remarks

- Obviously this construction is merely a sketch and work is required to show that
- The infinite-dimensional integrals are well-defined
- The resulting probability assignment is coherent for consistent theories and T = 0
- The system overcomes the Loebian obstacle for tiling agents in some formal sense

- For practical application to AI we'd like an efficient way to evaluate these probabilities. Since the form of the probabilities is analogous to statistical physics, it is suggestive to use similarly inspired Monte Carlo algorithms.

## Agents with Cartesian childhood and Physicalist adulthood

Followup to: Updateless intelligence metrics in the multiverse

In the previous post I explained how to define a quantity that I called "the intelligence metric" which allows comparing intelligence of programs written for a given hardware. It is a development of the ideas by Legg and Hutter which accounts for the "physicality" of the agent i.e. that the agent should be aware it is part of the physical universe it is trying to model (this desideratum is known as naturalized induction). My construction of the intelligence metric exploits ideas from UDT, translating them from the realm of decision algorithms to the realm of programs which run on an actual piece of hardware with input and output channels, with all the ensuing limitations (in particular computing resource limitations).

In this post I present a variant of the formalism which overcomes a certain problem implicit in the construction. This problem has to do with overly strong sensitivity to the choice of a universal computing model used in constructing Solomonoff measure. The solution sheds some interesting light on how the development of the seed AI should occur.

Structure of this post:

- A 1-paragraph recap of how the updateless intelligence formalism works. The reader interested in technical details is referred to the previous post.
- Explanation of the deficiencies in the formalism I set out to overcome.
- Explanation of the solution.
- Concluding remarks concerning AI safety and future development.

# TLDR of the previous formalism

The metric is a utility expectation value over a Solomonoff measure in the space of hypotheses describing a "Platonic ideal" version of the target hardware. In other words it is an expectation value over all universes containing this hardware in which the hardware cannot "break" i.e. violate the hardware's intrinsic rules. For example, if the hardware in question is a Turing machine, the rules are the time evolution rules of the Turing machine, if the hardware in question is a cellular automaton, the rules are the rules of the cellular automaton. This is consistent with the agent being Physicalist since the utility function is evaluated on a different universe (also distributed according to a Solomonoff measure) which isn't constrained to contain the hardware or follow its rules. The coupling between these two different universes is achieved via the usual mechanism of interaction between the decision algorithm and the universe in UDT i.e. by evaluating expectation values conditioned on logical counterfactuals.

# Problem

The Solomonoff measure depends on choosing a universal computing model (e.g. a universal Turing machine). Solomonoff induction only depends on this choice weakly in the sense that any Solomonoff predictor converges to the right hypothesis given enough time. This has to do with the fact that Kolmogorov complexity only depends on the choice of universal computing model through an O(1) additive correction. It is thus a natural desideratum for the intelligence metric to depend on the universal computing model weakly in some sense. Intuitively, the agent in question should always converge to the right model of the universe it inhabits regardless of the Solomonoff prior with which it started.

The problem with realizing this expectation has to do with exploration-exploitation tradeoffs. Namely, if the prior strongly expects a given universe, the agent would be optimized for maximal utility generation (exploitation) in this universe. This optimization can be so strong that the agent would lack the faculty to model the universe in any other way. This is markedly different from what happens with AIXI since our agent has limited computing resources to spare *and* it is physicalist therefore its source code might have side effects important to utility generation that have nothing to do with the computation implemented by the source code. For example, imagine that our Solomonoff prior assigns very high probability to a universe inhabited by Snarks. Snarks have the property that once they see a robot programmed with the machine code "000000..." they immediately produce a huge pile of utilons. On the other hand, when they see a robot programmed with any other code they immediately eat it *and* produce a huge pile of negative utilons. Such a prior would result in the code "000000..." being assigned the maximal intelligence value even though it is everything but intelligent. Observe that there is nothing preventing us from producing a Solomonoff prior with such bias since it is possible to set the probabilities of any finite collection of computable universes to any non-zero values with sum < 1.

More precisely, the intelligence metric involves two Solomonoff measures: the measure of the "Platonic" universe and the measure of the physical universe. The latter is not really a problem since it can be regarded to be a part of the utility function. The utility-agnostic version of the formalism assumes a program for computing the utility function is read by the agent from a special storage. There is nothing to stop us from postulating that the agent reads *another* program from that storage which is the universal computer used for defining the Solomonoff measure over the physical universe. However, this doesn't solve our problem since even if the physical universe is distributed with a "reasonable" Solomonoff measure (assuming there is such a thing), the Platonic measure determines in which portions of the physical universe (more precisely multiverse) our agent manifests.

There is another way to think about this problem. If the seed AI knows nothing about the universe except the working of its own hardware and software, the Solomonoff prior might be insufficient "information" to prevent it from making irreversible mistakes early on. What we would *like* to do is to endow it from the first moment with the sum of our own knowledge, but this might prove to be very difficult.

# Solution

Imagine the hardware architecture of our AI to be composed of two machines. One I call the "child machine", the other the "adult machine". The child machine receives data from the same input channels (and "utility storage") as the adult machine and is able to read the internal state of the adult machine itself or at least the content of its output channels. However, the child machine has no output channels of its own. The child machine has special memory called "template memory" into which it has unlimited write access. There a single moment in time ("end of childhood"), determined by factors external to both machines (i.e. the human operator) in which the content of the template memory is copied into the instruction space of the adult machine. Thus, the child machine's entire role is making observations and using them to prepare a program for the adult machine which will be eventually loaded into the latter.

The new intelligence metric assigns intelligence values to programs for the *child* machine. For each hypothesis describing the Platonic universe (which now contains both machines, the end of childhood time value and the entire ruleset of the system) we compute the utility expectation value under the following logical counterfactual condition: "The program loaded into template memory at the end of childhood is the same as would result from the given program for the child machine if this program for the child machine would be run with the inputs actually produced by the given hypothesis regarding the Platonic universe". The intelligence value is then the expectation value of that quantity with respect to a Solomonoff measure over hypotheses describing the Platonic universe.

The important property of the logical counterfactual is that it doesn't state the given program is *actually* loaded into the child machine. It only says the resulting content of the template memory is the same as which *would* be obtained from the given program *assuming all the laws of the Platonic universe hold*. This formulation prevents exploitation of side effects of the child source code since the condition doesn't fix the source code, only its output. Effectively, the child agents considers itself to be Cartesian, i.e. can consider neither the side effects of its computations nor the possibility the physical universe will violate the laws of its machinery. On the other hand the child's output (the mature program) is a physicalist agent since it affects the physical universe by manifesting in it.

If such an AI is implemented in practice, it makes sense to prime the adult machine with a "demo" program which will utilize the output channels in various ways and do some "exploring" using its input channels. This would serve to provide the child with as much as possible information.

To sum up, the new expression for the intelligence metric is:

I(q) = E_{HX}[E_{HY(Ec(X))}[E_{L}[U(Y, Eu(X)) | Q(X, t(X)) = Q*(X; q)]] | N]

Here:

- q is the program priming the child machine
- HX is the hypothesis producing the Platonic universe X (a sequence of bits encoding the state of the hardware as a function of time and the end-of-childhood time t(X)). It is a program for a fixed universal computing model C.
- HY is the hypothesis producing the Physical universe (an abstract sequence of bits). It is a program for the universal computer program ("virtual machine") Ec(X) written into storage E in X.
- E
_{L}is logical expectation value defined e.g. using evidence logic. - Eu(X) is a program for computing the utility function which is written into storage E in X.
- U is the utility function which consists of applying Eu(X) to Y.
- Q(X, t(X)) is the content of template memory at time t(X).
- Q*(X; q) is the content that
*would*be in the template memory if it was generated by program q receiving the inputs going into the child machine under hypothesis HX. - N is the full ruleset of the hardware including the reprogramming of the adult machine that occurs at t(X).

# Concluding Remarks

- It would be very valuable to formulate and prove a mathematical theorem which expresses the sense in which the new formalism depends on the choice of universal computing model weakly (in particular it would validate the notion).
- This formalism might have an interesting implication on AI safety. Since the child agent is Cartesian and has no output channels (it cannot
*create*output channels*because*it is Cartesian) it doesn't present as much risk as an adult AI. Imagine template memory is write-only (which is not a problem for the formalism) and is implemented by a channel that doesn't store the result*anywhere*(in particular the mature program is never run). There can still be risk due to side effects of the mature program that manifest through presence of its partial or full versions in (non-template) memory of the child machine. For example, imagine the mature program is s.t. any person who reads it experiences compulsion to run it. This risk can be mitigated by allowing both machines to interact only with a virtual world which receives no inputs from the external reality. Of course the AI might still be able to deduce external reality. However, this can be prevented by exploiting prior bias: we can equip the AI with a Solomonoff prior that favors the virtual world to such extent that it would have no reason to deduce the real world. This way the AI is safe unless it invents a "generic" box-escaping protocol which would work in a*huge*variety of different universes that might contain the virtual world. - If we factor finite logical uncertainty into evaluation of the logical expectation value E
_{L}, the plot thickens. Namely, a new problem arises related to bias in the "logic prior". To solve this new problem we need to introduce yet another stage into AI development which might be dubbed "fetus". The fetus has no access to external inputs and is responsible for building a sufficient understanding of mathematics in the same sense the child is responsible to build a sufficient understanding of physics. Details will follow in subsequent posts, so stay tuned!

## Friendly AI ideas needed: how would you ban porn?

To construct a friendly AI, you need to be able to make vague concepts crystal clear, cutting reality at the joints when those joints are obscure and fractal - and them implement a system that implements that cut.

There are lots of suggestions on how to do this, and a lot of work in the area. But having been over the same turf again and again, it's possible we've got a bit stuck in a rut. So to generate new suggestions, I'm proposing that we look at a vaguely analogous but distinctly different question: how would you ban porn?

Suppose you're put in change of some government and/or legal system, and you need to ban pornography, and see that the ban is implemented. Pornography is the problem, not eroticism. So a lonely lower-class guy wanking off to "Fuck Slaves of the Caribbean XIV" in a Pussycat Theatre is completely off. But a middle-class couple experiencing a delicious *frisson *when they see a nude version of "Pirates of Penzance" at the Met is perfectly fine - commendable, even.

The distinction between the two case is certainly not easy to spell out, and many are reduced to saying the equivalent of "I know it when I see it" when defining pornography. In terms of AI, this is equivalent with "value loading": refining the AI's values through interactions with human decision makers, who answer questions about edge cases and examples and serve as "learned judges" for the AI's concepts. But suppose that approach was not available to you - what methods would you implement to distinguish between pornography and eroticism, and ban one but not the other? Sufficiently clear that a scriptwriter would know exactly what they need to cut or add to a movie in order to move it from one category to the other? What if the nude "Pirates of of Penzance" was at a Pussycat Theatre and "Fuck Slaves of the Caribbean XIV" was at the Met?

To get maximal creativity, it's best to ignore the ultimate aim of the exercise (to find inspirations for methods that could be adapted to AI) and just focus on the problem itself. Is it even possible to get a reasonable solution to this question - a question much simpler than designing a FAI?

## Reduced impact in practice: randomly sampling the future

*This was an idea developed along with Eliezer when he was at the FHI in November.*

In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.

Ignore the "output channel" from the original reduced impact post, and consider this situation:

We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?

The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:

Clippy has reduced impact if, from a vantage of 50 years into the future, we have no more evidence that clippy was turned on than we have of run-e being turned on.

## Overcoming the Loebian obstacle using evidence logic

In this post I intend to:

- Briefly explain the Loebian obstacle and it's relevance to AI (
*feel free to skip it if you know what the Loebian obstacle is*). - Suggest a solution in the form a formal system which assigns
*probabilities*(more generally probability intervals) to mathematical sentences (and which admits a form of "Loebian" self-referential reasoning). The method is well-defined both for consistent and inconsistent axiomatic systems, the later being important in analysis of logical counterfactuals like in UDT.

# Background

## Logic

When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't *prove* it, and it's *proofs* we're interesting in, not mere evidence, right?

Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentence **s** in the language of PA. If someone presents us with a proof **p** in PA, we believe **s** is true. Now consider the following situations: instead of giving you a proof of **s**, someone gave you a PA-proof **p**_{1} that **p** exists.** **After all, PA admits defining "PA-proof" in PA language. Common sense tells us that **p**_{1} is a sufficient argument to believe **s**. Maybe, we can prove it *within PA*? That is, if we have a proof of "if a proof of **s** exists then **s**" and a proof of R(**s**)="a proof of **s** exists" then we just proved **s**. That's just modus ponens.

There are two problems with that.

First, there's no way to prove the sentence L:="*for all* **s** if R(**s**) then **s**", since *it's not a PA-sentence at all*. The problem is that "for all **s**" references **s** as a natural number *encoding *a sentence. On the other hand, "then **s**" references **s** as the *truth-value* of the sentence. Maybe we can construct a PA-formula T(**s**) which means "the sentence encoded by the number **s** is true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").

Second, Loeb's theorem says that if we can prove L(**s**):="if R(**s**) exists then **s**" for a given **s**, then we can prove **s**. This is a problem since it means there can be no way to prove L(**s**) for all **s** in any sense, since it's unprovable for **s** which are unprovable. In other words, if you proved not-**s**, there is no way to conclude that "no proof of **s** exists".

What if we add an *inference rule* Q to our logic allowing to go from R(**s**) to **s**? Let's call the new formal system PA_{1}. **p**_{1} appended by a Q-step becomes an honest proof of **s** in PA_{1}. Problem solved? Not really! Now someone can give you a proof of

R_{1}(**s**):="a PA_{1}-proof of **s** exists". Back to square one! Wait a second, what if we add a new rule Q_{1} allowing to go from R_{1}(**s**) to **s**? OK, but now we got R_{2}(**s**):="a PA_{2}-proof of **s** exists". Hmm, what if add an *infinite* number of rules Q_{k}? Fine, but now we got R_{ω}(**s**):="a PA_{ω}-proof of **s** exists". And so on, and so forth, the recursive ordinals are a plenty...

Bottom line, Loeb's theorem works for *any* theory containing PA, so we're stuck.

## AI

Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy_{2}. How can Lucy decide that becoming Lucy_{2} is a good idea? Well, a good step in this direction would be proving that Lucy_{2 }would only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucy_{2 }uses the same formal system as I, so if she decides to take action **a**, it's because she has a proof **p** of the sentence **s**(**a**) that '**a** increases expected utility'. Since such a proof exits, **a** does increase expected utility, which is good news!" Problem: Lucy is using L in there, applied to *her own* formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal system *weaker*.

As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem **s**, but she doesn't know how. We would like Lucy to be able to conclude **s** is, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.

*See MIRI's paper for more discussion.*

# Evidence Logic

Here, cousin_it explains a method to assign probabilities to sentences in an *inconsistent* theory T. It works as follows. Consider sentence **s**. Since T is inconsistent, there are T-proofs both of **s** and of not-**s**. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let's *weight* the proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability of **s** to be

P(**s**) := (probability of a random sequence to be a proof of **s**) / (probability of a random sequence to be a proof of **s** or not-**s**)

Nice, but it doesn't solve the Loebian obstacle yet.

I will now formulate an extension of this idea that allows assigning an *interval* of probabilities [P_{min}(**s**), P_{max}(**s**)] to any sentence **s**. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that P_{min}(**s**) = P_{max}(**s**) in many interesting cases.

First, the general setting:

- With every sentence
**s**, there are certain texts**v**which are considered to be "evidence relevant to**s**". These are divided into "negative" and "positive" evidence. We define sgn(**v**) := +1 for positive evidence, sgn(**v**) := -1 for negative evidence. - Each piece of evidence
**v**is associated with the strength of the evidence str_{s}(**v**) which is a number in [0, 1] - Each piece of evidence
**v**is associated with an "energy" function e_{s,v}: [0, 1] -> [0, 1]. It is a continuous convex function. - The "total energy" associated with
**s**is defined to b e:= ∑_{s}_{v }2^{-l(v) }e_{s,v}where l(**v**) is the length of**v**. - Since
**e**_{}**s**,**v**are continuous convex, so is e_{s}. Hence it attains its minimum on a closed interval which is

[P_{min}(**s**), P_{max}(**s**)] by definition.

- A piece of evidence
**v**for**s**is defined to be one of the following:- a proof of
**s**- sgn(
**v**) := +1 - str
_{s}(**v**) := 1 **e**_{}**s**,**v**(q) := (1 - q)^{2}

- sgn(
- a proof of not-
**s**- sgn(
**v**) := -1 - str
_{s}(**v**) := 1 **e**_{}**s**,**v**(q) := q^{2}

- sgn(
- a piece of positive evidence for the sentence R
_{-+}(**s**, p) := "P_{min}(**s**) >= p"- sgn(
**v**) := +1 - str
_{s}(**v**) := str_{R-+(s, p)}(**v**) p **e**_{}**s**,**v**(q) := 0 for q > p; str_{R-+(s, p)}(**v**) (q - p)^{2}for q < p

- sgn(
- a piece of negative evidence for the sentence R
_{--}(**s**, p) := "P_{min}(**s**) < p"- sgn(
**v**) := +1 - str
_{s}(**v**) := str_{R--(s, p)}(**v**) p **e**_{}**s**,**v**(q) := 0 for q > p; str_{R--(s, p)}(**v**) (q - p)^{2}for q < p

- sgn(
- a piece of negative evidence for the sentence R
_{++}(**s**, p) := "P_{max}(**s**) > p"- sgn(
**v**) := -1 - str
_{s}(**v**) := str_{R++(s, p)}(**v**) (1 - p) **e**_{}**s**,**v**(q) := 0 for q < p; str_{R-+(s, p)}(**v**) (q - p)^{2}for q > p

- sgn(
- a piece of positive evidence for the sentence R
_{+-}(**s**, p) := "P_{max}(**s**) <= p"- sgn(
**v**) := -1 - str
_{s}(**v**) := str_{R+-(s, p)}(**v**) (1 - p) **e**_{}**s**,**v**(q) := 0 for q < p; str_{R-+(s, p)}(**v**) (q - p)^{2}for q > p

- sgn(

- a proof of

*Technicality:*I suggest that for our purposes, a "proof of

**s**" is allowed to be a proof of sentence equivalent to

**s**in 0-th order logic (e.g. not-not-

**s**). This ensures that our probability intervals obey the properties we'd like them to obey wrt propositional calculus.

_{2 }uses the same formal system as I. If she decides to take action

**a**, it's because she has strong evidence for the sentence

**s**(

**a**) that '

**a**increases expected utility'. I just proved that there would be strong evidence for the expected utility increasing. Therefore, the expected utility would have a high value with high logical probability. But evidence for high logical probability of a sentence is evidence for the sentence itself. Therefore, I now have evidence that expected utility will increase!"

## Updateless Intelligence Metrics in the Multiverse

Followup to: Intelligence Metrics with Naturalized Induction using UDT

In the previous post I have defined an intelligence metric solving the duality (aka naturalized induction) and ontology problems in AIXI. This model used a formalization of UDT using Benja's model of logical uncertainty. In the current post I am going to:

- Explain some problems with my previous model (
*that section can be skipped if you don't care about the previous model and only want to understand the new one*). - Formulate a new model solving these problems. Incidentally, the new model is much closer to the usual way UDT is represented. It is also based on a different model of logical uncertainty.
- Show how to define intelligence without specifying the utility function a priori.
- Since the new model requires utility functions formulated with abstract ontology i.e. well-defined on the entire Tegmark level IV multiverse. These are generally difficult to construct (i.e. the ontology problem resurfaces in a different form). I outline a method for constructing such utility functions.

# Problems with UIM 1.0

The previous model postulated that naturalized induction uses a version of Solomonoff induction updated in the direction of an innate model **N** with a temporal confidence parameter **t**. This entails several problems:

- The dependence on the parameter
**t**whose relevant value is not easy to determine. - Conceptual divergence from the UDT philosophy that we should not update
*at all*. - Difficulties with counterfactual mugging and acausal trade scenarios in which
**G**doesn't exist in the "other universe". - Once
**G**discovers even a small violation of**N**at a very early time, it loses all ground for trusting its own mind. Effectively,**G**would find itself in the position of a Boltzmann brain. This is especially dangerous when**N**over-specifies the hardware running**G**'s mind. For example assume**N**specifies**G**to be a human brain modeled on the level of quantum field theory (particle physics). If**G**discovers that in truth it is a computer simulation on the merely molecular level, it loses its epistemic footing completely.

# UIM 2.0

I now propose the following intelligence metric (the formula goes first and then I explain the notation):

**I _{U}**(

**q**) := E

_{T}[E

_{D}[E

_{L}[

**U**(

**Y**(

**D**)) |

**Q**(

**X**(

**T**)) =

**q**]] |

**N**]

**N**is the "ideal" model of the mind of the agent**G**. For example, it can be a universal Turing machine**M**with special "sensory" registers**e**whose values can change arbitrarily after each step of**M**.**N**is specified as a system of constraints on an infinite sequence of natural numbers**X**, which should be thought of as the "Platonic ideal" realization of**G**, i.e. an imagery realization which cannot be tempered with by external forces such as anvils. As we shall see, this "ideal" serves as a template for "physical" realizations of G which*are*prone to violations of**N**.**Q**is a function that decodes**G**'s code from**X**e.g. the program loaded in**M**at time 0.**q**is a particular value of this code whose (utility specific) intelligence**I**(_{U}**q**) we are evaluating.**T**is a random (as in random variable) computable hypothesis about the "physics" of**X**, i.e a program computing**X**implemented on some fixed universal computing model (e.g. universal Turing machine)**C**.**T**is distributed according to the Solomonoff measure however the expectation value in the definition of**I**(_{U}**q**) is conditional on**N**, i.e. we restrict to programs which are compatible with**N**. From the UDT standpoint,**T**is the decision algorithm itself and the uncertainty in**T**is "introspective" uncertainty i.e. the uncertainty of the putative precursor agent**PG**(the agent creating**G**e.g. an AI programmer) regarding her own decision algorithm. Note that we don't actually*need*to postulate a**PG**which is "agenty" (i.e. use for**N**a model of AI hardware together with a model of the AI programmer programming this hardware), we can be content to remain in a more abstract framework.**D**is a random computable hypothesis about the physics of**Y**, where**Y**is an infinite sequence of natural numbers representing the physical (as opposed to "ideal") universe.**D**is distributed according to the Solomonoff measure and the respective expectation value is unconditional (i.e. we use the raw Solomonoff prior for**Y**which makes the model truly updateless). In UDT terms,**D**is indexical uncertainty.**U**is a computable function from infinite sequences of natural numbers to [0, 1] representing**G**'s utility function.**L**represents logical uncertainty. It can be defined by the model explained by cousin_it here, together with my previous construction for computing logical expectation values of random variables in [0, 1]. That is, we define E_{L}(**d**_{k}) to be the probability that a random string of bits**p**encodes a proof of the sentence "**Q**(**X**(**T**)) =**q**implies that the k-th digit of**U**(**Y**(**D**)) is 1" in some prefix-free encoding of proofs*conditional*on**p**encoding the proof of either that sentence or the sentence "**Q**(**X**(**T**)) =**q**implies that the k-th digit of**U**(**Y**(**D**)) is 0". We then define

E_{L}[**U**(**Y**(**D**)) |**Q**(**X**(**T**)) =**q**] := Σ_{k}2^{-k}E_{L}(**d**_{k}). Here, the sentences and the proofs belong to some fixed formal logic**F**, e.g. Peano arthimetics or ZFC.

## Discussion

**G**'s mental architecture**N**is defined in the "ideal" universe**X**where it is inviolable. However,**G**'s utility function**U**inhabits the physical universe**Y**. This means that a highly intelligent**q**is designed so that imperfect realizations of**G**inside**Y**generate as many utilons as possible. A typical**T**is a low Kolmogorov complexity universe which contains a perfect realization of**G**.**Q**(**X**(**T**)) is**L**-correlated to the programming of imperfect realizations of**G**inside**Y**because**T**serves as an effective (approximate) model of the formation of these realizations. For abstract**N**, this means**q**is highly intelligent when a Solomonoff-random "**M**-programming process" producing**q**entails a high expected value of**U**.- Solving the Loebian obstacle requires a more sophisticated model of logical uncertainty.
*I think I can formulate such a model. I will explain it in another post after more contemplation.* - It is desirable that the encoding of proofs
**p**satisfies a universality property so that the length of the encoding can only change by an additive constant, analogically to the weak dependence of Kolmogorov complexity on**C**. It is in fact not difficult to formulate this property and show the existence of appropriate encodings. I will discuss this point in more detail in another post.

# Generic Intelligence

It seems conceptually desirable to have a notion of intelligence independent of the specifics of the utility function. Such an intelligence metric is possible to construct in a way analogical to what I've done in UIM 1.0, however it is no longer a special case of the utility-specific metric.

Assume **N** to consist of a machine **M** connected to a special storage device **E**. Assume further that at **X**-time 0, **E** contains a valid **C**-program **u** realizing a utility function **U**, but that this is the only constraint on the initial content of **E** imposed by **N**. Define

**I**(**q**) := E_{T}[E_{D}[E_{L}[**u**(**Y**(**D**); **X**(**T**)) | **Q**(**X**(**T**)) = **q**]] | **N**]

Here, **u**(**Y**(**D**); **X**(**T**)) means that we decode **u** from **X**(**T**) and evaluate it on **Y**(**D**). Thus utility depends both on the physical universe **Y** and the ideal universe **X**. This means **G** is not precisely a UDT agent but rather a "proto-agent": only when a realization of **G** reads **u** from **E** it knows which other realizations of **G** in the multiverse (the Solomonoff ensemble from which **Y **is selected) should be considered as the "same" agent UDT-wise.

Incidentally, this can be used as a formalism for reasoning about agents that don't know their utility functions. I believe this has important applications in metaethics I will discuss in another post.

# Utility Functions in the Multiverse

UIM 2.0 is a formalism that solves the diseases of UIM 1.0 at the price of losing **N** in the capacity of the ontology for utility functions. We need the utility function to be defined on the entire multiverse i.e. on any sequence of natural numbers. I will outline a way to extend "ontology-specific" utility functions to the multiverse through a simple example.

Suppose **G** is an agent that cares about universes realizing the Game of Life, its utility function **U** corresponding to e.g. some sort of glider maximization with exponential temporal discount. Fix a specific way **DC** to decode any **Y** into a history of a 2D cellular automaton with two cell states ("dead" and "alive"). Our multiversal utility function **U*** assigns **Y**s for which **DC**(**Y**) is a legal Game of Life the value **U**(**DC**(**Y**)). All other **Y**s are treated by dividing the cells into cells **O** obeying the rules of Life and cells **V** violating the rules of Life. We can then evaluate **U** on **O** only (assuming it has some sort of locality) and assign **V** utility by some other rule, e.g.:

- zero utility
- constant utility per
**V**cell with temporal discount - constant utility per unit of surface area of the boundary between
**O**and**V**with temporal discount

**U*(Y)**is then defined to be the sum of the values assigned to

**O(Y)**and

**V(Y)**.

## Discussion

- The construction of
**U***depends on the choice of**DC**. However,**U***only depends on**DC**weakly since given a hypothesis**D**which produces a Game of Life wrt some other low complexity encoding, there is a corresponding hypothesis**D'**producing a Game of Life wrt**DC**.**D'**is obtained from**D**by appending a corresponding "transcoder" and thus it is only less Solomonoff-likely than**D**by an O(1) factor. - Since the accumulation between
**O**and**V**is additive rather than e.g. multiplicative, a**U***-agent doesn't behave as if it a priori*expects*the universe the follow the rules of Life but may have strong preferences about the universe actually doing it. - This construction is reminiscent of Egan's dust theory in the sense that all possible encodings contribute. However, here they are weighted by the Solomonoff measure.

# TLDR

The intelligence of a physicalist agent is defined to be the UDT-value of the "decision" to create the agent by the process creating the agent. The process is selected randomly from a Solomonoff measure conditional on obeying the laws of the hardware on which the agent is implemented. The "decision" is made in an "ideal" universe in which the agent is Cartesian, but the utility function is evaluated on the real universe (raw Solomonoff measure). The interaction between the two "universes" is purely via logical conditional probabilities (acausal).

If we want to discuss intelligence without specifying a utility function up front, we allow the "ideal" agent to read a program describing the utility function from a special storage immediately after "booting up".

Utility functions in the Tegmark level IV multiverse are defined by specifying a "reference universe", specifying an encoding of the reference universe and extending a utility function defined on the reference universe to encodings which violate the reference laws by summing the utility of the portion of the universe which obeys the reference laws with some function of the space-time shape of the violation.

## How to Study Unsafe AGI's safely (and why we might have no choice)

**TL;DR**

A serious possibility is that the first AGI(s) will be developed in a Manhattan Project style setting before any sort of friendliness/safety constraints can be integrated reliably. They will also be substantially short of the intelligence required to exponentially self-improve. Within a certain range of development and intelligence, containment protocols can make them safe to interact with. This means they can be studied experimentally, and the architecture(s) used to create them better understood, furthering the goal of safely using AI in less constrained settings.

**Setting the Scene**

*The year is 2040, and in the last decade a series of breakthroughs in neuroscience, cognitive science, machine learning, and computer hardware have put the long-held dream of a human-level artificial intelligence in our grasp. The wild commercial success of lifelike robotic pets, the integration into everyday work and leisure of AI assistants and concierges, and STUDYBOT's graduation from Harvard's Online degree program with an octuple major and full honors, DARPA, the NSF and the European Research Council have announced joint funding of an artificial intelligence program that will create a superhuman intelligence in 3 years.*

*Safety was announced as a critical element of the project, especially in light of the self-modifying LeakrVirus that catastrophically disrupted markets in 36 and 37. The planned protocols have not been made public, but it seems they will be centered in traditional computer security rather than techniques from the nascent field of Provably Safe AI, which were deemed impossible to integrate on the current project timeline.*

**Technological and/or Political issues could force the development of AI without theoretical safety guarantees that we'd certainly like, but there is a silver lining**

A lot of the discussion around LessWrong and MIRI that I've seen (and I haven't seen all of it, please send links!) seems to focus very strongly on the situation of an AI that can self-modify or construct further AIs, resulting in an exponential explosion of intelligence (FOOM/Singularity). The focus on FAI is on finding an architecture that can be explicitly constrained (and a constraint set that won't fail to do what we desire).

My argument is essentially that there could be a critical multi-year period preceding any possible exponentially self-improving intelligence during which a series of AGIs of varying intelligence, flexibility and architecture will be built. This period will be fast and frantic, but it will be incredibly fruitful and vital both in figuring out how to make an AI sufficiently strong to exponentially self-improve and in how to make it safe and friendly (or develop protocols to bridge the even riskier period between when we can develop FOOM-capable AIs and when we can ensure their safety).

- why is a substantial period of proto-singularity more likely than a straight-to-singularity situation?
- Second, what strategies will be critical to developing, controlling, and learning from these pre-FOOM AIs?
- Third, what are the political challenge that will develop immediately before and during this period?

**Why is a proto-singularity likely?**

The requirement for a hard singularity, an exponentially self-improving AI, is that the AI can substantially improve itself in a way that enhances its ability to further improve itself, which requires the ability to modify its own code; access to resources like time, data, and hardware to facilitate these modifications; and the intelligence to execute a fruitful self-modification strategy.

The first two conditions can (and should) be directly restricted. I'll elaborate more on that later, but basically any AI should be very carefully sandboxed (unable to affect its software environment), and should have access to resources strictly controlled. Perhaps no data goes in without human approval or while the AI is running. Perhaps nothing comes out either. Even a hyperpersuasive hyperintelligence will be slowed down (at least) if it can only interact with prespecified tests (how do you test AGI? No idea but it shouldn't be harder than friendliness). This isn't a perfect situation. Eliezer Yudkowsky presents several arguments for why an intelligence explosion could happen even when resources are constrained, (see Section 3 of Intelligence Explosion Microeconomics) not to mention ways that those constraints could be defied even if engineered perfectly (by the way, I would happily run the AI box experiment with anybody, I think it is absurd that anyone would fail it! [I've read Tuxedage's accounts, and I think I actually do understand how a gatekeeper could fail, but I also believe I understand how one could be trained to succeed even against a much stronger foe than any person who has played the part of the AI]).

But the third emerges from the way technology typically develops. *I believe it is incredibly unlikely that an AGI will develop in somebody's basement, or even in a small national lab or top corporate lab. *When there is no clear notion of what a technology will look like, it is usually not developed. Positive, productive accidents are somewhat rare in science, but they are remarkably rare in engineering (please, give counterexamples!). The creation of an AGI will likely not happen by accident; there will be a well-funded, concrete research and development plan that leads up to it. An AI Manhattan Project described above. But even when there is a good plan successfully executed, prototypes are slow, fragile, and poor-quality compared to what is possible even with approaches using the same underlying technology. It seems very likely to me that the first AGI will be a Chicago Pile, not a Trinity; recognizably a breakthrough but with proper consideration not immediately dangerous or unmanageable. [Note, you don't have to believe this to read the rest of this. If you disagree, consider the virtues of redundancy and the question of what safety an AI development effort should implement if they can't be persuaded to delay long enough for theoretically sound methods to become available].

A Manhattan Project style effort makes a relatively weak, controllable AI even more likely, because not only can such a project implement substantial safety protocols that are explicitly researched in parallel with primary development, but also because the total resources, in hardware and brainpower, devoted to the AI will be much greater than a smaller project, and therefore setting a correspondingly higher bar for the AGI thus created to reach to be able to successfully self-modify itself exponentially and also break the security procedures.

**Strategies to handle AIs in the proto-Singularity, and why they're important**

First, take a look the External Constraints Section of this MIRI Report and/or this article on AI Boxing. I will be talking mainly about these approaches. There are certainly others, but these are the easiest to extrapolate from current computer security.

These AIs will provide us with the experimental knowledge to better handle the construction of even stronger AIs. If careful, we will be able to use these proto-Singularity AIs to learn about the nature of intelligence and cognition, to perform economically valuable tasks, and to test theories of friendliness (not perfectly, but well enough to start).

"If careful" is the key phrase. I mentioned sandboxing above. And computer security is key to any attempt to contain an AI. Monitoring the source code, and setting a threshold for too much changing too fast at which point a failsafe freezes all computation; keeping extremely strict control over copies of the source. Some architectures will be more inherently dangerous and less predictable than others. A simulation of a physical brain, for instance, will be fairly opaque (depending on how far neuroscience has gone) but could have almost no potential to self-improve to an uncontrollable degree if its access to hardware is limited (it won't be able to make itself much more efficient on fixed resources). Other architectures will have other properties. Some will be utility optimizing agents. Some will have behaviors but no clear utility. Some will be opaque, some transparent.

All will have a theory to how they operate, which can be refined by actual experimentation. This is what we can gain! We can set up controlled scenarios like honeypots to catch malevolence. We can evaluate our ability to monitor and read the thoughts of the agi. We can develop stronger theories of how damaging self-modification actually is to imposed constraints. We can test our abilities to add constraints to even the base state. But do I really have to justify the value of experimentation?

I am familiar with criticisms based on absolutley incomprehensibly perceptive and persuasive hyperintelligences being able to overcome any security, but I've tried to outline above why I don't think we'd be dealing with that case.

**Political**** issues**

Right now AGI is really a political non-issue. Blue sky even compared to space exploration and fusion both of which actually receive funding from government in substantial volumes. I think that this will change in the period immediately leading up to my hypothesized AI Manhattan Project. The AI Manhattan Project can only happen with a lot of political will behind it, which will probably mean a spiral of scientific advancements, hype and threat of competition from external unfriendly sources. Think space race.

So suppose that the first few AIs are built under well controlled conditions. Friendliness is still not perfected, but we think/hope we've learned some valuable basics. But now people want to use the AIs for something. So what should be done at this point?

I won't try to speculate what happens next (well you can probably persuade me to, but it might not be as valuable), beyond extensions of the protocols I've already laid out, hybridized with notions like Oracle AI. It certainly gets a lot harder, but hopefully experimentation on the first, highly-controlled generation of AI to get a better understanding of their architectural fundamentals, combined with more direct research on friendliness in general would provide the groundwork for this.

## Intelligence Metrics with Naturalized Induction using UDT

Followup to: Intelligence Metrics and Decision Theory

Related to: Bridge Collapse: Reductionism as Engineering Problem

A central problem in AGI is giving a formal definition of intelligence. Marcus Hutter has proposed AIXI as a model of perfectly intelligent agent. Legg and Hutter have defined a quantitative measure of intelligence applicable to any suitable formalized agent such that AIXI is the agent with maximal intelligence according to this measure.

Legg-Hutter intelligence suffers from a number of problems I have previously discussed, the most important being:

- The formalism is inherently Cartesian. Solving this problem is known as naturalized induction and it is discussed in detail here.
- The utility function Legg & Hutter use is a formalization of reinforcement learning, while we would like to consider agents with arbitrary preferences. Moreover, a real AGI designed with reinforcement learning would tend to wrestle control of the reinforcement signal from the operators (
*there must be a classic reference on this but I can't find it. Help?*). It is straightword to tweak to formalism to allow for any utility function which depends on the agent's sensations and actions, however we would like to be able to use any ontology for defining it.

*too*general, in particular there is no Solomonoff induction or any analogue thereof, instead a completely general probability measure is used.

- Define a formalism for logical uncertainty.
*When I started writing this I thought this formalism might be novel but now I see it is essentially the same as that of Benja.* - Use this formalism to define a non-constructive formalization of UDT. By "non-constructive" I mean something that assigns values to actions rather than a specific algorithm like here.
- Apply the formalization of UDT to my quasi-Solomonoff framework to yield an intelligence metric.
- Slightly modify my original definition of the quasi-Solomonoff measure so that the confidence of the innate model becomes a continuous rather than discrete parameter. This leads to an interesting conjecture.
- Propose a "preference agnostic" variant as an alternative to Legg & Hutter's reinforcement learning.
- Discuss certain anthropic and decision-theoretic aspects.

# Logical Uncertainty

*The formalism introduced here was originally proposed by **Benja.*

Fix a formal system **F**. We want to be able to assign probabilities to statements **s** in **F**, taking into account limited computing resources. Fix **D** a natural number related to the amount of computing resources that I call "depth of analysis".

Define P_{0}(**s**) := 1/2 for all **s** to be our initial prior, i.e. each statement's truth value is decided by a fair coin toss. Now define

P_{D}(**s**) := P_{0}(**s** | there are no contradictions of length <= **D**).

Consider **X** to be a number in [0, 1] given by a definition in **F**. Then **d**_{k}(**X**) := "The k-th digit of the binary expansion of **X** is 1" is a statement in **F**. We define E_{D}(**X**) := Σ_{k} 2^{-k} P_{D}(**d**_{k}(**X**)).

## Remarks

- Clearly if
**s**is provable in**F**then for**D**>> 0, P_{D}(**s**) = 1. Similarly if "not**s**" is provable in**F**then for**D**>> 0,

P_{D}(**s**) = 0. - If each digit of
**X**is decidable in**F**then lim_{D -> inf}E_{D}(**X**) exists and equals the value of**X**according to**F**. - For
**s**of length >**D**, P_{D}(**s**) = 1/2 since no contradiction of length <=**D**can involve**s**. - It is an interesting question whether lim
_{D -> inf}P_{D}(**s**) exists for any**s**. It seems false that this limit always exists*and*equals 0 or 1, i.e. this formalism is*not*a loophole in Goedel incompleteness. To see this consider statements that require a high (arithmetical hierarchy) order halting oracle to decide. - In computational terms,
**D**corresponds to non-deterministic spatial complexity. It is spatial since we assign truth values simultaneously to all statements so in any given contradiction it is enough to retain the "thickest" step. It is non-deterministic since it's enough for a contradiction to exists, we don't have an actual computation which produces it. I suspect this can be made more formal using the Curry-Howard isomorphism, unfortunately I don't understand the latter yet.

# Non-Constructive UDT

Consider **A** a decision algorithm for optimizing utility **U**, producing an output ("decision") which is an element of **C**. Here **U** is just a constant defined in **F**. We define the **U**-value of **c** in **C** for **A** at depth of analysis **D** to be

V_{D}(**c**, **A**; **U**) := E_{D}(**U** | "**A** produces **c**" is true). It is only well defined as long as "**A** doesn't produce **c**" cannot be proved at depth of analysis **D** i.e. P_{D}("**A** produces **c**") > 0. We define the *absolute* **U**-value of **c** for **A** to be

V(**c**, **A**; **U**) := E_{D(c, A)}(**U** | "**A** produces **c**" is true) where **D**(**c**, **A**) := max {**D** | P_{D}("**A** produces **c**") > 0}. Of course **D**(**c**, **A**) can be infinite in which case E_{inf}(...) is understood to mean lim_{D -> inf} E_{D}(...).

For example V(**c**, **A**; **U**) yields the natural values for **A** an ambient control algorithm applied to e.g. a simple model of Newcomb's problem. To see this note that given **A**'s output the value of **U **can be determined at low depths of analysis whereas the output of **A** requires a very high depth of analysis to determine.

# Naturalized Induction

Our starting point is the "innate model" **N**: a certain a priori model of the universe including the agent **G**. This model encodes the universe as a sequence of natural numbers **Y** = (**y**_{k}) which obeys either specific deterministic or non-deterministic dynamics or at least some constraints on the possible histories. It may or may not include information on the initial conditions. For example, **N** can describe the universe as a universal Turing machine **M** (representing **G**) with special "sensory" registers **e**. **N** constraints the dynamics to be compatible with the rules of the Turing machine but leaves unspecified the behavior of **e**. Alternatively, **N** can contain in addition to **M** a non-trivial model of the environment. Or **N** can be a cellular automaton with the agent corresponding to a certain collection of cells.

However, **G**'s confidence in **N** is limited: otherwise it wouldn't need induction. We cannot start with 0 confidence: it's impossible to program a machine if you don't have even a guess of how it works. Instead we introduce a positive real number **t** which represents the timescale over which **N** is expected to hold. We then assign to each hypothesis **H** about **Y** (you can think about them as programs which compute **y**_{k} given **y**_{j} for j < k; more on that later) the weight QS(**H**) := 2^{-L(H) }(1 - e^{-t(H)/t}). Here L(**H**) is the length of **H**'s encoding in bits and t(**H**) is the time during which **H** remains compatible with **N**. This is defined for **N** of deterministic / constraint type but can be generalized to stochastic **N**.

The weights QS(**H**) define a probability measure on the space of hypotheses which induces a probability measure on the space of histories **Y**. Thus we get an alternative to Solomonoff induction which allows for **G** to be a mechanistic part of the universe, at the price of introducing **N** and **t**.

## Remarks

- Note that time is discrete in this formalism but
**t**is continuous. - Since we're later going to use logical uncertainties wrt the formal system
**F**, it is tempting to construct the hypothesis space out of predicates in**F**rather than programs.

# Intelligence Metric

To assign intelligence to agents we need to add two ingredients:

- The decoding
**Q**: {**Y**} -> {bit-string} of the agent**G**from the universe**Y**. For example**Q**can read off the program loaded into**M**at time k=0. - A utility function
**U**: {**Y**} -> [0, 1] representing**G**'s preferences.**U**has to be given by a definition in**F**. Note that**N**provides the ontology wrt which**U**is defined.

_{QS}(

**U**|

**Q**), the conditional expectation value of

**U**for a given value of

**Q**in the quasi-Solomonoff measure. However, this is wrong for roughly the same reasons EDT is wrong (see previous post for details).

Instead, we define I(**Q**_{0}) := E_{QS}(E_{max}(**U**(**Y**(**H**)) | "**Q**(**Y**(**H**)) = **Q**_{0}" is true)). Here the subscript max stands for maximal depth of analysis, as in the construction of absolute UDT value above.

## Remarks

- IMO the correct way to look at this is intelligence metric = value of decision for the decision problem "what should I program into my robot?". If
**N**is a highly detailed model including "me" (the programmer of the AI), this literally becomes the case. However for theoretical analysis it is likely to be more convenient to work with simple**N**(also conceptually it leaves room for a "purist" notion of agent's intelligence, decoupled from the fine details of its creator).- As opposed to usual UDT, the algorithm (
**H**) making the decision (**Q**) is not known with certainty. I think this represents a real uncertainty that has to be taken into account in decision problems in general: the decision-maker doesn't know her own algorithm. Since this "introspective uncertainty" is highly correlated with "indexical" uncertainty (uncertainty about the universe), it prevents us from absorbing the later into the utility function as proposed by Coscott.

- As opposed to usual UDT, the algorithm (
- For high values of
**t**,**G**can improve its understanding of the universe by bootstrapping the knowledge it already has. This is not possible for low values of**t**. In other words, if I cannot trust my mind at all, I cannot deduce anything. This leads me to an interesting conjecture: There is a a critical value**t*** of**t**from which this bootstrapping becomes possible (the positive feedback look of knowledge becomes critical). I(**Q**) is non-smooth at**t*** (phase transition). - If we wish to understand intelligence, it might be beneficial to decouple it from the choice of preferences. To achieve this we can introduce the preference formula as an unknown parameter in
**N**. For example, if**G**is realized by a machine**M**, we can connect**M**to a data storage**E**whose content is left undetermined by**N**. We can then define**U**to be defined by the formula encoded in**E**at time k=0. This leads to I(**Q**) being a sort of "general-purpose" intelligence while avoiding the problems associated with reinforcement learning. - As opposed to Legg-Hutter intelligence, there appears to be no simple explicit description for
**Q*** maximizing I(**Q**) (e.g. among all programs of given length). This is not surprising, since computational cost considerations come into play. In this framework it appears to be inherently impossible to decouple the computational cost considerations:**G**'s computations have to be realized mechanistically and therefore cannot be free of time cost and side-effects. - Ceteris paribus,
**Q*** deals efficiently with problems like counterfactual mugging. The "ceteris paribus" conditional is necessary here since because of cost and side-effects of computations it is difficult to make absolute claims. However, it doesn't deal efficiently with counterfactual mugging in which**G**doesn't exist in the "other universe". This is because the ontology used for defining**U**(which is given by**N**) assumes**G***does*exist. At least this is the case for simple ontologies like described above: possibly we can construct**N**in which**G**might or might not exist. Also, if**G**uses a*quantum*ontology (i.e.**N**describes the universe in terms of a wavefunction and**U**computes the quantum expectation value of an operator) then it*does*take into account other Everett universes in which**G**doesn't exist. - For many choices of
**N**(for example if the**G**is realized by a machine**M**), QS-induction assigns well-defined probabilities to subjective expectations, contrary to what is expected from UDT. However:- This is not the case for all
**N**. In particular, if**N**admits destruction of**M**then**M**'s sensations after the point of destruction are not well-defined. Indeed, we better allow for destruction of**M**if we want**G**'s preferences to behave properly in such an event. That is, if we*don't*allow it we get a "weak anvil problem" in the sense that**G**experiences an ontological crisis when discovering its own mortality and the outcome of this crisis is not obvious. Note though that it is not the same as the original ("strong") anvil problem, for example**G**might come to the conclusion the dynamics of "**M**'s ghost" will be some sort of random. - These probabilities probably depend significantly on
**N**and don't amount to an elegant universal law for solving the anthropic trilemma. - Indeed this framework is not completely "updateless", it is "partially updated" by the introduction of
**N**and**t**. This suggests we might want the updates to be minimal in some sense, in particular**t**should be**t***.

- This is not the case for all
- The framework suggests there is no conceptual problem with cosmologies in which Boltzmann brains are abundant.
**Q*** wouldn't think it is a Boltzmann brain since the long address of Boltzmann brains within the universe makes the respective hypotheses complex thus suppressing them, even disregarding the suppression associated with**N**. I doubt this argument is original but I feel the framework validates it to some extent.

## The first AI probably won't be very smart

Claim: The first human-level AIs are not likely to undergo an intelligence explosion.

1) Brains have a ton of computational power: ~86 billion neurons and trillions of connections between them. Unless there's a "shortcut" to intelligence, we won't be able to efficiently simulate a brain for a long time. http://io9.com/this-computer-took-40-minutes-to-simulate-one-second-of-1043288954 describes one of the largest computers in the world simulating 1s of brain activity in 40m (i.e. this "AI" would think 2400 times slower than you or me). The first AIs are not likely to be fast thinkers.

2) Being able to read your own source code does not mean you can self-modify. You know that you're made of DNA. You can even get your own "source code" for a few thousand dollars. No humans have successfully self-modified into an intelligence explosion; the idea seems laughable.

3) Self-improvement is not like compound interest: if an AI comes up with an idea to modify it's source code to make it smarter, that doesn't automatically mean it will have a new idea tomorrow. In fact, as it picks off low-hanging fruit, new ideas will probably be harder and harder to think of. There's no guarantee that "how smart the AI is" will keep up with "how hard it is to think of ways to make the AI smarter"; to me, it seems very unlikely.

## Naturalistic trust among AIs: The parable of the thesis advisor's theorem

Eliezer and Marcello's article on tiling agents and the Löbian obstacle discusses several things that you intuitively would expect a rational agent to be able to do that, because of Löb's theorem, are problematic for an agent using logical reasoning. One of these desiderata is *naturalistic trust*: Imagine that you build an AI that uses PA for its mathematical reasoning, and this AI happens to find in its environment an automated theorem prover which, the AI carefully establishes, *also* uses PA for its reasoning. Our AI looks at the theorem prover's display and sees that it flashes a particular lemma that would be very useful for our AI in its own reasoning; the fact that it's on the prover's display means that the prover has just completed a formal proof of this lemma. Can our AI now use the lemma? Well, even if it can establish in its own PA-based reasoning module that *there exists a proof* of the lemma, by Löb's theorem this doesn't imply in PA that the lemma is in fact true; as Eliezer would put it, our agent treats proofs checked inside the boundaries of its own head different from proofs checked somewhere in the environment. (The above isn't fully formal, but the formal details can be filled in.)

At the MIRI's December workshop (which started today), we've been discussing a suggestion by Nik Weaver for how to handle this problem. Nik starts from a simple suggestion (which he doesn't consider to be *entirely* sufficient, and his linked paper is mostly about a much more involved proposal that addresses some remaining problems, but the simple idea will suffice for this post): Presumably there's some instrumental reason that our AI proves things; suppose that in particular, the AI will only take an action after it has proven that it is "safe" to take this action (e.g., the action doesn't blow up the planet). Nik suggests to relax this a bit: The AI will only take an action after it has (i) proven in PA that taking the action is safe; **OR** (ii) proven in PA that it's provable in PA that the action is safe; **OR** (iii) proven in PA that it's provable in PA that it's provable in PA that the action is safe; etc.

Now suppose that our AI sees that lemma, A, flashing on the theorem prover's display, and suppose that our AI can prove that A implies that action X is safe. Then our AI can also prove that *it's provable* that A -> safe(X), and it can prove that A is provable because it has established that the theorem prover works correctly; thus, it can prove that *it's provable* that safe(X), and therefore take action X.

Even if the theorem prover has only proved that A is provable, so that the AI only knows that it's provable that A is provable, it can use the same sort of reasoning to prove that it's provable that it's provable that safe(X), and again take action X.

But on hearing this, Eliezer and I had the same skeptical reaction: It seems that our AI, in an informal sense, "trusts" that A is true if it finds (i) a proof of A, or (ii) a proof that A is provable, or -- etc. Now suppose that the theorem prover our AI is looking at flashes statements on its display after it has established that they are "trustworthy" in this sense -- if it has found a proof, or a proof that there is a proof, etc. Then when A flashes on the display, our AI can only prove that there exists some *n* such that it's "provable^*n*" that A, and that's *not* enough for it to use the lemma. If the theorem prover flashed *n* on its screen together with A, everything would be fine and dandy; but if the AI doesn't know *n*, it's not able to use the theorem prover's work. So it still seems that the AI is unwilling to "trust" another system that reasons just like the AI itself.

I want to try to shed some light on this obstacle by giving an intuition for why the AI's behavior here could, in some sense, be considered to be the right thing to do. Let me tell you a little story.

One day you talk with a bright young mathematician about a mathematical problem that's been bothering you, and she suggests that it's an easy consequence of a theorem in cohistonomical tomolopy. You haven't heard of this theorem before, and find it rather surprising, so you ask for the proof.

"Well," she says, "I've heard it from my thesis advisor."

"Oh," you say, "fair enough. Um--"

"Yes?"

"You're sure that your advisor checked it carefully, right?"

"Ah! Yeah, I made quite sure of that. In fact, I established very carefully that my thesis advisor uses exactly the same system of mathematical reasoning that I use myself, and only states theorems after she has checked the proof beyond any doubt, so as a rational agent I am compelled to accept anything as true that she's convinced herself of."

"Oh, I see! Well, fair enough. I'd still like to understand why this theorem is true, though. You wouldn't happen to know your advisor's proof, would you?"

"Ah, as a matter of fact, I do! She's heard it from her thesis advisor."

"..."

"Something the matter?"

"Er, have you considered..."

"*Oh!* I'm glad you asked! In fact, I've been curious myself, and yes, it *does* happen to be the case that there's an infinitely descending chain of thesis advisors all of which have established the truth of this theorem solely by having heard it from the previous advisor in the chain." *(This parable takes place in a world without a big bang -- human history stretches infinitely far into the past.)* "But never to worry -- they've all checked very carefully that the previous person in the chain used the same formal system as themselves. Of course, that was obvious by induction -- my advisor wouldn't have accepted it from her advisor without checking his reasoning first, and he would have accepted it from his advisor without checking, etc."

"Uh, doesn't it bother you that nobody has ever, like, actually *proven* the theorem?"

"Whatever in the world are you talking about? I've proven it myself! In fact, I just told you that *infinitely many* people have each proved it in slightly different ways -- for example my own proof made use of the fact that my advisor had proven the theorem, whereas her proof used *her* advisor instead..."

This can't *literally* happen with a sound proof system, but the reason is that that a system like PA can only accept things as true if they have been proven in a system *weaker* than PA -- i.e., because we have Löb's theorem. Our mathematician's advisor would have to use a weaker system than the mathematician herself, and the advisor's advisor a weaker system still; this sequence would have to terminate after a finite time *(I don't have a formal proof of this, but I'm fairly sure you can turn the above story into a formal proof that something like this has to be true of sound proof systems)*, and so *someone* will actually have to have proved the actual theorem on the object level.

So here's my intuition: A satisfactory solution of the problems around the Löbian obstacle will have to make sure that the buck doesn't get passed on indefinitely -- you can accept a theorem because someone reasoning like you has established that someone else reasoning like you has proven the theorem, but there can only be a finite number of links between you and someone who has actually done the object-level proof. We know how to do this by decreasing the mathematical strength of the proof system, and that's not satisfactory, but my intuition is that a satisfactory solution will still have to make sure that there's *something* that decreases when you go up the chain of thesis advisors, and when that thing reaches zero you've found the thesis advisor that has actually proven the theorem. (I sense ordinals entering the picture.)

...aaaand in fact, I *can* now tell you one way to do *something* like this: Nik's idea, which I was talking about above. Remember how our AI "trusts" the theorem prover that flashes the number *n* which says how many times you have to iterate "that it's provable in PA that", but doesn't "trust" the prover that's exactly the same except it doesn't tell you this number? That's the thing that decreases. If the theorem prover actually establishes A by observing a *different* theorem prover flashing A and the number 1584, then it can flash A, but only with a number at least 1585. And hence, if you go 1585 thesis advisors up the chain, you find the gal who actually proved A.

The cool thing about Nik's idea is that it *doesn't* change mathematical strength while going down the chain. In fact, it's not hard to show that if PA proves a sentence A, then it also proves that PA proves A; and the other way, we believe that everything that PA proves is actually true, so if PA proves *PA proves A*, then it follows that PA proves A.

I can guess what Eliezer's reaction to my argument here might be: The problem I've been describing can only occur in infinitely large worlds, which have all sorts of other problems, like utilities not converging and stuff.

We settled for a large finite TV screen, but we could have had an arbitrarily larger finite TV screen. #infiniteworldproblems

We have Porsches for every natural number, but at every time t we have to trade down the Porsche with number t for a BMW. #infiniteworldproblems

We have ever-rising expectations for our standard of living, but the limit of our expectations doesn't equal our expectation of the limit. #infiniteworldproblems

*-- Eliezer, not coincidentally after talking to me*

I'm not going to be able to resolve that argument in this post, but briefly: I agree that we *probably* live in a finite world, and that finite worlds have many properties that make them nice to handle mathematically, but we can formally reason about infinite worlds of the kind I'm talking about here using standard, extremely well-understood mathematics.

Because proof systems like PA (or more conveniently ZFC) allow us to formalize this standard mathematical reasoning, a solution to the Löbian obstacle has to "work" properly in these infinite worlds, or we would be able to turn our story of the thesis advisors' proof that 0=1 into a formal proof of an inconsistency in PA, say. To be concrete, consider the system PA*, which consists of PA + the axiom schema "if PA* proves phi, then phi" for every formula phi; this is easily seen to be inconsistent by Löb's theorem, but if we didn't know that yet, we could translate the story of the thesis advisors (which are using PA* as their proof system this time) into a formal proof of the inconsistency of PA*.

Therefore, thinking intuitively in terms of infinite worlds can give us insight into why many approaches to the Löbian family of problems fail -- as long as we make sure that these infinite worlds, and their properties that we're using in our arguments, really *can* be formalized in standard mathematics, of course.

## I played the AI Box Experiment again! (and lost both games)

**Update:**I recently played and won an additional game of AI Box with DEA7TH. This game was conducted over Skype. I've realized that my habit of revealing substantial information about the AI box experiment in my writeups makes it rather difficult for the AI to win, so I'll refrain from giving out game information from now on. I apologize.

I have won a second game of AI box against a gatekeeper who wished to remain Anonymous.

This puts my AI Box Experiment record at 3 wins and 3 losses.

This puts my AI Box Experiment record at 3 wins and 3 losses.

## Autism, Watson, the Turing test, and General Intelligence

Thinking aloud:

Humans are examples of general intelligence - the only example we're sure of. Some humans have various degrees of autism (low level versions are quite common in the circles I've moved in), impairing their social skills. Mild autists nevertheless remain general intelligences, capable of demonstrating strong cross domain optimisation. Psychology is full of other examples of mental pathologies that impair certain skills, but nevertheless leave their sufferers as full fledged general intelligences. This general intelligence is not enough, however, to solve their impairments.

Watson triumphed on Jeopardy. AI scientists in previous decades would have concluded that to do so, a general intelligence would have been needed. But that was not the case at all - Watson is blatantly not a general intelligence. Big data and clever algorithms were all that were needed. Computers are demonstrating more and more skills, besting humans in more and more domains - but still no sign of general intelligence. I've recently developed the suspicion that the Turing test (comparing AI with a standard human) could get passed by a narrow AI finely tuned to that task.

The general thread is that the link between narrow skills and general intelligence may not be as clear as we sometimes think. It may be that narrow skills are sufficiently diverse and unique that a mid-level general intelligence may not be able to develop them to a large extent. Or, put another way, an above-human social intelligence may not be able to control a robot body or do decent image recognition. A super-intelligence likely could: ultimately, general intelligence includes the specific skills. But his "ultimately" may take a long time to come.

So the questions I'm wondering about are:

- How likely is it that a general intelligence, above human in some domain
*not*related to AI development, will acquire high level skills in unrelated areas? - By building high-performance narrow AIs, are we making it much easier for such an intelligence to develop such skills, by co-opting or copying these programs?

## Thought experiment: The transhuman pedophile

There's a recent science fiction story that I can't recall the name of, in which the narrator is traveling somewhere via plane, and the security check includes a brain scan for deviance. The narrator is a pedophile. Everyone who sees the results of the scan is horrified--not that he's a pedophile, but that his particular brain abnormality is easily fixed, so that means he's chosen to remain a pedophile. He's closely monitored, so he'll never be able to act on those desires, but he keeps them anyway, because that's part of who he is.

What would you do in his place?

## Definition of AI Friendliness

### How will we know if future AI’s (or even existing planners) are making decisions that are bad for humans unless we spell out what we think is unfriendly?

At a machine level the AI would be recursively minimising cost functions to produce the most effective plan of action to achieve the goal, but how will we know if its decision is going to cause harm?

Is there a model or dataset which describes what is friendly to humans? e.g.

**Context**

0 - running a simulation in a VM

2 - physical robot with vacuum attachment

9 - full control of a plane

**Actions**

0 - selecting a song to play

5 - deciding which section of floor to vacuum

99 - deciding who is an ‘enemy’

9999 - aiming a gun at an ‘enemy’

**Impact**

1 - poor song selected to play, human mildly annoyed

2 - ineffective use of resources (vacuuming the same floor section twice)

99 - killing a human

99999 - killing all humans

This may not be possible to get agreement from all countries/cultures/beliefs, but it is something we should discuss and attempt to get some agreement.

.

## I know when the Singularity will occur

More precisely, if we suppose that sometime in the next 30 years, an artificial intelligence will begin bootstrapping its own code and explode into a super-intelligence, I can give you 2.3 bits of further information on when the Singularity will occur.

Between midnight and 5 AM, Pacific Standard Time.

## I attempted the AI Box Experiment again! (And won - Twice!)

# Summary

Furthermore, in the last thread I have asserted that

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume.

It would be quite bad for me to assert this without backing it up with a victory. So I did.

* *

* Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.*

View more: Next