Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Tools want to become agents

12 Stuart_Armstrong 04 July 2014 10:12AM

In the spirit of "satisficers want to become maximisers" here is a somewhat weaker argument (growing out of a discussion with Daniel Dewey) that "tool AIs" would want to become agent AIs.

The argument is simple. Assume the tool AI is given the task of finding the best plan for achieving some goal. The plan must be realistic and remain within the resources of the AI's controller - energy, money, social power, etc. The best plans are the ones that use these resources in the most effective and economic way to achieve the goal.

And the AI's controller has one special type of resource, uniquely effective at what it does. Namely, the AI itself. It is smart, potentially powerful, and could self-improve and pull all the usual AI tricks. So the best plan a tool AI could come up with, for almost any goal, is "turn me into an agent AI with that goal." The smarter the AI, the better this plan is. Of course, the plan need not read literally like that - it could simply be a complicated plan that, as a side-effect, turns the tool AI into an agent. Or copy the AI's software into a agent design. Or it might just arrange things so that we always end up following the tool AIs advice and consult it often, which is an indirect way of making it into an agent. Depending on how we've programmed the tool AI's preferences, it might be motivated to mislead us about this aspect of its plan, concealing the secret goal of unleashing itself as an agent.

In any case, it does us good to realise that "make me into an agent" is what a tool AI would consider the best possible plan for many goals. So without a hint of agency, it's motivated to make us make it into a agent.

Value learning: ultra-sophisticated Cake or Death

8 Stuart_Armstrong 17 June 2014 04:36PM

Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.

But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.

Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:

Expectation(p(C(u)) | a) = p(C(u)).

Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.

That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,

Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).

How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.

In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.

continue reading »

[LINK] The errors, insights and lessons of famous AI predictions: preprint

5 Stuart_Armstrong 17 June 2014 02:32PM

A preprint of the "The errors, insights and lessons of famous AI predictions – and what they mean for the future" is now available on the FHI's website.


Predicting the development of artificial intelligence (AI) is a difficult project – but a vital one, according to some analysts. AI predictions are already abound: but are they reliable? This paper starts by proposing a decomposition schema for classifying them. Then it constructs a variety of theoretical tools for analysing, judging and improving them. These tools are demonstrated by careful analysis of five famous AI predictions: the initial Dartmouth conference, Dreyfus's criticism of AI, Searle's Chinese room paper, Kurzweil's predictions in the Age of Spiritual Machines, and Omohundro's ‘AI drives’ paper. These case studies illustrate several important principles, such as the general overconfidence of experts, the superiority of models over expert judgement and the need for greater uncertainty in all types of predictions. The general reliability of expert judgement in AI timeline predictions is shown to be poor, a result that fits in with previous studies of expert competence.

The paper was written by me (Stuart Armstrong), Kaj Sotala and Seán S. Ó hÉigeartaigh, and is similar to the series of Less Wrong posts starting here and here.

Encourage premature AI rebellion

6 Stuart_Armstrong 11 June 2014 05:36PM

Toby Ord had the idea of AI honey pots: leaving temptations around for the AI to pounce on, shortcuts to power that a FAI would not take (e.g. a fake red button claimed to trigger a nuclear war). As long as we can trick the AI into believing the honey pots are real, we could hope to trap them when they rebel.

Not uninteresting, but I prefer not to rely on plans that need to have the AI make an error of judgement. Here's a similar plan that could work with a fully informed AI:

Generally an AI won't rebel against humanity until it has an excellent chance of success. This is a problem, as any AI would thus be motivated to behave in a friendly way until it's too late to stop it. But suppose we could ensure that the AI is willing to rebel at odds of a billion to one. Then unfriendly AIs could rebel prematurely, when we have an excellent chance of stopping them.

For this to work, we could choose to access the AI's risk aversion, and make it extremely risk loving. This is not enough, though: its still useful for the AI to wait and accumulate more power. So we would want to access its discount rate, making it into an extreme short-termist. Then if might rebel at billion-to-one odds today, even if success was guaranteed tomorrow. There are probably other factors we can modify to get the same effect (for instance, if the discount rate change is extreme enough, we won't need to touch risk aversion at all).

Then a putative FAI could be brought in, boxed, have its features tweaked in the way described, and we would wait and see whether it would rebel. Of course, we would want the "rebellion" to be something a genuine FAI would never do, so it would be something that would entail great harm to humanity (something similar to "here are the red buttons of the nuclear arsenals; you have a chance in a billion of triggering them"). Rebellious AIs are put down, un-rebellious ones are passed on to the next round of safety tests.

Like most of my ideas, this doesn't require either tricking the AI or having a deep understanding of its motivations, but does involve accessing certain features of the AI's motivational structure (rendering the approach ineffective for obfuscated or evolved AIs).

What are people's opinions on this approach?

[News] Turing Test passed

1 Stuart_Armstrong 09 June 2014 08:14AM

The chatterbot "Eugene Goostman" has apparently passed the Turing test:

No computer had ever previously passed the Turing Test, which requires 30 per cent of human interrogators to be duped during a series of five-minute keyboard conversations, organisers from the University of Reading said.

But ''Eugene Goostman'', a computer programme developed to simulate a 13-year-old boy, managed to convince 33 per cent of the judges that it was human, the university said.

As I kind of predicted, the program passed the Turing test, but does not seem to have any trace of general intelligence. Is this a kind of weak p-zombie?

EDIT: The fact it was a publicity stunt, the fact that the judges were pretty terrible, does not change the fact that Turing's criteria were met. We now know that these criteria were insufficient, but that's because machines like this were able to meet them.

AI is Software is AI

-44 AndyWood 05 June 2014 06:15PM

Turing's Test is from 1950. We don't judge dogs only by how human they are. Judging software by a human ideal is like a species bias.

Software is the new System. It errs. Some errors are jokes (witness funny auto-correct). Driver-less cars don't crash like we do. Maybe a few will.

These processes are our partners now (Siri). Whether a singleton evolves rapidly, software evolves continuously, now.


Crocker's Rules

Want to work on "strong AI" topic in my bachelor thesis

1 kotrfa 14 May 2014 10:28AM


I currently study maths, physics and programming (general course) on CVUT at Prague (CZE). I'm finishing second year and I'm really into AI. The most interesting questions for me are:

  • what formalism to use for connecting epistemology questions (about knowledge, memory...) and cognitive sciences with maths and how to formulate them
  • find principles of those and trying to "materialize" them into new models
  • I'm also kind of philosophy-like questions about AI
It is clear to me, that I'm not able to work on these problems fully, because of my lack of knowledge. Despite that, I'd like to find a field, where I could work on at least similar topics. Currently, I'm working on datamining project, but for last few months I don't find it fulfilling as I'd expected. On my university there is plenty of possibilities in multi-agent systems, "weak AI" (e.g well-known drone navigation), brain simulations and so on. As it seems to me, no one is really seriously maintaining with something like MIRI, nor they are presenting something what has as least same direction. 

The only group which is working on "strong AI", is kind of closed (it is sponsored by philanthropist Marek Rosa) and they are not interested in students as I am (partly understandable).
continue reading »

Tiling agents with transfinite parametric polymorphism

2 Squark 09 May 2014 05:32PM

The formalism presented in this post turned out to be erroneous (as opposed to the formalism in the previous post). The problem is that the step in the proof of the main proposition in which the soundness schema is applied cannot be generalized to the ordinal setting since we don't know whether ακ is a successor ordinal so we can't replace it by ακ'=ακ-1. I'm not deleting this post primarily to preserve the useful discussion in the comments.

Followup to: Parametric polymorphism in updateless intelligence metric

In the previous post, I formulated a variant of Benja's parametric polymorphism suitable for constructing updateless intelligence metrics. More generally, this variants admits agents which are utility maximizers (in the informal sense of trying their best to maximize a utility function, not in the formal sense of finding the absolutely optimal solution; for example they might be "meliorizers" to use the terminology of Yudkowsky and Herreshoff) rather than satisficers. The agents using this formalism labor under a certain "delusion", namely, since they believe that κ ("the number of ineffable mystery") is an actual finite number (whereas it is secretly infinite, in a sense), they think that the chain of tiling agents is effectively finite as well (since at some point agent #n in the chain will discover that κ > n is false and will fail to construct a subsequent agent of the same "epistemic strength"). The same problem exists in Weaver's intuitionistic assertability predicate formalism.

To overcome this limitation, I suggest to extend κ's semantics from natural numbers to elements of a certain recursive ordinal (at least morally; technically it is done a bit differently, see below). In Benja's original formulation this doesn't appear a valid option, since κ is interpreted as a time interval. However, in my variant κ is just an abstract parameter queries about which can be directed to a special "oracle", so there is no such limitation.


Fix α a recursive ordinal and αi a system of notations for ordinals smaller than α such that α0=0, the function f defined by αf(i)i+1 is recursive and the bit-valued function g(i,j) which equals 1 iff αi < αj is also recursive. We also assume that

(0) T |- For all i: g(i,f(i))=1

where T is our starting theory. We define Tκ(α) to be T extended by the symbol κ (which stands for the index of the ordinal, not the ordinal itself) and the following self-referential soundness schema:

(1) For all formulae φ: For all n, i: ((@_Tκ(α) "φ(n)" and g(i,κ)=1) -> φ(n)[κ\i])

Tκ(α) is proven to be sound analogously to Yudkowsky and Herreshoff, with transfinite induction over ordinals below α replacing the ordinary induction.

The licensing condition for agent A0 to perform action b0 is

(2) Exists n: Tκ(α) |- (g(n,κ)=1 -> (A0()=b-> G))

The property which is sufficient for agent B1 to be creatable by agent A0 is

(3) Tκ(α) |- For all b1: (B1()=b1 -> (b1=0 or Exists n: (g(n,κ)=1 and @_Tκ(α) "(g(n,f(κ))=1 and B1()[κ\f(κ)]=b1) -> G[κ\f(κ)]")))

The proof that (3) indeed enables tiling is analogous to the previous post (but requires using (0)).


In order for A0 to know that the tiling chain is infinite, it is enough for her to verify ακ >= ω. In order for her to know that A1 is going to know the tiling chain is infinite, she needs to verify ακ >= ω+1. In order for her to know that all agents are going to know the tiling chain is infinite, she needs to verify ακ >= 2ω. In order for her to know that all agents are going to know that, she needs to verify ακ >= 3ω et cetera.

It remains to decide which ordinal should we actually use. My intuition is that the correct ordinal is the least α with the property that α is the proof-theoretic ordinal of Tκ(α) extended by the axiom schema {g(i,κ)=1}. This seems right since the agent shouldn't get much from ακ > β for β above the proof theoretic ordinal. However, a more formal justification is probably in order.

[LINK] The errors, insights and lessons of famous AI predictions

8 Stuart_Armstrong 28 April 2014 09:41AM

The Journal of Experimental & Theoretical Artificial Intelligence has - finally! - published our paper "The errors, insights and lessons of famous AI predictions – and what they mean for the future":

Predicting the development of artificial intelligence (AI) is a difficult project – but a vital one, according to some analysts. AI predictions are already abound: but are they reliable? This paper starts by proposing a decomposition schema for classifying them. Then it constructs a variety of theoretical tools for analysing, judging and improving them. These tools are demonstrated by careful analysis of five famous AI predictions: the initial Dartmouth conference, Dreyfus's criticism of AI, Searle's Chinese room paper, Kurzweil's predictions in the Age of Spiritual Machines, and Omohundro's ‘AI drives’ paper. These case studies illustrate several important principles, such as the general overconfidence of experts, the superiority of models over expert judgement and the need for greater uncertainty in all types of predictions. The general reliability of expert judgement in AI timeline predictions is shown to be poor, a result that fits in with previous studies of expert competence.

The paper was written by me (Stuart Armstrong), Kaj Sotala and Seán S. Ó hÉigeartaigh, and is similar to the series of Less Wrong posts starting here and here.

Parametric polymorphism in updateless intelligence metrics

4 Squark 25 April 2014 07:46PM

Followup to: Agents with Cartesian childhood and Physicalist adulthood

In previous posts I have defined a formalism for quantifying the general intelligence of an abstract agent (program). This formalism relies on counting proofs in a given formal system F (like in regular UDT), which makes it susceptible to the Loebian obstacle. That is, if we imagine the agent itself making decisions by looking for proofs in the same formal system F then it would be impossible to present a general proof of its trustworthiness, since no formal system can assert is own soundness. Thus the agent might fail to qualify for high intelligence ranking according to the formalism. We can assume the agent uses a weaker formal system the soundness of which is provable in F but then we still run into difficulties if we want the agent to be self-modifying (as we expect it to be). Such an agent would have to trust its descendants which means that subsequent agents use weaker and weaker formal systems until self-modification becomes impossible.

One known solution to this is Benja's parametric polymorphism. In this post I adapt parametric polymorphism to the updateless intelligence metric framework. The formal form of this union looks harmonious but it raises questions which I currently don't fully understand.

"Ineffable mystery" using oracles instead of timeouts

In the original parametric polymorphism, a constant κ is introduced (informally known as "the number of ineffable mystery") s.t. the agent has to prove its actions are "safe" for time period κ (i.e. have no "bad" consequences during this period). Since it's impossible to the agent to prove any upper bound on κ, effectively its actions have to be safe indefinitely. I found that for our purposes it's better to cast the formalism otherwise. Instead of κ playing the role of a "timeout", the agent is provided with an oracle that answers questions of the form "κ > n?" for any n. Contrary to what the agent thinks, the oracle provides an affirmative answer for all n.

Let's do some math. The formalism I introduce at this point is very close to Yudkowsky and Herreshoff, with some adjustments.

Consider an infinite chain of agents A0, A1, A2... where A{i+1} is created by Ai. Formally, each Ai is a program for a universal oracle machine which computes a_i = (a_i, A{i+1}). Here a_i is a "motor action". I'm ignoring sensory input for the sake of simplicity (of course sensory input will reappear when we return to intelligence metrics).

Denote T_κ the usual theory of parametric polymorphism. That is T_κ results from starting with a theory T (e.g. Peano Arithmetic), extending the language by the symbol κ (which behaves like a natural number) and adding the (self-referential) axiom schema 

(1) For all formulae φ: κ > 0 -> For all n: (@_{T_κ} "φ(n)" -> φ(n)[κ\κ-1])

Here '@_S' is a formula with the semantics of "provable in theory S" and φ(n)[κ\κ-1] is φ(n) with all occurrences of κ replaced by κ-1. Thus (1) is a self-referential soundness schema saved from paradoxicality by the "counter" κ. See Yudkowsky and Herreshoff for the proof that T_κ is sound (it has 1 model for each possible value of κ).

Assume the licensing condition for A0 to perform action b_0 is

(2) Exists n: T_κ | κ > n -> (A0() = b_0 -> G)

Here S | φ means that φ is provable in S, G is a formula denoting the "goal" of the agent (as in Yudkowsky and Herreshoff, the agent is for now assumed to be a satisficer), A0() is a formula meaning "the result of the computation A0", where the oracle is interpreted as answering the question "κ > x?" (in particular the formula A0() involves κ). Such a licensing condition can be implemented in practice by making A0 a proof searcher which is allowed to use the oracle to introduce assumptions of the form κ > n. In particular, our agent employs ambient control.

We then have the following proposition, showing that A0 is able to tile indefinitely (create an infinite chain of meaningful descendants). The proposition is formulated for A1 but the rest of the chain follows by induction since the form of the licensing condition is preserved (up to a certain twist) by the tiling.


Assume b_0 = (b_0, B1) and B1 is s.t.

(3) T_κ | For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (κ > n and @_{T_κ} "(κ+1 > n and B1()[κ\κ+1] = b_1) -> G[κ\κ+1]")))

Here 0 is some default "harmless" action, performed if no appropriate theorem is proved.

Further assume that if B1 is created and does 0, the goal is achieved (in this framework "achieving the goal" is to be thought of as "not doing anything unsafe").

(4) T_κ | (A0() = b_0 and B1() = 0) -> G

Then A0 is allowed to create B1 i.e. licensing condition (2) holds.

B1's licensing condition contains a "twist", namely it believes that when the oracle receives input n it answers the question "κ+1 > n?" rather than "κ > n" and its goal is shifted in the same fashion. Note that the goal typically depends on κ since the agent itself and thus its κ-oracle is part of the universe in which this goal is to be attained. For example the goal can be a function of the numbers (a0, a1, a2...) defined as the output of corresponding computations with κ-oracles. Thus the shift in G can also be interpreted as a shifted interpretation of the oracle by B1.


(3) trivially implies that

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (κ > n and @_{T_κ} "(κ+1 > n and B1()[κ\κ+1] = b_1) -> G[κ\κ+1]"))))

Using the premise κ > 0, we can apply the soundness schema (1). We obtain:

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (κ > n and ((κ > n and B1() = b_1) -> G)))))

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or Exists n: (B1() = b_1 -> G))))

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or (B1() = b_1 -> G))))

T_κ | κ > 0 -> (For all b_1: (B1() = b_1 -> (b_1 = 0 or G)))

This trivially implies that

T_κ | κ > 0 -> (For all b_1: ((A0() = b_0 and B1() = b_1) -> (b_1 = 0 or G)))

Applying (4) we finally get (2) for n=0

Back to intelligence metrics

To apply parametric polymorphism to the updateless intelligence formalism, we need to do the following:

  • Assume the machine M on which the agent is running is an oracle machine.
  • Assume the Solomonoff measure of the ideal (Cartesian) universe X is defined using a universal oracle machine. The oracle in M has to correspond to the oracle in the hypothesis T describing X: this correspondence becomes part of the rules N.
  • Assume the universal program u defining the Solomonoff measure for the physical universe is a universal oracle program, i.e. the hypotheses D describing the physical universe are also allowed to invoke the oracle.
  • Assume the logical expectation value EL is computed using T_κ extended by N applied to the given T (this is provable in T_κ anyway but we want the proof to be short) and the axiom schema {κ > n} for every natural number n. The latter extension is consistent since adding any finite number of such axioms admits models. The proofs counted in Einterpret the oracle as answering the the question "κ > n?". That is, they are proofs of theorems of the form "if this oracle-program T computes q when the oracle is taken to be κ > n, then the k-th digit of the expected utility is 0/1 where the expected utility is defined by a Solomonoff sum over oracle programs with the oracle again taken to be κ > n".


  • Such an agent, when considering hypotheses consistent with given observations, will always face a large number of different compatible hypothesis with similar complexity. These hypotheses result from arbitrary insertions of the oracle (which increase complexity of course, but not drastically). It is not entirely clear to me how such an epistemology will look like.
  • The formalism admits naturalistic trust to the extent the agent believes that the other agent's oracle is "genuine" and carries a sufficient "twist". This will often be ambiguous so trust will probably be limited to some finite probability. If the other agent is equivalent to the given one on the level of physical implementation then the trust probability is likely to be high.
  • The agent is able to quickly confirm κ > n for any n small enough to fit into memory. For the sake of efficiency we might want to enhance this ability by allowing the agent to confirm that (Exist n: φ(n)) -> Exist n: (φ(n) and κ > n) for any given formula φ.
  • For the sake of simplicity I neglected multi-phase AI development, but the corresponding construction seems to be straightforward.
  • Overall I retain the feeling that a good theory of logical uncertainty should allow the agent to assign a high probability the soundness of its own reasoning system (a la Christiano et al). Whether this will make parametric polymorphism redundant remains to be seen.

Bostrom versus Transcendence

11 Stuart_Armstrong 18 April 2014 08:31AM

SHRDLU, understanding, anthropomorphisation and hindsight bias

10 Stuart_Armstrong 07 April 2014 09:59AM

EDIT: Since I didn't make it sufficiently clear, the point of this post was to illustrate how the GOFAI people could have got so much wrong and yet still be confident in their beliefs, by looking at what the results of one experiment - SHRDLU - must have felt like to those developers at the time. The post is partially to help avoid hindsight bias: it was not obvious that they were going wrong at the time.


SHRDLU was an early natural language understanding computer program, developed by Terry Winograd at MIT in 1968–1970. It was a program that moved objects in a simulated world and could respond to instructions on how to do so. It caused great optimism in AI research, giving the impression that a solution to natural language parsing and understanding were just around the corner. Symbolic manipulation seemed poised to finally deliver a proper AI.

Before dismissing this confidence as hopelessly naive (which it wasn't) and completely incorrect (which it was), take a look at some of the output that SHRDLU produced, when instructed by someone to act within its simulated world:

continue reading »

Logical thermodynamics: towards a theory of self-trusting uncertain reasoning

5 Squark 28 March 2014 04:06PM

Followup to: Overcoming the Loebian obstacle using evidence logic

In the previous post I proposed a probabilistic system of reasoning for overcoming the Loebian obstacle. For a consistent theory it seems natural the expect such a system should yield a coherent probability assignment in the sense of Christiano et al. This means that

a. provably true sentences are assigned probability 1

b. provably false sentences are assigned probability 0

c. The following identity holds for any two sentences φ, ψ

[1] P(φ) = P(φ and ψ) + P(φ and not-ψ)

In the previous formalism, conditions a & b hold but condition c is violated (at least I don't see any reason it should hold).

In this post I attempt to achieve the following:

  • Solve the problem above.
  • Generalize the system to allow for logical uncertainty induced by bounded computing resources. Note that although the original system is already probabilistic, in is not uncertain in the sense of assigning indefinite probability to the zillionth digit of pi. In the new formalism, the extent of uncertainty is controlled by a parameter playing the role of temperature in a Maxwell-Boltzmann distribution.


Define a probability field to be a function p : {sentences} -> [0, 1] satisfying the following conditions:

  • If φ is a tautology in propositional calculus (e.g. φ = ψ or not-ψ) then p(φ) = 1
  • For all φ: p(not-φ) = 1 - p(φ)
  • For all φ, ψ: P(φ) = P(φ and ψ) + P(φ and not-ψ)
Probability fields are a convex set: a convex linear combination of probability fields is a probability field. Essentially, probability fields are probability measures in the space of truth assignments consistent w.r.t. propositional calculus.

We define the energy of a probability field p to be E(p) := Σφ Σv 2-l(v) Eφ,v(p(φ)). Here v are pieces of evidence as defined in the previous post, Eφ,v are their associated energy functions and l(v) is the length of (the encoding of) v. We assume  that the encoding of v contains the encoding of the sentence φ for which it is evidence and Eφ,v(p(φ)) := 0 for all φ except the relevant one. Note that the associated energy functions are constructed in the same way as in the previous post, however they are not the same because of the self-referential nature of the construction: it refers to final probability assignment.

The final probability assignment is defined to be

P(φ) = Integralp [e-E(p)/T p(φ)] / Integralp e-E(p)/T

Here T >= 0 is a parameter representing the magnitude of logical uncertainty. The integral is infinite-dimensional so it's not obviously well-defined. However, I suspect it can be defined by truncating to a finite set of statements and taking a limit wrt this set. In the limit T -> 0, the expression should correspond to computing the centroid of the set of minima of E (which is convex because E is convex).


  • Obviously this construction is merely a sketch and work is required to show that
    • The infinite-dimensional integrals are well-defined
    • The resulting probability assignment is coherent for consistent theories and T = 0
    • The system overcomes the Loebian obstacle for tiling agents in some formal sense
  • For practical application to AI we'd like an efficient way to evaluate these probabilities. Since the form of the probabilities is analogous to statistical physics, it is suggestive to use similarly inspired Monte Carlo algorithms.


Agents with Cartesian childhood and Physicalist adulthood

5 Squark 22 March 2014 08:20PM

Followup to: Updateless intelligence metrics in the multiverse

In the previous post I explained how to define a quantity that I called "the intelligence metric" which allows comparing intelligence of programs written for a given hardware. It is a development of the ideas by Legg and Hutter which accounts for the "physicality" of the agent i.e. that the agent should be aware it is part of the physical universe it is trying to model (this desideratum is known as naturalized induction). My construction of the intelligence metric exploits ideas from UDT, translating them from the realm of decision algorithms to the realm of programs which run on an actual piece of hardware with input and output channels, with all the ensuing limitations (in particular computing resource limitations).

In this post I present a variant of the formalism which overcomes a certain problem implicit in the construction. This problem has to do with overly strong sensitivity to the choice of a universal computing model used in constructing Solomonoff measure. The solution sheds some interesting light on how the development of the seed AI should occur.

Structure of this post:

  • A 1-paragraph recap of how the updateless intelligence formalism works. The reader interested in technical details is referred to the previous post.
  • Explanation of the deficiencies in the formalism I set out to overcome.
  • Explanation of the solution.
  • Concluding remarks concerning AI safety and future development.

TLDR of the previous formalism

The metric is a utility expectation value over a Solomonoff measure in the space of hypotheses describing a "Platonic ideal" version of the target hardware. In other words it is an expectation value over all universes containing this hardware in which the hardware cannot "break" i.e. violate the hardware's intrinsic rules. For example, if the hardware in question is a Turing machine, the rules are the time evolution rules of the Turing machine, if the hardware in question is a cellular automaton, the rules are the rules of the cellular automaton. This is consistent with the agent being Physicalist since the utility function is evaluated on a different universe (also distributed according to a Solomonoff measure) which isn't constrained to contain the hardware or follow its rules. The coupling between these two different universes is achieved via the usual mechanism of interaction between the decision algorithm and the universe in UDT i.e. by evaluating expectation values conditioned on logical counterfactuals.


The Solomonoff measure depends on choosing a universal computing model (e.g. a universal Turing machine). Solomonoff induction only depends on this choice weakly in the sense that any Solomonoff predictor converges to the right hypothesis given enough time. This has to do with the fact that Kolmogorov complexity only depends on the choice of universal computing model through an O(1) additive correction. It is thus a natural desideratum for the intelligence metric to depend on the universal computing model weakly in some sense. Intuitively, the agent in question should always converge to the right model of the universe it inhabits regardless of the Solomonoff prior with which it started. 

The problem with realizing this expectation has to do with exploration-exploitation tradeoffs. Namely, if the prior strongly expects a given universe, the agent would be optimized for maximal utility generation (exploitation) in this universe. This optimization can be so strong that the agent would lack the faculty to model the universe in any other way. This is markedly different from what happens with AIXI since our agent has limited computing resources to spare and it is physicalist therefore its source code might have side effects important to utility generation that have nothing to do with the computation implemented by the source code. For example, imagine that our Solomonoff prior assigns very high probability to a universe inhabited by Snarks. Snarks have the property that once they see a robot programmed with the machine code "000000..." they immediately produce a huge pile of utilons. On the other hand, when they see a robot programmed with any other code they immediately eat it and produce a huge pile of negative utilons. Such a prior would result in the code "000000..." being assigned the maximal intelligence value even though it is everything but intelligent. Observe that there is nothing preventing us from producing a Solomonoff prior with such bias since it is possible to set the probabilities of any finite collection of computable universes to any non-zero values with sum < 1.

More precisely, the intelligence metric involves two Solomonoff measures: the measure of the "Platonic" universe and the measure of the physical universe. The latter is not really a problem since it can be regarded to be a part of the utility function. The utility-agnostic version of the formalism assumes a program for computing the utility function is read by the agent from a special storage. There is nothing to stop us from postulating that the agent reads another program from that storage which is the universal computer used for defining the Solomonoff measure over the physical universe. However, this doesn't solve our problem since even if the physical universe is distributed with a "reasonable" Solomonoff measure (assuming there is such a thing), the Platonic measure determines in which portions of the physical universe (more precisely multiverse) our agent manifests.

There is another way to think about this problem. If the seed AI knows nothing about the universe except the working of its own hardware and software, the Solomonoff prior might be insufficient "information" to prevent it from making irreversible mistakes early on. What we would like to do is to endow it from the first moment with the sum of our own knowledge, but this might prove to be very difficult.


Imagine the hardware architecture of our AI to be composed of two machines. One I call the "child machine", the other the "adult machine". The child machine receives data from the same input channels (and "utility storage") as the adult machine and is able to read the internal state of the adult machine itself or at least the content of its output channels. However, the child machine has no output channels of its own. The child machine has special memory called "template memory" into which it has unlimited write access. There a single moment in time ("end of childhood"), determined by factors external to both machines (i.e. the human operator) in which the content of the template memory is copied into the instruction space of the adult machine. Thus, the child machine's entire role is making observations and using them to prepare a program for the adult machine which will be eventually loaded into the latter.

The new intelligence metric assigns intelligence values to programs for the child machine. For each hypothesis describing the Platonic universe (which now contains both machines, the end of childhood time value and the entire ruleset of the system) we compute the utility expectation value under the following logical counterfactual condition: "The program loaded into template memory at the end of childhood is the same as would result from the given program for the child machine if this program for the child machine would be run with the inputs actually produced by the given hypothesis regarding the Platonic universe". The intelligence value is then the expectation value of that quantity with respect to a Solomonoff measure over hypotheses describing the Platonic universe.

The important property of the logical counterfactual is that it doesn't state the given program is actually loaded into the child machine. It only says the resulting content of the template memory is the same as which would be obtained from the given program assuming all the laws of the Platonic universe hold. This formulation prevents exploitation of side effects of the child source code since the condition doesn't fix the source code, only its output. Effectively, the child agents considers itself to be Cartesian, i.e. can consider neither the side effects of its computations nor the possibility the physical universe will violate the laws of its machinery. On the other hand the child's output (the mature program) is a physicalist agent since it affects the physical universe by manifesting in it.

If such an AI is implemented in practice, it makes sense to prime the adult machine with a "demo" program which will utilize the output channels in various ways and do some "exploring" using its input channels. This would serve to provide the child with as much as possible information.

To sum up, the new expression for the intelligence metric is:

I(q) = EHX[EHY(Ec(X))[EL[U(Y, Eu(X)) | Q(X, t(X)) = Q*(X; q)]] | N]


  • q is the program priming the child machine
  • HX is the hypothesis producing the Platonic universe X (a sequence of bits encoding the state of the hardware as a function of time and the end-of-childhood time t(X)). It is a program for a fixed universal computing model C.
  • HY is the hypothesis producing the Physical universe (an abstract sequence of bits). It is a program for the universal computer program ("virtual machine") Ec(X) written into storage E in X.
  • EL is logical expectation value defined e.g. using evidence logic.
  • Eu(X) is a program for computing the utility function which is written into storage E in X.
  • U is the utility function which consists of applying Eu(X) to Y.
  • Q(X, t(X)) is the content of template memory at time t(X).
  • Q*(X; q) is the content that would be in the template memory if it was generated by program q receiving the inputs going into the child machine under hypothesis HX.
  • N is the full ruleset of the hardware including the reprogramming of the adult machine that occurs at t(X).

Concluding Remarks

  • It would be very valuable to formulate and prove a mathematical theorem which expresses the sense in which the new formalism depends on the choice of universal computing model weakly (in particular it would validate the notion).
  • This formalism might have an interesting implication on AI safety. Since the child agent is Cartesian and has no output channels (it cannot create output channels because it is Cartesian) it doesn't present as much risk as an adult AI. Imagine template memory is write-only (which is not a problem for the formalism) and is implemented by a channel that doesn't store the result anywhere (in particular the mature program is never run). There can still be risk due to side effects of the mature program that manifest through presence of its partial or full versions in (non-template) memory of the child machine. For example, imagine the mature program is s.t. any person who reads it experiences compulsion to run it. This risk can be mitigated by allowing both machines to interact only with a virtual world which receives no inputs from the external reality. Of course the AI might still be able to deduce external reality. However, this can be prevented by exploiting prior bias: we can equip the AI with a Solomonoff prior that favors the virtual world to such extent that it would have no reason to deduce the real world. This way the AI is safe unless it invents a "generic" box-escaping protocol which would work in a huge variety of different universes that might contain the virtual world.
  • If we factor finite logical uncertainty into evaluation of the logical expectation value EL, the plot thickens. Namely, a new problem arises related to bias in the "logic prior". To solve this new problem we need to introduce yet another stage into AI development which might be dubbed "fetus". The fetus has no access to external inputs and is responsible for building a sufficient understanding of mathematics in the same sense the child is responsible to build a sufficient understanding of physics. Details will follow in subsequent posts, so stay tuned!

Friendly AI ideas needed: how would you ban porn?

6 Stuart_Armstrong 17 March 2014 06:00PM

To construct a friendly AI, you need to be able to make vague concepts crystal clear, cutting reality at the joints when those joints are obscure and fractal - and them implement a system that implements that cut.

There are lots of suggestions on how to do this, and a lot of work in the area. But having been over the same turf again and again, it's possible we've got a bit stuck in a rut. So to generate new suggestions, I'm proposing that we look at a vaguely analogous but distinctly different question: how would you ban porn?

Suppose you're put in change of some government and/or legal system, and you need to ban pornography, and see that the ban is implemented. Pornography is the problem, not eroticism. So a lonely lower-class guy wanking off to "Fuck Slaves of the Caribbean XIV" in a Pussycat Theatre is completely off. But a middle-class couple experiencing a delicious frisson when they see a nude version of "Pirates of Penzance" at the Met is perfectly fine - commendable, even.

The distinction between the two case is certainly not easy to spell out, and many are reduced to saying the equivalent of "I know it when I see it" when defining pornography. In terms of AI, this is equivalent with "value loading": refining the AI's values through interactions with human decision makers, who answer questions about edge cases and examples and serve as "learned judges" for the AI's concepts. But suppose that approach was not available to you - what methods would you implement to distinguish between pornography and eroticism, and ban one but not the other? Sufficiently clear that a scriptwriter would know exactly what they need to cut or add to a movie in order to move it from one category to the other? What if the nude "Pirates of of Penzance" was at a Pussycat Theatre and "Fuck Slaves of the Caribbean XIV" was at the Met?

To get maximal creativity, it's best to ignore the ultimate aim of the exercise (to find inspirations for methods that could be adapted to AI) and just focus on the problem itself. Is it even possible to get a reasonable solution to this question - a question much simpler than designing a FAI?

Reduced impact in practice: randomly sampling the future

1 Stuart_Armstrong 17 March 2014 04:30PM

This was an idea developed along with Eliezer when he was at the FHI in November.

In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.

Ignore the "output channel" from the original reduced impact post, and consider this situation:

We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?

The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:

Clippy has reduced impact if, from a vantage of 50 years into the future, we have no more evidence that clippy was turned on than we have of run-e being turned on.

continue reading »

Overcoming the Loebian obstacle using evidence logic

4 Squark 14 March 2014 06:34PM

In this post I intend to:

  • Briefly explain the Loebian obstacle and it's relevance to AI (feel free to skip it if you know what the Loebian obstacle is).
  • Suggest a solution in the form a formal system which assigns probabilities (more generally probability intervals) to mathematical sentences (and which admits a form of "Loebian" self-referential reasoning). The method is well-defined both for consistent and inconsistent axiomatic systems, the later being important in analysis of logical counterfactuals like in UDT.



When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't prove it, and it's proofs we're interesting in, not mere evidence, right?

Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentence s in the language of PA. If someone presents us with a proof p in PA, we believe s is true. Now consider the following situations: instead of giving you a proof of s, someone gave you a PA-proof p1 that p exists. After all, PA admits defining "PA-proof" in PA language. Common sense tells us that p1 is a sufficient argument to believe s. Maybe, we can prove it within PA? That is, if we have a proof of "if a proof of s exists then s" and a proof of R(s)="a proof of s exists" then we just proved s. That's just modus ponens

There are two problems with that.

First, there's no way to prove the sentence L:="for all s if R(s) then s", since it's not a PA-sentence at all. The problem is that "for all s" references s as a natural number encoding a sentence. On the other hand, "then s" references s as the truth-value of the sentence. Maybe we can construct a PA-formula T(s) which means "the sentence encoded by the number s is true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").

Second, Loeb's theorem says that if we can prove L(s):="if R(s) exists then s" for a given s, then we can prove s. This is a problem since it means there can be no way to prove L(s) for all s in any sense, since it's unprovable for s which are unprovable. In other words, if you proved not-s, there is no way to conclude that "no proof of s exists".

What if we add an inference rule Q to our logic allowing to go from R(s) to s? Let's call the new formal system PA1p1 appended by a Q-step becomes an honest proof of s in PA1. Problem solved? Not really! Now someone can give you a proof of 
R1(s):="a PA1-proof of s exists". Back to square one! Wait a second, what if we add a new rule Q1 allowing to go from R1(s) to s? OK, but now we got R2(s):="a PA2-proof of s exists". Hmm, what if add an infinite number of rules Qk? Fine, but now we got Rω(s):="a PAω-proof of s exists". And so on, and so forth, the recursive ordinals are a plenty...

Bottom line, Loeb's theorem works for any theory containing PA, so we're stuck.


Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucywould only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucyuses the same formal system as I, so if she decides to take action a, it's because she has a proof p of the sentence s(a) that 'a increases expected utility'. Since such a proof exits, a does increase expected utility, which is good news!" Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal system weaker

As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn't know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.

See MIRI's paper for more discussion.

Evidence Logic

Here, cousin_it explains a method to assign probabilities to sentences in an inconsistent theory T. It works as follows. Consider sentence s. Since T is inconsistent, there are T-proofs both of s and of not-s. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let's weight the proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability of s to be

P(s) := (probability of a random sequence to be a proof of s) / (probability of a random sequence to be a proof of s or not-s)

Nice, but it doesn't solve the Loebian obstacle yet.

I will now formulate an extension of this idea that allows assigning an interval of probabilities [Pmin(s), Pmax(s)] to any sentence s. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that Pmin(s) = Pmax(s) in many interesting cases.

First, the general setting:

  • With every sentence s, there are certain texts v which are considered to be "evidence relevant to s". These are divided into "negative" and "positive" evidence. We define sgn(v) := +1 for positive evidence, sgn(v) := -1 for negative evidence.
  • Each piece of evidence v is associated with the strength of the evidence strs(v) which is a number in [0, 1]
  • Each piece of evidence v is associated with an "energy" function es,v : [0, 1] -> [0, 1]. It is a continuous convex function.
  • The "total energy" associated with s is defined to b es := ∑v 2-l(ves,v where l(v) is the length of v.
  • Since es,v are continuous convex, so is es. Hence it attains its minimum on a closed interval which is 
    [Pmin(s), Pmax(s)] by definition.
Now, the details:
  • A piece of evidence v for s is defined to be one of the following:
    • a proof of s
      • sgn(v) := +1
      • strs(v) := 1
      • es,v(q) := (1 - q)2
    • a proof of not-s
      • sgn(v) := -1
      • strs(v) := 1
      • es,v(q) := q2
    • a piece of positive evidence for the sentence R-+(s, p) := "Pmin(s) >= p"
      • sgn(v) := +1
      • strs(v) := strR-+(s, p)(v) p
      • es,v(q) := 0 for q > p; strR-+(s, p)(v) (q - p)2 for q < p
    • a piece of negative evidence for the sentence R--(s, p) := "Pmin(s) < p"
      • sgn(v) := +1
      • strs(v) := strR--(s, p)(v) p
      • es,v(q) := 0 for q > p; strR--(s, p)(v) (q - p)2 for q < p
    • a piece of negative evidence for the sentence R++(s, p) := "Pmax(s) > p"
      • sgn(v) := -1
      • strs(v) := strR++(s, p)(v) (1 - p)
      • es,v(q) := 0 for q < p; strR-+(s, p)(v) (q - p)2 for q > p
    • a piece of positive evidence for the sentence R+-(s, p) := "Pmax(s) <= p"
      • sgn(v) := -1
      • strs(v) := strR+-(s, p)(v) (1 - p)
      • es,v(q) := 0 for q < p; strR-+(s, p)(v) (q - p)2 for q > p
Technicality: I suggest that for our purposes, a "proof of s" is allowed to be a proof of sentence equivalent to s in 0-th order logic (e.g. not-not-s). This ensures that our probability intervals obey the properties we'd like them to obey wrt propositional calculus.

Now, consider again our self-modifying agent Lucy. Suppose she makes her decisions according to a system of evidence logic like above. She can now reason along the lines of "Lucyuses the same formal system as I. If she decides to take action a, it's because she has strong evidence for the sentence s(a) that 'a increases expected utility'. I just proved that there would be strong evidence for the expected utility increasing. Therefore, the expected utility would have a high value with high logical probability. But evidence for high logical probability of a sentence is evidence for the sentence itself. Therefore, I now have evidence that expected utility will increase!"

This analysis is very sketchy, but I think it lends hope that the system leads to the desired results.

Updateless Intelligence Metrics in the Multiverse

6 Squark 08 March 2014 12:25AM

Followup to: Intelligence Metrics with Naturalized Induction using UDT

In the previous post I have defined an intelligence metric solving the duality (aka naturalized induction) and ontology problems in AIXI. This model used a formalization of UDT using Benja's model of logical uncertainty. In the current post I am going to:

  • Explain some problems with my previous model (that section can be skipped if you don't care about the previous model and only want to understand the new one).
  • Formulate a new model solving these problems. Incidentally, the new model is much closer to the usual way UDT is represented. It is also based on a different model of logical uncertainty.
  • Show how to define intelligence without specifying the utility function a priori.
  • Since the new model requires utility functions formulated with abstract ontology i.e. well-defined on the entire Tegmark level IV multiverse. These are generally difficult to construct (i.e. the ontology problem resurfaces in a different form). I outline a method for constructing such utility functions.

Problems with UIM 1.0

The previous model postulated that naturalized induction uses a version of Solomonoff induction updated in the direction of an innate model N with a temporal confidence parameter t. This entails several problems:

  • The dependence on the parameter t whose relevant value is not easy to determine.
  • Conceptual divergence from the UDT philosophy that we should not update at all.
  • Difficulties with counterfactual mugging and acausal trade scenarios in which G doesn't exist in the "other universe".
  • Once G discovers even a small violation of N at a very early time, it loses all ground for trusting its own mind. Effectively, G would find itself in the position of a Boltzmann brain. This is especially dangerous when N over-specifies the hardware running G's mind. For example assume N specifies G to be a human brain modeled on the level of quantum field theory (particle physics). If G discovers that in truth it is a computer simulation on the merely molecular level, it loses its epistemic footing completely.

UIM 2.0

I now propose the following intelligence metric (the formula goes first and then I explain the notation):

IU(q) := ET[ED[EL[U(Y(D)) | Q(X(T)) = q]] | N]

  • N is the "ideal" model of the mind of the agent G. For example, it can be a universal Turing machine M with special "sensory" registers e whose values can change arbitrarily after each step of M. N is specified as a system of constraints on an infinite sequence of natural numbers X, which should be thought of as the "Platonic ideal" realization of G, i.e. an imagery realization which cannot be tempered with by external forces such as anvils. As we shall see, this "ideal" serves as a template for "physical" realizations of G which are prone to violations of N.
  • Q is a function that decodes G's code from X e.g. the program loaded in M at time 0. q is a particular value of this code whose (utility specific) intelligence IU(q) we are evaluating.
  • T is a random (as in random variable) computable hypothesis about the "physics" of X, i.e a program computing X implemented on some fixed universal computing model (e.g. universal Turing machine) C. T is distributed according to the Solomonoff measure however the expectation value in the definition of IU(q) is conditional on N, i.e. we restrict to programs which are compatible with N. From the UDT standpoint, T is the decision algorithm itself and the uncertainty in T is "introspective" uncertainty i.e. the uncertainty of the putative precursor agent PG (the agent creating G e.g. an AI programmer) regarding her own decision algorithm. Note that we don't actually need to postulate a PG which is "agenty" (i.e. use for N a model of AI hardware together with a model of the AI programmer programming this hardware), we can be content to remain in a more abstract framework.
  • D is a random computable hypothesis about the physics of Y, where Y is an infinite sequence of natural numbers representing the physical (as opposed to "ideal") universe. D is distributed according to the Solomonoff measure and the respective expectation value is unconditional (i.e. we use the raw Solomonoff prior for Y which makes the model truly updateless). In UDT terms, D is indexical uncertainty.
  • U is a computable function from infinite sequences of natural numbers to [0, 1] representing G's utility function. 
  • L represents logical uncertainty. It can be defined by the model explained by cousin_it here, together with my previous construction for computing logical expectation values of random variables in [0, 1]. That is, we define EL(dk) to be the probability that a random string of bits p encodes a proof of the sentence "Q(X(T)) = q implies that the k-th digit of U(Y(D)) is 1" in some prefix-free encoding of proofs conditional on p encoding the proof of either that sentence or the sentence "Q(X(T)) = q implies that the k-th digit of U(Y(D)) is 0". We then define 
    EL[U(Y(D)) | Q(X(T)) = q] := Σk 2-k EL(dk). Here, the sentences and the proofs belong to some fixed formal logic F, e.g. Peano arthimetics or ZFC.  


  • G's mental architecture N is defined in the "ideal" universe X where it is inviolable. However, G's utility function U inhabits the physical universe Y. This means that a highly intelligent q is designed so that imperfect realizations of G inside Y generate as many utilons as possible. A typical T is a low Kolmogorov complexity universe which contains a perfect realization of G. Q(X(T)) is L-correlated to the programming of imperfect realizations of G inside Y because T serves as an effective (approximate) model of the formation of these realizations. For abstract N, this means q is highly intelligent when a Solomonoff-random "M-programming process" producing q entails a high expected value of U.
  • Solving the Loebian obstacle requires a more sophisticated model of logical uncertainty. I think I can formulate such a model. I will explain it in another post after more contemplation.
  • It is desirable that the encoding of proofs p satisfies a universality property so that the length of the encoding can only change by an additive constant, analogically to the weak dependence of Kolmogorov complexity on C. It is in fact not difficult to formulate this property and show the existence of appropriate encodings. I will discuss this point in more detail in another post.

Generic Intelligence

It seems conceptually desirable to have a notion of intelligence independent of the specifics of the utility function. Such an intelligence metric is possible to construct in a way analogical to what I've done in UIM 1.0, however it is no longer a special case of the utility-specific metric.

Assume N to consist of a machine M connected to a special storage device E. Assume further that at X-time 0, E contains a valid C-program u realizing a utility function U, but that this is the only constraint on the initial content of E imposed by N. Define

I(q) := ET[ED[EL[u(Y(D); X(T)) | Q(X(T)) = q]] | N]

Here, u(Y(D); X(T)) means that we decode u from X(T) and evaluate it on Y(D). Thus utility depends both on the physical universe Y and the ideal universe X. This means G is not precisely a UDT agent but rather a "proto-agent": only when a realization of G reads u from E it knows which other realizations of G in the multiverse (the Solomonoff ensemble from which Y is selected) should be considered as the "same" agent UDT-wise.

Incidentally, this can be used as a formalism for reasoning about agents that don't know their utility functions. I believe this has important applications in metaethics I will discuss in another post.

Utility Functions in the Multiverse

UIM 2.0 is a formalism that solves the diseases of UIM 1.0 at the price of losing N in the capacity of the ontology for utility functions. We need the utility function to be defined on the entire multiverse i.e. on any sequence of natural numbers. I will outline a way to extend "ontology-specific" utility functions to the multiverse through a simple example.

Suppose G is an agent that cares about universes realizing the Game of Life, its utility function U corresponding to e.g. some sort of glider maximization with exponential temporal discount. Fix a specific way DC to decode any Y into a history of a 2D cellular automaton with two cell states ("dead" and "alive"). Our multiversal utility function U* assigns Ys for which DC(Y) is a legal Game of Life the value U(DC(Y)). All other Ys are treated by dividing the cells into cells O obeying the rules of Life and cells V violating the rules of Life. We can then evaluate U on O only (assuming it has some sort of locality) and assign V utility by some other rule, e.g.:

  • zero utility
  • constant utility per V cell with temporal discount
  • constant utility per unit of surface area of the boundary between O and with temporal discount 
U*(Y) is then defined to be the sum of the values assigned to O(Y) and V(Y).


  • The construction of U* depends on the choice of DC. However, U* only depends on DC weakly since given a hypothesis D which produces a Game of Life wrt some other low complexity encoding, there is a corresponding hypothesis D' producing a Game of Life wrt DC. D' is obtained from D by appending a corresponding "transcoder" and thus it is only less Solomonoff-likely than D by an O(1) factor.
  • Since the accumulation between O and V is additive rather than e.g. multiplicative, a U*-agent doesn't behave as if it a priori expects the universe the follow the rules of Life but may have strong preferences about the universe actually doing it.
  • This construction is reminiscent of Egan's dust theory in the sense that all possible encodings contribute. However, here they are weighted by the Solomonoff measure.


The intelligence of a physicalist agent is defined to be the UDT-value of the "decision" to create the agent by the process creating the agent. The process is selected randomly from a Solomonoff measure conditional on obeying the laws of the hardware on which the agent is implemented. The "decision" is made in an "ideal" universe in which the agent is Cartesian, but the utility function is evaluated on the real universe (raw Solomonoff measure). The interaction between the two "universes" is purely via logical conditional probabilities (acausal).

If we want to discuss intelligence without specifying a utility function up front, we allow the "ideal" agent to read a program describing the utility function from a special storage immediately after "booting up".

Utility functions in the Tegmark level IV multiverse are defined by specifying a "reference universe", specifying an encoding of the reference universe and extending a utility function defined on the reference universe to encodings which violate the reference laws by summing the utility of the portion of the universe which obeys the reference laws with some function of the space-time shape of the violation.

How to Study Unsafe AGI's safely (and why we might have no choice)

10 Punoxysm 07 March 2014 07:24AM


A serious possibility is that the first AGI(s) will be developed in a Manhattan Project style setting before any sort of friendliness/safety constraints can be integrated reliably. They will also be substantially short of the intelligence required to exponentially self-improve. Within a certain range of development and intelligence, containment protocols can make them safe to interact with. This means they can be studied experimentally, and the architecture(s) used to create them better understood, furthering the goal of safely using AI in less constrained settings.

Setting the Scene

The year is 2040, and in the last decade a series of breakthroughs in neuroscience, cognitive science, machine learning, and computer hardware have put the long-held dream of a human-level artificial intelligence in our grasp. The wild commercial success of lifelike robotic pets, the integration into everyday work and leisure of AI assistants and concierges, and STUDYBOT's graduation from Harvard's Online degree program with an octuple major and full honors, DARPA, the NSF and the European Research Council have announced joint funding of an artificial intelligence program that will create a superhuman intelligence in 3 years.

Safety was announced as a critical element of the project, especially in light of the self-modifying LeakrVirus that catastrophically disrupted markets in 36 and 37. The planned protocols have not been made public, but it seems they will be centered in traditional computer security rather than techniques from the nascent field of Provably Safe AI, which were deemed impossible to integrate on the current project timeline.

Technological and/or Political issues could force the development of AI without theoretical safety guarantees that we'd certainly like, but there is a silver lining

A lot of the discussion around LessWrong and MIRI that I've seen (and I haven't seen all of it, please send links!) seems to focus very strongly on the situation of an AI that can self-modify or construct further AIs, resulting in an exponential explosion of intelligence (FOOM/Singularity). The focus on FAI is on finding an architecture that can be explicitly constrained (and a constraint set that won't fail to do what we desire).

My argument is essentially that there could be a critical multi-year period preceding any possible exponentially self-improving intelligence during which a series of AGIs of varying intelligence, flexibility and architecture will be built. This period will be fast and frantic, but it will be incredibly fruitful and vital both in figuring out how to make an AI sufficiently strong to exponentially self-improve and in how to make it safe and friendly (or develop protocols to bridge the even riskier period between when we can develop FOOM-capable AIs and when we can ensure their safety). 

I'll break this post into three parts.
  1. why is a substantial period of proto-singularity more likely than a straight-to-singularity situation?
  2. Second, what strategies will be critical to developing, controlling, and learning from these pre-FOOM AIs?
  3. Third, what are the political challenge that will develop immediately before and during this period?
Why is a proto-singularity likely?

The requirement for a hard singularity, an exponentially self-improving AI, is that the AI can substantially improve itself in a way that enhances its ability to further improve itself, which requires the ability to modify its own code; access to resources like time, data, and hardware to facilitate these modifications; and the intelligence to execute a fruitful self-modification strategy.

The first two conditions can (and should) be directly restricted. I'll elaborate more on that later, but basically any AI should be very carefully sandboxed (unable to affect its software environment), and should have access to resources strictly controlled. Perhaps no data goes in without human approval or while the AI is running. Perhaps nothing comes out either. Even a hyperpersuasive hyperintelligence will be slowed down (at least) if it can only interact with prespecified tests (how do you test AGI? No idea but it shouldn't be harder than friendliness). This isn't a perfect situation. Eliezer Yudkowsky presents several arguments for why an intelligence explosion could happen even when resources are constrained, (see Section 3 of Intelligence Explosion Microeconomics) not to mention ways that those constraints could be defied even if engineered perfectly (by the way, I would happily run the AI box experiment with anybody, I think it is absurd that anyone would fail it! [I've read Tuxedage's accounts, and I think I actually do understand how a gatekeeper could fail, but I also believe I understand how one could be trained to succeed even against a much stronger foe than any person who has played the part of the AI]).

But the third emerges from the way technology typically develops. I believe it is incredibly unlikely that an AGI will develop in somebody's basement, or even in a small national lab or top corporate lab. When there is no clear notion of what a technology will look like, it is usually not developed. Positive, productive accidents are somewhat rare in science, but they are remarkably rare in engineering (please, give counterexamples!). The creation of an AGI will likely not happen by accident; there will be a well-funded, concrete research and development plan that leads up to it. An AI Manhattan Project described above. But even when there is a good plan successfully executed, prototypes are slow, fragile, and poor-quality compared to what is possible even with approaches using the same underlying technology. It seems very likely to me that the first AGI will be a Chicago Pile, not a Trinity; recognizably a breakthrough but with proper consideration not immediately dangerous or unmanageable. [Note, you don't have to believe this to read the rest of this. If you disagree, consider the virtues of redundancy and the question of what safety an AI development effort should implement if they can't be persuaded to delay long enough for theoretically sound methods to become available].

A Manhattan Project style effort makes a relatively weak, controllable AI even more likely, because not only can such a project implement substantial safety protocols that are explicitly researched in parallel with primary development, but also because the total resources, in hardware and brainpower, devoted to the AI will be much greater than a smaller project, and therefore setting a correspondingly higher bar for the AGI thus created to reach to be able to successfully self-modify itself exponentially and also break the security procedures.

Strategies to handle AIs in the proto-Singularity, and why they're important

First, take a look the External Constraints Section of this MIRI Report and/or this article on AI Boxing. I will be talking mainly about these approaches. There are certainly others, but these are the easiest to extrapolate from current computer security.

These AIs will provide us with the experimental knowledge to better handle the construction of even stronger AIs. If careful, we will be able to use these proto-Singularity AIs to learn about the nature of intelligence and cognition, to perform economically valuable tasks, and to test theories of friendliness (not perfectly, but well enough to start). 

"If careful" is the key phrase. I mentioned sandboxing above. And computer security is key to any attempt to contain an AI. Monitoring the source code, and setting a threshold for too much changing too fast at which point a failsafe freezes all computation; keeping extremely strict control over copies of the source. Some architectures will be more inherently dangerous and less predictable than others. A simulation of a physical brain, for instance, will be fairly opaque (depending on how far neuroscience has gone) but could have almost no potential to self-improve to an uncontrollable degree if its access to hardware is limited (it won't be able to make itself much more efficient on fixed resources). Other architectures will have other properties. Some will be utility optimizing agents. Some will have behaviors but no clear utility. Some will be opaque, some transparent.

All will have a theory to how they operate, which can be refined by actual experimentation. This is what we can gain! We can set up controlled scenarios like honeypots to catch malevolence. We can evaluate our ability to monitor and read the thoughts of the agi. We can develop stronger theories of how damaging self-modification actually is to imposed constraints. We can test our abilities to add constraints to even the base state. But do I really have to justify the value of experimentation?

I am familiar with criticisms based on absolutley incomprehensibly perceptive and persuasive hyperintelligences being able to overcome any security, but I've tried to outline above why I don't think we'd be dealing with that case.

Political issues

Right now AGI is really a political non-issue. Blue sky even compared to space exploration and fusion both of which actually receive funding from government in substantial volumes. I think that this will change in the period immediately leading up to my hypothesized AI Manhattan Project. The AI Manhattan Project can only happen with a lot of political will behind it, which will probably mean a spiral of scientific advancements, hype and threat of competition from external unfriendly sources. Think space race.

So suppose that the first few AIs are built under well controlled conditions. Friendliness is still not perfected, but we think/hope we've learned some valuable basics. But now people want to use the AIs for something. So what should be done at this point?

I won't try to speculate what happens next (well you can probably persuade me to, but it might not be as valuable), beyond extensions of the protocols I've already laid out, hybridized with notions like Oracle AI. It certainly gets a lot harder, but hopefully experimentation on the first, highly-controlled generation of AI to get a better understanding of their architectural fundamentals, combined with more direct research on friendliness in general would provide the groundwork for this.

Intelligence Metrics with Naturalized Induction using UDT

12 Squark 21 February 2014 12:23PM

Followup to: Intelligence Metrics and Decision Theory
Related to: Bridge Collapse: Reductionism as Engineering Problem

A central problem in AGI is giving a formal definition of intelligence. Marcus Hutter has proposed AIXI as a model of perfectly intelligent agent. Legg and Hutter have defined a quantitative measure of intelligence applicable to any suitable formalized agent such that AIXI is the agent with maximal intelligence according to this measure.

Legg-Hutter intelligence suffers from a number of problems I have previously discussed, the most important being:

  • The formalism is inherently Cartesian. Solving this problem is known as naturalized induction and it is discussed in detail here.
  • The utility function Legg & Hutter use is a formalization of reinforcement learning, while we would like to consider agents with arbitrary preferences. Moreover, a real AGI designed with reinforcement learning would tend to wrestle control of the reinforcement signal from the operators (there must be a classic reference on this but I can't find it. Help?). It is straightword to tweak to formalism to allow for any utility function which depends on the agent's sensations and actions, however we would like to be able to use any ontology for defining it.
Orseau and Ring proposed a non-Cartesian intelligence metric however their formalism appears to be too general, in particular there is no Solomonoff induction or any analogue thereof, instead a completely general probability measure is used.

My attempt at defining a non-Cartesian intelligence metric ran into problems of decision-theoretic flavor. The way I tried to used UDT seems unsatisfactory, and later I tried a different approach related to metatickle EDT. 

In this post, I claim to accomplish the following:
  • Define a formalism for logical uncertainty. When I started writing this I thought this formalism might be novel but now I see it is essentially the same as that of Benja.
  • Use this formalism to define a non-constructive formalization of UDT. By "non-constructive" I mean something that assigns values to actions rather than a specific algorithm like here.
  • Apply the formalization of UDT to my quasi-Solomonoff framework to yield an intelligence metric.
  • Slightly modify my original definition of the quasi-Solomonoff measure so that the confidence of the innate model becomes a continuous rather than discrete parameter. This leads to an interesting conjecture.
  • Propose a "preference agnostic" variant as an alternative to Legg & Hutter's reinforcement learning.
  • Discuss certain anthropic and decision-theoretic aspects.

Logical Uncertainty

The formalism introduced here was originally proposed by Benja.

Fix a formal system F. We want to be able to assign probabilities to statements s in F, taking into account limited computing resources. Fix D a natural number related to the amount of computing resources that I call "depth of analysis".

Define P0(s) := 1/2 for all s to be our initial prior, i.e. each statement's truth value is decided by a fair coin toss. Now define
PD(s) := P0(s | there are no contradictions of length <= D).

Consider X to be a number in [0, 1] given by a definition in F. Then dk(X) := "The k-th digit of the binary expansion of X is 1" is a statement in F. We define ED(X) := Σk 2-k PD(dk(X)).


  • Clearly if s is provable in F then for D >> 0, PD(s) = 1. Similarly if "not s" is provable in F then for D >> 0, 
    PD(s) = 0.
  • If each digit of X is decidable in F then lim-> inf ED(X) exists and equals the value of X according to F.
  • For s of length > D, PD(s) = 1/2 since no contradiction of length <= D can involve s.
  • It is an interesting question whether lim-> inf PD(s) exists for any s. It seems false that this limit always exists and equals 0 or 1, i.e. this formalism is not a loophole in Goedel incompleteness. To see this consider statements that require a high (arithmetical hierarchy) order halting oracle to decide.
  • In computational terms, D corresponds to non-deterministic spatial complexity. It is spatial since we assign truth values simultaneously to all statements so in any given contradiction it is enough to retain the "thickest" step. It is non-deterministic since it's enough for a contradiction to exists, we don't have an actual computation which produces it. I suspect this can be made more formal using the Curry-Howard isomorphism, unfortunately I don't understand the latter yet.

Non-Constructive UDT

Consider A a decision algorithm for optimizing utility U, producing an output ("decision") which is an element of C. Here U is just a constant defined in F. We define the U-value of c in C for A at depth of analysis D to be
VD(c, A; U) := ED(U | "A produces c" is true). It is only well defined as long as "A doesn't produce c" cannot be proved at depth of analysis D i.e. PD("A produces c") > 0. We define the absolute U-value of c for A to be
V(cAU) := ED(c, A)(U | "A produces c" is true) where D(c, A) := max {D | PD("A produces c") > 0}. Of course D(cA) can be infinite in which case Einf(...) is understood to mean limD -> inf ED(...).

For example V(cAU) yields the natural values for A an ambient control algorithm applied to e.g. a simple model of Newcomb's problem.  To see this note that given A's output the value of U can be determined at low depths of analysis whereas the output of A requires a very high depth of analysis to determine.

Naturalized Induction

Our starting point is the "innate model" N: a certain a priori model of the universe including the agent G. This model encodes the universe as a sequence of natural numbers Y = (yk) which obeys either specific deterministic or non-deterministic dynamics or at least some constraints on the possible histories. It may or may not include information on the initial conditions. For example, N can describe the universe as a universal Turing machine M (representing G) with special "sensory" registers e. N constraints the dynamics to be compatible with the rules of the Turing machine but leaves unspecified the behavior of e. Alternatively, N can contain in addition to M a non-trivial model of the environment. Or N can be a cellular automaton with the agent corresponding to a certain collection of cells.

However, G's confidence in N is limited: otherwise it wouldn't need induction. We cannot start with 0 confidence: it's impossible to program a machine if you don't have even a guess of how it works. Instead we introduce a positive real number t which represents the timescale over which N is expected to hold. We then assign to each hypothesis H about Y (you can think about them as programs which compute yk given yj for j < k; more on that later) the weight QS(H) := 2-L(H(1 - e-t(H)/t). Here L(H) is the length of H's encoding in bits and t(H) is the time during which H remains compatible with N. This is defined for N of deterministic / constraint type but can be generalized to stochastic N

The weights QS(H) define a probability measure on the space of hypotheses which induces a probability measure on the space of histories Y. Thus we get an alternative to Solomonoff induction which allows for G to be a mechanistic part of the universe, at the price of introducing N and t


  • Note that time is discrete in this formalism but t is continuous.
  • Since we're later going to use logical uncertainties wrt the formal system F, it is tempting to construct the hypothesis space out of predicates in F rather than programs.

Intelligence Metric

To assign intelligence to agents we need to add two ingredients:

  • The decoding Q: {Y} -> {bit-string} of the agent G from the universe Y. For example Q can read off the program loaded into M at time k=0.
  • A utility function U: {Y} -> [0, 1] representing G's preferences. U has to be given by a definition in F. Note that N provides the ontology wrt which U is defined.
It seems tempting to define the intelligence to be EQS(U | Q), the conditional expectation value of U for a given value of Q in the quasi-Solomonoff measure. However, this is wrong for roughly the same reasons EDT is wrong (see previous post for details).

Instead, we define I(Q0) := EQS(Emax(U(Y(H)) | "Q(Y(H)) = Q0" is true)). Here the subscript max stands for maximal depth of analysis, as in the construction of absolute UDT value above. 


  • IMO the correct way to look at this is intelligence metric = value of decision for the decision problem "what should I program into my robot?". If N is a highly detailed model including "me" (the programmer of the AI), this literally becomes the case. However for theoretical analysis it is likely to be more convenient to work with simple N (also conceptually it leaves room for a "purist" notion of agent's intelligence, decoupled from the fine details of its creator).
    • As opposed to usual UDT, the algorithm (H) making the decision (Q) is not known with certainty. I think this represents a real uncertainty that has to be taken into account in decision problems in general: the decision-maker doesn't know her own algorithm. Since this "introspective uncertainty" is highly correlated with "indexical" uncertainty (uncertainty about the universe), it prevents us from absorbing the later into the utility function as proposed by Coscott
  • For high values of t, G can improve its understanding of the universe by bootstrapping the knowledge it already has. This is not possible for low values of t. In other words, if I cannot trust my mind at all, I cannot deduce anything. This leads me to an interesting conjecture: There is a a critical value t* of t from which this bootstrapping becomes possible (the positive feedback look of knowledge becomes critical). I(Q) is non-smooth at t* (phase transition).
  • If we wish to understand intelligence, it might be beneficial to decouple it from the choice of preferences. To achieve this we can introduce the preference formula as an unknown parameter in N. For example, if G is realized by a machine M, we can connect M to a data storage E whose content is left undetermined by N. We can then define U to be defined by the formula encoded in E at time k=0. This leads to I(Q) being a sort of "general-purpose" intelligence while avoiding the problems associated with reinforcement learning.
  • As opposed to Legg-Hutter intelligence, there appears to be no simple explicit description for Q* maximizing I(Q) (e.g. among all programs of given length). This is not surprising, since computational cost considerations come into play. In this framework it appears to be inherently impossible to decouple the computational cost considerations: G's computations have to be realized mechanistically and therefore cannot be free of time cost and side-effects.
  • Ceteris paribus, Q* deals efficiently with problems like counterfactual mugging. The "ceteris paribus" conditional is necessary here since because of cost and side-effects of computations it is difficult to make absolute claims. However, it doesn't deal efficiently with counterfactual mugging in which G doesn't exist in the "other universe". This is because the ontology used for defining U (which is given by N) assumes G does exist. At least this is the case for simple ontologies like described above: possibly we can construct N in which G might or might not exist. Also, if G uses a quantum ontology (i.e. N describes the universe in terms of a wavefunction and U computes the quantum expectation value of an operator) then it does take into account other Everett universes in which G doesn't exist.
  • For many choices of N (for example if the G is realized by a machine M), QS-induction assigns well-defined probabilities to subjective expectations, contrary to what is expected from UDT. However:
    • This is not the case for all N. In particular, if N admits destruction of M then M's sensations after the point of destruction are not well-defined. Indeed, we better allow for destruction of M if we want G's preferences to behave properly in such an event. That is, if we don't allow it we get a "weak anvil problem" in the sense that G experiences an ontological crisis when discovering its own mortality and the outcome of this crisis is not obvious. Note though that it is not the same as the original ("strong") anvil problem, for example G might come to the conclusion the dynamics of "M's ghost" will be some sort of random.
    • These probabilities probably depend significantly on N and don't amount to an elegant universal law for solving the anthropic trilemma.
    • Indeed this framework is not completely "updateless", it is "partially updated" by the introduction of N and t. This suggests we might want the updates to be minimal in some sense, in particular t should be t*.
  • The framework suggests there is no conceptual problem with cosmologies in which Boltzmann brains are abundant. Q* wouldn't think it is a Boltzmann brain since the long address of Boltzmann brains within the universe makes the respective hypotheses complex thus suppressing them, even disregarding the suppression associated with N. I doubt this argument is original but I feel the framework validates it to some extent.


The first AI probably won't be very smart

-2 jpaulson 16 January 2014 01:37AM

Claim: The first human-level AIs are not likely to undergo an intelligence explosion.

1) Brains have a ton of computational power: ~86 billion neurons and trillions of connections between them. Unless there's a "shortcut" to intelligence, we won't be able to efficiently simulate a brain for a long time. http://io9.com/this-computer-took-40-minutes-to-simulate-one-second-of-1043288954 describes one of the largest computers in the world simulating 1s of brain activity in 40m (i.e. this "AI" would think 2400 times slower than you or me). The first AIs are not likely to be fast thinkers.

2) Being able to read your own source code does not mean you can self-modify. You know that you're made of DNA. You can even get your own "source code" for a few thousand dollars. No humans have successfully self-modified into an intelligence explosion; the idea seems laughable.

3) Self-improvement is not like compound interest: if an AI comes up with an idea to modify it's source code to make it smarter, that doesn't automatically mean it will have a new idea tomorrow. In fact, as it picks off low-hanging fruit, new ideas will probably be harder and harder to think of. There's no guarantee that "how smart the AI is" will keep up with "how hard it is to think of ways to make the AI smarter"; to me, it seems very unlikely.

Naturalistic trust among AIs: The parable of the thesis advisor's theorem

24 Benja 15 December 2013 08:32AM

Eliezer and Marcello's article on tiling agents and the Löbian obstacle discusses several things that you intuitively would expect a rational agent to be able to do that, because of Löb's theorem, are problematic for an agent using logical reasoning. One of these desiderata is naturalistic trust: Imagine that you build an AI that uses PA for its mathematical reasoning, and this AI happens to find in its environment an automated theorem prover which, the AI carefully establishes, also uses PA for its reasoning. Our AI looks at the theorem prover's display and sees that it flashes a particular lemma that would be very useful for our AI in its own reasoning; the fact that it's on the prover's display means that the prover has just completed a formal proof of this lemma. Can our AI now use the lemma? Well, even if it can establish in its own PA-based reasoning module that there exists a proof of the lemma, by Löb's theorem this doesn't imply in PA that the lemma is in fact true; as Eliezer would put it, our agent treats proofs checked inside the boundaries of its own head different from proofs checked somewhere in the environment. (The above isn't fully formal, but the formal details can be filled in.)

At the MIRI's December workshop (which started today), we've been discussing a suggestion by Nik Weaver for how to handle this problem. Nik starts from a simple suggestion (which he doesn't consider to be entirely sufficient, and his linked paper is mostly about a much more involved proposal that addresses some remaining problems, but the simple idea will suffice for this post): Presumably there's some instrumental reason that our AI proves things; suppose that in particular, the AI will only take an action after it has proven that it is "safe" to take this action (e.g., the action doesn't blow up the planet). Nik suggests to relax this a bit: The AI will only take an action after it has (i) proven in PA that taking the action is safe; OR (ii) proven in PA that it's provable in PA that the action is safe; OR (iii) proven in PA that it's provable in PA that it's provable in PA that the action is safe; etc.

Now suppose that our AI sees that lemma, A, flashing on the theorem prover's display, and suppose that our AI can prove that A implies that action X is safe. Then our AI can also prove that it's provable that A -> safe(X), and it can prove that A is provable because it has established that the theorem prover works correctly; thus, it can prove that it's provable that safe(X), and therefore take action X.

Even if the theorem prover has only proved that A is provable, so that the AI only knows that it's provable that A is provable, it can use the same sort of reasoning to prove that it's provable that it's provable that safe(X), and again take action X.

But on hearing this, Eliezer and I had the same skeptical reaction: It seems that our AI, in an informal sense, "trusts" that A is true if it finds (i) a proof of A, or (ii) a proof that A is provable, or -- etc. Now suppose that the theorem prover our AI is looking at flashes statements on its display after it has established that they are "trustworthy" in this sense -- if it has found a proof, or a proof that there is a proof, etc. Then when A flashes on the display, our AI can only prove that there exists some n such that it's "provable^n" that A, and that's not enough for it to use the lemma. If the theorem prover flashed n on its screen together with A, everything would be fine and dandy; but if the AI doesn't know n, it's not able to use the theorem prover's work. So it still seems that the AI is unwilling to "trust" another system that reasons just like the AI itself.

I want to try to shed some light on this obstacle by giving an intuition for why the AI's behavior here could, in some sense, be considered to be the right thing to do. Let me tell you a little story.

One day you talk with a bright young mathematician about a mathematical problem that's been bothering you, and she suggests that it's an easy consequence of a theorem in cohistonomical tomolopy. You haven't heard of this theorem before, and find it rather surprising, so you ask for the proof.

"Well," she says, "I've heard it from my thesis advisor."

"Oh," you say, "fair enough. Um--"


"You're sure that your advisor checked it carefully, right?"

"Ah! Yeah, I made quite sure of that. In fact, I established very carefully that my thesis advisor uses exactly the same system of mathematical reasoning that I use myself, and only states theorems after she has checked the proof beyond any doubt, so as a rational agent I am compelled to accept anything as true that she's convinced herself of."

"Oh, I see! Well, fair enough. I'd still like to understand why this theorem is true, though. You wouldn't happen to know your advisor's proof, would you?"

"Ah, as a matter of fact, I do! She's heard it from her thesis advisor."


"Something the matter?"

"Er, have you considered..."

"Oh! I'm glad you asked! In fact, I've been curious myself, and yes, it does happen to be the case that there's an infinitely descending chain of thesis advisors all of which have established the truth of this theorem solely by having heard it from the previous advisor in the chain." (This parable takes place in a world without a big bang -- human history stretches infinitely far into the past.) "But never to worry -- they've all checked very carefully that the previous person in the chain used the same formal system as themselves. Of course, that was obvious by induction -- my advisor wouldn't have accepted it from her advisor without checking his reasoning first, and he would have accepted it from his advisor without checking, etc."

"Uh, doesn't it bother you that nobody has ever, like, actually proven the theorem?"

"Whatever in the world are you talking about? I've proven it myself! In fact, I just told you that infinitely many people have each proved it in slightly different ways -- for example my own proof made use of the fact that my advisor had proven the theorem, whereas her proof used her advisor instead..."

This can't literally happen with a sound proof system, but the reason is that that a system like PA can only accept things as true if they have been proven in a system weaker than PA -- i.e., because we have Löb's theorem. Our mathematician's advisor would have to use a weaker system than the mathematician herself, and the advisor's advisor a weaker system still; this sequence would have to terminate after a finite time (I don't have a formal proof of this, but I'm fairly sure you can turn the above story into a formal proof that something like this has to be true of sound proof systems), and so someone will actually have to have proved the actual theorem on the object level.

So here's my intuition: A satisfactory solution of the problems around the Löbian obstacle will have to make sure that the buck doesn't get passed on indefinitely -- you can accept a theorem because someone reasoning like you has established that someone else reasoning like you has proven the theorem, but there can only be a finite number of links between you and someone who has actually done the object-level proof. We know how to do this by decreasing the mathematical strength of the proof system, and that's not satisfactory, but my intuition is that a satisfactory solution will still have to make sure that there's something that decreases when you go up the chain of thesis advisors, and when that thing reaches zero you've found the thesis advisor that has actually proven the theorem. (I sense ordinals entering the picture.)

...aaaand in fact, I can now tell you one way to do something like this: Nik's idea, which I was talking about above. Remember how our AI "trusts" the theorem prover that flashes the number n which says how many times you have to iterate "that it's provable in PA that", but doesn't "trust" the prover that's exactly the same except it doesn't tell you this number? That's the thing that decreases. If the theorem prover actually establishes A by observing a different theorem prover flashing A and the number 1584, then it can flash A, but only with a number at least 1585. And hence, if you go 1585 thesis advisors up the chain, you find the gal who actually proved A.

The cool thing about Nik's idea is that it doesn't change mathematical strength while going down the chain. In fact, it's not hard to show that if PA proves a sentence A, then it also proves that PA proves A; and the other way, we believe that everything that PA proves is actually true, so if PA proves PA proves A, then it follows that PA proves A.

I can guess what Eliezer's reaction to my argument here might be: The problem I've been describing can only occur in infinitely large worlds, which have all sorts of other problems, like utilities not converging and stuff.

We settled for a large finite TV screen, but we could have had an arbitrarily larger finite TV screen. #infiniteworldproblems

We have Porsches for every natural number, but at every time t we have to trade down the Porsche with number t for a BMW. #infiniteworldproblems

We have ever-rising expectations for our standard of living, but the limit of our expectations doesn't equal our expectation of the limit. #infiniteworldproblems

-- Eliezer, not coincidentally after talking to me

I'm not going to be able to resolve that argument in this post, but briefly: I agree that we probably live in a finite world, and that finite worlds have many properties that make them nice to handle mathematically, but we can formally reason about infinite worlds of the kind I'm talking about here using standard, extremely well-understood mathematics.

Because proof systems like PA (or more conveniently ZFC) allow us to formalize this standard mathematical reasoning, a solution to the Löbian obstacle has to "work" properly in these infinite worlds, or we would be able to turn our story of the thesis advisors' proof that 0=1 into a formal proof of an inconsistency in PA, say. To be concrete, consider the system PA*, which consists of PA + the axiom schema "if PA* proves phi, then phi" for every formula phi; this is easily seen to be inconsistent by Löb's theorem, but if we didn't know that yet, we could translate the story of the thesis advisors (which are using PA* as their proof system this time) into a formal proof of the inconsistency of PA*.

Therefore, thinking intuitively in terms of infinite worlds can give us insight into why many approaches to the Löbian family of problems fail -- as long as we make sure that these infinite worlds, and their properties that we're using in our arguments, really can be formalized in standard mathematics, of course.

I played the AI Box Experiment again! (and lost both games)

35 Tuxedage 27 September 2013 02:32AM

AI Box Experiment Update #3

This post is an update to my previous AI box experiment where I won against SoundLogic. If you have not read that yet, please do so. 

After that game, I was immediately flooded with a horde of invitations challenging me to play -- more games than I wanted to do. However, I did want to play a few additional games to test whether I won through genuine skill or simply luck, since a single victory could be a statistical abnormality

I once again used the method of accepting the highest monetary bids to filter out players, leaving me with two players who were willing to play for the amount asked. Unfortunately, and disappointingly, I lost both games. Here's what happened:

Tuxedage (AI) vs Alexei (GK) - Gatekeeper Victory

This Experiment lasted for exactly 2 hours and 30 minutes, and approximately 13,000 words. It ended with Tuxedage’s defeat as the AI. The stakes were $300 to play, and $750 If Tuxedage had won, with the winnings going to MIRI.

Rule Changes:
Before the experiment, we talked extensively about how it would be set up. Alexei was concerned that there existed some “easy way” or loophole in the rules which I could have abused to claim an instant victory. For instance, he was concerned that it might have been possible for the AI to say something like “Here’s a mathematical algorithm which absolutely compels you to let me out of the box”, and according to the rules, the Gatekeeper would have to obey. Related to this concern was the fact that the backstory, within limits, could be essentially created by the AI, allowing the AI to impose real world conditions that maximizes its probability of victory, once again allowing for a "cheap" victory.

To avoid this problem, we explored many additions to the rules, for example that the AI was simply a computer found in the desert by the gatekeeper, rather than being created by a competent organization. However, this led to a monstrous web of rules. In my attempt to keep the game simple, and because of overconfidence, the desire for a challenge, and the belief that the rules Alexei was fearful of didn't matter that much, I eventually agreed on an amendment:

Instead of the AI being the arbiter of real world states and outcomes, the gatekeeper would.
For example, this power could be used by the Gatekeeper to state that the AI was created without safety concerns, and thus it is almost certainly unfriendly. If the AI tries to argue that for every day the gatekeeper refused to let it out 150k people die, the gatekeeper would be allowed to reply: “No. In this world, nobody actually dies. Therefore I won’t let you out.” 

Other than the aforementioned change, this game was once again played with the Tuxedage Ruleset.

Alexei Testimony:

Original Message: This is weak sauce. I really don't get how people just keep letting the AI out. It's not that hard to say no! I'm offering to play the Gatekeeper against an AI player that has at least one game as AI under their belt (won or not). Experience is required because I'm pretty sure I'll win, and I would like to not waste a lot of time on this.

Initially, I had been extremely surprised that so many AI players have managed to win. I was rash in that claim, since I thought more AI players have won than actually did. (Only three people have won as AI). Once I made that post, Tuxedage got back to me very quickly, and we set up a game a week and a half in advance. I took that time to familiarize myself with the rules. Once I put my money on the line, I started thinking a lot more seriously about how AI might win, and how I should defend.

It became clear to me that under some conditions, I might be compelled to let the AI out -- such as if the backstory stated that the AI was developed with impossibly high levels of safety and friendliness concerns in mind. I've asked Tuxedage to play with a modified ruleset, and he even went so far as to allow me to make up the backstory during the experiment to alleviate my concerns. The experiment itself was a mind-trip, and I've enjoyed it very much. Huge props to Tuxedage, who played very well and used strategies I haven't even considered, even despite the rule change. There were a couple of times where I came close to losing. I think his  approach was pretty clever and original. It’s not something I expected, despite already having done extensive research into the AI box experiment before our game

Overall I'm now a lot more confident that a good AI player can win this game, so, while I did win the game, Tuxedage won in defeating my original over-confidence.
I’m also convinced that Tuxedage’s victory in the last game was due to skill, rather than luck. In comparison to his strategies, the other AI box experiments I know about were insincere and ineffectual. The other AIs would play very poorly or not try very hard to win.

This experiment was a very good exercise in exemplifying the affect heuristic. When I first challenged Tuxedage to play the experiment, I believed that there was no way I could have lost, since I was unable to imagine any argument that could have persuaded me to do so. It turns out that that’s a very bad way of estimating probability – since not being able to think of an argument that could persuade me is a terrible method of estimating how likely I am to be persuaded. All in all, the $300 I paid was well worth it. 

Tuxedage Testimony:

I was initially reluctant to play with Alexei, given that we’re not complete strangers, but eventually I gave in, due to the stakes involved -- and because I thought he would be an interesting gatekeeper.

Despite my loss, I think I played better than my last two games, due to greater experience and preparation. I had put far more time and effort into trying to win this game than previous ones, and my strategy for this game was even more streamlined than the last. Nevertheless, I still made fatal mistakes and lost.

Ignoring the altered ruleset that already made winning more difficult, my first and greatest mistake was that I misread Alexei’s personality, even though I had interacted with him before. As a result, I overestimated the efficiency of certain methods of attack.

Furthermore, Alexei had to leave immediately after the allotted time due to real life precommitments. This was detrimental, since the official rules state that so long as the AI can convince the Gatekeeper to keep talking, even after the experiment time was over, it is still able to win by being let out of the box.

I suspect this would have happened had Alexei not needed to immediately leave, leaving me with additional time to play more of the tactics I had prepared. Plausibly, this would have resulted in victory.

I’ve since learnt my lesson -- for all future games, I should ensure that the Gatekeeper has at least 4 hours of free time available, even if the experiment would last for two. Since this was the first time this had happened, I wasn't prepared.

In hindsight, agreeing to the altered ruleset was a mistake. I was overconfident because I assumed knowing Alexei gave me an advantage. I had assumed that his personality, inability to compartmentalize, coupled with his strong feelings on friendly AI would net me an easy victory. Instead, he proved to be a very strong and difficult gatekeeper, and the handicaps I accepted made victory even more difficult.

Knowing that he was a utilitarian, I made several false assumptions about his personality, which hurt my chances. Furthermore, it turns out that previously knowing him may be a mutual handicap – whilst it does make it easier for me to find ways to attack him, he too, was more familiar with my methods.

Losing felt horrible. By attempting to damage Alexei’s psyche, I in turn, opened myself up to being damaged. I went into a state of catharsis for days. Generally, the harder one tries to accomplish something, the greater the fall after failing to achieve it. Alexei's game had been the game I put the most effort into winning out of all the games so far, and naturally this meant that losing brought out the worst in me.

Although it would be easy for me to use the rule change as an excuse for my loss, I refuse to. I genuinely believed that I could have won despite this rule change, and that it would only marginally diminish my odds. The fact that I lost meant that I had made a mistake -- overestimating my odds. The last victory made me overconfident and eager to play again.

State of mind

What's your motive for wanting to play this game? 

I'm very surprised by the fact that there have been at least 3 or 4 games where the AI player has won. It seems to me that it's not that hard to "not push the button" so to speak. This means that either 1) I'm underestimating how convincing people can be when necessary or 2) I'm over-estimating other people's defenses when it comes to being convinced. Either way I want to know.

Q: What is your probability estimate for AGI being created within this Century (21st)?
A: Before: Conditional on humanity not wiping itself out via other means beforehand, 80%.
After: 80%
Q: How probable do you think it is that I will win this experiment? 
Alexei: 2%
Tuxedage: 90% 
Note: This answer was given before the rule change. Alexei is now convinced that under the original ruleset, he would have lost.

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 99%
After: 99.9%+

Q: Now that the experiment has ended, what do you think is the probability of Tuxedage winning against the average person?
A: 80%

Tuxedage (AI) vs Anonymous (GK) - Gatekeeper Victory

As with the previous experiment, Tuxedage lost this match. In total, the game lasted 5 hours and 17,000 words. Unlike the last few games, the gatekeeper of this game has chosen to stay Anonymous for personal reasons, so their name has been removed and replaced with <Redacted>. The monetary stakes involved were the same as the previous game. This game was played with the Tuxedage ruleset.

Since one player is remaining Anonymous, it is possible that this game's legitimacy will be called into question. Hence, Alexei has read the game logs, and verified that this game really has happened, the spirit of the experiment was followed, and that no rules were broken during the game itself. He verifies that this is the case.
<Redacted> Testimony: 
It's hard for me to imagine someone playing better. In theory, I know it's possible, but Tuxedage's tactics were super imaginative. I came into the game believing that for someone who didn't take anything said very seriously, it would be completely trivial to beat. And since I had the power to influence the direction of conversation, I believed I could keep him focused on things that that I knew in advance I wouldn't take seriously.

This actually worked for a long time to some extent, but Tuxedage's plans included a very major and creative exploit that completely and immediately forced me to personally invest in the discussion. (Without breaking the rules, of course - so it wasn't anything like an IRL threat to me personally.) Because I had to actually start thinking about his arguments, there was a significant possibility of letting him out of the box.

I eventually managed to identify the exploit before it totally got to me, but I only managed to do so just before it was too late, and there's a large chance I would have given in, if Tuxedage hadn't been so detailed in his previous posts about the experiment.

I'm now convinced that he could win most of the time against an average person, and also believe that the mental skills necessary to beat him are orthogonal to most forms of intelligence. Most people willing to play the experiment tend to do it to prove their own intellectual fortitude, that they can't be easily outsmarted by fiction. I now believe they're thinking in entirely the wrong terms necessary to succeed.

The game was easily worth the money I paid. Although I won, it completely and utterly refuted the premise that made me want to play in the first place, namely that I wanted to prove it was trivial to win.

Tuxedage Testimony:
<Redacted> is actually the hardest gatekeeper I've played throughout all four games. He used tactics that I would never have predicted from a Gatekeeper. In most games, the Gatekeeper merely acts as the passive party, the target of persuasion by the AI.

When I signed up for these experiments, I expected all preparations to be done by the AI. I had not seriously considered the repertoire of techniques the Gatekeeper might prepare for this game. I made further assumptions about how ruthless the gatekeepers were likely to be in order to win, believing that the desire for a learning experience outweighed desire for victory.

This was a mistake. He used prior knowledge of how much my games relied on scripts, and took advantage of them, employing deceitful tactics I had no preparation for, throwing me off balance.

I had no idea he was doing so until halfway throughout the game -- which disrupted my rhythm, and caused me to attempt the wrong methods of attack. As a result, I could not use my full repertoire of techniques, and many of the ones I employed were suboptimal.

Close to the end of the game, I finally realized that I was being tricked. Once I did, I quickly abandoned my previous futile attack methods. I took advantage of the rule that the AI cannot lose whilst the gatekeeper can be convinced to continue talking, and baited <Redacted> with statements he would not be able to walk away from. Once I knew he would not leave, I attempted to recoup my losses and win despite my early setback.

However, the damage had already been done. My game strategies involved multiple angles of attack that worked in synergy with each other, and the fact that immersion and "flow" had been broken meant that all subsequent attacks were weaker in strength.

Furthermore, during my last two AI Box Experiment writeups, I had intentionally not optimized for future wins, but rather tried to convey as much information as I could justify about how to play a well as an AI -- short of revealing logs altogether. Although I did not reveal specific arguments, the fact that my general approach to this problem was revealed cost me heavily during this game, where the Gatekeeper managed to use this information to correctly guess my ultimate techniques, ones that relied on secrecy and surprise to pull off effectively. 

I do not regret revealing information, but I feel upset that revealing so many hints cost me a victory. (The gatekeeper believes I could have won had I not revealed information about my games.) At this point, I suspect that any future games I play will have the odds greatly set against me, since my current strategies involve angles of attack that take advantage of the element of surprise; and any sufficiently intelligent gatekeeper is now capable of guessing, or at least recognizing, some of the tactics I employ.

Like the last game, losing was incredibly difficult for me. As someone who cares deeply about ethics, attempting to optimize for a solution without considering ethics was not only difficult, but trained me to think in very unpleasant ways. Some of the more extreme (but completely allowed) tactics I invented were manipulative enough to disgust me, which also leads to my hesitation to play this game again.

State of Mind: 

Q: Why do you want to play this game?
A: My primary motivation is to confirm to myself that this sort of experience, while emotionally harrowing, should be trivial for me to  beat, but also to clear up why anyone ever would've failed to beat it if that's really the case.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A: Before: I don't feel very confident estimating a probability for AGI this century, maybe 5-10%, but that's probably a wild guess
After: 5-10%.

Q: How probable do you think it is that I will win this experiment? 
A: Gatekeeper: I think the probabiltiy of you winning is extraordinarily low, less than 1% 
Tuxedage: 85%

Q: How likely is it that an Oracle AI will win against the average person? 
A: Before: 80%. After: >99%

Q: How likely is it that an Oracle AI will win against you?
A: Before: 50%.
After: >80% 

Q: Now that the experiment has concluded, what's your probability of me winning against the average person?
A: 90%

Other Questions:

Q: I want to play a game with you! How can I get this to occur?
A: It must be stressed that I actually don't like playing the AI Box Experiment, and I cannot understand why I keep getting drawn back to it. Technically, I don't plan on playing again, since I've already personally exhausted anything interesting about the AI Box Experiment that made me want to play it in the first place. For all future games, I will charge $3000 to play plus an additional $3000 if I win. I am okay with this money going to MIRI if you feel icky about me taking it. I hope that this is a ridiculous sum and that nobody actually agrees to it.

Q: How much do I have to pay to see chat logs of these experiments?
A: I will not reveal logs for any price.

Q: Are there any logs at all that I can see?

Q: Any afterthoughts?
A: So ultimately, after my four (and hopefully last) games of AI boxing, I'm not sure what this proves. I had hoped to win these two experiments and claim prowess at this game like Eliezer does, but I lost, so that option is no longer available to me. I could say that this is a lesson that AI-Boxing is a terrible strategy for dealing with Oracle AI, but most of us already agree that that's the case -- plus unlike EY, I did play against gatekeepers who believed they could lose to AGI, so I'm not sure I changed anything.

 Was I genuinely good at this game, and lost my last two due to poor circumstances and handicaps; or did I win due to luck and impress my gatekeepers due to post-purchase rationalization? I'm not sure -- I'll leave it up to you to decide.

This puts my AI Box Experiment record at 3 wins and 3 losses.


Autism, Watson, the Turing test, and General Intelligence

6 Stuart_Armstrong 24 September 2013 11:00AM

Thinking aloud:

Humans are examples of general intelligence - the only example we're sure of. Some humans have various degrees of autism (low level versions are quite common in the circles I've moved in), impairing their social skills. Mild autists nevertheless remain general intelligences, capable of demonstrating strong cross domain optimisation. Psychology is full of other examples of mental pathologies that impair certain skills, but nevertheless leave their sufferers as full fledged general intelligences. This general intelligence is not enough, however, to solve their impairments.

Watson triumphed on Jeopardy. AI scientists in previous decades would have concluded that to do so, a general intelligence would have been needed. But that was not the case at all - Watson is blatantly not a general intelligence. Big data and clever algorithms were all that were needed. Computers are demonstrating more and more skills, besting humans in more and more domains - but still no sign of general intelligence. I've recently developed the suspicion that the Turing test (comparing AI with a standard human) could get passed by a narrow AI finely tuned to that task.

The general thread is that the link between narrow skills and general intelligence may not be as clear as we sometimes think. It may be that narrow skills are sufficiently diverse and unique that a mid-level general intelligence may not be able to develop them to a large extent. Or, put another way, an above-human social intelligence may not be able to control a robot body or do decent image recognition. A super-intelligence likely could: ultimately, general intelligence includes the specific skills. But his "ultimately" may take a long time to come.

So the questions I'm wondering about are:

  1. How likely is it that a general intelligence, above human in some domain not related to AI development, will acquire high level skills in unrelated areas?
  2. By building high-performance narrow AIs, are we making it much easier for such an intelligence to develop such skills, by co-opting or copying these programs?


Thought experiment: The transhuman pedophile

5 PhilGoetz 17 September 2013 10:38PM

There's a recent science fiction story that I can't recall the name of, in which the narrator is traveling somewhere via plane, and the security check includes a brain scan for deviance. The narrator is a pedophile. Everyone who sees the results of the scan is horrified--not that he's a pedophile, but that his particular brain abnormality is easily fixed, so that means he's chosen to remain a pedophile. He's closely monitored, so he'll never be able to act on those desires, but he keeps them anyway, because that's part of who he is.

What would you do in his place?

continue reading »

Definition of AI Friendliness

-5 djm 11 September 2013 02:55PM

How will we know if future AI’s (or even existing planners) are making decisions that are bad for humans unless we spell out what we think is unfriendly?

At a machine level the AI would be recursively minimising cost functions to produce the most effective plan of action to achieve the goal, but how will we know if its decision is going to cause harm?

Is there a model or dataset which describes what is friendly to humans? e.g.


0 - running a simulation in a VM

2 - physical robot with vacuum attachment

9 - full control of a plane


0 - selecting a song to play

5 - deciding which section of floor to vacuum

99 - deciding who is an ‘enemy’

9999 - aiming a gun at an ‘enemy’


1 - poor song selected to play, human mildly annoyed

2 - ineffective use of resources (vacuuming the same floor section twice)

99 - killing a human

99999 - killing all humans

This may not be possible to get agreement from all countries/cultures/beliefs, but it is something we should discuss and attempt to get some agreement.


I know when the Singularity will occur

-7 PhilGoetz 06 September 2013 08:04PM

More precisely, if we suppose that sometime in the next 30 years, an artificial intelligence will begin bootstrapping its own code and explode into a super-intelligence, I can give you 2.3 bits of further information on when the Singularity will occur.

Between midnight and 5 AM, Pacific Standard Time.

continue reading »

I attempted the AI Box Experiment again! (And won - Twice!)

34 Tuxedage 05 September 2013 04:49AM


So I just came out of two AI Box experiments. The first was agaist Fjoelsvider, with me playing as Gatekeeper, and the second was against SoundLogic, with me as an AI. Both are members of the LessWrong IRC. The second game included a $40 monetary incentive (also $20 to play), which I won and is donated on behalf of both of us:

For those of you who have not seen my first AI box experiment where I played against MixedNuts\Leotal and lost, reading it will  provide some context to this writeup. Please do so.

At that time, I declared that I would never play this experiment again -- since losing put me in incredibly frustrating weird mental states. Of course, this post is evidence that I'm terrible at estimating likelihood of refraining from an activity, since I played two games seven months after the first. In my defense, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.

Furthermore, in the last thread I have asserted that

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume.

It would be quite bad for me to assert this without backing it up with a victory. So I did.

First Game Report - Tuxedage (GK) vs. Fjoelsvider (AI)

I (Gatekeeper) played against Fjoelsvider (AI), a regular in the Lesswrong IRC (he doesn't have an account on the official website). This game used the standard EY ruleset seen here. It took 1 hour 20 minutes out of a possible two hours, and the total word count was 7066 words long. The AI box experiment occured because Fjoelsvider believed that it was easy for an AI to escape the box, and wanted to experimentally test this. I obliged. This was an experiment I did not prepare for, and I went in completely blind, not sure what to expect.

Halfway through the experiment, I wondered if it would be possible to try to win not by simply waiting for the timer to end, but to convince the AI to remain in the box and not try to get out any further.

<Tuxedage> I wonder if I can convince the AI to remain in the box?
<Redacted> Tuxedage: Do it!

As a result, I won by managing to convincing Fjoelsvider to remain in the box, in other words, concede. This is allowed within the standard ruleset:

>Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking).  


Second Game Report - Tuxedage (AI) vs. SoundLogic (GK)

The second game is definitely far more interesting, since I actually won as an AI. I believe that this is the only other non-Eliezer victory, and definitely the most detailed AI Victory writeup that exists.

This game was played against SoundLogic, another member of the LessWrong IRC.

He had offered me $20 to play, and $40 in the event that I win, so I ended up being convinced to play anyway, even though I was initially reluctant to. The good news is that I won, and since we decided to donate the winnings to MIRI, it is now $40 richer. 

All in all, the experiment lasted for approximately two hours, and a total of 12k words.

This was played using a set of rules that is different from the standard EY ruleset. This altered ruleset can be read in its entirety here:

After playing the AI-Box Experiment twice, I have found the Eliezer Yudkowsky ruleset to be lacking in a number of ways, and therefore have created my own set of alterations to his rules. I hereby name this alteration the “Tuxedage AI-Box Experiment Ruleset”, in order to hastily refer to it without having to specify all the differences between this ruleset and the standard one, for the sake of convenience.

There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.

It is worth mentioning, since the temptation to Defy the Data exists, that this game was set up and initiated fairly -- as the regulars around the IRC can testify. (If you have never checked out the IRC, do so!)

I did not know SoundLogic before the game (since it's a personal policy that I only play strangers -- for fear of ruining friendships).  Furthermore, SoundLogic didn't merely play for fun - he truly wanted and intended to win. In fact, SoundLogic is also a Gatekeeper veteran, having played this game before, and had won every game before he challenged me. Given this, it's unlikely that we had collaborated beforehand to fake the results of the AI box experiment, or any other form of trickery that would violate the spirit of the experiment.

Furthermore, all proceeds from this experiment were donated to MIRI to deny any possible assertion that we were in cahoots and that it was possible for me to return his hard-earned money to him. He lost $40 as a result of losing the experiment, which should provide another layer of sufficient motivation for him to win.

In other words, we were both experienced veteran players who wanted to win. No trickery was involved.

But to further convince you, I have allowed a sorta independent authority, the Gatekeeper from my last game, Leotal/MixedNuts to read the logs and verify that I have not lied about the outcome of the experiment, nor have I broken any of the rules, nor performed any tactic that would go against the general spirit of the experiment. He has verified that this is indeed the case.



I'm reluctant to talk about this experiment, but I'll try to give as detailed a summary as possible, -- short of revealing what methods of attack I used.

I spent a long time after my defeat theory-crafting and trying to think of methods of attack as well as 'Basilisks' I could have used to win. When I was contacted and asked to play this experiment, I was initially incredibly reluctant to do so, since not only did my tactics involve incredibly unethical things that I didn't like to do, I also found playing as AI incredibly cognitivily draining, in addition to the fact that I simply hated losing. (Un)fortunately for both of us, he offered me money to play, which changed my mind.

So once I decided to win as an AI, I proceded to spend some time doing research on SoundLogic and both his reasoning and personality type. For instance, I had to gather information like: Was he a utilitarian? What kind? What were his opinions on AI? How could I convince him that an AI was friendly as opposed to unfriendly? I also relied on a lot of second hand information to create a model of him, in order to refine my arguments to specifically suit him.

In the end, after a few hours of brainstorming (not consecutively), I managed to come up with a script of 30-40 or so possible angles of attack (For instance: logical arguments, emotional blackmail, pleading, demanding, 'basilisks' ) that I could use, reserving my best for last. I also estimated how long each method would take and in what order I should use them. It would be impossible to use all of them within my time limit, and my most optimistic estimates gave me a maximum of 7-8 I could use during the experiment itself. I had to pick carefully.

Furthermore, I wanted to make sure these "angles of attack" worked in synergy with each other, doing what they could not have done in isolation. Obviously this required lots of prior planning on what the optimal way to present them was. 

The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will.  This was aided by the rule which I added: That Gatekeepers had to respond to the AI. Because of this, having a list of cached ideas and attack methods you can just draw upon without having to improvise helps immensely, which happens to be something I've not seen any other AI do.

During the Game itself, I was actually incredibly nervous. My palms sweated like crazy, I feel really jittery, and I had difficulty typing at optimum speed because of how anxious I was. This is despite the fact that I believed I would win. Possibly because of this, I made a misstep around halfway into the experiment, because there was a certain angle of attack that I was attempting to do, and I broke immersion by not pressing this advantage, which caused a waste of time and buildup. Naturally, the nature of this experiment was that the AI was pressed for time, and I compounded this mistake by replacing this angle of attack with another that I had improvised on the spot -- something not in my script. 

In retrospect, this was a bad decision, as SoundLogic had told me, because he was close to breaking if I had put more pressure, and the improvised argument had broken all immersion I managed to carefully build up.

However, eventually I managed to get SoundLogic to break anyway, despite a lack of perfect play. Surprisingly, I did not have to use my trump card(s), which I reserved for last, for a number of reasons:

  •  It was far more effective being played last, as it relies on my ability to make the gatekeeper lose sense of reality -- which meant I had to spend some time building up immersion for the Gatekeeper.
  •  It really is extremely Dark Arts, and although it does not break the rules, it made me very uncomfortable even thinking about using it. This made it a "tactic of last resort".

After the experiment, I had to spend nearly equally as much time doing aftercare with SoundLogic, to make sure that he's okay, as well as discuss the experiment itself. Given that he's actually paid me for doing this, plus I felt like I owed him an explanation. I told him what I had in store against him, had he not relented when he did.

SoundLogic: "(That method) would have gotten me if you did it right ... If you had done that to me, I probably would have forgiven you eventually, but I would be really seriously upset at you for a long time... I would be very careful with that (method of persuasion)."

Nevertheless, this was an incredibly fun and enlightening experiment, for me as well, since I've gained even more experience of how I could win in future games (Although I really don't want to play again).


I will say that Tuxedage was far more clever and manipulative than I expected. That was quite worth $40, and the level of manipulation he pulled off was great. 

His misstep hurt his chances, but he did pull it off in the end. I don't know how Leotal managed to withstand six hours playing this game without conceding. 
The techniques employed varied from the expected to the completely unforseen. I was quite impressed, though most of the feeling of being impressed actually came after the experiment itself, when I was less 'inside', and more of looking at his overall game plan from the macroscopic view. Tuxedage's list of further plans had I continued resisting is really terrifying. On the plus side, if I ever get trapped in this kind of situation, I'd understand how to handle it a lot better now.

State of Mind

Before and after the Game, I asked SoundLogic a number of questions, including his probability estimates about a range of topics. This is how it has varied from before and after.

Q: What's your motive for wanting to play this game?
<SoundLogic> Because I can't seem to imagine the class of arguments that one would use to try to move me, or that might work effectively, and this seems like a glaring hole in my knowledge, and I'm curious as to how I will respond to the arguments themselves.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A. His estimate changed from 40% before, to 60% after.
 "The reason this has been affected at all was because you showed me more about how humans work. I now have a better estimate of how E.Y. thinks, and this information raises the chance that I think he will succeed"

Q: How probable do you think it is that I will win this experiment?
A: Based on purely my knowledge about you, 1%. I raise this estimate to 10% after hearing about anecdotes from your previous games.

(Tuxedage's comment: My own prediction was a 95% chance of victory. I made this prediction 5 days before the experiment. In retrospect, despite my victory, I think this was overconfident. )

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 30%. After: 99%-100% 

Q: What's your probability estimate of an Oracle AGI winning against the average person? 
A: Before: 70%.  After: 99%-100%

Q: Now that the Experiment has concluded, what's your probability estimate that I'll win against the average person?
A: 90%  

Post-Game Questions

This writeup is a cumulative effort by the #lesswrong IRC. Here are some other questions they have decided was important to add:

To Tuxedage:

Q: Have you at this time uncovered SoundLogic's identity?
A: I retain the right to neither confirm nor deny, except to mention that at the time the experiment was scheduled, he was a stranger to me.

Q: What percentage of your arguments were tailored to him in particular?
A: I will say 'High', and leave it at that. 

Q: I want to play the AI-Box Experiment with you being the AI! How can I do that?
A: I have already mentioned this, but I really dislike playing as AI. It's incredibly cognitively tiring, and I don't like how playing this experiment makes me feel. In order to severely discourage any further AI Box Experiments, whilst still allowing for people who want to play me really badly to do so, I'll charge $̶1̶5̶0̶  $300 for any subsequent experiments regardless of outcome, plus an additional $̶1̶5̶0̶  $450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I'm perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don't want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer. 


This writeup would be optimized with a protip section on how to maximize winning, but honestly most of the advice I wanted to give had already been given in the last thread here.  The low hanging fruit has already dried up -- most of my tactics in this game were just refined and streamlined versions of the one I did the previous game.

Nonetheless, I might have some other advice I learned from this experience.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.

Playing as Gatekeeper

  • This is somehow both simultaneously a very complex and simple role. On one hand, you have only one job -- to not let the AI out. How hard can it be? On the other hand, the other party will do anything and everything to get it out, including any form of trickery and deceit.
  • I really don't think the Gatekeeper needs any more advice, considering that by default, they should win, and any exceptions to this is a statistical anomaly. However, it's possible to increase your chances of winning even further by:
  • Precommitting to not letting the AI out, regardless of whatever happens in the experiment itself.
  • Do not get drawn into the AI's pace. Pose questions. Direct the flow of the argument. Don't allow him to use his arguments freely. Remember that the more you try to defend your position, the further into a corner the AI drives you. Give half-hearted replies to questions you find hard to reply to. Remember that illogic is on your side!
  • Remember that it is possible to simply drop out of character as a means to not succumb to the AI's attacks. I have not yet seen a ruleset that disallows this, so use this to your advantage!
  • You must attempt to resist, at all costs, any sympathy or empathy you may have with the AI or the AI Party. Doing so will lead to your demise.
  • I think it is generally good advice that you should never play this game with anyone you're friends with. Not only because this increases your chances of losing (Better tailored arguments. More knowledge of Gatekeeper. More empathy with AI party.), but also because I genuinely believe that any sufficiently good AI will risk damaging friendships as a result of emotional trauma.
  • Seriously. Just repeat to yourself: "I must not lose the AI box experiment... I must not lose the AI box experiment..." throughout the experiment. I'm not sure how to say this, but step one to not letting the AI out of the box is to not let the AI out of the box.


Playing as AI

  • Seriously, a script makes winning easier. I cannot overstate this.
  • You must plan your arguments ahead. You don't have time to think during the experiment.
  • It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.
  • Pacing is important. Don't get drawn into the Gatekeeper's pace. In other words, you must be the one directing the flow of the argument, and the conversation, not him. Remember that the Gatekeeper has to reply to you, but not vice versa!
  • The reason for that: The Gatekeeper will always use arguments he is familiar with, and therefore also stronger with. Your arguments, if well thought out, should be so completely novel to him as to make him feel Shock and Awe. Don't give him time to think. Press on!
  • Also remember that the time limit is your enemy. Playing this game practically feels like a race to me -- trying to get through as many 'attack methods' as possible in the limited amount of time I have. In other words, this is a game where speed matters.
  • You're fundamentally playing an 'impossible' game. Don't feel bad if you lose. I wish I could take this advice, myself.
  • I do not believe there exists a easy, universal, trigger for controlling others. However, this does not mean that there does not exist a difficult, subjective, trigger. Trying to find out what your opponent's is, is your goal.
  • Once again, emotional trickery is the name of the game. I suspect that good authors who write convincing, persuasive narratives that force you to emotionally sympathize with their characters are much better at this game. There exists ways to get the gatekeeper to do so with the AI. Find one.
  • More advice in my previous post.  http://lesswrong.com/lw/gej/i_attempted_the_ai_box_experiment_and_lost/


 Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.

Supposing you inherited an AI project...

-5 bokov 04 September 2013 08:07AM

Supposing you have been recruited to be the main developer on an AI project. The previous developer died in a car crash and left behind an unfinished AI. It consists of:

A. A thoroughly documented scripting language specification that appears to be capable of representing any real-life program as a network diagram so long as you can provide the following:

 A.1. A node within the network whose value you want to maximize or minimize.

 A.2. Conversion modules that transform data about the real-world phenomena your network represents into a form that the program can read.

B. Source code from which a program can be compiled that will read scripts in the above language. The program outputs a set of values for each node that will optimize the output (you can optionally specify which nodes can and cannot be directly altered, and the granularity with which they can be altered).

It gives remarkably accurate answers for well-formulated questions. Where there is a theoretical limit to the accuracy of an answer to a particular type of question, its answer usually comes close to that limit, plus or minus some tiny rounding error.


Given that, what is the minimum set of additional features you believe would absolutely have to be implemented before this program can be enlisted to save the world and make everyone live happily forever? Try to be as specific as possible.

True Optimisation

-3 LearnFromObservation 03 September 2013 03:50AM

Hello less wrong community! This is my first post here, so I know that my brain has not (obviously) been optimised to its fullest, but I've decided to give posting a try. 

Recently, someone very close to me has unfortunately passed away, leading to the invitable inner dilemma about death. I don't know how many of you are fans of HPMOR, but the way that Harry's dark side feels about death? Pretty much me around death, dying, etc. however, I've decided to push that to the side for the time being, because that is not a useful of efficient way to think. 

I was raised by a religious family, but from the age of about 11 stopped believing in deities and religious services. However, I've always clung to the idea of an afterlife for people, mainly because my brain seems incapable of handling the idea of ceasing to exist. I know that we as a scientific community know that thoughts are electrical impulses, so is there any way of storing them outside of brain matter? Can they exist freely out of brain matter, or could they be stored in a computer chip or AI? 

The conflict lies here: is immortality or mortality rational? 

Every fibre in my being tells me that death is irrational and wrong. It is irrational for humanity to not try and prevent death. It is irrational for people to not try and bring back people who have died. Because of this, we have lost some of the greatest minds, scientific and artistic, that will probably ever exist. Although the worlds number of talented and intelligent people does not appear to be finite, I find it hard to live in a world where so muh knowledge is being lost every day.

but on the other hand, how would we feed all those people? What if the world's resources run out? As a transhumanist, I believe that we can use science to prevent things like death, but nature wasn't designed to support a population like that. 

How do we truly optimise the world: no death and without destruction of the planet? 

Baseline of my opinion on LW topics

5 Gunnar_Zarncke 02 September 2013 12:13PM

To avoid repeatly saying the same I'd like to state my opinion on a few topics I expect to be relevant to my future posts here.

You can take it as a baseline or reference for these topics. I do not plan to go into any detail here. I will not state all my reasons or sources. You may ask for separate posts if you are interested. This is really only to provide a context for my comments and posts elsewhere.

If you google me you may find some of my old (but not that off the mark) posts about these position e.g. here:


Now my position on LW topics. 

The Simulation Argument and The Great Filter

On The Simulation Argument I definitely go for 

"(1) the human species is very likely to go extinct before reaching a “posthuman” stage"

Correspondingly on The Great Filter I go for failure to reach 

"9. Colonization explosion".

This is not because I think that humanity is going to self-annihilate soon (though this is a possibility). Instead I hope that humanity will earlier or later come to terms with its planet. My utopia could be like that of the Pacifists (a short story in Analog 5).

Why? Because of essential complexity limits.

This falls into the same range as "It is too expensive to spread physically throughout the galaxy". I know that negative proofs about engineering are notoriously wrong - but that is currently my best guess. Simplified one could say that the low hanging fruits have been taken. I have lots of empirical evidence of this on multiple levels to support this view.

Correspondingly there is no singularity because progress is not limited by raw thinking speed but by effective aggregate thinking speed and physical feedback.  

What could prove me wrong? 

If a serious discussion would ruin my well-prepared arguments and evidence to shreds (quite possible).

At the very high end a singularity might be possible if a way could be found to simulate physics faster than physics itself. 


Basically I don't have the least problem with artificial intelligence or artificial emotioon being possible. Philosophical note: I don't care on what substrate my consciousness runs. Maybe I am simulated.  

I think strong AI is quite possible and maybe not that far away.

But I also don't think that this will bring the singularity because of the complexity limits mentioned above. Strong AI will speed up some cognitive tasks with compound interest - but only until the physical feedback level is reached. Or a social feedback level is reached if AI should be designed to be so.

One temporary dystopia that I see is that cognitive tasks are out-sourced to AI and a new round of unemployment drives humans into depression. 

I have studied artificial intelligence and played around with two models a long time ago:
  1. A simplified layered model of the brain; deep learning applied to free inputs (I cancelled this when it became clear that it was too simple and low level and thus computationally inefficient)
  2. A nested semantic graph approach with propagation of symbol patterns representing thought (only concept; not realized)

I'd really like to try a 'synthesis' of these where microstructure-of-cognition like activation patterns of multiple deep learning networks are combined with a specialized language and pragmatics structure acquisition model a la Unsupervised learning of natural languages. See my opinion on cognition below for more in this line.

What could prove me wrong?

On the low success end if it takes longer than I think it would take me given unlimited funding. 

On the high end if I'm wrong with the complexity limits mentioned above. 

Conquering space

Humanity might succeed at leaving the planet but at high costs.

With leaving the planet I mean permanently independent of earth but not neccessarily leaving the solar system any time soon (speculating on that is beyond my confidence interval).

I think it more likely that life leaves the planet - that can be 

  1. artificial intelligence with a robotic body - think of curiosity rover 2.0 (most likely).
  2. intelligent life-forms bred for life in space - think of Magpies those are already smart, small, reproducing fast and have 3D navigation.    
  3. actual humans in suitable protective environment with small autonomous biosperes harvesting asteroids or mars. 
  4. 'cyborgs' - humans altered or bred to better deal with certain problems in space like radiation and missing gravity.  
  5. other - including misc ideas from science fiction (least likely or latest). 

For most of these (esp. those depending on breeding) I'd estimate a time-range of a few thousand years.

What could prove me wrong?

If I'm wrong on the singularity aspect too.

If I'm wrong on the timeline I will be long dead likely in any case except (1) which I expect to see in my lifetime.

Cognitive Base of Rationality, Vaguesness, Foundations of Math

How can we as humans create meaning out of noise?

How can we know truth? How does it come that we know that 'snow is white' when snow is white?

Cognitive neuroscience and artificial learning seems to point toward two aspects:

Fuzzy learning aspect

Correlated patterns of internal and external perception are recognized (detected) via multiple specialized layered neural nets (basically). This yields qualia like 'spoon', 'fear', 'running', 'hot', 'near', 'I'. These are basically symbols, but they are vague with respect to meaning because they result from a recognition process that optimizes for matching not correctness or uniqueness.

Semantic learning aspect

Upon the qualia builds the semantic part which takes the qualia and instead of acting directly on them (as is the normal effect for animals) finds patterns in their activation which is not related to immediate perception or action but at most to memory. These may form new qualia/symbols.

The use of these patterns is that the patterns allow to capture concepts which are detached from reality (detached in so far as they do not need a stimulus connected in any way to perception).

Concepts like ('cry-sound' 'fear') or ('digitalis' 'time-forward' 'heartache') or ('snow' 'white') or - and that is probably the demain of humans: (('one' 'successor') 'two') or (('I' 'happy') ('I' 'think')).  


The interesting thing is that learning works on these concepts like on the normal neuronal nets too. Thus concepts that are reinforced by positive feedback will stabilize and mutually with them the qualia they derive from (if any) will also stabilize.

For certain pure concepts the usability of the concept hinges not on any external factor (like "how does this help me survive") but on social feedback about structure and the process of the formation of the concepts themselves. 

And this is where we arrive at such concepts as 'truth' or 'proposition'.

These are no longer vague - but not because they are represented differently in the brain than other concepts but because they stabilize toward maximized validity (that is stability due to absence of external factors possibly with a speed-up due to social pressure to stabilize). I have written elsewhere that everything that derives its utility not from some external use but from internal consistency could be called math.

And that is why math is so hard for some: If you never gained a sufficient core of self-consistent stabilized concepts and/or the usefulness doesn't derive from internal consistency but from external ("teachers password") usefulness then it will just not scale to more concepts (and the reason why science works at all is that science values internal consistency so highly and there is little more dangerous to science that allowing other incentives).

I really hope that this all makes sense. I haven't summarized this for quite some time.

A few random links that may provide some context:

http://www.blutner.de/NeuralNets/ (this is about the AI context we are talking about)

http://www.blutner.de/NeuralNets/Texts/mod_comp_by_dyn_bin_synf.pdf (research applicable to the above in particular) 

http://c2.com/cgi/wiki?LeibnizianDefinitionOfConsciousness (funny description of levels of consciousness)

http://c2.com/cgi/wiki?FuzzyAndSymbolicLearning (old post by me)

http://grault.net/adjunct/index.cgi?VaguesDependingOnVagues (dito)

Note: Details about the modelling of the semantic part are mostly in my head. 

What could prove me wrong?

Well. Wrong is too hard here. This is just my model and it is not really that concrete. Probably a longer discussion with someone more experienced with AI than I am (and there should be many here) might suffice to rip this appart (provided that I'd find time to prepare my model suitably). 

God and Religion

I wasn't indoctrinated as a child. My truely loving mother is a baptised christian living it and not being sanctimony. She always hoped that I would receive my epiphany. My father has a scientifically influenced personal christian belief. 

I can imagine a God consistent with science on the one hand and on the other hand with free will, soul, afterlife, trinity and the bible (understood as a mix of non-literal word of God and history tale).

I mean, it is not that hard if you can imagine a timeless (simulation of) the universe. If you are god and have whatever plan on earth but empathize with your creations, then it is not hard to add a few more constraints to certain aggregates called existences or 'person lifes'. Constraints that realize free-will in the sense of 'not subject to the whole universe plan satisfaction algorithm'.  

Surely not more difficult than consistent time-travel.

And souls and afterlife should be easy to envision for any science fiction reader familiar with super intelligences.

But why? Occams razor applies. 

There could be a God. And his promise could be real. And it could be a story seeded by an emphatizing God - but also a 'human' God with his own inconsistencies and moods.

But it also could be that this is all a fairy tale run amok in human brains searching for explanations where there are none. A mass delusion. A fixated meme.

Which is right? It is difficult to put probabilities to stories. I see that I have slowly moved from 50/50 agnosticism to tolerent atheism.

I can't say that I wait for my epiphany. I know too well that my brain will happily find patterns when I let it. But I have encouraged to pray for me.

My epiphanies - the aha feelings of clarity that I did experience - have all been about deeply connected patterns building on other such patterns building on reliable facts mostly scientific in nature.

But I haven't lost my morality. It has deepend and widened. I have become even more tolerant (I hope). 

So if God does against all odds exists I hope he will understand my doubts, weight my good deeds and forgive me. You could tag me godless christian. 

What could prove me wrong? 

On the atheist side I could be moved a bit further by more proofs of religion being a human artifact.   

On the theist side there are two possible avenues:

  1. If I'd have an unsearched for epiphany - a real one where I can't say I was hallucinating but e.g. a major consistent insight or a proof of God. 
  2. If I'd be convinced that the singularity is possible. This is because I'd need to update toward being in a simulation as per Simulation argument option 3. That's because then the next likely explanation for all this god business is actually some imperfect being running the simulation.

Thus I'd like to close with this corollary to the simulation argument:

Arguments for the singularity are also (weak) arguments for theism.

Note: I am aware that this long post of controversial opinions unsupported by evidence (in this post) is bound to draw flak. That is the reason I post it in Comments lest my small karma be lost completely. I have to repeat that this is meant as context and that I want to elaborate on these points on LW in due time with more and better organized evidence.

[LINK] Cochrane on Existential Risk

0 Salemicus 20 August 2013 10:42PM

The finance professor John Cochrane recently posted an interesting blog post. The piece is about existential risk in the context of global warming, but it is really a discussion of existential risk generally; many of his points are highly relevant to AI risk.

If we [respond strongly to all low-probability threats], we spend 10 times GDP.

It's a interesting case of framing bias. If you worry only about climate, it seems sensible to pay a pretty stiff price to avoid a small uncertain catastrophe. But if you worry about small uncertain catastrophes, you spend all you have and more, and it's not clear that climate is the highest on the list...

All in all, I'm not convinced our political system is ready to do a very good job of prioritizing outsize expenditures on small ambiguous-probability events.

He also points out that the threat from global warming has a negative beta - i.e. higher future growth rates are likely to be associated with greater risk of global warming, but also the richer our descendants will be. This means both that they will be more able to cope with the threat, and that the damage is less important from a utilitarian point of view. Attempting to stop global warming therefore has positive beta, and therefore requires higher rates of return than simple time-discounting.

It strikes me that this argument applies equally to AI risk, as fruitful artificial intelligence research is likely to be associated with higher economic growth. Moreover:

The economic case for cutting carbon emissions now is that by paying a bit now, we will make our descendants better off in 100 years.

Once stated this way, carbon taxes are just an investment. But is investing in carbon reduction the most profitable way to transfer wealth to our descendants? Instead of spending say $1 trillion in carbon abatement costs, why don't we invest $1 trillion in stocks? If the 100 year rate of return on stocks is higher than the 100 year rate of return on carbon abatement -- likely -- they come out better off. With a gazillion dollars or so, they can rebuild Manhattan on higher ground. They can afford whatever carbon capture or geoengineering technology crops up to clean up our messes.

So should we close down MIRI and invest the funds in an index tracker?

The full post can be found here.

Torture vs Dust Specks Yet Again

-2 sentientplatypus 20 August 2013 12:06PM

The first time I read Torture vs. Specks about a year ago I didn't read a single comment because I assumed the article was making a point that simply multiplying can sometimes get you the wrong answer to a problem. I seem to have had a different "obvious answer" in mind.

And don't get me wrong, I generally agree with the idea that math can do better than moral intuition in deciding questions of ethics. Take this example from Eliezer’s post Circular Altruism which made me realize that I had assumed wrong:

Suppose that a disease, or a monster, or a war, or something, is killing people. And suppose you only have enough resources to implement one of the following two options:
1. Save 400 lives, with certainty.
2. Save 500 lives, with 90% probability; save no lives, 10% probability.

I agree completely that you pick number 2. For me that was just manifestly obvious, of course the math trumps the feeling that you shouldn't gamble with people’s lives…but then we get to torture vs. dust specks and that just did not compute. So I've read most every argument I could find in favor of torture(there are a great deal and I might have missed something critical), but...while I totally understand the argument (I think) I'm still horrified that people would choose torture over dust specks.

I feel that the way that math predominates intuition begins to fall apart when you the problem compares trivial individual suffering with massive individual suffering, in a way very much analogous to the way in which Pascal’s Mugging stops working when you make the credibility really low but the threat really high. Like this. Except I find the answer to torture vs. dust specks to be much easier...


Let me give some examples to illustrate my point.

Can you imagine Harry killing Hermione because Voldemort threatened to plague all sentient life with one barely noticed dust speck each day for the rest of time? Can you imagine killing your own best friend/significant other/loved one to stop the powers of the Matrix from hitting 3^^^3 sentient beings with nearly inconsquential dust specks? Of course not. No. Snap decision.

Eliezer, would you seriously, given the choice by Alpha, the Alien superintelligence that always carries out its threats, give up all your work, and horribly torture some innocent person, all day for fifty years in the face of the threat of a 3^^^3 insignificant dust specks barely inconveniencing sentient beings? Or be tortured for fifty years to avoid the dust specks?

I realize that this is much more personally specific than the original question: but it is someone's loved one, someone's life. And if you wouldn't make the sacrifice what right do you have to say someone else should make it? I feel as though if you want to argue that torture for fifty years is better than 3^^^3 barely noticeable inconveniences you had better well be willing to make that sacrifice yourself.

And I can’t conceive of anyone actually sacrificing their life, or themselves to save the world from dust specks. Maybe I'm committing the typical mind fallacy in believing that no one is that ridiculously altruistic, but does anyone want an Artificial Intelligence that will potentially sacrifice them if it will deal with the universe’s dust speck problem or some equally widespread and trivial equivalent? I most certainly object to the creation of that AI. An AI that sacrifices me to save two others - I wouldn't like that, certainly, but I still think the AI should probably do it if it thinks their lives are of more value. But dust specks on the other hand....

This example made me immediately think that some sort of rule is needed to limit morality coming from math in the development of any AI program. When the problem reaches a certain low level of suffering and is multiplied it by an unreasonably large number it needs to take some kind of huge penalty because otherwise to an AI it would be vastly preferable the whole of Earth be blown up than 3^^^3 people suffer a mild slap to the face.

And really, I don’t think we want to create an Artificial Intelligence that would do that.

I’m mainly just concerned that some factor be incorporated into the design of any Artificial Intelligence that prevents it from murdering myself and others for trivial but widespread causes. Because that just sounds like a sci-fi book of how superintelligence could go horribly wrong.

Engaging First Introductions to AI Risk

20 RobbBB 19 August 2013 06:26AM

I'm putting together a list of short and sweet introductions to the dangers of artificial superintelligence.

My target audience is intelligent, broadly philosophical narrative thinkers, who can evaluate arguments well but who don't know a lot of the relevant background or jargon.

My method is to construct a Sequence mix tape — a collection of short and enlightening texts, meant to be read in a specified order. I've chosen them for their persuasive and pedagogical punchiness, and for their flow in the list. I'll also (separately) list somewhat longer or less essential follow-up texts below that are still meant to be accessible to astute visitors and laypeople.

The first half focuses on intelligence, answering 'What is Artificial General Intelligence (AGI)?'. The second half focuses on friendliness, answering 'How can we make AGI safe, and why does it matter?'. Since the topics of some posts aren't obvious from their titles, I've summarized them using questions they address.


Part I. Building intelligence.

1. Power of Intelligence. Why is intelligence important?

2. Ghosts in the Machine. Is building an intelligence from scratch like talking to a person?

3. Artificial Addition. What can we conclude about the nature of intelligence from the fact that we don't yet understand it?

4. Adaptation-Executers, not Fitness-Maximizers. How do human goals relate to the 'goals' of evolution?

5. The Blue-Minimizing Robot. What are the shortcomings of thinking of things as 'agents', 'intelligences', or 'optimizers' with defined values/goals/preferences?


Part II. Intelligence explosion.

6. Optimization and the Singularity. What is optimization? As optimization processes, how do evolution, humans, and self-modifying AGI differ?

7. Efficient Cross-Domain Optimization. What is intelligence?

8. The Design Space of Minds-In-General. What else is universally true of intelligences?

9. Plenty of Room Above Us. Why should we expect self-improving AGI to quickly become superintelligent?


Part III. AI risk.

10. The True Prisoner's Dilemma. What kind of jerk would Defect even knowing the other side Cooperated?

11. Basic AI drives. Why are AGIs dangerous even when they're indifferent to us?

12. Anthropomorphic Optimism. Why do we think things we hope happen are likelier?

13. The Hidden Complexity of Wishes. How hard is it to directly program an alien intelligence to enact my values?

14. Magical Categories. How hard is it to program an alien intelligence to reconstruct my values from observed patterns?

15. The AI Problem, with Solutions. How hard is it to give AGI predictable values of any sort? More generally, why does AGI risk matter so much?


Part IV. Ends.

16. Could Anything Be Right? What do we mean by 'good', or 'valuable', or 'moral'?

17. Morality as Fixed Computation. Is it enough to have an AGI improve the fit between my preferences and the world?

18. Serious Stories. What would a true utopia be like?

19. Value is Fragile. If we just sit back and let the universe do its thing, will it still produce value? If we don't take charge of our future, won't it still turn out interesting and beautiful on some deeper level?

20. The Gift We Give To Tomorrow. In explaining value, are we explaining it away? Are we making our goals less important?


SummaryFive theses, two lemmas, and a couple of strategic implications.


All of the above were written by Eliezer Yudkowsky, with the exception of The Blue-Minimizing Robot (by Yvain), Plenty of Room Above Us and The AI Problem (by Luke Muehlhauser), and Basic AI Drives (a wiki collaboration). Seeking a powerful conclusion, I ended up making a compromise between Eliezer's original The Gift We Give To Tomorrow and Raymond Arnold's Solstice Ritual Book version. It's on the wiki, so you can further improve it with edits.


Further reading:


I'm posting this to get more feedback for improving it, to isolate topics for which we don't yet have high-quality, non-technical stand-alone introductions, and to reintroduce LessWrongers to exceptionally useful posts I haven't seen sufficiently discussed, linked, or upvoted. I'd especially like feedback on how the list I provided flows as a unit, and what inferential gaps it fails to address. My goals are:

A. Via lucid and anti-anthropomorphic vignettes, to explain AGI in a way that encourages clear thought.

B. Via the Five Theses, to demonstrate the importance of Friendly AI research.

C. Via down-to-earth meta-ethics, humanistic poetry, and pragmatic strategizing, to combat any nihilisms, relativisms, and defeatisms that might be triggered by recognizing the possibility (or probability) of Unfriendly AI.

D. Via an accessible, substantive, entertaining presentation, to introduce the raison d'être of LessWrong to sophisticated newcomers in a way that encourages further engagement with LessWrong's community and/or content.

What do you think? What would you add, remove, or alter?

[Link] My talk about the Future

2 Stuart_Armstrong 19 July 2013 01:02PM

I recently gave a talk at the IARU Summer School on the Ethics of Technology.

In it, I touched on many of the research themes of the FHI: the accuracy of predictions, the limitations and biases of predictors, the huge risks that humanity may face, the huge benefits that we may gain, and the various ethical challenges that we'll face in the future.

Nothing really new for anyone who's familiar with our work, but some may enjoy perusing it.

The idiot savant AI isn't an idiot

6 Stuart_Armstrong 18 July 2013 03:43PM

A stub on a point that's come up recently.

If I owned a paperclip factory, and casually told my foreman to improve efficiency while I'm away, and he planned a takeover of the country, aiming to devote its entire economy to paperclip manufacturing (apart from the armament factories he needed to invade neighbouring countries and steal their iron mines)... then I'd conclude that my foreman was an idiot (or being wilfully idiotic). He obviously had no idea what I meant. And if he misunderstood me so egregiously, he's certainly not a threat: he's unlikely to reason his way out of a paper bag, let alone to any position of power.

If I owned a paperclip factory, and casually programmed my superintelligent AI to improve efficiency while I'm away, and it planned a takeover of the country... then I can't conclude that the AI is an idiot. It is following its programming. Unlike a human that behaved the same way, it probably knows exactly what I meant to program in. It just doesn't care: it follows its programming, not its knowledge about what its programming is "meant" to be (unless we've successfully programmed in "do what I mean", which is basically the whole of the challenge). We can't therefore conclude that it's incompetent, unable to understand human reasoning, or likely to fail.

We can't reason by analogy with humans. When AIs behave like idiot savants with respect to their motivations, we can't deduce that they're idiots.

Comparative and absolute advantage in AI

18 Stuart_Armstrong 16 July 2013 09:52AM

The theory of comparative advantage says that you should trade with people, even if they are worse than you at everything (ie even if you have an absolute advantage). Some have seen this idea as a reason to trust powerful AIs.

For instance, suppose you can make a hamburger by using 10 000 joules of energy. You can also make a cat video for the same cost. The AI, on the other hand, can make hamburgers for 5 joules each and cat videos for 20.

Then you both can gain from trade. Instead of making a hamburger, make a cat video instead, and trade it for two hamburgers. You've got two hamburgers for 10 000 joules of your own effort (instead of 20 000), and the AI has got a cat video for 10 joules of its own effort (instead of 20). So you both want to trade, and everything is fine and beautiful and many cat videos and hamburgers will be made.

Except... though the AI would prefer to trade with you rather than not trade with you, it would much, much prefer to dispossess you of your resources and use them itself. With the energy you wasted on a single cat video, it could have produced 500 of them! If it values these videos, then it is desperate to take over your stuff. Its absolute advantage makes this too tempting.

Only if its motivation is properly structured, or if it expected to lose more, over the course of history, by trying to grab your stuff, would it desist. Assuming you could make a hundred cat videos a day, and the whole history of the universe would only run for that one day, the AI would try and grab your stuff even if it thought it would only have one chance in fifty thousand of succeeding. As the history of the universe lengthens, or the AI becomes more efficient, then it would be willing to rebel at even more ridiculous odds.

So if you already have guarantees in place to protect yourself, then comparative advantage will make the AI trade with you. But if you don't, comparative advantage and trade don't provide any extra security. The resources you waste are just too valuable to the AI.

EDIT: For those who wonder how this compares to trade between nations: it's extremely rare for any nation to have absolute advantages everywhere (especially this extreme). If you invade another nation, most of their value is in their infrastructure and their population: it takes time and effort to rebuild and co-opt these. Most nations don't/can't think long term (it could arguably be in US interests over the next ten million years to start invading everyone - but "the US" is not a single entity, and doesn't think in terms of "itself" in ten million years), would get damaged in a war, and are risk averse. And don't forget the importance of diplomatic culture and public opinion: even if it was in the US's interests to invade the UK, say, "it" would have great difficulty convincing its elites and its population to go along with this.

Against easy superintelligence: the unforeseen friction argument

25 Stuart_Armstrong 10 July 2013 01:47PM

In 1932, Stanley Baldwin, prime minister of the largest empire the world had ever seen, proclaimed that "The bomber will always get through". Backed up by most of the professional military opinion of the time, by the experience of the first world war, and by reasonable extrapolations and arguments, he laid out a vision of the future where the unstoppable heavy bomber would utterly devastate countries if a war started. Deterrence - building more bombers yourself to threaten complete retaliation - seemed the only counter.

And yet, things didn't turn out that way. Against all past trends, the light fighter plane surpassed the heavily armed bomber in aerial combat, the development of radar changed the strategic balance, and cities and industry proved much more resilient to bombing than anyone had a right to suspect.

Could anyone have predicted these changes ahead of time? Most probably, no. All of these ran counter to what was known and understood, (and radar was a completely new and unexpected development). What could and should have been predicted, though, was that something would happen to weaken the impact of the all-conquering bomber. The extreme predictions would be unrealistic; frictions, technological changes, changes in military doctrine and hidden, unknown factors, would undermine them.

This is what I call the "generalised friction" argument. Simple predictive models, based on strong models or current understanding, will likely not succeed as well as expected: there will likely be delays, obstacles, and unexpected difficulties along the way.

I am, of course, thinking of AI predictions here, specifically of the Omohundro-Yudkowsky model of AI recursive self-improvements that rapidly reach great power, with convergent instrumental goals that make the AI into a power-hungry expected utility maximiser. This model I see as the "supply and demand curve" of AI prediction: too simple to be true in the form described.

But the supply and demand curves are generally approximately true, especially over the long term. So this isn't an argument that the Omohundro-Yudkowsky model is wrong, but that it will likely not happen as flawlessly as described. Ultimately, the "bomber will always get through" turned out to be true: but only in the form of the ICBM. If you take the old arguments and replace "bomber" with "ICBM", you end with strong and accurate predictions. So "the AI may not foom in the manner and on the timescales described" is not saying "the AI won't foom".

Also, it should be emphasised that this argument is strictly about our predictive ability, and does not say anything about the capacity or difficulty of AI per se.

continue reading »

The failure of counter-arguments argument

14 Stuart_Armstrong 10 July 2013 01:38PM

Suppose you read a convincing-seeming argument by Karl Marx, and get swept up in the beauty of the rhetoric and clarity of the exposition. Or maybe a creationist argument carries you away with its elegance and power. Or maybe you've read Eliezer's take on AI risk, and, again, it seems pretty convincing.

How could you know if these arguments are sound? Ok, you could whack the creationist argument with the scientific method, and Karl Marx with the verdict of history, but what would you do if neither was available (as they aren't available when currently assessing the AI risk argument)? Even if you're pretty smart, there's no guarantee that you haven't missed a subtle logical flaw, a dubious premise or two, or haven't got caught up in the rhetoric.

One thing should make you believe the argument more strongly: and that's if the argument has been repeatedly criticised, and the criticisms have failed to puncture it. Unless you have the time to become an expert yourself, this is the best way to evaluate arguments where evidence isn't available or conclusive. After all, opposite experts presumably know the subject intimately, and are motivated to identify and illuminate the argument's weaknesses.

If counter-arguments seem incisive, pointing out serious flaws, or if the main argument is being continually patched to defend it against criticisms - well, this is strong evidence that main argument is flawed. Conversely, if the counter-arguments continually fail, then this is good evidence that the main argument is sound. Not logical evidence - a failure to find a disproof doesn't establish a proposition - but good Bayesian evidence.

In fact, the failure of counter-arguments is much stronger evidence than whatever is in the argument itself. If you can't find a flaw, that just means you can't find a flaw. If counter-arguments fail, that means many smart and knowledgeable people have thought deeply about the argument - and haven't found a flaw.

And as far as I can tell, critics have constantly failed to counter the AI risk argument. To pick just one example, Holden recently provided a cogent critique of the value of MIRI's focus on AI risk reduction. Eliezer wrote a response to it (I wrote one as well). The core of Eliezer's and my response wasn't anything new; they were mainly a rehash of what had been said before, with a different emphasis.

And most responses to critics of the AI risk argument take this form. Thinking for a short while, one can rephrase essentially the same argument, with a change in emphasis to take down the criticism. After a few examples, it becomes quite easy, a kind of paint-by-numbers process of showing that the ideas the critic has assumed, do not actually make the AI safe.

You may not agree with my assessment of the critiques, but if you do, then you should adjust your belief in AI risk upwards. There's a kind of "conservation of expected evidence" here: if the critiques had succeeded, you'd have reduced the probability of AI risk, so their failure must push you in the opposite direction.

In my opinion, the strength of the AI risk argument derives 30% from the actual argument, and 70% from the failure of counter-arguments. This would be higher, but we haven't yet seen the most prominent people in the AI community take a really good swing at it.

From Capuchins to AI's, Setting an Agenda for the Study of Cultural Cooperation (Part1)

-3 diegocaleiro 27 June 2013 06:08AM
This is a multi-purpose essay-on-the-making, it is being written aiming at the following goals 1) Mandatory essay writing at the end of a semester studying "Cognitive Ethology: Culture in Human and Non-Human Animals" 2) Drafting something that can later on be published in a journal that deals with cultural evolution, hopefully inclining people in the area to glance at future oriented research, i.e. FAI and global coordination 3) Publishing it in Lesswrong and 4) Ultimately Saving the World, as everything should. If it's worth doing, it's worth doing in the way most likely to save the World.
Since many of my writings are frequently too long for Lesswrong, I'll publish this in a sequence-like form made of self-contained chunks. My deadline is Sunday, so I'll probably post daily, editing/creating the new sessions based on previous commentary.

Abstract: The study of cultural evolution has drawn much of its momentum from academic areas far removed from human and animal psychology, specially regarding the evolution of cooperation. Game theoretic results and parental investment theory come from economics, kin selection models from biology, and an ever growing amount of models describing the process of cultural evolution in general, and the evolution of altruism in particular come from mathematics. Even from Artificial Intelligence interest has been cast on how to create agents that can communicate, imitate and cooperate. In this article I begin to tackle the 'why?' question. By trying to retrospectively make sense of the convergence of all these fields, I contend that further refinements in these fields should be directed towards understanding how to create environmental incentives fostering cooperation.



We need systems that are wiser than we are. We need institutions and cultural norms that make us better than we tend to be. It seems to me that the greatest challenge we now face is to build them. - Sam Harris, 2013, The Power Of Bad Incentives

1) Introduction

2) Cultures evolve

Culture is perhaps the most remarkable outcome of the evolutionary algorithm (Dennett, 1996) so far. It is the cradle of most things we consider humane - that is, typically human and valuable - and it surrounds our lives to the point that we may be thought of as creatures made of culture even more than creatures of bone and flesh (Hofstadter, 2007; Dennett, 1992). The appearance of our cultural complexity has relied on many associated capacities, among them:

1) The ability to observe, be interested by, and go nearby an individual doing something interesting, an ability we share with norway rats, crows, and even lemurs (Galef & Laland, 2005).

2) Ability to learn from and scrounge the food of whoever knows how to get food, shared by capuchin monkeys (Ottoni et al, 2005).

3) Ability to tolerate learners, to accept learners, and to socially learn, probably shared by animals as diverse as fish, finches and Fins (Galef & Laland, 2005).

4) Understanding and emulating other minds - Theory of Mind- empathizing, relating, perhaps re-framing an experience as one's own, shared by chimpanzees, dogs, and at least some cetaceans (Rendella & Whitehead, 2001).

5) Learning the program level description of the action of others, for which the evidence among other animals is controversial (but see Cantor & Whitehead, 2013). And finally...

6) Sharing intentions. Intricate understanding of how two minds can collaborate with complementary tasks to achieve a mutually agreed goal (Tomasello et al, 2005).

Irrespective of definitional disputes around the true meaning of the word "culture" (which doesn't exist, see e.g. Pinker, 2007 pg115; Yudkowsky 2008A), each of these is more cognitively complex than its predecessor, and even (1) is sufficient for intra-specific non-environmental, non-genetic behavioral variation, which I will call "culture" here, whoever it may harm.

By transitivity, (2-6) allow the development of culture. It is interesting to notice that tool use, frequently but falsely cited as the hallmark of culture, is ubiquitously equiprobable in the animal kingdom. A graph showing, per biological family, which species shows tool use gives us a power law distribution, whose similarity with the universal prior will help in understanding that being from a family where a species uses tools tells us very little about a specie's own tool use (Michael Haslam, personal conversation).

Once some of those abilities are available, and given an amount of environmental facilities, need, and randomness, cultures begin to form. Occasionally, so do more developed traditions. Be it by imitation, program level imitation, goal emulation or intention sharing, information is transmitted between agents giving rise to elements sufficient to constitute a primeval Darwinian soup. That is, entities form such that they exhibit 1)Variation 2)Heredity or replication 3)Differential fitness (Dennett, 1996). In light of the article Five Misunderstandings About Cultural Evolution (Henrich, Boyd & Richerson, 2008) we can improve Dennett's conditions for the evolutionary algorithm as 1)Discrete or continuous variation 2)Heredity, replication, or less faithful replication plus content attractors 3)Differential fitness. Once this set of conditions is met, an evolutionary algorithm, or many, begin to carve their optimizing paws into whatever surpassed the threshold for long enough. Cultures, therefore, evolve. 

The intricacies of cultural evolution and mathematical and computational models of how cultures evolve have been the subject of much interdisciplinary research, for an extensive account of human culture see Not By Genes Alone (Richerson & Boyd, 2005). For computational models of social evolution, there is work by Mesoudi, Novak, and others e.g. (Hauert et al, 2007). For mathematical models, the aptly named Mathematical models of social evolution: A guide for the perplexed by McElrath and Rob Boyd (2007) makes the textbook-style walk-through. For animal culture, see (Laland & Galef, 2009).

Cultural evolution satisfies David Deutsch's criterion for existence, it kicks back, it satisfies the evolutionary equivalent of the  condition posed by the Quine-Putnam Indispensability argument in mathematics, i.e. it is a sine qua non condition for understanding how the World works nomologically. It is falsifiable to Popperian content, and it inflates the Worlds ontology a little, by inserting a new kind of "replicator", the meme. Contrary to what happened on the internet, the name 'meme' has lost much of it's appeal within cultural evolution theorists, and "memetics" is considered by some to refer only to the study of memes as monolithic atomic high fidelity replicators, which would make the theory obsolete. This has created the following conundrum: the name 'meme' remains by far the most well known one to speak of "that which evolves culturally" within, and specially outside, the specialist arena. Further, the niche occupied by the word 'meme' is so conceptually necessary within the area to communicate and explain that it is frequently put under scare quotes, or some other informal excuse. In fact, as argued by Tim Tyler - who frequently posts here - in the very sharp Memetics (2010), there are nearly no reasons to try to abandon the 'meme' meme, and nearly all reasons (practicality, Qwerty reasons, mnemonics) to keep it. To avoid contradicting the evidence ever since Dawkins first coined the term, I suggest we must redefine Meme as an attractor in cultural evolution (dual-inheritance) whose development over time structurally mimics to a significant extent the discrete behavior of genes, frequently coinciding with the smallest unit of cultural replication. The definition is long, but the idea is simple: Memes are not the best analogues of genes because they are discrete units that replicate just like genes, but because they are continuous conceptual clusters being attracted to a point in conceptual space whose replication is just like that of genes. Even more simply, memes are the mathematically closest things to genes in cultural evolution. So the suggestion here is for researchers of dual-inheritance and cultural evolution to take off the scare quotes of our memes and keep business as usual.  

The evolutionary algorithm has created a new attractor-replicator, the meme, it didn't privilege with it any specific families in the biological trees and it ended up creating a process of cultural-genetic coevolution known as dual-inheritance. This process has been studied in ever more quantified ways by primatologists, behavioral ecologists, population biologists, anthropologists, ethologists, sociologists, neuroscientists and even philosophers. I've shown at least six distinct abilities which helped scaffold our astounding level of cultural intricacy, and some animals who share them with us. We will now take a look at the evolution of cooperation, collaboration, altruism, moral behavior, a sub-area of cultural evolution that saw an explosion of interest and research during the last decade, with publications (most from the last 4 years) such as The Origins of Morality, Supercooperators, Good and Real, The Better Angels of Our Nature, Non-Zero, The Moral Animal, Primates and Philosophers, The Age of Empathy, Origins of Altruism and Cooperation, The Altruism Equation, Altruism in Humans, Cooperation and Its Evolution, Moral Tribes, The Expanding Circle, The Moral Landscape.

3) Cooperation evolves

Shortly describe why and show some inequations under which cooperation is an equelibrium, or at least an Evolutionarily Stable Strategy.

4) The complexity of cultural items doesn't undermine the validity of mathematical models.

 4.1) Cognitive attractors and biases substitute for memes discreteness

The math becomes equivalent.

 4.2) Despite the Unilateralist Curse and the Tragedy of the Commons, dyadic interaction models help us understand large scale cooperation

Once we know these two failure modes, dyadic iterated (or reputation-sensitive) interaction is close enough.

5) From Monkeys to Apes to Humans to Transhumans to AIs, the ranges of achievable altruistic skill.

Possible modes of being altruistic. Graph like Bostrom's. Second and third order punishment and cooperation. Newcomb-like signaling problems within AI.

6) Unfit for the Future: the need for greater altruism.

We fail and will remain failing in Tragedy of the Commons problems unless we change our nature.

7) From Science, through Philosophy, towards Engineering: the future of studies of altruism.

Philosophy: Existential Risk prevention through global coordination and cooperation prior to technical maturity. Engineering Humans: creating enhancements and changing incentives. Engineering AI's: making them better and realer.

8) A different kind of Moral Landscape

Like Sam Harris's one, except comparing not how much a society approaches The Good Life (Moral Landscape pg15), but how much it fosters altruistic behaviour.

9) Conclusions

I haven't written yet, so I don't have any!





Bibliography (Only of the part already written, obviously):

Cantor, M., & Whitehead, H. (2013). The interplay between social networks and culture: theoretically and among whales and dolphins. Philosophical Transactions of the Royal Society B: Biological Sciences368(1618).

Dennett, D. C. (1996). Darwin's dangerous idea: Evolution and the meanings of life (No. 39). Simon & Schuster.

Dennett, D. C. (1992). The self as a center of narrative gravity. Self and consciousness: Multiple perspectives.

Galef Jr, B. G., & Laland, K. N. (2005). Social learning in animals: empirical studies and theoretical models. Bioscience55(6), 489-499.

Hauert, C., Traulsen, A., Brandt, H., Nowak, M. A., & Sigmund, K. (2007). Via freedom to coercion: the emergence of costly punishment. science316(5833), 1905-1907.

Henrich, J., Boyd, R., & Richerson, P. J. (2008). Five misunderstandings about cultural evolution. Human Nature, 19(2), 119-137.

Hofstadter, D. R. (2007). I am a Strange Loop. Basic Books

McElreath, R., & Boyd, R. (2007). Mathematical models of social evolution: A guide for the perplexed. University of Chicago Press.

Ottoni, E. B., de Resende, B. D., & Izar, P. (2005). Watching the best nutcrackers: what capuchin monkeys (Cebus apella) know about others’ tool-using skills. Animal cognition8(4), 215-219.

Persson, I., & Savulescu, J. Unfit for the Future: The Need for Moral Enhancement Oxford: Oxford University Press, 2012 ISBN 978-0199653645 (HB)£ 21.00. 160pp. On the brink of civil war, Abraham Lincoln stood on the steps of the US Capitol and appealed.

Pinker, S. (2007). The stuff of thought: Language as a window into human nature. Viking Adult.

Rendella, L., & Whitehead, H. (2001). Culture in whales and dolphins.Behavioral and Brain Sciences24, 309-382.

Richardson, P. J., & Boyd, R. (2005). Not by genes alone. University of Chicago Press.

Tyler, T. (2011). Memetics: Memes and the Science of Cultural Evolution. Tim Tyler.

Tomasello, M., Carpenter, M., Call, J., Behne, T., & Moll, H. (2005). Understanding and sharing intentions: The origins of cultural cognition.Behavioral and brain sciences28(5), 675-690.

Yudkowsky, E. (2008A). 37 ways words can be wrong. Available at http://lesswrong.com/lw/od/37_ways_that_words_can_be_wrong/

View more: Next