## Announcing the AI Alignment Prize

Stronger than human artificial intelligence would be dangerous to humanity. It is vital any such intelligence’s goals are aligned with humanity's goals. Maximizing the chance that this happens is a difficult, important and under-studied problem.

To encourage more and better work on this important problem, we (Zvi Mowshowitz and Vladimir Slepnev) are announcing a $5000 prize for publicly posted work advancing understanding of AI alignment, funded by Paul Christiano.

This prize will be awarded based on entries gathered over the next two months. If the prize is successful, we will award further prizes in the future.

This prize is not backed by or affiliated with any organization.

### Rules

Your entry must be published online for the first time between November 3 and December 31, 2017, and contain novel ideas about AI alignment. Entries have no minimum or maximum size. Important ideas can be short!

Your entry must be written by you, and submitted before 9pm Pacific Time on December 31, 2017. Submit your entries either as URLs in the comments below, or by email to apply@ai-alignment.com. We may provide feedback on early entries to allow improvement.

We will award $5000 to between one and five winners. The first place winner will get at least $2500. The second place winner will get at least $1000. Other winners will get at least $500.

Entries will be judged subjectively. Final judgment will be by Paul Christiano. Prizes will be awarded on or before January 15, 2018.

### What kind of work are we looking for?

AI Alignment focuses on ways to ensure that future smarter than human intelligence will have goals aligned with the goals of humanity. Many approaches to AI Alignment deserve attention. This includes technical and philosophical topics, as well as strategic research about related social, economic or political issues. A non-exhaustive list of technical and other topics can be found here.

We are not interested in research dealing with the dangers of existing machine learning systems commonly called AI that do not have smarter than human intelligence. These concerns are also understudied, but are not the subject of this prize except in the context of future smarter than human intelligence. We are also not interested in general AI research. We care about AI Alignment, which may or may not also advance the cause of general AI research.

## Against lone wolf self-improvement

LW has a problem. Openly or covertly, many posts here promote the idea that a rational person ought to be able to self-improve on their own. Some of it comes from Eliezer's refusal to attend college (and Luke dropping out of his bachelors, etc). Some of it comes from our concept of rationality, that all agents can be approximated as perfect utility maximizers with a bunch of nonessential bugs. Some of it is due to our psychological makeup and introversion. Some of it comes from trying to tackle hard problems that aren't well understood anywhere else. And some of it is just the plain old meme of heroism and forging your own way.

*lone wolf self-improvement*, which I believe has harmed LWers more than any other part of our belief system.

*Lone wolf self-improvement doesn't work*.

*Joining a class*. With a fixed schedule, a group of students, a teacher, and an exam at the end. Compared to any "anti-akrasia technique" ever proposed on LW or adjacent self-help blogs, joining a class works ridiculously well. You don't need constant willpower: just show up on time and you'll be carried along. You don't get lonely: other students are there and you can't help but interact. You don't wonder if you're doing it right: just ask the teacher.

## Steelmanning the Chinese Room Argument

*(This post grew out of an old conversation with Wei Dai.)*

Imagine a person sitting in a room, communicating with the outside world through a terminal. Further imagine that the person knows some secret fact (e.g. that the Moon landings were a hoax), but is absolutely committed to never revealing their knowledge of it in any way.

Can you, by observing the input-output behavior of the system, distinguish it from a person who *doesn't* know the secret, or knows some other secret instead?

Clearly the only reasonable answer is "no, not in general".

Now imagine a person in the same situation, claiming to possess some mental skill that's hard for you to verify (e.g. visualizing four-dimensional objects in their mind's eye). Can you, by observing the input-output behavior, distinguish it from someone who is lying about having the skill, but has a good grasp of four-dimensional math otherwise?

Again, clearly, the only reasonable answer is "not in general".

Now imagine a sealed box that behaves exactly like a human, dutifully saying things like "I'm conscious", "I experience red" and so on. Moreover, you know from trustworthy sources that the box was built by scanning a human brain, and then optimizing the resulting program to use less CPU and memory (preserving the same input-output behavior). Would you be willing to trust that the box is in fact conscious, and has the same internal experiences as the human brain it was created from?

A philosopher believing in computationalism would emphatically say yes. But considering the examples above, I would say I'm not sure! Not at all!

## What useless things did you understand recently?

Please reply in the comments with things you understood recently. The only condition is that they have to be useless in your daily life. For example, "I found this idea that defeats procrastination" doesn't count, because it sounds useful and you might be deluded about its truth. Whereas "I figured out how construction cranes are constructed" qualifies, because you aren't likely to use it and it will stay true tomorrow.

I'll start. Today I understood how Heyting algebras work as a model for intuitionistic logic. The main idea is that you represent sentences as shapes. So you might have two sentences A and B shown as two circles, then "A and B" is their intersection, "A or B" is their union, etc. But "A implies B" doesn't mean one circle lies inside the other, as you might think! Instead it's a shape too, consisting of all points that lie outside A or inside B (or both). There were some other details about closed and open sets, but these didn't cause a problem for me, while "A implies B" made me stumble for some reason. I probably won't use Heyting algebras for anything ever, but it was pretty fun to figure out.

Your turn!

PS: please don't feel pressured to post something super advanced. It's really, honestly okay to post basic things, like why a stream of tap water narrows as it falls, or why the sky is blue (though I don't claim to understand that one :-))

## Self-modification as a game theory problem

In this post I'll try to show a surprising link between two research topics on LW: game-theoretic cooperation between AIs (quining, Loebian cooperation, modal combat, etc) and stable self-modification of AIs (tiling agents, Loebian obstacle, etc).

When you're trying to cooperate with another AI, you need to ensure that its action will fulfill your utility function. And when doing self-modification, you also need to ensure that the successor AI will fulfill your utility function. In both cases, naive utility maximization doesn't work, because you can't fully understand another agent that's as powerful and complex as you. That's a familiar difficulty in game theory, and in self-modification it's known as the Loebian obstacle (fully understandable successors become weaker and weaker).

In general, any AI will be faced with two kinds of situations. In "single player" situations, you're faced with a choice like eating chocolate or not, where you can figure out the outcome of each action. (Most situations covered by UDT are also "single player", involving identical copies of yourself.) Whereas in "multiplayer" situations your action gets combined with the actions of other agents to determine the outcome. Both cooperation and self-modification are "multiplayer" situations, and are hard for the same reason. When someone proposes a self-modification to you, you might as well evaluate it with the same code that you use for game theory contests.

If I'm right, then any good theory for cooperation between AIs will also double as a theory of stable self-modification for a single AI. That means neither problem can be much easier than the other, and in particular self-modification won't be a special case of utility maximization, as some people seem to hope. But on the plus side, we need to solve one problem instead of two, so creating FAI becomes a little bit easier.

The idea came to me while working on this mathy post on IAFF, which translates some game theory ideas into the self-modification world. For example, Loebian cooperation (from the game theory world) might lead to a solution for the Loebian obstacle (from the self-modification world) - two LW ideas with the same name that people didn't think to combine before!

## Thought experiment: coarse-grained VR utopia

I think I've come up with a fun thought experiment about friendly AI. It's pretty obvious in retrospect, but I haven't seen it posted before.

When thinking about what friendly AI should do, one big source of difficulty is that the inputs are supposed to be human intuitions, based on our coarse-grained and confused world models. While the AI's actions are supposed to be fine-grained actions based on the true nature of the universe, which can turn out very weird. That leads to a messy problem of translating preferences from one domain to another, which crops up everywhere in FAI thinking, Wei's comment and Eliezer's writeup are good places to start.

What I just realized is that you can handwave the problem away, by imagining a universe whose true nature agrees with human intuitions by fiat. Think of it as a coarse-grained virtual reality where everything is built from polygons and textures instead of atoms, and all interactions between objects are explicitly coded. It would contain player avatars, controlled by ordinary human brains sitting outside the simulation (so the simulation doesn't even need to support thought).

The FAI-relevant question is: How hard is it to describe a coarse-grained VR utopia that you would agree to live in?

If describing such a utopia is feasible at all, it involves thinking about only human-scale experiences, not physics or tech. So in theory we could hand it off to human philosophers or some other human-based procedure, thus dealing with "complexity of value" without much risk. Then we could launch a powerful AI aimed at rebuilding reality to match it (more concretely, making the world's conscious experiences match a specific coarse-grained VR utopia, without any extra hidden suffering). That's still a very hard task, because it requires solving decision theory and the problem of consciousness, but it seems more manageable than solving friendliness completely. The resulting world would be suboptimal in many ways, e.g. it wouldn't have much room for science or self-modification, but it might be enough to avert AI disaster (!)

I'm not proposing this as a plan for FAI, because we can probably come up with something better. But what do you think of it as a thought experiment? Is it a useful way to split up the problem, separating the complexity of human values from the complexity of non-human nature?

## Bet or update: fixing the will-to-wager assumption

(Warning: completely obvious reasoning that I'm only posting because I haven't seen it spelled out anywhere.)

Some people say, expanding on an idea of de Finetti, that Bayesian rational agents should offer two-sided bets based on their beliefs. For example, if you think a coin is fair, you should be willing to offer anyone a 50/50 bet on heads (or tails) for a penny. Jack called it the "will-to-wager assumption" here and I don't know a better name.

In its simplest form the assumption is false, even for perfectly rational agents in a perfectly simple world. For example, I can give you my favorite fair coin so you can flip it and take a peek at the result. Then, even though I still believe the coin is fair, I'd be a fool to offer both sides of the wager to you, because you'd just take whichever side benefits you (since you've seen the result and I haven't). That objection is not just academic: using your sincere beliefs to bet money against better informed people is a bad idea in real world markets as well.

Then the question arises, how can we fix the assumption so it still says something sensible about rationality? I think the right fix should go something like this. If you flip a coin and peek at the result, then offer me a bet at 90:10 odds that the coin came up heads, I must either accept the bet or update toward believing that the coin indeed came up heads, with at least these odds. I don't get to keep my 50:50 beliefs about the coin and refuse the bet at the same time. More generally, a Bayesian rational agent offered a bet (by another agent who might have more information) must *either accept the bet or update their beliefs so the bet becomes unprofitable*. The old obligation about offering two-sided bets on all your beliefs is obsolete, use this one from now on. It should also come in handy in living room Bayesian scuffles, throwing some money on the table and saying "bet or update!" has a nice ring to it.

What do you think?

## Overpaying for happiness?

Happy New Year, everyone!

In the past few months I've been thinking several thoughts that all seem to point in the same direction:

1) People who live in developed Western countries usually make and spend much more money than people in poorer countries, but aren't that much happier. It feels like we're overpaying for happiness, spending too much money to get a single bit of enjoyment.

2) When you get enjoyment from something, the association between "that thing" and "pleasure" in your mind gets stronger, but at the same time it becomes less sensitive and requires more stimulus. For example if you like sweet food, you can get into a cycle of eating more and more food that's sweeter and sweeter. But the guy next door, who's eating much less and periodically fasting to keep the association fresh, is actually getting more pleasure from food than you are! The same thing happens when you learn to deeply appreciate certain kinds of art, and then notice that the folks who enjoy "low" art are visibly having more fun.

3) People sometimes get unrealistic dreams and endlessly chase them, like trying to "make it big" in writing or sports, because they randomly got rewarded for it at an early age. I wrote a post about that.

I'm not offering any easy answers here. But it seems like too many people get locked in loops where they spend more and more effort to get less and less happiness. The most obvious examples are drug addiction and video gaming, but also "one-itis" in dating, overeating, being a connoisseur of anything, striving for popular success, all these things follow the same pattern. You're just chasing after some Skinner-box thing that you think you "love", but it doesn't love you back.

Sooo... if you like eating, give yourself a break every once in a while? If you like comfort, maybe get a cold shower sometimes? Might be a good idea to make yourself the kind of person that can get happiness cheaply.

Sorry if this post is not up to LW standards, I typed it really quickly as it came to my mind.

## A proof of Löb's theorem in Haskell

I'm not sure if this post is very on-topic for LW, but we have many folks who understand Haskell and many folks who are interested in Löb's theorem (see e.g. Eliezer's picture proof), so I thought why not post it here? If no one likes it, I can always just move it to my own blog.

A few days ago I stumbled across a post by Dan Piponi, claiming to show a Haskell implementation of something similar to Löb's theorem. Unfortunately his code had a couple flaws. It was circular and relied on Haskell's laziness, and it used an assumption that doesn't actually hold in logic (see the second comment by Ashley Yakeley there). So I started to wonder, what would it take to code up an actual proof? Wikipedia spells out the steps very nicely, so it seemed to be just a matter of programming.

Well, it turned out to be harder than I thought.

One problem is that Haskell has no type-level lambdas, which are the most obvious way (by Curry-Howard) to represent formulas with propositional variables. These are very useful for proving stuff in general, and Löb's theorem uses them to build fixpoints by the diagonal lemma.

The other problem is that Haskell is Turing complete, which means it can't really be used for proof checking, because a non-terminating program can be viewed as the proof of any sentence. Several people have told me that Agda or Idris might be better choices in this regard. Ultimately I decided to use Haskell after all, because that way the post will be understandable to a wider audience. It's easy enough to convince yourself by looking at the code that it is in fact total, and transliterate it into a total language if needed. (That way you can also use the nice type-level lambdas and fixpoints, instead of just postulating one particular fixpoint as I did in Haskell.)

But the biggest problem for me was that the Web didn't seem to have any good explanations for the thing I wanted to do! At first it seems like modal proofs and Haskell-like languages should be a match made in heaven, but in reality it's full of subtle issues that no one has written down, as far as I know. So I'd like this post to serve as a reference, an example approach that avoids all difficulties and just works.

LW user lmm has helped me a lot with understanding the issues involved, and wrote a candidate implementation in Scala. The good folks on /r/haskell were also very helpful, especially Samuel Gélineau who suggested a nice partial implementation in Agda, which I then converted into the Haskell version below.

To play with it online, you can copy the whole bunch of code, then go to CompileOnline and paste it in the edit box on the left, replacing what's already there. Then click "Compile & Execute" in the top left. If it compiles without errors, that means everything is right with the world, so you can change something and try again. (I hate people who write about programming and don't make it easy to try out their code!) Here we go:

main = return ()

-- Assumptions

data Theorem a

logic1 = undefined :: Theorem (a -> b) -> Theorem a -> Theorem b logic2 = undefined :: Theorem (a -> b) -> Theorem (b -> c) -> Theorem (a -> c) logic3 = undefined :: Theorem (a -> b -> c) -> Theorem (a -> b) -> Theorem (a -> c)

data Provable a

rule1 = undefined :: Theorem a -> Theorem (Provable a) rule2 = undefined :: Theorem (Provable a -> Provable (Provable a)) rule3 = undefined :: Theorem (Provable (a -> b) -> Provable a -> Provable b)

data P

premise = undefined :: Theorem (Provable P -> P)

data Psi

psi1 = undefined :: Theorem (Psi -> (Provable Psi -> P)) psi2 = undefined :: Theorem ((Provable Psi -> P) -> Psi)

-- Proof

step3 :: Theorem (Psi -> Provable Psi -> P) step3 = psi1

step4 :: Theorem (Provable (Psi -> Provable Psi -> P)) step4 = rule1 step3

step5 :: Theorem (Provable Psi -> Provable (Provable Psi -> P)) step5 = logic1 rule3 step4

step6 :: Theorem (Provable (Provable Psi -> P) -> Provable (Provable Psi) -> Provable P) step6 = rule3

step7 :: Theorem (Provable Psi -> Provable (Provable Psi) -> Provable P) step7 = logic2 step5 step6

step8 :: Theorem (Provable Psi -> Provable (Provable Psi)) step8 = rule2

step9 :: Theorem (Provable Psi -> Provable P) step9 = logic3 step7 step8

step10 :: Theorem (Provable Psi -> P) step10 = logic2 step9 premise

step11 :: Theorem ((Provable Psi -> P) -> Psi) step11 = psi2

step12 :: Theorem Psi step12 = logic1 step11 step10

step13 :: Theorem (Provable Psi) step13 = rule1 step12

step14 :: Theorem P step14 = logic1 step10 step13

-- All the steps squished together

lemma :: Theorem (Provable Psi -> P) lemma = logic2 (logic3 (logic2 (logic1 rule3 (rule1 psi1)) rule3) rule2) premise

theorem :: Theorem P theorem = logic1 lemma (rule1 (logic1 psi2 lemma))

To make sense of the code, you should interpret the type constructor Theorem as the symbol ⊢ from the Wikipedia proof, and Provable as the symbol ☐. All the assumptions have value "undefined" because we don't care about their computational content, only their types. The assumptions logic1..3 give just enough propositional logic for the proof to work, while rule1..3 are direct translations of the three rules from Wikipedia. The assumptions psi1 and psi2 describe the specific fixpoint used in the proof, because adding general fixpoint machinery would make the code much more complicated. The types P and Psi, of course, correspond to sentences P and Ψ, and "premise" is the premise of the whole theorem, that is, ⊢(☐P→P). The conclusion ⊢P can be seen in the type of step14.

As for the "squished" version, I guess I wrote it just to satisfy my refactoring urge. I don't recommend anyone to try reading that, except maybe to marvel at the complexity :-)

EDIT: in addition to the previous Reddit thread, there's now a new Reddit thread about this post.

View more: Next