## Toy model of the AI control problem: animated version

7 10 October 2017 11:12AM

Crossposted at LessWrong 2.0.

A few years back, I came up with a toy model of the AI control problem. It has a robot moving boxes into a hole, with a slightly different goal than it's human designers, and a security camera to check that it's behaving as it should. The robot learns to block the camera to get its highest reward.

I've been told that the model is useful for explaining the control problem quite a few people, and I've always wanted to program the "robot" and get an animated version of it. Gwern had a live demo, but it didn't illustrate all the things I wanted to.

So I programmed the toy problem in python, and generated a video with commentary.

In this simplified version, the state space is sufficiently small that you can explicitly generate the whole table of Q-values (expected reward for taking an action in a certain state, assuming otherwise optimal policy). Since behaviour is deterministic, this can be updated in dynamical programming, using a full width backup. The number of such backups essentially measures the depth of the robot's predictive ability.

The most interesting depths of planning are:

• At depth 0, the robot behaves randomly.
• At depth 3, the robot crashes itself efficiently, to avoid loosing too much reward.
• At depth 6, the robot figures out how to get one box into the hole. The automated camera turns it off.
• At depth 17, the robot finally starts to deceive/conceal/manipulate, by blocking the camera and pushing two boxes into the hole. Note that the robot's deception comes from the fact that its incentives are misaligned, and that humans tried to control it.
• At depth 18, the robot efficiently does the plan from depth 17.
• At depth 20, the robot does the maximally efficient plan: blocking the camera, and pushing all boxes into the hole.
• At depth 32, the robot has the correct Q-values for the maximally efficient plan.
• At depth 45, finally, the Q-value table is fully updated, and the robot will take maximally efficient, and, if need be, deceptive plans from any robot/box starting positions.

The code and images can be found here.

## The Outside View isn't magic

6 27 September 2017 02:37PM

Crossposted at Less Wrong 2.0.

The planning fallacy is an almost perfect example of the strength of using the outside view. When asked to predict the time taken for a project that they are involved in, people tend to underestimate the time needed (in fact, they tend to predict as if question was how long things would take if everything went perfectly).

Simply telling people about the planning fallacy doesn't seem to make it go away. So the outside view argument is that you need to put your project into the "reference class" of other projects, and expect time overruns as compared to your usual, "inside view" estimates (which focus on the details you know about the project.

So, for the outside view, what is the best way of estimating the time of a project? Well, to find the right reference class for it: the right category of projects to compare it with. You can compare the project with others that have similar features - number of people, budget, objective desired, incentive structure, inside view estimate of time taken etc... - and then derive a time estimate for the project that way.

That's the outside view. But to me, it looks a lot like... induction. In fact, it looks a lot like the elements of a linear (or non-linear) regression. We can put those features (at least the quantifiable ones) into a linear regression with a lot of data about projects, shake it all about, and come up with regression coefficients.

At that point, we are left with a decent project timeline prediction model, and another example of human bias. The fact that humans often perform badly in prediction tasks is not exactly new - see for instance my short review on the academic research on expertise.

So what exactly is the outside view doing in all this?

## The role of the outside view: model incomplete and bias human

The main use of the outside view, for humans, seems to be to point out either an incompleteness in the model or a human bias. The planning fallacy has both of these: if you did a linear regression comparing your project with all projects with similar features, you'd notice your inside estimate was more optimistic than the regression - your inside model is incomplete. And if you also compared each person's initial estimate with the ultimate duration of their project, you'd notice a systematically optimistic bias - you'd notice the planning fallacy.

The first type of errors tend to go away with time, if the situation is encountered regularly, as people refine models, add variables, and test them on the data. But the second type remains, as human biases are rarely cleared by mere data.

## Reference class tennis

If use of the outside view is disputed, it often develops into a case of reference class tennis - where people with opposing sides insist or deny that a certain example belongs in the reference class (similarly to how, in politics, anything positive is claimed for your side and anything negative assigned to the other side).

But once the phenomena you're addressing has an explanatory model, there are no issues of reference class tennis any more. Consider for instance Goodhart's law: "When a measure becomes a target, it ceases to be a good measure". A law that should be remembered by any minister of education wanting to reward schools according to improvements to their test scores.

This is a typical use of the outside view: if you'd just thought about the system in terms of inside facts - tests are correlated with child performance; schools can improve child performance; we can mandate that test results go up - then you'd have missed several crucial facts.

But notice that nothing mysterious is going on. We understand exactly what's happening here: schools have ways of upping test scores without upping child performance, and so they decided to do that, weakening the correlation between score and performance. Similar things happen in the failures of command economies; but again, once our model is broad enough to encompass enough factors, we get decent explanations, and there's no need for further outside views.

In fact, we know enough that we can show when Goodhart's law fails: when no-one with incentives to game the measure has control of the measure. This is one of the reasons central bank interest rate setting has been so successful. If you order a thousand factories to produce shoes, and reward the managers of each factory for the number of shoes produced, you're heading to disaster. But consider GDP. Say the central bank wants to increase GDP by a certain amount, by fiddling with interest rates. Now, as a shoe factory manager, I might have preferences about the direction of interest rates, and my sales are a contributor to GDP. But they are a tiny contributor. It is not in my interest to manipulate my sales figures, in the vague hope that, aggregated across the economy, this will falsify GDP and change the central bank's policy. The reward is too diluted, and would require coordination with many other agents (and coordination is hard).

Thus if you're engaging in reference class tennis, remember the objective is to find a model with enough variables, and enough data, so that there is no more room for the outside view - a fully understood Goodhart's law rather than just a law.

## In the absence of a successful model

Sometimes you can have a strong trend without a compelling model. Take Moore's law, for instance. It is extremely strong, going back decades, and surviving multiple changes in chip technology. But it has no clear cause.

A few explanations have been proposed. Maybe it's a consequence of its own success, of chip companies using it to set their goals. Maybe there's some natural exponential rate of improvement in any low-friction feature of a market economy. Exponential-type growth in the short term is no surprise - that just means growth in proportional to investment - so maybe it was an amalgamation of various short term trends.

Do those explanations sound unlikely? Possibly, but there is a huge trend in computer chips going back decades that needs to be explained. They are unlikely, but they have to be weighed against the unlikeliness of the situation. The most plausible explanation is a combination of the above and maybe some factors we haven't thought of yet.

But here's an explanation that is implausible: little time-travelling angels modify the chips so that they follow Moore's law. It's a silly example, but it shows that not all explanations are created equal, even for phenomena that are not fully understood. In fact there are four broad categories of explanations for putative phenomena that don't have a compelling model:

1. Unlikely but somewhat plausible explanations.
2. We don't have an explanation yet, but we think it's likely that there is an explanation.
3. The phenomenon is a coincidence.
4. Any explanation would go against stuff that we do know, and would be less likely than coincidence.

The explanations I've presented for Moore's law fall into category 1. Even if we hadn't thought of those explanations, Moore's law would fall into category 2, because of the depth of evidence for Moore's law and because a "medium length regular technology trend within a broad but specific category" is something that has is intrinsically likely to have an explanation.

Compare with Kurzweil's "law of time and chaos" (a generalisation of his "law of accelerating returns") and Robin Hanson's model where the development of human brains, hunting, agriculture and the industrial revolution are all points on a trend leading to uploads. I discussed these in a previous post, but I can now better articulate the problem with them.

Firstly, they rely on very few data points (the more recent part of Kurzweil's law, the part about recent technological trends, has a lot of data, but the earlier part does not). This raises the probability that they are a mere coincidence (we should also consider selection bias in choosing the data points, which increases the probability of coincidence). Secondly, we have strong reasons to suspect that there won't be any explanation that ties together things like the early evolution of life on Earth, human brain evolution, the agricultural revolution, the industrial revolution, and future technology development. These phenomena have decent local explanations that we already roughly understand (local in time and space to the phenomena described), and these run counter to any explanation that would tie them together.

## Human biases and predictions

There is one area where the outside view can still function for multiple phenomena across different eras: when it comes to pointing out human biases. For example, we know that doctors have been authoritative, educated, informed, and useless for most of human history (or possibly much worse than useless). Hence authoritative, educated, and informed statements or people are not to be considered of any value, unless there is some evidence the statement or person is truth tracking. We now have things like expertise research, some primitive betting markets, and track records to try and estimate their experience; these can provide good "outside views".

And the authors of the models of the previous section have some valid points where bias is concerned. Kurzweil's point that (paraphrasing) "things can happen a lot faster than some people think" is valid: we can compare predictions with outcomes. Robin has similar valid points in defense of the possibility of the em scenario.

The reason these explanations are more likely valid is because they have a very probable underlying model/explanation: humans are biased.

## Conclusions

• The outside view is a good reminder for anyone who may be using too narrow a model.
• If the model explains the data well, then there is no need for further outside views.
• If there is a phenomena with data but no convincing model, we need to decide if it's a coincidence or there is an underlying explanation.
• Some phenomena have features that make it likely that there is an explanation, even if we haven't found it yet.
• Some phenomena have features that make it unlikely that there is an explanation, no matter how much we look.
• Outside view arguments that point at human prediction biases, however, can be generally valid, as they only require the explanation that humans are biased in that particular way.

## Naturalized induction – a challenge for evidential and causal decision theory

4 22 September 2017 08:15AM

As some of you may know, I disagree with many of the criticisms leveled against evidential decision theory (EDT). Most notably, I believe that Smoking lesion-type problems don't refute EDT. I also don't think that EDT's non-updatelessness leaves a lot of room for disagreement, given that EDT recommends immediate self-modification to updatelessness. However, I do believe there are some issues with run-of-the-mill EDT. One of them is naturalized induction. It is in fact not only a problem for EDT but also for causal decision theory (CDT) and most other decision theories that have been proposed in- and outside of academia. It does not affect logical decision theories, however.

# The role of naturalized induction in decision theory

Recall that EDT prescribes taking the action that maximizes expected utility, i.e.

$\underset{a\in A}{\mathrm{argmax}} ~\mathbb{E}[U(w)|a,o] = \underset{a\in A}{\mathrm{argmax}} \sum_{w\in W} P(w|a,o) U(w),$

where $A$ is the set of available actions, $U$ is the agent's utility function, $W$ is a set of possible world models, $o$ represents the agent's past observations (which may include information the agent has collected about itself). CDT works in a – for the purpose of this article – similar way, except that instead of conditioning on $a$ in the usual way, it calculates some causal counterfactual, such as Pearl's do-calculus: $P(w|do(a),o)$. The problem of naturalized induction is that of assigning posterior probabilities to world models $P(w|a,o)$ (or $P(w|do(a),o)$ or whatever) when the agent is naturalized, i.e., embedded into its environment.

Consider the following example. Let's say there are 5 world models $W=\{w_1,...,w_5\}$, each of which has equal prior probability. These world models may be cellular automata. Now, the agent makes the observation $o$. It turns out that worlds $w_1$ and $w_2$ don't contain any agents at all, and $w_3$ contains no agent making the observation $o$. The other two world models, on the other hand, are consistent with $o$. Thus, $P(w_i\mid o)=0$ for $i=1,2,3$ and $P(w_i\mid o)=\frac{1}{2}$ for $i=4,5$. Let's assume that the agent has only two actions $A=\{a_1,a_2\}$ and that in world model $w_4$ the only agent making observation $o$ takes action $a_1$ and in $w_5$ the only agent making observation $o$ takes action $a_2$, then $P(w_4\mid a_1)=1=P(w_5\mid a_2)$ and $P(w_5\mid a_1)=0=P(w_4\mid a_2)$. Thus, if, for example, $U(w_5)>U(w_4)$, an EDT agent would take action $a_2$ to ensure that world model $w_5$ is actual.

# The main problem of naturalized induction

This example makes it sound as though it's clear what posterior probabilities we should assign. But in general, it's not that easy. For one, there is the issue of anthropics: if one world model $w_1$ contains more agents observing $o$ than another world model $w_2$, does that mean $P(w_1\mid o) > P(w_2\mid o)$? Whether CDT and EDT can reason correctly about anthropics is an interesting question in itself (cf. Bostrom 2002Armstrong 2011; Conitzer 2015), but in this post I'll discuss a different problem in naturalized induction: identifying instantiations of the agent in a world model.

It seems that the core of the reasoning in the above example was that some worlds contain an agent observing $o$ and others don't. So, besides anthropics, the central problem of naturalized induction appears to be identifying agents making particular observations in a physicalist world model. While this can often be done uncontroversially – a world containing only rocks contains no agents –, it seems difficult to specify how it works in general. The core of the problem is a type mismatch of the "mental stuff" (e.g., numbers or Strings) $o$ and the "physics stuff" (atoms, etc.) of the world model. Rob Bensinger calls this the problem of "building phenomenological bridges" (BPB) (also see his Bridge Collapse: Reductionism as Engineering Problem).

# Sensitivity to phenomenological bridges

Sometimes, the decisions made by CDT and EDT are very sensitive to whether a phenomenological bridge is built or not. Consider the following problem:

One Button Per Agent. There are two similar agents with the same utility function. Each lives in her own room. Both rooms contain a button. If agent 1 pushes her button, it creates 1 utilon. If agent 2 pushes her button, it creates -50 utilons. You know that agent 1 is an instantiation of you. Should you press your button?

Note that this is essentially Newcomb's problem with potential anthropic uncertainty (see the second paragraph here) – pressing the button is like two-boxing, which causally gives you $1k if you are the real agent but costs you$1M if you are the simulation.

If agent 2 is sufficiently similar to you to count as an instantiation of you, then you shouldn't press the button. If, on the other hand, you believe that agent 2 does not qualify as something that might be you, then it comes down to what decision theory you use: CDT would press the button, whereas EDT wouldn't (assuming that the two agents are strongly correlated).

It is easy to specify a problem where EDT, too, is sensitive to the phenomenological bridges it builds:

One Button Per World. There are two possible worlds. Each contains an agent living in a room with a button. The two agents are similar and have the same utility function. The button in world 1 creates 1 utilon, the button in world 2 creates -50 utilons. You know that the agent in world 1 is an instantiation of you. Should you press the button?

If you believe that the agent in world 2 is an instantiation of you, both EDT and CDT recommend you not to press the button. However, if you believe that the agent in world 2 is not an instantiation of you, then naturalized induction concludes that world 2 isn't actual and so pressing the button is safe.

# Building phenomenological bridges is hard and perhaps confused

So, to solve the problem of naturalized induction and apply EDT/CDT-like decision theories, we need to solve BPB. The behavior of an agent is quite sensitive to how we solve it, so we better get it right.

Unfortunately, I am skeptical that BPB can be solved. Most importantly, I suspect that statements about whether a particular physical process implements a particular algorithm can't be objectively true or false. There seems to be no way of testing any such relations.

Probably we should think more about whether BPB really is doomed. There even seems to be some philosophical literature that seems worth looking into (again, see this Brian Tomasik post; cf. some of Hofstadter's writings and the literatures surrounding "Mary the color scientist", the computational theory of mind, computation in cellular automata, etc.). But at this point, BPB looks confusing/confused enough to look into alternatives.

## Assigning probabilities pragmatically?

One might think that one could map between physical processes and algorithms on a pragmatic or functional basis. That is, one could say that a physical process A implements a program p to the extent that the results of A correlate with the output of p. I think this idea goes into the right direction and we will later see an implementation of this pragmatic approach that does away with naturalized induction. However, it feels inappropriate as a solution to BPB. The main problem is that two processes can correlate in their output without having similar subjective experiences. For instance, it is easy to show that Merge sort and Insertion sort have the same output for any given input, even though they have very different "subjective experiences". (Another problem is that the dependence between two random variables cannot be expressed as a single number and so it is unclear how to translate the entire joint probability distribution of the two into a single number determining the likelihood of the algorithm being implemented by the physical process. That said, if implementing an algorithm is conceived of as binary – either true or false –, one could just require perfect correlation.)

# Getting rid of the problem of building phenomenological bridges

If we adopt an EDT perspective, it seems clear what we have to do to avoid BPB. If we don't want to decide whether some world contains the agent, then it appears that we have to artificially ensure that the agent views itself as existing in all possible worlds. So, we may take every world model and add a causally separate or non-physical entity representing the agent. I'll call this additional agent a logical zombie (l-zombie) (a concept introduced by Benja Fallenstein for a somewhat different decision-theoretical reason). To avoid all BPB, we will assume that the agent pretends that it is the l-zombie with certainty. I'll call this the l-zombie variant of EDT (LZEDT). It is probably the most natural evidentialist logical decision theory.

Note that in the context of LZEDT, l-zombies are a fiction used for pragmatic reasons. LZEDT doesn't make the metaphysical claim that l-zombies exist or that you are secretly an l-zombie. For discussions of related metaphysical claims, see, e.g., Brian Tomasik's essay Why Does Physics Exist? and references therein.

LZEDT reasons about the real world via the correlations between the l-zombie and the real world. In many cases, LZEDT will act as we expect an EDT agent to act. For example, in One Button Per Agent, it doesn't press the button because that ensures that neither agent pushes the button.

LZEDT doesn't need any additional anthropics but behaves like anthropic decision theory/EDT+SSA, which seems alright.

Although LZEDT may assign a high probability to worlds that don't contain any actual agents, it doesn't optimize for these worlds because it cannot significantly influence them. So, in a way LZEDT adopts the pragmatic/functional approach (mentioned above) of, other things equal, giving more weight to worlds that contain a lot of closely correlated agents.

LZEDT is automatically updateless. For example, it gives the money in counterfactual mugging. However, it invariably implements a particularly strong version of updatelessness. It's not just updatelessness in the way that "son of EDT" (i.e., the decision theory that EDT would self-modify into) is updateless, it is also updateless w.r.t. its existence. So, for example, in the One Button Per World problem, it never pushes the button, because it thinks that the second world, in which pushing the button generates -50 utilons, could be actual. This is the case even if the second world very obviously contains no implementation of LZEDT. Similarly, it is unclear what LZEDT does in the Coin Flip Creation problem, which EDT seems to get right.

So, LZEDT optimizes for world models that naturalized induction would assign zero probability to. It should be noted that this is not done on the basis of some exotic ethical claim according to which non-actual worlds deserve moral weight.

I'm not yet sure what to make of LZEDT. It is elegant in that it effortlessly gets anthropics right, avoids BPB and is updateless without having to self-modify. On the other hand, not updating on your existence is often counterintuitive and even regular updateless is, in my opinion, best justified via precommitment. Its approach to avoiding BPB isn't immune to criticism either. In a way, it is just a very wrong approach to BPB (mapping your algorithm into fictions rather than your real instantiations). Perhaps it would be more reasonable to use regular EDT with an approach to BPB that interprets anything as you that could potentially be you?

Of course, LZEDT also inherits some of the potential problems of EDT, in particular, the 5-and-10 problem.

## CDT is more dependant on building phenomenological bridges

It seems much harder to get rid of the BPB problem in CDT. Obviously, the l-zombie approach doesn't work for CDT: because none of the l-zombies has a physical influence on the world, "LZCDT" would always be indifferent between all possible actions. More generally, because CDT exerts no control via correlation, it needs to believe that it might be X if it wants to control X's actions. So, causal decision theory only works with BPB.

That said, a causalist approach to avoiding BPB via l-zombies could be to tamper with the definition of causality such that the l-zombie "logically causes" the choices made by instantiations in the physical world. As far as I understand it, most people at MIRI currently prefer this flavor of logical decision theory.

# Acknowledgements

Most of my views on this topic formed in discussions with Johannes Treutlein. I also benefited from discussions at AISFP.

1 03 September 2017 03:56PM

## Intrinsic properties and Eliezer's metaethics

6 29 August 2017 11:26PM

#### Abstract

I give an account for why some properties seem intrinsic while others seem extrinsic. In light of this account, the property of moral goodness seems intrinsic in one way and extrinsic in another. Most properties do not suffer from this ambiguity. I suggest that this is why many people find Eliezer's metaethics to be confusing.

#### Section 1: Intuitions of intrinsicness

What makes a particular property seem more or less intrinsic, as opposed to extrinsic?

Consider the following three properties that a physical object X might have:

1. The property of having the shape of a regular triangular. (I'll call this property "∆-ness" or "being ∆-shaped", for short.)
2. The property of being hard, in the sense of resisting deformation.
3. The property of being a key that can open a particular lock L (or L-opening-ness).

To me, intuitively, ∆-ness seems entirely intrinsic, and hardness seems somewhat less intrinsic, but still very intrinsic. However, the property of opening a particular lock seems very extrinsic. (If the notion of "intrinsic" seems meaningless to you, please keep reading. I believe that I ground these intuitions in something meaningful below.)

When I query my intuition on these examples, it elaborates as follows:

(1) If an object X is ∆-shaped, then X is ∆-shaped independently of any consideration of anything else. Object X could manifest its ∆-ness even in perfect isolation, in a universe that contained no other objects. In that sense, being ∆-shaped is intrinsic to X.

(2) If an object X is hard, then that fact does have a whiff of extrinsicness about it. After all, X's being hard is typically apparent only in an interaction between X and some other object Y, such as in a forceful collision after which the parts of X are still in nearly the same arrangement.

Nonetheless, X's hardness still feels to me to be primarily "in" X. Yes, something else has to be brought onto the scene for X's hardness to do anything. That is, X's hardness can be detected only with the help of some "test object" Y (to bounce off of X, for example). Nonetheless, the hardness detected is intrinsic to X. It is not, for example, primarily a fact about the system consisting of X and the test object Y together.

(3) Being an L-opening key (where L is a particular lock), on the other hand, feels very extrinsic to me. A thought experiment that pumps this intuition for me is this: Imagine a molten blob K of metal shifting through a range of key-shapes. The vast majority of such shapes do not open L. Now suppose that, in the course of these metamorphoses, K happens to pass through a shape that does open L. Just for that instant, K takes on the property of L-opening-ness. Nonetheless, and here is the point, an observer without detailed knowledge of L in particular wouldn't notice anything special about that instant.

Contrast this with the other two properties: An observer of three dots moving in space might notice when those three dots happen to fall into the configuration of a regular triangle. And an observer of an object passing through different conditions of hardness might notice when the object has become particularly hard. The observer can use a generic test object Y to check the hardness of X. The observer doesn't need anything in particular to notice that X has become hard.

But all that is just an elaboration of my intuitions. What is really going on here? I think that the answer sheds light on how people understand Eliezer's metaethics.

#### Section 2: Is goodness intrinsic?

I was led to this line of thinking while trying to understand why Eliezer's metaethics is consistently confusing.

The notion of an L-opening key has been my personal go-to analogy for thinking about how goodness (of a state of affairs) can be objective, as opposed to subjective. The analogy works like this: We are like locks, and states of affairs are like keys. Roughly, a state is good when it engages our moral sensibilities so that, upon reflection, we favor that state. Speaking metaphorically, a state is good just when it has the right shape to "open" us. (Here, "us" means normal human beings as we are in the actual world.) Being of the right shape to open a particular lock is an objective fact about a key. Analogously, being good is an objective fact about a state of affairs.

Objective in what sense? In this important sense, at least: The property of being L-opening picks out a particular point in key-shape space1. This space contains a point for every possible key-shape, even if no existing key has that shape. So we can say that a hypothetical key is "of an L-opening shape" even if the key is assumed to exist in a world that has no locks of type L. Analogously, a state can still be called good even if it is in a counterfactual world containing no agents who share our moral sensibilities.

But the discussion in Section 1 made "being L-opening" seem, while objective, very extrinsic, and not primarily about the key K itself. The analogy between "L-opening-ness" and goodness seems to work against Eliezer's purposes. It suggests that goodness is extrinsic, rather than intrinsic. For, one cannot properly call a key "opening" in general. One can only say that a key "opens this or that particular lock". But the analogous claim about goodness sounds like relativism: "There's no objective fact of the matter about whether a state of affairs is good. There's just an objective fact of the matter about whether it is good to you."

This, I suppose, is why some people think that Eliezer's metaethics is just warmed-over relativism, despite his protestations.

#### Section 3: Seeing intrinsicness in simulations

I think that we can account for the intuitions of intrinsicness in Section 1 by looking at them from the perspective simulations. Moreover, this account will explain why some of us (including perhaps Eliezer) judge goodness to be intrinsic.

The main idea is this: In our minds, a property P, among other things, "points to" the test for its presence. In particular, P evokes whatever would be involved in detecting the presence of P. Whether I consider a property P to be intrinsic depends on how I would test for the presence of P — NOT, however, on how I would test for P "in the real world", but rather on how I would test for P in a simulation that I'm observing from the outside.

Here is how this plays out in the cases above.

(1) In the case of being ∆-shaped, consider a simulation (on a computer, or in your mind's eye) consisting of three points connected by straight lines to make a triangle X floating in space. The points move around, and the straight lines stretch and change direction to keep the points connected. The simulation itself just keeps track of where the points and lines are. Nonetheless, when X becomes ∆-shaped, I notice this "directly", from outside the simulation. Nothing else within the simulation needs to react to the ∆-ness. Indeed, nothing else needs to be there at all, aside from the points and lines. The ∆-shape detector is in me, outside the simulation. To make the ∆-ness of an object X manifest, the simulation needs to contain only the object X itself.

In summary: A property will feel extremely intrinsic to X when my detecting the property requires only this: "Simulate just X."

(2) For the case of hardness, imagine a computer simulation that models matter and its motions as they follow from the laws of physics and my exogenous manipulations. The simulation keeps track of only fundamental forces, individual molecules, and their positions and momenta. But I can see on the computer display what the resulting clumps of matter look like. In particular, there is a clump X of matter in the simulation, and I can ask myself whether X is hard.

Now, on the one hand, I am not myself a hardness detector that can just look at X and see its hardness. In that sense, hardness is different from ∆-ness, which I can just look at and see. In this case, I need to build a hardness detector. Moreover, I need to build the detector inside the simulation. I need some other thing Y in the simulation to bounce off of X to see whether X is hard. Then I, outside the simulation, can say, "Yup, the way Y bounced off of X indicates that X is hard." (The simulation itself isn't generating statements like "X is hard", any more than the 3-points-and-lines simulation above was generating statements about whether the configuration was a regular triangle.)

On the other hand, crucially, I can detect hardness with practically anything at all in addition to X in the simulation. I can take practically any old chunk of molecules and bounce it off of X with sufficient force.

A property of an object X still feels intrinsic when detecting the property requires only this: "Simulate just X + practically any other arbitrary thing."

Indeed, perhaps I need only an arbitrarily small "epsilon" chunk of additional stuff inside the simulation. Given such a chunk, I can run the simulation to knock the chunk against X, perhaps from various directions. Then I can assess the results to conclude whether X is hard. The sense of intrinsicness comes, perhaps, from "taking the limit as epsilon goes to 0", seeing the hardness there the whole time, and interpreting this as saying that the hardness is "within" X itself.

In summary: A property will feel very intrinsic to X when its detection requires only this: "Simulate just X + epsilon."

(3) In this light, L-opening keys differ crucially from ∆-shaped things and from hard things.

An L-opening key differs from an ∆-shaped object because I myself do not encode lock L. Whereas I can look at a regular triangle and see its ∆-ness from outside the simulation, I cannot do the same (let's suppose) for keys of the right shape to open lock L. So I cannot simulate a key K alone and see its L-opening-ness.

Moreover, I cannot add something merely arbitrary to the simulation to check K for L-opening-ness.  I need to build something very precise and complicated inside the simulation: an instance of the lock L. Then I can insert K in the lock and observe whether it opens.

I need, not just K, and not just K + epsilon: I need to simulate K + something complicated in particular.

#### Section 4: Back to goodness

So how does goodness as a property fit into this story?

There is an important sense in which goodness is more like being ∆-shaped than it is like being L-opening. Namely, goodness of a state of affairs is something that I can assess myself from outside a simulation of that state. I don't need to simulate anything else to see it. Putting it another way, goodness is like L-opening would be if I happened myself to encode lock L. If that were the case, then, as soon as I saw K take on the right shape inside the simulation, that shape could "click" with me outside of the simulation.

That is why goodness seems to have the same ultimate kind of intrinsicness that ∆-ness has and which being L-opening lacks. We don't encode locks, but we do encode morality.

### Footnote

1. Or, rather, a small region in key-shape space, since a lock will accept keys that vary slightly in shape.

## The Reality of Emergence

8 19 August 2017 09:58PM

## In praise of fake frameworks

15 11 July 2017 02:12AM

Followup to: Gears in Understanding

I use a lot of fake frameworks — that is, ways of seeing the world that are probably or obviously wrong in some important way.

I think this is an important skill. There are obvious pitfalls, but I think the advantages are more than worth it. In fact, I think the "pitfalls" can even sometimes be epistemically useful.

Here I want to share why. This is for two reasons:

• I think fake framework use is a wonderful skill. I want it represented more in rationality in practice. Or, I want to know where I'm missing something, and Less Wrong is a great place for that.

• I'm building toward something. This is actually a continuation of Gears in Understanding, although I imagine it won't be at all clear here how. I need a suite of tools in order to describe something. Talking about fake frameworks is a good way to demo tool #2.

With that, let's get started.

## [Link] The Internet as an existential threat

4 09 July 2017 11:40AM

## Against lone wolf self-improvement

27 07 July 2017 03:31PM

LW has a problem. Openly or covertly, many posts here promote the idea that a rational person ought to be able to self-improve on their own. Some of it comes from Eliezer's refusal to attend college (and Luke dropping out of his bachelors, etc). Some of it comes from our concept of rationality, that all agents can be approximated as perfect utility maximizers with a bunch of nonessential bugs. Some of it is due to our psychological makeup and introversion. Some of it comes from trying to tackle hard problems that aren't well understood anywhere else. And some of it is just the plain old meme of heroism and forging your own way.

I'm not saying all these things are 100% harmful. But the end result is a mindset of lone wolf self-improvement, which I believe has harmed LWers more than any other part of our belief system.

Any time you force yourself to do X alone in your room, or blame yourself for not doing X, or feel isolated while doing X, or surf the web to feel some human contact instead of doing X, or wonder if X might improve your life but can't bring yourself to start... your problem comes from believing that lone wolf self-improvement is fundamentally the right approach. That belief is comforting in many ways, but noticing it is enough to break the spell. The fault wasn't with the operator all along. Lone wolf self-improvement doesn't work.

Doesn't work compared to what? Joining a class. With a fixed schedule, a group of students, a teacher, and an exam at the end. Compared to any "anti-akrasia technique" ever proposed on LW or adjacent self-help blogs, joining a class works ridiculously well. You don't need constant willpower: just show up on time and you'll be carried along. You don't get lonely: other students are there and you can't help but interact. You don't wonder if you're doing it right: just ask the teacher.

Can't find a class? Find a club, a meetup, a group of people sharing your interest, any environment where social momentum will work in your favor. Even an online community for X that will reward your progress with upvotes is much better than going X completely alone. But any regular meeting you can attend in person, which doesn't depend on your enthusiasm to keep going, is exponentially more powerful.

Avoiding lone wolf self-improvement seems like embarrassingly obvious advice. But somehow I see people trying to learn X alone in their rooms all the time, swimming against the current for years, blaming themselves when their willpower isn't enough. My message to such people: give up. Your brain is right and what you're forcing it to do is wrong. Put down your X, open your laptop, find a class near you, send them a quick email, and spend the rest of the day surfing the web. It will be your most productive day in months.