Notes on "Can you control the past"

So8res

2022 MIRI Alignment Discussion

57 Notes on "Can you control the past"

by So8res

20th Oct 2022

AI Alignment Forum

26 min read

57 Ω 34

The following is a (lightly edited version of a) series of notes I sent Joe Carlsmith about his essay, Can you control the past?. It's addressed to Joe, but it seems worth publishing here while I'm on the topic of decision theory. I’ve included some of his comments, and my replies, below.

I only recently skimmed Can you control the past?, and have a couple notes that you may or may not be interested in. (I'm not under the impression that this matters a ton, and am writing this recreationally.)

First: this is overall a great review of decision theories. Better than most I've seen. Nice.

Now, onto some more substansive points.

Who am I?

I think a bunch of your sense of oddness about the "magic" that "you can write on whiteboards light-years away" is stemming from a faulty framing you have. In particular, the part where the word "you" points to a single physical instantiation of your algorithm in the universe. I'd say: insofar as your algorithm is multiply instantiated throughout the universe, there is no additional fact about which one is really you.

For analogy, consider tossing a coin in a quantum-mechanical universe, and covering it with your hand. The coin is superpositioned between heads and tails, and once you look at it, you'll decohere into Joe-who-saw-heads and Joe-who-saw-tails, both of whom stem from Joe-who-hasn't-looked-yet. So, before you look, are you Joe-who-saw-heads or Joe-who-saw-tails?

Wrong question! These two entities have not yet diverged; the pasts of those two separate entities coincide. The word "you", at the time before you split, refers to ~one configuration. The time-evolution splits the amplitude on that configuration between ~two distinct future configurations, and once they've split (by making different observations), each will be able to say "me" in a way that refers to them and not the other, but before the split there is no distinction to be made, no extra physical fact, and no real question as to whether pre-split Joe "is" Joe-who-will-see-heads versus Joe-who-will-see-tails.

(It's also maybe informative to imagine what happens if the quantum coin is biased. I'd say, even when the coin is 99.99999% biased towards heads, it's still the case that there isn't a real question about whether Joe-who-has-not-looked-at-the-coin is Joe-who-will-see-heads versus Joe-will-see-tails. There is a question of to what degree Joe-who-has-not-looked becomes Joe-who-saw-heads versus Joe-who-saw-tails, but that's a different sort of question.)

One of my most-confident guesses about anthropics is that being multiply-instantiated in other ways is analogous. For instance, if there are two identical physical copies of you (in physical rooms that are identical enough that you're going to make the same observations for the length of the hypothetical, etc.), then my guess is that there isn't a real question about which one is you. They are both you. You are the pattern, not the meat.

This person may become multiple people in the future, insofar as they see different things in different places-that-embed-them. But before the differing observations come in, they're both you. You can tell because the situation is symmetric: once you know all the physical facts, there's no additional bit telling you which one is "you".

From this perspective, the "magic" is much less mysterious: whenever you are multiply-instantiated, your actions are also multiply-instantiated. If you're multiply-instantiated in two places separated by a 10-light-year gap, then when you act, the two meat-bodies move in the same way on each side of the gap. This is all much less surprising once you acknowledge that "you" refers to everything that instantiates you(-who-have-seen-what-you-have-seen). Which, notably, is a viewpoint more-or-less forced upon us by quantum mechanics anyway.

Also, a subtlety: literal multiple-instantiation of your entire mind (in a place with sufficiently similar physics) is what you need to get "You can draw a demon kitten eating a windmill. You can scream, and dance, and wave your arms around, however you damn well please. Feel the wind on your face, cowboy: this is liberty. And yet, he will do the same." But it's much easier to find other creatures that make the same choice in a limited decision problem, but that won't draw the same demon kitten.

In particular, the thing you need for rational cooperation in a one-shot prisoner's dilemma, is multiple instantiation of your decision algorithm, which is notably smaller than your entire mind. Imagining multiple-instantiation of your entire mind is a fine intuition-pump, but the sort of multiple-instantiation humans find in real life is just of the decision-making fragment (which is enough).

Corollary: To a first approximation, the answer to "Can you control the past?" is "Well, you can be multiply instantiated at different points in time, and control the regions afterwards of the places you’re instantiated, and it’s possible for some of those to be beforewards of other places you’re instantiated. But you can’t control anything beforewards of your earliest instantiation."

To a second approximation, the above is true not only of you (in all your detailed glory, having learned everything you've learned and seen everything you've seen), but of your decision algorithm — a much smaller fragment of you, that is instantiated much more often, and thus can readily affect regions beforewards of the earliest instantiation of you-in-all-your-glory. This is what’s going on in the version of Newcomb’s problem, for instance, where Omega doesn’t simulate you in all your glory, but does reason accurately about the result of your decision algorithm (thereby instantiating it in the relevant sense).

More generally, I think it's worth distinguishing you from your decision algorithm. You can let your full self bleed into your decision-making fragment, by feeling the wind on your face and using specifics of your recent train-of-thought to determine what you draw. Or you can prevent your full self from bleeding into your decision-making fragment, by boiling the problem before you down into a simple and abstract decision problem.

Consider Omega's little sister Omicron, who can't figure out what you'll draw, but has no problem figuring out whether you'll one-box. You-who-have-felt-the-wind-on-your-face are not instantiated in the past, but your decision algorithm on a simple problem could well be. It's the latter that controls things that are beforewards of you (but afterwards of Omicron).

I personally don't think I (Nate-in-all-his-glory) can personally control the past. I think that my decision-procedure can control the future laid out before each and every one of its instantiations.

Is the box in Newcomb's problem full because I one-box? Well, it's full because The Algorithm one-boxes, and I'm a full-ass person wrapped around The Algorithm, but I'm not the instance of The Algorithm that Omicron was looking at, so it seems a bit weird to blame it on me. Like how when you use a calculator to check whether 7 divides 1331 and use that knowledge to decide how to make a bet, and then later I use a different calculator to see whether 1331 is prime in a way that includes (as an intermediate step) checking whether 7 divides it, it's a bit weird to say that my longer calculation was the cause of your bet.

I'm a longer calculation than The Algorithm. It wasn't me who controlled the past, it was The Algorithm Omega looked at, and that I follow.

If you ever manage to get two copies of me (the cowboy who feels the wind on his face) at different times, then in that case I'll say that I (who am both copies) control the earlier-copy's future and the later-copy's past (necessarily in ways that the later copy has not yet observed, for otherwise we are not true copies). Till then, it is merely the past instances of my decision algorithm that control my past, not me.

(Which doesn't mean that I can choose something other than what my decision algorithm selects in any given case, thereby throwing off the yoke; that's crazytalk; if you think you can throw off the yoke of your own decision algorithm then you've failed to correctly identify the fragment of you that makes decisions.)

Joe Carlsmith:

You-who-have-felt-the-wind-on-your-face are not instantiated in the past, but your decision algorithm on a simple problem could well be. It's the latter that controls things that are beforewards of you (but afterwards of Omicron).

I currently expect this part to continue to feel kind of magical to me, due to my identification with the full-on self. E.g., if my decision algorithm is instantiated 10 lightyears away in a squid-person, it will feel like "I" can control "something else" very far away.

Nate Soares: If you were facing me in a game that turns out (after some simple arithmetic) to be isomorphic to a stag hunt, would you feel like you can control my action, despite me being on the other side of the room?

(What I'd say is that we both notice that the game is a stag hunt, and then do the same utility calculation + a bit of reasoning about the other player, and come to the same conclusion, and those calculations control both our actions, but neither of us controls the other player.)

(You can tell this in part from how our actions would not be synchronized in any choice that turns on a bunch of the extra details of Joe that Nate lacks. Like, if we both need to draw a picture that would make a child laugh, and we get an extra bonus from the pictures having identical content, then we might aim for Schelling drawings, but it's not going to work, because it was the simple stag-hunt calculation that was controlling both our actions, rather than all-that-is-Joe.)

(This is part of why I'd say, if your decision algorithm is instantiated 10 light-years away in a squid person, then you don't control them; rather, your shared decision algorithm governs the both of you. The only cases where you (in all your detailed glory) control multiple distant things are cases where exact copies of your brain occur multiple times, in which case it's not that one of you can control things 10ly away, it's that the term 'you' refers to multiple locations simultaneously)

(Of course, this could just ground out into a question of how we define 'you'. In which case I'd be happy to fall back to first (a) claiming that there's a concept ‘you' for which the above makes sense, and then separately (b) arguing that this is the correct way to rescue the English word "you" in light of multiple instantiation.)

Joe Carlsmith: Cool, the stag-hunt example is useful for giving me a sense of where you’re coming from. I can still imagine the sense that “if I hunt hare, the suitably-good-predictor of me will probably hunt hare too; and if I hunt stag, they will probably hunt stag too” giving me a sense of control over what they do, but it feels like we’ll quickly just run into debates about the best way to talk; your way seems coherent, and I’m not super attached to which is preferable from a “rescue” perspective.

Nate Soares: My reply: if a predictor is looking at you and copying your answer, then yes, you control them. But it's worth distinguishing between predictors that look at the-simple-shard-of-you-that-utility-maximizes-in-simple-games and you-in-all-your-detailed-glory. Like, in real life, it's much more common to find a predictor that can tell you'll go for a stag, than a predictor that can predict which drawing you'll make. And saying that 'you' control the former has some misleading implications, that are clarified away by specifying that the simple rules of decisionmaking are embedded in you and are all that the predictor needs to look at (in the former case) to get the right answer.

(We may already agree on these points, but also you might appreciate hearing my phrasing of the obvious reply, so \shrug)

Joe Carlsmith:

Well, it's full because The Algorithm one-boxes, and I'm a full-ass person wrapped around The Algorithm, but I'm not the instance of The Algorithm that Omicron was looking at, so it seems a bit weird to blame it on me.

Do you not control the output of the algorithm?

Nate Soares: In case it's not clear by this point, my reply is "the algorithm controls the output of me". Like, try as I might, I cannot make LDT 2-box on Newcomb's problem — I can't make 2-boxing be higher-utility, and I can't make LDT be anything other than utility-maximizing. I happen to make my choices according to LDT, in a way that is reflectively stable on account of all the delicious delicious utility I get that way.

From this point of view, the point where I'd start saying that it is "me" choosing something (rather than my simpler decision-making core) is when the decision draws on a bunch of extra personal details about Nate-in-particular.

There is of course another point of view, which says "the output of Joe in (say) Newcomb's problem is determined by Joe". This viewpoint is sometimes useful to give to people who are reflecting on themselves and struggling to decide between (say) CDT and LDT.

It's perhaps useful to note that these people tend to have complicated, messy, heuristical decision-procedures, that they're currently in the process of reflecting upon, in ways that are sensitive to various details of their personality and arguments they just heard. Which is to say, someone who's waffling on Newcomb's problem does have much more of their full self engaged in the choice than (say) I do. Their decision procedure is much more unique to them; it involves much more of their true name; all-that-is-them is much more of an input to it.

At that point, "their decision algorithm" and "them" are much closer to synonymous, and I won't quibble much if we say "their algorithm is what determines them" or "they are what determines the output of their algorithm". But in my case, having already passed through the reflective gauntlet, it's much clearer that the algorithm guides me, than that the parts of me wrapped around the algorithm guide it.

(Of course, the algorithm is also part of me, as it is part of many, and so it is still true that some part of me controls the output of The Algorithm. Namely, The Algorithm controls the output of The Algorithm.)

LDT doesn’t pass up guaranteed payoffs

Logical decision theorists firmly deny that they pass up guaranteed payoffs. (I can't quite tell from a skim whether you understand this; apologies if I missed the parts where you acknowledge this.)

As you probably know, in a twin PD problem, a CDT agent might protest that by cooperating you pass up a guaranteed payoff, because (they say) defecting is a dominant strategy. A logical decision theorist counters that the CDT agent has made an error, by imagining that "I defect while my twin cooperates" is a possibility, when in fact it is not.

In particular, when the CDT agent closes their eyes and imagines defecting, they (wrongly) imagine that the action of their twin remains fixed. Among the actual possibilities (cooperate, cooperate) and (defect, defect), the former clearly dominates. The disagreement is not about whether to take dominated strategies, but about what possibilities to admit in the matrix from which we calculate what is dominated and what is not.

Now consider Parfit's hitchhiker. An LDT agent withdraws the $10k and gives it to the selfish man. Will MacAskill objects, "you're passing up a guaranteed payoff of $10k, now that you're certain you're in the city!". The LDT agent says "you have made an error, by imagining ‘I fail to pay while being in the city’ is a possibility, when in fact it is not. In particular, when you close your eyes and imagine not paying, you (wrongly) imagine that your location remains fixed, and wind up imagining an impossibility."

Objecting “it's crazy to imagine your location changing if you fail to pay” is a fair criticism. Objecting that logical decision theorists pass up guaranteed payoffs is not.

The whole question at hand is how to evaluate the counterfactuals. Causal decision theorists say "according to my counterfactuals, if you pay you lose $10k, thus passing up a guaranteed payoff", whereas logical decision theorists say "your counterfactuals are broken, if I don't pay then I die, life is worth more than $10k to me, I am taking the action with the highest payoff". You're welcome to argue that logical decision theorists calculate their counterfactuals wrong, if you think that, but saying we pass up guaranteed payoffs is either confused or disingenuous.

Joe Carlsmith:

(I can't quite tell from a skim whether you understand this; apologies if I missed the parts where you acknowledge this.)

I think I could’ve been clearer about it in the piece, and in my own head. Your comments here were useful on that front.

Joe Carlsmith:

Objecting “it’s crazy to imagine your location changing if you fail to pay” is a fair criticism.

Yeah I suppose this is where my inner “guaranteed payoffs” objector would go next. Could imagine thinking: “well, that just seems flat out metaphysically wrong, and in this sense worse than violating guaranteed payoffs, because just saying false stuff about what happens if you do X is worse than saying weird stuff about what’s ‘rational.’”

Nate Soares: I agree "you're flat-out metaphysically wrong (in a way that seems even worse than violating guaranteed payoffs)" is a valid counterargument to my actual position (in a way that "you violate guaranteed payoffs" is not). :-)

Parfit’s hitchhiker and contradicting the problem statement

There's a cute theorem I've proven (or, well, I've jotted down what looks to me like a proof somewhere, but haven't machine-checked it or anything), which says that if you want to disagree with logical decision theorists, then you have to disagree in cases where the predictor is literally perfect. The idea is that we can break any decision problem down by cases (like "insofar as the predictor is accurate, ..." and "insofar as the predictor is inaccurate, ...") and that all the competing decision theories (CDT, EDT, LDT) agree about how to aggregate cases. So if you want to disagree, you have to disagree in one of the separated cases. (And, spoilers, it's not going to be the case where the predictor is on the fritz.)

I see this theorem as the counter to the decidedly human response "but in real life, predictors are never perfect". "OK!", I respond, "But decomposing a decision problem by cases is always valid, so what do you suggest we do under the assumption that the predictor is accurate?"

Even if perfect predictors don't exist in real life, your behavior in the more complicated probabilistic setting should be assembled out of a mixture of ways you'd behave in simpler cases. Or, at least, so all the standard leading decision theories prescribe. So, pray tell, what do you do insofar as the predictor reasoned accurately?

I think this is a good intuition pump for the thing where logical decision theorists are like "if I imagine stiffing the driver, then I imagine dying in the desert." Insofar as the predictor is accurate, imagining being in the city after stiffing the driver is just as bonkers as imagining defecting while your twin cooperates.

One way I like to think about it is, this decision problem is set up in a fashion that purports to reveal the agent's choice to them before they make it. What, then, happens in the case where the agent acts inconsistently with this revelation? The scenario is ill-defined.

Like, consider the decision problem "You may have either a cookie or a bonk on the head, and you're going to choose the bonk on the head. Which do you choose?" The cookie might seem more appealing than the bonk, but observe that taking the cookie refutes the problem statement. It's at least a little weird to confidently assert that, in that case, you get a cookie. What you really get is a contradiction. And sure, ex falso quodlibet, but it seems a bit strange to anchor on the cookie.

It's not the fault of the agent that this problem statement is refutable by some act of the agent! The problem is ill-defined without someone telling us what actually happens if we refute the problem statement. If you try to take the cookie, you don’t actually wind up with a cookie; you yeet yourself clean out of the hypothetical. To figure out whether to take the cookie, you need to know where you'd land.

Parfit's hitchhiker, at the point where you're standing at the ATM, is much like this. The alleged problem statement is "you may either lose $0 or $10,000, and you're going to choose to lose $10,000". At which point we're like "Hold on a sec, the problem statement makes an assertion about my choice, which I can refute. What happens if I refute the problem statement?" At which point the question-poser is like "haha oops, yeah, if you refute the problem statement then you die alone in the desert". At which point, yeah, when the logical decision theorist closes their eyes and imagines stiffing the driver, then (under the assumption that the driver is accurate) they're like "oh dang, this would refute my observations; what happens in that case again? right, I'd die alone in the desert, which is worse than losing $10,000", and then they pay.

(I also note that this counterfactual they visualize is correct. Insofar as the predictor is accurate, if they wouldn't pay, then they would die alone in the desert instead. That is, in real life, what happens to non-payers who face accurate predictors. The "$0" was a red herring; that case is contradictory and cannot actually be attained.)

(In the problem where you may have either a cookie or a bonk, and you're going to take the bonk, but if you render the problem inconsistent then you get two cookies, by all means, take the cookie. But in the problem where you may have either a cookie or a bonk, and you're going to take the bonk, but if you render the problem inconsistent then you die alone in the desert, then take the dang bonk.)

This sort of thing definitely runs counter to some human intuitions — presumably because, in real life, we rarely observe consequences of actions we haven't made yet.

(Well, except for in a variety of social settings, where we have patches such as "honor" and "reputation" that, notably, give the correct answer in this case, but I digress.)

This is where I think my cute theorem makes it easier to see what's going on: insofar as the predictor is perfect, it doesn't make sense to visualize being in the city after stiffing the driver. When you're standing in front of the ATM, and you screw your eyes shut and imagine what happens if you just run off instead of withdrawing the money, then in the case where the predictor reasoned correctly, your visualizer should be like ERROR ERROR HOW DID WE GET TO THE CITY?, and then fall back to visualizing you dying alone in the desert.

Is it weird that your counterfactual-visualizer paints pictures of you being in the desert, even though you remember being driven to the city? Yep. But it's not the agent's fault that they were shown a consequence of their choice before making their choice; they're not the one who put the potential for contradiction into the decision problem. Avoiding contradiction isn’t their problem. One of their available choices is contradictory with observation (at least under the assumption that the predictor is accurate), and they need to handle the contradiction somehow, and the problem says right there on the tin that if you would cause a contradiction then you die alone in the desert instead.

(Humans, of course, implement the correct decision in this case via a sense of honor or suchlike. Which is astute! "I will pay, because I said I would pay and I am a man of my word" can be seen as a shadow of the correct line of reasoning, cast onto monkey brains that were otherwise ill-suited for it. I endorse the practice of recruiting your intuitions about honor to perform correct counterfactual reasoning.)

(And these counterfactuals are true, to be clear. You can't go find people who were accurately predicted, driven to the city, and then stiffed the driver. There are none to be found.)

Do you see how useful this cute little theorem is? I love it. Instead of worrying about "but what if the driver was simply a fool, and I can save $10k?", we get to decompose the decision problem down into cases, one where the driver was incorrect, and one where they were correct. We all agree that insofar as they're incorrect you have to stiff them, and we all agree about how to aggregate cases, so the remaining question is what you do insofar as they're accurate. And insofar as they're accurate, the contradiction is laid bare. And the "stand in front of the ATM, but visualize yourself dying in the desert" thing feels quite justified, at least to me, as a response to a full-on contradiction.

Just remember that it's not your job to render the universe consistent, and that contradictions can't actually happen. Insofar as the predictor is accurate, imagining yourself surviving and then stiffing the driver makes just as much sense as imagining yourself defecting against your cooperating clone.

Joe Carlsmith:

"You may have either a cookie or a bonk on the head, and you're going to choose the bonk on the head. Which do you choose?"

I think this is a useful way of illustrating some of the puzzles that come up with transparent-Newcomb-like cases.

Joe Carlsmith:

we get to break the decision problem down into cases, one where the driver was incorrect, and one where they were correct

Do you have something like "reliable" in mind, here, rather than "correct"? E.g., presumably you don't care if he's correct, but he flipped a coin to determine his prediction. It seems like what matters is whether his prediction was sensitive to your choice or not — a modal thing.

Nate Soares: Yeah, that's actually my preferred way to think about it. That adds some extra subtleties that turn out to make no difference, though, so skipped over it for the sake of exposition.

(Like, an easy way to do it is to say "I think there's a 95% chance they reason correctly about me, and a 5% chance they make at least one reasoning error, and in the latter case it's equally likely (in a manner uncorrelated with my action) that the error pushes them to an invalid true conclusion as an invalid false conclusion, and so we can model this as one case where they're correct, and one case where they toss a coin and guess accordingly". And this turns out to be equivalent to assuming that they're 97.5% right and 2.5% wrong, which is why it makes no difference. But this still doesn't match real life, because in real life they're using fallible stuff like intuition and plausible-seeming deductive leaps, but whatever, I claim it still basically comes down to "were they taking the relevant considerations about me into account, and reasoning validly to their conclusion, or not?" \shrug)

Joe Carlsmith: Cool, would like to think about this more (I do feel like being X% percent accurate won't always be relevantly equivalent to being Y% infallible and Z% something else), but breaking things down into cases like this seems useful regardless. In particular, seems like the "can't I just control whether he's accurate" response discussed below should apply in the Y%-infallible-Z%-something-else case.

Nate Soares: (I agree it won't always be relevantly equivalent. It happens to be equivalent in this case, and in most other simple decision problems where you care only about whether (and not why) the predictor got the answer right. Which is not supposed to be terribly obvious, and I'll consider myself to have learned a lesson about using expositional simplifications where the fact that it is a simplification is not trivial. :-p)

Joe Carlsmith:

We all agree that insofar as they're incorrect you have to stiff them, and we all agree about how to aggregate cases, so the remaining question is what you do insofar as they're accurate. And insofar as they're accurate, the contradiction is laid bare. And the "stand in front of the ATM, but visualize yourself dying in the desert" thing feels quite justified, at least to me, as a response to a full-on contradiction.

Rephrasing to make sure I understand (using the "reliable/sensitive" interpretation I flagged above): “You stand in front of the ATM. Thus, he’s predicted that you pay. Now, either it’s the case that, if it weren’t the case that you pay, you’d be in the desert dead; or it’s the case that, if it weren’t the case that you pay, you’d still be at the ATM. In the former case, not paying is a contradiction. In the latter case, you should not pay.”

I wonder if the one-boxer could accept this but say: “OK, but given that I’m standing in front of the ATM, if I don’t pay, then I’m in the case where I should not pay, so it’s fine to not pay, so I won’t." E.g., by not paying in the city, you can "make it not the case" that if you don't pay, you die in the desert five hours ago — after all, you're alive in the city now.

Nate Soares:

Rephrasing to make sure I understand [...]

That's right!

I wonder if the one-boxer could accept this but say [...]

There are decision theories that have this behavior! (With some caveats.) Note that this corresponds to an agent that 1-boxes in Newcomb's problem, but 2-boxes in the transparent Newcomb's problem. I don't know of anyone who seriously advocates for that theory, but it's a self-consistent middle-ground.

One caveat is that this isn't reflectively consistent (e.g., such agents expect to die in the desert in any future Parfit's hitchhiker, and would pay in advance to self-modify into something that pays the driver if the driver makes their prediction after the moment of modification). Another caveat is that such agents are easily exploitable by blackmail.

I also suspect that this decision theory violates the principle where you can break down a decision problem by cases? But i'm not sure. You can almost surely get them to pay you to not reveal information. You can maybe money pump them, though I haven't tried.

But those aren't quite my true objection to this sort of thinking. And indeed, the error in this line of thinking ("if I stiff the driver, then I must thereby render them inaccurate, because I've already seen the ATM") is precisely what my lemma about problem decomposition is intended to ward off.

Like, one thing that's wrong with this sort of thinking is that it's hallucinating that the driver's accuracy is under your (decision algorithm's) control. It isn't (and I suspect that the mistake can be money-pumped).

Another thing that's wrong with it is that it's comparing counterfactuals with different degrees of consistency.

Like, consider the problem "you can choose a cookie or a bonk on the head; also, I tossed a coin that comes up 'bonk' 99.9999% of the time and 'cookie' 0.00001% of the time, and your choice matches the coin."

Now, choosing 'cookie' only has a 99.9999% chance of being inconsistent with the problem statement, but this doesn't put the two choices on equal footing. Like, yes, now you can only probabilistically render this problem-statement false, but it's still pretty weird that you can probably render this problem-statement false! And the fact that I mixed in a little uncertainty, doesn't mean that you can now make your choice without knowing what happens if you render the problem statement false! The fact that we mixed in a little uncertainty doesn't justify comparing a bonk directly to a cookie; the problem statement is still incomplete; you still need to know what would actually happen insofar as your action contradicts the allegation that it matches the biased coin.

And, like, there's an intuition that it would be pretty weird, given that problem-statement, to imagine that your choice controls the coin. The coin isn't about you; it's not about your algorithm; there's nothing linking your action to the coin. The weird thing about this problem-statement is the bizarre assertion that your action is known to match the coin. Like… whichever way the coin came up, what if you did the opposite of that?

This is an intuition behind the idea that we should be able to case on the value of the coin and consider each of the cases independently. Like, no matter what the value of the coin is, one of our actions reveals the problem statement to be bogus. And someone needs to tell us what happens if we render the whole problem-statement bogus. And so even when there's uncertainty, we need to know the consequences of refuting the problem statement in order to choose our action.

Joe Carlsmith:

Note that this corresponds to an agent that 1-boxes in Newcomb's problem, but 2-boxes in the transparent Newcomb's problem. I don't know of anyone who seriously advocates for that theory, but it's a self-consistent middle-ground."

EDT 1-boxes in Newcomb's, but 2-boxes in transparent Newcomb's, no?

You can almost surely get them to pay you to not reveal information.

Agree, I feel like avoiding this is one of the key points of being "updateless." E.g., because you're able to act as you would've committed to acting prior to learning the information, it's fine to learn it. Also agree re: exploitable via blackmail (e.g. EDT's XOR blackmail problems).

one thing that's wrong with this sort of thinking is that it's hallucinating that the driver's accuracy is under your (decision algorithm's) control. It isn't (and I suspect that the mistake can be money-pumped).

Flagging that I still feel confused about this, and it feels like it rhymes a bit with stuff about ‘can you control the base rate of lesions’ in smoking lesion that I discuss in the post. (I expect you want to say no, and that this is connected to why you want to smoke in smoking lesion — but in cases where your smoking is genuinely evidence that you’ve got the lesion, I’m not sure this is the right verdict.) I'm wondering if there's something generally weird going on in terms "having a problem-set-up" that can be violated or not.

the fact that I mixed in a little uncertainty, doesn't mean that you can now make your choice without knowing what happens if you render the problem statement false!

Cool, this helps give me a sense of where you're coming from. In particular, even if the predictor isn't always accurate, sounds like you want to interpret “I’m in the city and successfully don’t pay” as having some probability of rendering the problem-statement false, as opposed to being certain to put you in the worlds where the predictor was wrong.

Nate Soares:

EDT 1-boxes in Newcomb's, but 2-boxes in transparent Newcomb's, no?

You're right, I should have thrown in some extra things that rule out EDT. I think that thing refuses XOR blackmail, 1-boxes in Newcomb's problem, and 2-boxes in transparent Newcomb's? (Though I haven't checked.) Which is the sort of theory that, like, only locals would consider, and I don't know any local who takes it seriously, on account of the exploitability and reflective inconsistency and stuff.

I don't have the smoking lesion problem mentally loaded up (I basically think it's just a confused problem statement), but my cached thought is that I give the One True Rescuing of that problem in the "The Smoking Lesion Problem" section of https://arxiv.org/pdf/1710.05060.pdf :-p. And I agree with the diagnosis that there's generally something weird going on when the problem set-up can be violated.

In particular, even if the predictor isn't always accurate, sounds like you want to interpret “I’m in the city and successfully don’t pay” as having some probability of rendering the problem-statement false, as opposed to being certain to put you in the worlds where the predictor was wrong.

Yep! With the justification being that (a) you obviously need to do this when things are certain, and (b) there shouldn't be some enormous change in your behavior when we replace "certain" with "with probability ". Doubly so on account of how you should be able to reason by cases.

Like, if you buy that shit is weird when you can certainly render the problem statement false, and if you buy that either you should be able to reason by cases or you shouldn't have some giant discontinuity at literal certitude, then you're basically funneled into believing that you have to consider (when at the ATM) that failing to pay could render the whole set-up false, at which point you need some extra rule for how to reason in that case.

Where CDT says "assume you live and don't pay" and LDT says "assume you die in the desert", and both agree that the rest of the choice is determined given how you respond to the literal contradiction in the flatly contradictory case.

At which point it's my turn to assert that CDT is flat-out metaphysically wrong, because it's hallucinating that flat contradictions are relevantly possible.

Finally, a minor note: I think the twin clone prisoner's dilemma is sufficient to kill CDT. But if you want to kill it extra dead, you might be interested in the fact that you can turn CDT into a money pump whenever you have a predictor that's more accurate than chance, using some cleverness and the fact that you can expand CDT's action space by also offering it contracts that pay out in counterfactuals that are less possible than CDT pretends they are.

Joe Carlsmith: Sounds interesting — is this written up anywhere?

Nate Soares: Maybe in the Death in Damascus paper? Regardless, my offhand guess is that the result is due to Ben Levenstein so if it's not in that paper then it might be in some other paper of Ben's.

Joe Carlsmith: Thanks again for this! I do hope you publish — I'd like to be able to cite your comments in future.

Decision Theory

Frontpage

57 Ω 34

Contra shard theory, in the context of the diamond maximizer problem

19 comments102 karma

Decision theory does not imply that we get to have nice things

72 comments170 karma

Mentioned in

81AI Safety via Luck

72Distinguishing test from training

29A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans

15all claw, no world — and other thoughts on the universal distribution

New Comment

41 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:31 AM

[-]Vladimir_Nesov2yΩ470

The issue with you-in-all-detail vs. your-decision-algorithm is that a decision algorithm can have different levels of updatelessness, it's unclear what the decision algorithm already knows vs. what a policy it chooses takes as input. So we pick some intermediate level that is updateless enough to allow acausal coordination among relevant entities (agents/predictors), and updateful enough to make a decision without running out of time/memory while being implemented in its instances. But that level/scope is different for different collections of entities being coordinated.

So I think a boundary shouldn't be drawn around "a decision algorithm", but around whatever common knowledge of each other the entities being acausally coordinated happen to have (where they don't need to have common knowledge of everything). When packaged as a decision algorithm, the common knowledge becomes an adjudicator, which these entities can allow influence over their actions. To the extent the influence they allow an adjudicator is common knowledge among them, it also becomes knowledge of the adjudicator, available for its decision making reasoning.

Importantly for the reframing, an adjudicator is not a decision algorithm belonging to either agent individually, it's instead a shared decision algorithm. It's a single decision algorithm purposefully built out of the agents' common knowledge of each other, rather than a collection of their decision algorithms that luckily happen to have common knowledge of each other. It's much easier for there to be some common knowledge than for there to be common knowledge of individually predefined decision algorithms that each agent follows.

[-]Jozdien2y5-1

For instance, if there are two identical physical copies of you (in physical rooms that are identical enough that you're going to make the same observations for the length of the hypothetical, etc.), then my guess is that there isn't a real question about which one is you. They are both you. You are the pattern, not the meat.

I agree with this in the decision-theoretic context (which I suppose is the only relevant one here, but I want to add this because it always sticks out to me), but not in the context of self. If someone instantiates an identical physical copy of me, I would still have asymmetric preferences over worlds involving the both of us. For example, I wouldn't like a world where I was killed and replaced instantly with an identical copy of me, because while that person is me for all external and decision-theoretical purposes, it isn't me in the stream-of-consciousness-that-forms-my-self sense.

[-]Rob Bensinger2y53

it isn't me in the stream-of-consciousness-that-forms-my-self sense

How do you know?

If it turned out that the entire universe were constantly being destroyed and then an exact copy recreated, innumerable times every second, would you thereby learn something crucial about the value of copies of yourself? Why would a random weird metaphysics footnote like this make such a profound difference for your utility function?

Seems even weirder to me given how much turnover there is in a body's composition over time. Caring about your future self as long as you slowly switch out your body parts for functionally identical new parts, but not if you do the same process quickly, feels to me like deciding that getting a haircut destroys your continuity of self. (But weirder, because at least a haircut makes a perceptible difference of some sort!)

[-]Jozdien2y4-1

How do you know?

Our disagreement might come down to differing definitions of "self", but if someone created an identical copy of me in a room identical to mine, there would be two "Jose"s in my model of reality, not one. While there isn't anything different between the two in terms of present state, there is a difference in terms of past state, specifically "which one of the two carries over the physical substrate that was me a second ago".

If it turned out that the entire universe were constantly being destroyed and then an exact copy recreated, innumerable times every second

I've thought about this (along with a "closer" version of whether sleep does this in some way), and every way I think about it, I just come to the conclusion of "this would probably be very sad if true" (with a caveat that it's possible this happens in a way that preserves the stream-of-consciousness I care about in some way, but that's diving way too deep into ill-formed theories of consciousness). To me it seems like destroying a bunch of sentient minds and then recreating an equivalent number later - even if they're similar in form, that doesn't mean a bunch of them didn't die, and new ones were created.

Why would a random weird metaphysics footnote like this make such a profound difference for your utility function?

I mean, why wouldn't it? Weird differences at that low a level having profound differences on something high-level like utility functions feels normal to me? And its randomness is relative, and if you have a utility function that values continuity-of-self, it's relevant.

Seems even weirder to me given how much turnover there is in a body's composition over time.

Yeah, but at no point does the body's composition change drastically enough in a short period of time that the high-level signals passing through the brain substrate are reformed at a biological level.

I fully agree that this thinking leads to some very weird directions, but that seems to me more a fact about the territory, not this in particular. Like, what if you were put in a room with an identical copy of yourself, such that both your experiences were perfectly identical (the room is configured to perfect symmetry, etc)? You see this person, you interact with him (albeit in weird, perfectly symmetrical ways), would you have no preference for your life and happiness over his? To me, it's weird to think of a physical structure that just happens to be similar to you as you in a value sense.

Just to make sure we're disagreeing about the right thing: do you disagree with Harry and Quirrell's stance on the original horcrux design in HPMOR?

[-]Rob Bensinger2y63

with a caveat that it's possible this happens in a way that preserves the stream-of-consciousness I care about in some way

??? You say that this is a claim you feel uncertain about. Does this uncertainty cash out in any experiential difference?

Like, the mere fact that we don't know whether we're being destroyed and recreated innumerable times every second, seems to rule out the idea that there's an experiential difference here.

We could think of ourselves as characters in a movie, whose state at each moment is represented as a frame in a film reel. From within the movie, no character can perceive the transition from one frame to the next, precisely because the brain-state encoded in each frame is our experience at that moment. We experience the frames, not the transitions or the superstructure those frames are part of. Expecting a denizen of a film frame to "notice" whether (e.g.) you double up on every frame, is like expecting them to "notice" that there's a rubber duck a few feet away from the film reel, outside of the movie universe.

Given that there's a smallest "physically meaningful" unit of time (the Planck time), it's not clear to me that it makes sense to think of your physical body as continuously persisting over time, as opposed to "jumping" (film-reel-style, or flipbook-style) from frame to frame. But maybe you think that's fine, because a body that persists via jerky, uncontinuous motions (at some level of granularity) is still different from one that's annihilated and recreated in between each jerk?

I assume you also, then, think it would be horrible news to think you're in a simulation? Since by the same logic as "it would be tragic to learn that my body is constantly being destroyed and recreated" (even though this would, by assumption, be how things had always been, would be normality and inevitable, and wouldn't change a danged thing about what you experience at any time), it would surely also be tragic to learn that your body never existed (even though the same "this makes no experiential difference" caveats hold)?

To my eye, all of this smacks of a soul-ish illusion. I get the intuition that, e.g., it's weird to "anticipate" the future experiences of a version of me that went through a teleporter (being destroyed, and then recreated somewhere else). But our memories are exactly the thing that creates the impression of a continuous self -- as noted above, there would be literally no experiential difference if we learned that our body were a simulation, or that it were being constantly destroyed and recreated. It's just the accumulation of memory that makes any of this feel, at any given time, "continuity-of-self-ish" / "stream-of-consciousness-ish".

And the teletransporter would preserve memory, to exactly the same degree that any routine action I take preserves memory. The idea that there's a ghost in my brain that won't get to really experience the post-teleportation experience, even though it feels obvious that this ghost is surviving moment-to-moment as I wrote this LessWrong comment, has to be an illusion, because there is no Bayesian evidence I have that my current moment-to-moment experience is any different from the teletransporter case. My brain has not encountered a single experience, in its entire lifetime, that differentiates the "moment-to-moment existence involves Me persisting over time" hypothesis from the "moment-to-moment existence involves a succession of Me-like entities replacing each other while passing on the memories of the previous Me".

The "succession of Me entities persisting over time, with memories preserved" hypothesis is completely empirically indistinguishable from the "one Me persisting over time, with memories preserved" hypothesis. So it seems clear to me that I should "anticipate" having an experience a moment from now iff it has the same kind of memory-relation to my present experience. It will feel the same in any case, if you take a survey of all the experiences happening in the universe at a given time. The idea that there's a Soul, separate from my experiences and memories, that will blip out of existence or perceive Darkness if something interferes with its Adhesion to a body (even though the exact same memory-relation to a body in the world still holds) -- this idea doesn't make physical sense.

And since it's just a free-floating intuition, not a thing that we could possibly have gotten Bayesian evidence for, it seems unusually clear that it has to be an illusion.

By the same logic, if more than one "me" is created with that same memory-relation to me, then I should "anticipate" having all of those experiences. Not in the sense that I'm a Soul that will get to stare at multiple movie screens simultaneously; but in the sense that the experiencing self will itself be copied, and each movie screen will get its own "me" that has an equal claim to being "the same person" as present-me. See also the MWI example.

The experiencing self is in the universe -- in the film reel, in the momentary brain-state -- and this feels emotionally un-obvious (and the teletransporter feels scary) because of an introspective illusion. If this is true, then it seems unparsimonious to separately claim that our CEV would assign enormous moral importance to this weird metaphysical thingie we currently have an illusion about; we would expect the illusion to feel emotionally salient even if our CEV doesn't care about bodily continuity.

[-]Rob Bensinger2y53

To me it seems like destroying a bunch of sentient minds and then recreating an equivalent number later

Destroying, minus all of the experiential bad things we mentally associate with "destroying"!

I mean, why wouldn't it? Weird differences at that low a level having profound differences on something high-level like utility functions feels normal to me?

I don't object to the idea of people having utility functions over various unconscious physical states -- e.g., maybe I assign some aesthetic value to having my upload run on cool shiny-looking computronium, rather than something tacky, even if no one's ever going to see my hardware.

The three things that make me suspicious of your "this is a huge part of my utility function" claim are:

The fact that the properties you're attached to are things that (for all you know) you never had in the first place. The idea of humans experiencing and enjoying beauty, and then desiring that things be beautiful even in observed parts of the universe, makes total sense to me. The idea of humans coming up with a weird metaphysical hypothesis, and assigning ultimate value to the answer to this hypothesis even though its truth or falsehood has never once made a perceptible difference in anyone's life, seems far weirder to me.
This alleged utility function component happens to coincide with a powerful and widespread cognitive illusion (the idea of a Soul or Self or Homunculus existing outside determinism / physics / the Cartesian theater).
The alleged importance of this component raises major alarm bells. I wouldn't be skeptical if you said that you slightly aesthetically prefer bodily continuity, all else equal; but treating this metaphysical stuff as though it were a matter of life or death, or as though it were crucial in the way that the distinction between "me" and "a random stranger" is important, seems like a flag that there's something wrong here. The "it's a matter of life or death" thing only makes sense, AFAICT, if you're subject to a soul-style illusion about what "you" are, and about what your experience of personal continuity currently consists in.

You see this person, you interact with him (albeit in weird, perfectly symmetrical ways), would you have no preference for your life and happiness over his?

To my eye, this is isomorphic to the following hypothetical (modulo the fact that two copies of me existing may not have the same moral value as one copy of me existing):

I'm playing a VR game.
Iff a VR avatar of mine is killed, I die in real life.
When I use the controls, it causes two different avatars to move. They move in exactly the same way, and since they're symmetrically positioned in a symmetrical room, there's no difference between whether I'm seeing through the eyes of one avatar or the other. Through my VR helmet, I am in fact looking through the eyes of just one of the two avatars -- I can't see both of their experiences at once -- but since the experiences are identical, it doesn't matter which one.

It doesn't really matter whether I value "my" life over "his", because we're going to have the same experiences going forward regardless. But if something breaks this symmetry (note that at that point we'd need to clarify that there are two separate streams of consciousness here, not just one), my preference before the symmetry is broken should be indifferent between which of the copies gets the reward, because both versions of me bear the correct memory-relationship to my present self.

(So, I should "anticipate" being both copies of me, for the same reason I should "anticipate" being all of my MWI selves, and anticipate having new experiences after stepping through a teleporter.)

[-]Signer2y31

This alleged utility function component happens to coincide with a powerful and widespread cognitive illusion

So what? Ones you know that the illusion is wrong, rescuing your utility function even to "the universe is now sad forever" is perfectly fine. So in what way it doesn't make sense to value some approximation of soul that is more problematic than patternist one? There is no law that says you must use the same reasoning for ethically navigating ontology shifts, that you use for figuring the world out.

[-]green_leaf2y20

So in what way it doesn't make sense to value some approximation of soul that is more problematic than patternist one?

Because the reason why people make such an approximation is because they wish to save the continuity of their consciousness. But there is nothing else than patternism that does that.

They don't terminally value being made of the same substance, or having the same body, etc. They terminally value their consciousness persisting, and the common decision to care about, let's say, preserving the substance of their body, is made instrumentally to protect that.

Once they someday understand that their consciousness persists independently of that, they will stop valuing it.

[-]Signer2y43

Persistence of your consciousness is not something you (only) understand - it's a value-laden concept, because it depends on what you define as "you". If your identity includes your body then the consciousness in new body is not yours so your consciousness didn't persist.

But there is nothing else than patternism that does that.

Sure there is - "soul dies when body dies" preserves the continuity of consciousness until body dies. Or what do you mean by "save"?

They don’t terminally value being made of the same substance, or having the same body, etc. They terminally value their consciousness persisting, and the common decision to care about, let’s say, preserving the substance of their body, is made instrumentally to protect that.

And how did you decide that? Why they couldn't have been wrong about preserving body just for the soul and then figure out that no, they actually do terminally care about not destroying specific instantiations of their pattern?

[-]green_leaf2y10

I think we need to go through this first:

Persistence of your consciousness is not something you (only) understand - it's a value-laden concept, because it depends on what you define as "you". If your identity includes your body then the consciousness in new body is not yours so your consciousness didn't persist.

There is only one consciousness. If I define myself as myself + my house and someone demolishes my house, there is no consciousness that has just been destroyed, in reality or in the model. Similarly, if I include my body in the definition of me and then destroy my body, no consciousness has been destroyed. Etc.

We can choose to call my surviving consciousness someone else's, but that doesn't make it so. This purely definitional going out of existence isn't what those people have in mind. What they believe is that they actually black out forever, like after a car accident, and a copy of them who falsely feels like them will actually take their place.

(If I told you one day that I terminally cared about never leaving my house because I defined myself as my body + my house, and if my body leaves my house, I cease to exist, would you think that I'm holding a terminal belief, or would you consider some deep confusion, possibly caused by schizophrenia? I don't admire the self-determination of people who choose to define themselves as their body - instead, I can see they're making an error of judgment as to what actually destroys/doesn't destroy their consciousness (not because of schizophrenia, but nevertheless an error)).

[-]Signer2y1-2

There is only one universe. Everything else, including singling out consciousness, is already definitional (and I don't see how physical locality, for example, is more definitional than "not blacking out forever"). Thinking overwise is either smuggling epistemic methods to ethics ("but it's simpler if we model you and house as separate entities!" or something) which is unjustified personal meta-preference and certainly bad without explicitly marking it as such. Or just plain forgetting about subjectivity of high-level concepts such as consciousness. Calling something "consciousness" also doesn't change underling reality of your brain continuing to exist after a car accident.

What they believe is that they actually black out forever, like after a car accident, and a copy of them who falsely feels like them will actually take their place.

In the case of teleportation, for example, how "this body will no longer instantiate my pattern of consciousness" is not correct description of reality and how "I will black out forever" is not a decent approximation of it? There will be instantiation of blacking out forever - decision to model it as continuing somewhere else is, like you say, definitional.

If I told you one day that I terminally cared about never leaving my house because I defined myself as my body + my house, and if my body leaves my house, I cease to exist, would you think that I’m holding a terminal belief, or would you consider some deep confusion, possibly caused by schizophrenia?

Wait, why do you think schizophrenia doesn't change terminal values? I would hope it's just confusion, but if t's your values then it's your values.

I don’t admire the self-determination of people who choose to define themselves as their body—instead, I can see they’re making an error of judgment as to what actually destroys/doesn’t destroy their consciousness

You didn't demonstrate what fact about reality in low-level terms they get wrong.

[-]green_leaf2y10

I don't see how physical locality, for example, is more definitional than "not blacking out forever"

If by "definitional" you mean "just a matter of semantics," that would be a bizarre position to hold (it would mean, for example, that if someone shoots you in the head, it's just a matter of semantics whether you black out forever or not). If by "definitional" you mean something else, please, clarify what.

In the case of teleportation, for example, how "this body will no longer instantiate my pattern of consciousness" is not correct description of reality and how "I will black out forever" is not a decent approximation of it?

The former is correct, while the latter is completely wrong. Since you actually black out forever iff your consciousness is destroyed, and your consciousness isn't destroyed (it just moves somewhere else).

Saying that you will black out forever after the teleportation is as much of a decent approximation of the truth as saying that you will black out forever after going to sleep.

[-]Signer2y0-3

it would mean, for example, that if someone shoots you in the head, it’s just a matter of semantics whether you black out forever or not

Yes, it is just a matter of semantics, unless you already specified what "you black out forever" means in low-level terms. There are no additional facts about reality that force you to describe your head in a shot state as blacking out forever or even talk about "you" at all.

Since you actually black out forever iff your consciousness is destroyed

And that's an inference from what law? Defining "destroyed" as "there are no other instantiations of a pattern of your consciousness" is just begging the question.

Saying that you will black out forever after the teleportation is as much of a decent approximation of the truth as saying that you will black out forever after going to sleep

I mean, there are detectable physical differences, but sure, what's your objection to "sleep is death" preference?

[-]green_leaf2y10

Yes, it is just a matter of semantics

No, it's not. The low-level terms have to be inferred, not made correct by definition.

Otherwise, you could survive your death by defining yourself to be a black hole (those are extremely long-lived).

And that's an inference from what law?

There is, of course, no other possibility. If your consciousness survives but you don't black out forever, that would be a contradiction - similarly, if the latter is true but the former false.

what's your objection to "sleep is death" preference

That it's a mental illness, I suppose.

[-]Signer2y0-3

Otherwise, you could survive your death by defining yourself to be a black hole (those are extremely long-lived).

That's the point? There are no physical laws that force you to care about your pattern instead of caring about black hole. And there are no laws that force you to define death one way or another. And no unique value-free way to infer low-level description from high-level concept - that's ontological shift problem. You can't "know", that death means your pattern is not instantiated anywhere, ones you know about atoms or whatever. You can only know that there are atoms and that you can model yourself as using "death" concept in some situations. You still need to define death afterwards. You can motivate high-level concepts by epistemic values - "modeling these atoms as a chair allows me to cheaply predict it". But there is no fundamental reason why your epistemic values must dictate what high-level concepts you use for your other values.

If your consciousness survives but you don’t black out forever, that would be a contradiction

It's only contradiction if you define "you blacking out" as your destruction of consciousness. Like I said, it's begging the question. Nothing is forcing you to equate ethically significant blacking out with "there are no instantiations of your pattern" instead of "current instantiation is destroyed.

That it’s a mental illness, I suppose.

So it is a matter of semantics.

[-]green_leaf2y1-1

There are 2 confusions there:

That's the point? There are no physical laws that force you to care about your pattern instead of caring about black hole.

There, you are confusing not caring about what you are with what you are being a matter of semantics.

And there are no laws that force you to define death one way or another.

There you are confusing not being forced by the laws of physics to define death correctly with death being a matter of semantics.

And no unique value-free way to infer low-level description from high-level concept - that's ontological shift problem.

That's potentially a good objection, but a high-level concept already has a set of properties, explanatorily prior to it being reduced. They don't follow from our choice of reduction, rather, our choice of reduction is fixed by them.

Nothing is forcing you to equate ethically significant blacking out with "there are no instantiations of your pattern" instead of "current instantiation is destroyed.

(I assume that means having to keep the same matter.)

You're conflating two things there - both options being equally correct and nothing forcing us to pick the other option. It's true nothing forces us to pick the other option, but once we make an analysis of what consciousness, the continuity of consciousness and qualia are, it turns out the correct reduction is to the pattern, and not to the pattern+substance. People who pick other reductions made a mistake in their reasoning somewhere along the way.

This isn't purely a matter of who wins in a philosophy paper. We'll have mind uploading relatively soon, and the deaths of people who decide, to show off their wit, to define their substance as a part of them, will be as needless as someone's who refuses to leave a burning bus because they're defining it as a part of them.

[-]Signer2y0-3

I pretty much agree with your restatement of my position, but you didn't present arguments for yours. Yes, I'm saying that all high-level concepts are value-laden. You didn't specify what do you mean by "correct", but for "corresponds to reality" it just doesn't make sense for the definition of death to be incorrect. What do you even mean that it is incorrect, when it describes the same reality?

That’s potentially a good objection, but a high-level concept already has a set of properties, explanatorily prior to it being reduced.

Yeah, they are called "preferences" and the problem is that they are in different language.

I mean, it's not exactly inconsistent to have a system of preferences about how you resolve ontological shifts and to call that subset of preferences "correct" and nothing is really low-level anyway, but... you do realize that mistakes in reduction are not the same things as mistakes about reality?

This isn’t purely a matter of who wins in a philosophy paper.

So is killing someone by separating them from their bus.

[-]green_leaf2y2-1

Yes, I'm saying that all high-level concepts are value-laden.

No, concepts are value-neutral. Caring about a concept is distinct from whether a given reduction of a concept is correct, incorrect or arbitrary.

it just doesn't make sense for the definition of death to be incorrect

You're confusing semantics and ontology. While all definitions are arbitrary, the ontology of any given existing thing (like consciousness) is objective. (So far it seems that in your head, the arbitrariness of semantics somehow spills over into arbitrariness of ontology, so you think you can just say that to preserve your consciousness, you need to keep the same matter, and it will really be that way).)

you do realize that mistakes in reduction are not the same things as mistakes about reality?

They are a subset of them (because by making a mistake in the first one, we'll end up mistakenly believing incorrect things about reality (namely, that whatever we reduced our high-level concept to will behave the same way we expect our high-level concept to behave)).

[-]Signer2y1-1

While all definitions are arbitrary, the ontology of any given existing thing (like consciousness) is objective.

How does this work? There is only one objective ontology - true physics. That we don't know it complicates things somewhat, but on our current understanding (ethically significant) consciousness is not an ontological primitive. Everything is just quantum amplitude. Nothing really changes whether you call some part of universe "consciousness" or "chair" or whatever. Your reduction can't say things that contradict real ontology, of course - you can't say "chair is literally these atoms and also it teleports faster than light". But there is nothing that contradict real ontology in "I am my body".

No, concepts are value-neutral. Caring about a concept is distinct from whether a given reduction of a concept is correct, incorrect or arbitrary.

There is no objective justification for a concept of a chair. AIXI doesn't need to think about chairs. Like, really, try to specify correctness of a reduction of chair without appeal to usefulness.

we’ll end up mistakenly believing incorrect things about reality

Wait, but we already assumed that we are using "I am my body" definition that is correct about reality. Well, I assumed. Being incorrect means there must be some atoms in your model that are not in there real place. But "I am my body" doesn't mean you forget that there will be another body with the same pattern at the destination of a teleporter. Or any other physical consequence. You still haven't specified what incorrect things about atoms "I am my body" entails.

Is it about you thinking that consciousness specifically is ontologically primitive or that "blacking out" can't be reduced to whatever you want or something and you would agree if we only talked about chairs? Otherwise I really want to see your specification of what does "correct reduction" means.

[-]green_leaf2y10

How does this work? There is only one objective ontology - true physics. That we don't know it complicates things somewhat, but on our current understanding (ethically significant) consciousness is not an ontological primitive.

It doesn't have to be. Aspects of the pattern are the correct ontology of consciousness. They're not ontologically primitive, but that's the correct ontology of consciousness. Someone else can use the word consciousness to denote something else, like a chair, but then they are no longer talking about consciousness. They didn't start talking about chairs (instead of consciousness) because they care about chairs. They started talking about chairs because they mistakenly believe that what we call consciousness reduces to chair-shaped collections of atoms that people can sit in, and if some random good-doer found for them the mistake they made in reducing the concept of consciousness, they would agree that they made a mistake, stop caring about chair-shaped collections of atoms that people can sit in and start caring about consciousness.

Otherwise I really want to see your specification of what does "correct reduction" means.

I don't have an explicit specification. Maybe it could be something like a process that maps a high-level concept to a lowest-level one while preserving the implicit and explicit properties of the concept.

[-]Signer2y-1-2

"Different bodies have different consciousness" is an implicit property of the concept. The whole problem of changing ontology is that you can't keep all the properties. And there is nothing except your current preferences that can decide which ones you keep.

They didn’t start talking about chairs (instead of consciousness) because they care about chairs.

In what parts (the reduction of) your concept of being mistaken is not isomorphic to caring? Or, if someone just didn't talk about high-level concepts at all, what are your explicit and implicit properties of correctness that are not satisfied by knowing where all the atoms are and still valuing your body?

[-]green_leaf2y10

"Different bodies have different consciousness" is an implicit property of the concept.

No, it's not, you just think it is. If you could reflect on all your beliefs, you'd come to the conclusion you were wrong. (For example, transplanting a brain to another body is something you'd (hopefully) agree preserves you. Etc.)

The whole problem of changing ontology is that you can't keep all the properties.

Why not? All properties are reducible, so you can reduce them along with the concept.

In what parts (the reduction of) your concept of being mistaken is not isomorphic to caring?

All of them. Those are two different concepts. Being mistaken in the reduction means making a logical error at some step that a computer could point at. Caring means having a high-level (or a low-level) concept in your utility function.

Or, if someone just didn't talk about high-level concepts at all, what are your explicit and implicit properties of correctness that are not satisfied by knowing where all the atoms are and still valuing your body?

If you mean not talking about them in the sense of not referring to them, I'd want to know how they reduced consciousness (or their personal survival) if they couldn't refer to those concepts in the first place. If they were a computer who already started out as only referring to low-level concepts, they might not be making any mistakes, but they don't care about the survival of anyone's consciousness. No human is like that.

[-]Signer2y-1-2

Wait, "logical" error? Like, you believe that "transplanting a brain to another body preserves you" is a theorem of QFT + ZFC or something? That... doesn't make sense - there is no symbol for "brain" in QFT + ZFC.

No, it’s not, you just think it is. If you could reflect on all your beliefs, you’d come to the conclusion you were wrong.

How does it make that conclusion correct?

If you mean not talking about them in the sense of not referring to them, I’d want to know how they reduced consciousness (or their personal survival) if they couldn’t refer to those concepts in the first place.

I mean after they stopped believing in (and valuing) soul they switched to valuing physically correct description of their body without thinking whether it was correct reduction of a soul. And not being a computer they are not perfectly correct in their description, but the point is why not help them correct their description and make them more like a computer valuing the body? Where is mistake in that?

Or even, can you give an example of just one correct step in the reasoning about what are the real properties of a concept (of a chair or consciousness or whatever)?

[-]green_leaf2y10

Wait, "logical" error? Like, you believe that "transplanting a brain to another body preserves you" is a theorem of QFT + ZFC or something? That... doesn't make sense - there is no symbol for "brain" in QFT + ZFC.

Why would ZFC have to play a role there? By a logical error, I had in mind committing a contradiction, an invalid implication, etc.

In other words, if you consider the explicit properties you believe the concept of yourself to have, and then you compare them against other beliefs you already hold, you'll discover a contradiction which can be only remedied by accepting the reduction of "you" into a substanceless pattern. There is no other way.

I mean after they stopped believing in (and valuing) soul they switched to valuing physically correct description of their body without thinking whether it was correct reduction of a soul.

That's not possible. There must've been some step in between, even if it wasn't made explicit, in their decision chain. (For example, they were seeking what low-level concept would fit their high-level concept of themselves (since a soul can no longer fit the bill) and didn't do the reduction correctly.)

Or even, can you give an example of just one correct step in the reasoning about what are the real properties of a concept (of a chair or consciousness or whatever)?

For example, that step could be imagining slowly replacing your neurons by mechanical ones performing the same function. (Then there would be subsequent steps, which would end with concluding the only possible reduction is to a substanceless pattern.)

[-]Signer2y-1-2

By a logical error, I had in mind committing a contradiction, an invalid implication, etc.

Implication from what? There is no chain of implications that starts with "I think I value me" and "everything is atoms" and ends with “transplanting a brain to another body preserves me”. Unless you already have laws for how you reduce things.

In other words, if you consider the explicit properties you believe the concept of yourself to have, and then you compare them against other beliefs you already hold, you’ll discover a contradiction which can be only remedied by accepting the reduction of “you” into a substanceless pattern. There is no other way.

Your beliefs and explicit properties are in different ontologies - there is no law for comparing them. If your current reduction of yourself contradicts your beliefs, you can change your reduction. Yes, a substanceless pattern is a valid change of reduction. But a vast space of other reductions is also contradiction-free (physically possible, in other words) - "I am my body" doesn't require atoms to be in wrong places. You didn't present an example of where it does, so you agree, right?

If by "beliefs" you mean high-level approximations, like in addition to "I am my body", you have "I remain myself after sleep" and then you figure out atoms and start to use "the body after sleep is not really the same", then obviously there are many other ways to resolve this instead of "I am substanceless pattern". There is nothing preventing you from saying that "body" should mean different things in "I am my body" and "the body after sleep", you can conclude that you are not you after sleep - why is one explicit property is better than another if excluding either solves the contradiction? Like I said, is it about consciousness specifically, where you think people can't be wrong about what way point to when they think about blacking out? Because it's totally possible to be wrong about your consciousness.

Then there would be subsequent steps

"To be sure, Fading Qualia may be logically possible. Arguably, there is no contradiction in the notion of a system that is so wrong about its experiences." So, by "correct" you mean "doesn't feel implausible"? Or what else makes imagining slowly replacing your neurons "correct"?

I mean, where did you even got the idea that it is possible to derive anything ethical using only correctness? That's is/ought distinction, isn't it?

[-]green_leaf2y10

There is no chain of implications that starts with "I think I value me" and "everything is atoms" and ends with “transplanting a brain to another body preserves me”.

Right, you need more than those two statements. (Also, the first one doesn't actually help - it doesn't matter to the conclusion if you value yourself or not.)

contradiction-free (physically possible, in other words)

Contradiction-free doesn't mean physically possible.

"I am my body" doesn't require atoms to be in wrong places

Right. The contradiction is in your brain in the form of the data encoded there. It's not an incorrect belief about where atoms are.

in addition to "I am my body", you have "I remain myself after sleep" and then you figure out atoms and start to use "the body after sleep is not really the same", then obviously there are many other ways to resolve this instead of "I am substanceless pattern".

There are. The problem is that there is more than one ("I remain myself after sleep") statement and if you consider all of them together, there is no longer another way.

you can conclude that you are not you after sleep

You can't. Nobody can actually believe that.

Like I said, is it about consciousness specifically, where you think people can't be wrong about what way point to when they think about blacking out? Because it's totally possible to be wrong about your consciousness.

People can be wrong when doing this sort of reasoning, but the solution isn't to postulate the answer by an axiom. The solution is to be really careful about the reasoning.

"To be sure, Fading Qualia may be logically possible. Arguably, there is no contradiction in the notion of a system that is so wrong about its experiences." So, by "correct" you mean "doesn't feel implausible"?

That would require some extremely convoluted theory of consciousness that nobody could believe. (For example, it would contradict one of the things you said previously, where a consciousness belongs to a macroscopic, spatially extended object (like a human body), and that's what makes the object experience that consciousness. (That wouldn't be possible on the theory of Fading Qualia (because Joe from the thought experiment doesn't have fully functioning consciousness even though both the consciousness and his body function correctly, etc.).))

I mean, where did you even got the idea that it is possible to derive anything ethical using only correctness?

Oh, I don't have ethics in mind there.

[-]Signer2y0-1

The problem is that there is more than one (“I remain myself after sleep”) statement and if you consider all of them together, there is no longer another way.

Well, yes, there are other statements - “I am my body” and “I remain myself after sleep” are among them. If your way allows contradicting “I am my body” then it's not the only contradiction-free way, and other ways (that contradict other initial statements) are on the same footing. At least as far as logic goes.

The contradiction is in your brain in the form of the data encoded there. It’s not an incorrect belief about where atoms are.

Then patternist identity encodes contradiction to “I am my body” in the same way. And if your choice of statements to contradict is not determined by either logic or beliefs about atoms, then it is determined by your preferences. There is just not much other kinds of stuff in the universe.

That would require some extremely convoluted theory of consciousness that nobody could believe.

So like I said, requirement for a theory of consciousness to not be convoluted is just your preference. Just like any definition of what it means for someone to actually believe something - it's not logic that forces you, because as long as you have a contradiction anyway, you can say that someone was wrong about themselves not believing in convoluted theory of consciousness - and not knowledge about reality. That's why it's about ethics. Or why do you thing someone should prefer non-convoluted theory of consciousness?

For example, it would contradict one of the things you said previously, where a consciousness belongs to a macroscopic, spatially extended object (like a human body), and that’s what makes the object experience that consciousness.

Nah, you can just always make it more convoluted^^. For example I could say that usually microscopic changes are safe, but changing neurons into silicon is too much and destroys consciousness.

[-]green_leaf2y10

Well, yes, there are other statements - “I am my body” and “I remain myself after sleep” are among them.

The first one can't be there. If you put it there and then we add everything else, there will be some statements we're psychologically incapable of disbelieving, and us being our body isn't among them, so it is that statement that will have to go.

And if your choice of statements to contradict is not determined by either logic or beliefs about atoms, then it is determined by your preferences. There is just not much other kinds of stuff in the universe.

There is a fourth kind of data - namely, what our psychological makeup determines we're capable of believing. (Those aren't our preferences.)

So like I said, requirement for a theory of consciousness to not be convoluted is just your preference.

Right, but the key part there isn't that it's convoluted, but that we're incapable of believing it.

[-]Signer2y-1-2

Caring about what our psychological makeup determines we’re capable of believing, instead of partially operating only on surface reasoning until you change your psychological makeup, is a preference. It's not a law that you must believe things in whatever sense you mean it for these things to matter. It may be useful for acquiring knowledge, but it's not "correct" to always do everything that maximally helps your brain know true things. It's not avoiding mistakes - it's just selling your soul for knowledge.

[-]green_leaf2y10

Caring about what our psychological makeup determines we’re capable of believing, instead of partially operating only on surface reasoning until you change your psychological makeup, is a preference.

You can't change your psychological makeup to allow you to hold a self-consistent system of beliefs that would include the belief that you are your body. Even if you could (which you can't), you haven't done it yet, so you can't currently hold such a system of beliefs.

It's not a law that you must believe things in whatever sense you mean it for these things to matter.

If you don't believe any system of statements that includes that you are your body, then you have no reason to avoid a mind upload or a teleporter.

If you want to declare that you have null beliefs about what you are and say that you only care about your physical body (instead of believing that that is you), that's not possible. Humans don't psychologically work like that.

it's not "correct" to always do everything that maximally helps your brain know true things

You can't avoid that. By the time you are avoiding doing something that would maximally help you know the truth, you already know your current belief is false.

[-]Jozdien2y30

I assume you also, then, think it would be horrible news to think you're in a simulation?

I don't see why this follows? I would find it sad if the substrate forming my self ceased to exist, but nothing about a simulation implies that - my "body" wouldn't exist, but there would be something somewhere that hosted the signals forming me.

Like, the mere fact that we don't know whether we're being destroyed and recreated innumerable times every second, seems to rule out the idea that there's an experiential difference here.

We don't know in the present, yeah. But in my model it's like if Omega decided to torture all of us in half the Everett branches leading forward - to the the ones in the remaining branches, there wouldn't be any experiential difference. But we'd still value negatively that happening to copies of ourselves.

It's less intuitive in a case where Omega instead decides to kill us in those branches because death by definition isn't experiential - and I think that might lead to some ontological confusion? Like, I don't really see there being much of a difference between the two apart from that in one we suffer, and in the other we die. In the case where us being destroyed and recreated constantly is how reality works, I would think that's sad for the same reason it's sad if it were Omega making a copy of us constantly and torturing it instead.

I think it's possible we might agree on some of the practical points if "I anticipate dying as one of the experiences if I use a teletransporter" is valid in your eyes? My disagreement past that would only be that for a given body / substrate my thoughts are currently hosted on, I anticipate just the experiences those thoughts undergo, which I'll address in the last paragraph.

I can see how parts of this resemble soul-theory behaviourally (but not completely I think? Unless this is just my being bad at modelling soul-theorists, I don't think they'd consider a copy of themselves to be a perfectly normal sentient being in all ways), but I don't think they're the same internally (while this might explain why I believed it in the first place if the object-level arguments were wrong, I'm not convinced of that, so it doesn't feel that way to me on reflection).

It doesn't really matter whether I value "my" life over "his", because we're going to have the same experiences going forward regardless. But if something breaks this symmetry (note that at that point we'd need to clarify that there are two separate streams of consciousness here, not just one), my preference before the symmetry is broken should be indifferent between which of the copies gets the reward, because both versions of me bear the correct memory-relationship to my present self.

I'm somewhat sceptical about this. If a copy of yourself appeared before your eyes right now, and Omega riddled him with bullets until he died, would you at that moment assign that the same amount of negative utility as you having gotten shot yourself? Insofar as we assign the pain we experience a unique utility value relative to knowing someone else is experiencing it, it's (at least partly) because we're feeling it, and the you in that body wouldn't have. In that view, preferences before the symmetry is broken should still care about the specific body one copy is in. That's the difference I guess I'm talking about?

(I spent way too long writing this before deciding to just post it, so apologies if it seems janky).

[-]SMK2y20

What's your take on playing a PD against someone who is implementing a different decision algorithm to the one you are implementing, albeit strongly (logically) correlated in terms of outputs?

[-]JBlack2y20

I'm guessing that "strongly logically correlated in terms of outputs" means that it has the same outputs for a large fraction of inputs (but not all) according to some measure over the space of all possible inputs.

If that's all you know, then there will likely be nearly zero logical correlation between your outputs for this instance, and what you will decide depends upon what your decision algorithm does when there is close to zero logical correlation.

If you have more specific information than just existence of a strong logical correlation in general, then you should use it. For example, you may be told that the measure over which the correlation is taken is heavily weighted toward your specific inputs for this instance, and that the other player is given the same inputs. That raises the logical correlation between outputs for this instance, and (if your decision algorithm depends upon such things) you should cooperate.

[-]Rob Bensinger2y20

Depends on the decision algorithm! Do you have a specific one in mind?

E.g., LDT will defect against CDT.

[-]SMK2y41

I had something like the following in mind: you are playing the PD against someone implementing "AlienDT" which you know nothing about except that (i) it's a completely different algorithm to the one you are implementing, and (ii) that it nonetheless outputs the same action/policy as the algorithm you are implementing with some high probability (say 0.9), in a given decision problem.

It seems to me that you should definitely cooperate in this case, but I have no idea how logi-causalist decision theories are supposed to arrive at that conclusion (if at all).

[-]Rob Bensinger2y40

This is why I suggested naming FDT "functional decision theory" rather than "algorithmic decision theory", when MIRI was discussing names.

Suppose that Alice is an LDT Agent and Bob is an Alien Agent. The two swap source code. If Alice can verify that Bob (on the input "Alice's source code") behaves the same as Alice in the PD, then Alice will cooperate. This is because Alice sees that the two possibilities are (C,C) and (D,D), and the former has higher utility.

The same holds if Alice is confident in Bob's relevant conditional behavior for some other reason, but can't literally view Bob's source code. Alice evaluates counterfactuals based on "how would Bob behave if I do X? what about if I do Y?", since those are the differences that can affect utility; knowing the details of Bob's algorithm doesn't matter if those details are screened off by Bob's functional behavior.

[-]SMK2y11

The same holds if Alice is confident in Bob's relevant conditional behavior for some other reason, but can't literally view Bob's source code. Alice evaluates counterfactuals based on "how would Bob behave if I do X? what about if I do Y?", since those are the differences that can affect utility; knowing the details of Bob's algorithm doesn't matter if those details are screened off by Bob's functional behavior.

Hm. What kind of dependence is involved here? Doesn't seem like a case of subjunctive dependence as defined in the FDT papers; the two algorithms are not related in any way beyond that they happen to be correlated.

Alice evaluates counterfactuals based on "how would Bob behave if I do X? what about if I do Y?", since those are the differences that can affect utility...

Sure, but so do all agents that subscribe to standard decision theories. The whole DT debate is about what that means.

[-]Bunthut2y10

The idea is that we can break any decision problem down by cases (like "insofar as the predictor is accurate, ..." and "insofar as the predictor is inaccurate, ...") and that all the competing decision theories (CDT, EDT, LDT) agree about how to aggregate cases.

Doesn't this also require that all the decision theories agree that the conditioning fact is independent of your decision?

Otherwise you could break down the normal prisoners dilemma into "insofar as the opponent makes the same move as me" and "insofar as the opponent makes the opposite move" and conclude that defect isn't the dominant strategy even there, not even under CDT.

And I imagine the within-CDT perspective would reject an independent probability for the predictors accuracy. After all, theres an independent probability it guessed 1-box, and if I 1-box it's right with that probability, and if I 2-box it's right with 1 minus that probability.

[-]benjamincosman2y10

Typo: yolk -> yoke

[-]So8res2y20

(fixed thanks)

[-]Harlan2y10

One of my most-confident guesses about anthropics is that being multiply-instantiated in other ways is analogous. For instance, if there are two identical physical copies of you (in physical rooms that are identical enough that you're going to make the same observations for the length of the hypothetical, etc.), then my guess is that there isn't a real question about which one is you. They are both you. You are the pattern, not the meat.

Thinking about identical brains as the same person is an interesting idea, and I think it's useful for reasoning about some decision puzzles.

To anyone thinking about this idea, it has some important limitations. Don't try to use it in domains where counting the number of individuals/observers is important. If you roll a die 100 times and it keeps coming up "6" then you should update towards it being a loaded die, even though there are infinite copies of every brain state experiencing every possibility of the die rolls. If you're in a trolley problem where the five people on the track have identical brains, you should still pull the lever, or else utilitarian ethics don't work (and if you're going to bite the bullet that utilitarian ethics don't work because of this, you have to also bite the bullet on reasoning about the world from your own observations not working, which it obviously does).

Here's a Bostrom paper talking about this https://nickbostrom.com/papers/experience.pdf

Moderation Log