This is a special post for quick takes by Anthony DiGiovanni. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
22 comments, sorted by Click to highlight new comments since:

I continue to be puzzled as to why many people on LW are very confident in the "algorithmic ontology" about decision theory:

So I see all axes except the "algorithm" axis as "live debates" -- basically anyone who has thought about it very much seems to agree that you control "the policy of agents who sufficiently resemble you" (rather than something more myopic like "your individual action")

Can someone point to resources that clearly argue for this position? (I don't think that, e.g., the intuition that you ought to cooperate with your exact copy in a Prisoner's Dilemma — much as I share it — is an argument for this ontology. You could endorse the physicalist ontology + EDT, for example.)

I can't point you to existing resources, but from my perspective, I assumed an algorithmic ontology because it seemed like the only way to make decision theory well defined (at least potentially, after solving various open problems). That is, for an AI that knows its own source code S, you could potentially define the "consequences of me doing X" as the logical consequences of the logical statement "S outputs X". Whereas I'm not sure how this could even potentially be defined under a physicalist ontology, since it seems impossible for even an ASI to know the exact details of itself as a physical system.

This does lead to the problem that I don't know how to apply LDT to humans (who do not know their own source code), which does make me somewhat suspicious that the algorithmic ontology might be a wrong approach (although physicalist ontology doesn't seem to help). I mentioned this as problem #6 in UDT shows that decision theory is more puzzling than ever.

ETA: I was (and still am) also under strong influence of Tegmark's Mathematical universe hypothesis. What's your view on it?

Thanks, that's helpful!

I am indeed interested in decision theory that applies to agents other than AIs that know their own source code. Though I'm not sure why it's a problem for the physicalist ontology that the agent doesn't know the exact details of itself — seems plausible to me that "decisions" might just be a vague concept, which we still want to be able to reason about under bounded rationality. E.g. under physicalist EDT, what I ask myself when I consider a decision to do X is, "What consequences do I expect conditional on my brain-state going through the process that I call 'deciding to do X' [and conditional on all the other relevant info I know including my own reasoning about this decision, per the Tickle Defense]?" But I might miss your point.

Re: mathematical universe hypothesis: I'm pretty unconvinced, though I at least see the prima facie motivation (IIUC: we want an explanation for why the universe we find ourselves in has the dynamical laws and initial conditions it does, rather than some others). Not an expert here, this is just based on some limited exploration of the topic. My main objections:

  • The move from "fundamental physics is very well described by mathematics" to "physics is (some) mathematical structure" seems like a map-territory error. I just don't see the justification for this.
  • I worry about giving description-length complexity a privileged status when setting priors / judging how "simple" a hypothesis is. The Great Meta-Turing Machine in the Sky as described by Schmidhuber scores very poorly by the speed prior.
  • It's very much not obvious to me that conscious experience is computable. (This is a whole can of worms in this community, presumably :).)

how does a physicalist ontology differ from an algorithmic ontology in terms of the math?

Not sure what you mean by "the math" exactly. I've heard people cite the algorithmic ontology as a motivation for, e.g., logical updatelessness, or for updateless decision theory generally. In the case of logical updatelessness, I think (low confidence!) the idea is that if you don't see yourself as this physical object that exists in "the real world," but rather see yourself as an algorithm instantiated in a bunch of possible worlds, then it might be sensible to follow a policy that doesn't update on e.g. the first digit of pi being odd.

query rephrase: taboo both "algorithmic ontology" and "physicalist ontology". describe how each of them constructs math to describe things in the world, and how that math differs. That is, if you're saying you have an ontology, presumably this means you have some math and some words describing how the math relates to reality. I'm interested in a comparison of that math and those words; so far you're saying things about a thing I don't really understand as being separate from physicalism. Why can't you just see yourself as multiple physical objects and still have a physicalist ontology? what makes these things different in some, any, math, as opposed to only being a difference in how the math connects to reality?

I think I just don't understand / probably disagree with the premise of your question, sorry. I'm taking as given whatever distinction between these two ontologies is noted in the post I linked. These don't need to be mathematically precise in order to be useful concepts.

ah my bad, my attention missed the link! that does in fact answer my whole question, and if I hadn't missed it I'd have had nothing to ask :)

There isn't a difference, though it's only because algorithms can simulate all of the physical stuff that's in the physical ontology, and thus the physical ontology is a special case of the algorithmic ontology.

Helpful link here to build intuition:

http://www.amirrorclear.net/academic/ideas/simulation/index.html

That's what I already believed, but OP seems to disagree, so I'm trying to understand what they mean

The argument for this is spelled out in Eliezer and Nate's Functional Decision Theory: A New Theory of Instrumental Rationality. See also the LessWrong wiki tag page

Thanks — do you have a specific section of the paper in mind? Is the idea that this ontology is motivated by "finding a decision theory that recommends verdicts in such and such decision problems that we find pre-theoretically intuitive"?

That sounds like a good description of my understanding, but I'd also say the pre-theoretic intuitions are real damn convincing!

There's a table of contents which you can use to read relevant sections of the paper. You know your cruxes better than I do.

shrug — I guess it's not worth rehashing pretty old-on-LW decision theory disagreements, but: (1) I just don't find the pre-theoretic verdicts in that paper nearly as obvious as the authors do, since these problems are so out-of-distribution. Decision theory is hard. Also, some interpretations of logical decision theories give the pre-theoretically "wrong" verdict on "betting on the past." (2) I pre-theoretically find the kind of logical updatelessness that some folks claim follows from the algorithmic ontology pretty bizarre. (3) On its face it seems more plausible to me that algorithms just aren’t ontologically basic, they’re abstractions we use to represent (physical) input-output processes.

Linkpost: "Against dynamic consistency: Why not time-slice rationality?"

This got too long for a "quick take," but also isn't polished enough for a top-level post. So onto my blog it goes.

I’ve been skeptical for a while of updateless decision theory, diachronic Dutch books, and dynamic consistency as a rational requirement. I think Hedden's (2015) notion of time-slice rationality nicely grounds the cluster of intuitions behind this skepticism.

Endorsement on reflection is not straightforward, even states of knowledge or representations of values or ways of interpreting them can fail to be endorsed. It's not good from my perspective for someone else to lose themselves and start acting in my interests. But it is good for them to find themselves if they are confused about what they should endorse on reflection.

From 0-my perspective, it’s good for 1-me to believe updatelessness is rational, even if from 1-my perspective it isn’t.

I'm afraid I don't understand your point — could you please rephrase?

Values can say things about how agents think, about the reasons behind outcomes, not just the outcomes themselves. An object level moral point that gestures at the issue is to say that it's not actually good when a person gets confused or manipulated and starts working towards an outcome that I prefer, that is I don't prefer an outcome when it's bundled with a world that produced it in this way, even if I would prefer the outcome when considered on its own. So I disagree with the claim that, assuming "from 1-my perspective it's not good to do X", then it's still "from 0-my perspective it's good for 1-me to believe that they should do X".

A metaethical point on interpretation of states of knowledge or values not being straightforward is about the nature of possible confusion about what an agent might value. There is a setting where decision theory is sorted out, and values are specified explicitly, so that the notion of them being confused is not under consideration. But if we do entertain the possibility of confusion, that the design isn't yet settled, or that there is no reflective stability, then the thing that's currently written down as "values" and determines immediate actions has little claim to be actual values.

Claims about counterfactual value of interventions given AI assistance should be consistent

A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).

I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.

It’s plausible (and apparently a reasonably common view among alignment researchers) that:

  1. Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
  2. If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.

It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:

  • In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
  • To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
  • For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).

I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).

  1. ^

    That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.

  2. ^

    I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can't evaluate, where the ELK problem arises.

Claims about counterfactual value of interventions given AI assistance should be consistent

A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).

I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.


I do in fact discount the counterfactual value of alignment for exactly this reason, BTW.

I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).

Agree with this point in particular.

Is God's coin toss with equal numbers a counterexample to mrcSSA?

I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):

  • Let H = "heads world", W_{me} = "I am in a white room, [created by God in the manner described in the problem setup]", R_{me} = "I have a red jacket."
  • We want to know P(H | W_{me}, R_{me}).
  • First, P(R_{me} | W_{me}, H) and P(R_{me} | W_{me}, ~H) seem uncontroversial: Once I've already conditioned on my own existence in this problem, and on who "I" am, but before I've observed my jacket color, surely I should use a principle of indifference: 1 out of 10 observers of existing-in-the-white-room in the heads world have red jackets, while all of them have red jackets in the tails world, so my credences are P(R_{me} | W_{me}, H) = 0.1 and P(R_{me} | W_{me}, ~H) = 1. Indeed we don't even need a first-person perspective at this step — it's the same as computing P(R_{Bob} | W_{Bob}, H) for some Bob we're considering from the outside.
    • (This is not the same as non-mrcSSA with reference class "observers in a white room," because we're conditioning on knowing "I" am an observer in a white room when computing a likelihood (as opposed to computing the posterior of some world given that I am an observer in a white room). Non-mrcSSA picks out a particular reference class when deciding how likely "I" am to observe anything in the first place, unconditional on "I," leading to the Doomsday Argument etc.)
  • The step where things have the potential for anthropic weirdness is in computing P(W_{me} | H) and P(W_{me} | ~H). In the Presumptuous Philosopher and the Doomsday Argument, at least, probabilities like this would indeed be sensitive to our anthropics.
  • But in this problem, I don't see how mrcSSA would differ from non-mrcSSA with the reference class R_{non-minimal} = "observers in a white room" used in Joe's analysis (and by extension, from SIA):
    • In general, SSA says
    • Here, the supposedly "non-minimal" reference class R_{non-minimal} coincides with the minimal reference class! I.e., it's the observer-moments in your epistemic situation (of being in a white room), before you know your jacket color.
  • The above likelihoods plus the fair-coin prior are all we need to get P(H | R_{me}, W_{me}), but at no point did the three anthropic views disagree.

In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about "I." But once we've picked out a particular "I," the different views should agree.

(I still feel suspicious of mrcSSA's metaphysics for independent reasons, but am considerably less confident in that than my verdict on God's coin toss with equal numbers.)

It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!

Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.

(h/t Jesse Clifton for helping me see this)