Currently working as a Community Associate at Center on Long-Term Risk, and as an independent s-risk researcher. Former scholar at SERI MATS 4.1 (multipolar stream), former summer research fellow at Center on Long-Term Risk, former intern at Center for Reducing Suffering, Wild Animal Initiative, Animal Charity Evaluators. Former co-head of Haverford Effective Altruism.
Research interests: • AI alignment • animal advocacy from a longtermist perspective • acausal interactions • artificial sentience • commitment races • s-risks • updatelessness
Feel free to contact me for whatever reason! You can set up a meeting with me here.
Thanks for the cool discussion Ryan and Nate! This thread seemed pretty insightful to me. Here’s some thoughts / things I’d like to clarify (mostly responding to Nate's comments).[1]
Who’s doing this trade?
In places it sounds like Ryan and Nate are talking about predecessor civilisations like humanity agreeing to the mutual insurance scheme? But humans aren’t currently capable of making our decisions logically dependent on those of aliens, or capable of rescuing them. So to be precise the entity engaging in this scheme or other acausal interactions on our behalf is our successor, probably a FAI, in the (possibly counterfactual or counterlogical) worlds where we solve alignment.
Nate says:
Roughly speaking, I suspect that the sort of civilizations that aren't totally fucked can already see that "comb through reality for people who can see me and make their decisions logically dependent on mine" is a better use of insurance resources, by the time they even consider this policy.
Unlike us, our FAI can see other aliens. So I think the operative part of that sentence is “comb through reality”—Nate’s envisioning a scenario where with ~85% probability our FAI has 0 reality-fluid before any acausal trades are made.[2] If aliens restrict themselves to counterparties with nonzero reality-fluid, and humans turn out to not be at a competence level where we can solve alignment, then our FAI doesn’t make the cut.
Note: Which FAI we deploy is unlikely to be physically overdetermined in scenarios where alignment succeeds, and definitely seems unlikely to be determined by more coarse-grained (not purely physical) models of how a successor to present-day humanity comes about. (The same goes for which UFAI we deploy.) I’m going to ignore this fact for simplicity and talk about a single FAI; let me know if you think it causes problems for what I say below.
Trading with nonexistent agents is normal
I do see an argument that agents trying to do insurance with similar motives to ours could strongly prefer to trade with agents who do ex post exist, and in particular those agents that ex post exist with more reality-fluid. It’s that insurance is an inherently risk-averse enterprise.[3] It doesn’t matter if someone offers us a fantastic but high-variance ex ante deal, when the whole reason we’re looking for insurance is in order to maximise the chances of a non-sucky ex post outcome. (One important caveat is that an agent might be able to do some trades to first increase their ex ante resources, and then leverage those increased resources in order to purchase better guarantees than they’d initially be able to buy.)
On the other hand, I think an agent with increasing utility in resources will readily trade with counterparties who wouldn’t ex post exist absent such a trade, but who have some ex ante chance of naturally existing according to a less informed prior of the agent. I get the impression Nate thinks agents would avoid such trades, but I’m not sure / this doesn’t seem to be explicit outside of the mutual insurance scenario.
There’s two major advantages to trading with ex post nonexistent agents, as opposed to updating on (facts upstream of) their existence and consequently rejecting trade with them:
The main disadvantage is that by not conditioning on a counterparty’s existence you’re more likely to be throwing resources away ex post. The counterparty needs to be able to compensate you for this risk (as the mugger does in counterfactual mugging). I’d expect this bar is going to be met very frequently.
To recap, I'm saying that for plausible agents carrying out trades with our FAI, Nate's 2^-75 number won't matter. Instead, it would be something closer to the 85% number that matters—an ex ante rather than an ex post estimate of the FAI’s reality-fluid.
But would our FAI do the trade if it exists?
Nate says (originally talking about aliens instead of humanity):
As I understand it, the argument goes something like "your counterfactual picture of reality should include worlds in which your whole civilization turned out to be much much less competent, and so when you imagine the multiverse where you pay for all humanity to live, you should see that, in the parts of the multiverse where you're totally utterly completely incompetent and too poor to save anything but a fraction of your own brethren, somebody else pays to save you".
I agree that doing an insurance trade on behalf of a civilisation requires not conditioning on that civilisation’s competence. Nate implies that aliens’ civilisational competence is “tightly intertwined with [aliens’] basic reasoning faculties”, and this seems probably true for alien or human members of predecessor civilisations? But I don’t know why the civilisational competence of a FAI’s predecessor would be tightly intertwined with the FAI’s cognition. As mentioned above, I think the relevant actor here is our FAI, not our current selves.
We can further specify civilisational competence (relative to the stakes of alignment) as a function of two variables:
Either of these when combined with the other provides evidence about what sort of AI a predecessor civilisation deploys, but each will be uninformative on its own. I have in mind that agents executing an insurance trade would condition on all physical facts about their counterparty's civilisation—up until some truncation point that’s plausibly late enough to be capturing our current selves—but would not condition on some logical facts that are necessary to interpret those physical facts into a ~determinate answer as to whether the civilisation solves alignment.
Conditioning on those logical facts sounds pretty analogous to conditioning on a digit of pi to me. The important part is that the facts an agent chooses not to condition on aren't determined by core parts of an agent's cognition / decision procedure. Those facts being determinative of an agent's amount of reality-juice is typically fine, this just discounts the value of the resources they possess when making such trades.
Does this mean we can have nice things?
So overall, I think that aliens or earth-originating UFAIs (who aren’t motivated by insurance) would be pretty interested in trading with our FAI, and vice-versa. Counterparties would discount the FAI’s resources by their prior probability that it’s deployed (before conditioning on factors that pin this down).
Because we’re assuming our FAI would be willing to offer terms that are terrible for us if denominated in measure-weighted resources, counterparties would gain ex ante resources by engaging in an insurance trade with it. Those resources could later be used to engage in trade with others who are themselves willing to (indirectly) trade with nonexistent agents, and who don’t have much more sceptical priors about the deployment odds of our FAI. So because the trade at hand yields a net profit, I don't think it competes much with ordinary alternative demands for counterparties’ resources.
Nevertheless, here’s a few (nonexhaustive) reasons why this trade opportunity wouldn't be taken by another updateless AI:
Overall I'd guess 30% chance humanity survives misalignment to a substantial extent through some sort of insurance trade, conditional on us not surviving to a substantial extent another cheaper way?
Other survival mechanisms
I’m pretty uncertain about how Evidential cooperation in large worlds works out, but at my current rough credences I do think there’s a good chance (15%) we survive through something which pattern-matches to that, or through other schemes that look similar but have more substantial differences (10%).
I also put some credence in there being very little of us in base reality, in some of those scenarios could involve substantial survival odds. (Though I weakly think the overall contribution of these scenarios is undesirable for us.)
Meta: I don’t think figuring out insurance schemes is very important or time-sensitive for us. But I do think understanding the broader dynamics of acausal interactions that determine when insurance schemes would work could be very important and time-sensitive. Also note I'd bet I misinterpreted some claims here, but got to the point where it seemed more useful to post a response than work on better comprehension. (In particular I haven't read much on this page beyond this comment thread.)
I don’t think Nate thinks alignment would be physically overdetermined if misalignment winds up not being overdetermined, but we can assume for simplicity there’s a 15% chance of our FAI having all the reality fluid of the Everett branches we’re in.
I'm not clear on what the goal of this insurance scheme is exactly. Here's a (possibly not faithful) attempt: we want to maximise the fraction of reality-fluid devoted to minds initially ~identical to ours that are in very good scenarios as opposed to sucky ones, subject to a constraint that we not increase the reality-fluid devoted to minds initially ~identical to us in sucky scenarios. I’m kind of sympathetic to this—I think I selfishly care about something like this fraction. But it seems higher priority to me to minimise the reality-fluid devoted to us in sucky / terrible scenarios, and higher priority still to use any bargaining power we have for less parochial goals.
It's definitely not clear to me that updatelessness + Yudkowsky's solution prevent threats. The core issue is that a target and a threatener face a prima facie symmetric decision problem of whether to use strategies that depend on their counterpart's strategy or strategies that do not depend on their counterpart's strategy.[1]
In other words, the incentive targets have to use non-dependent strategies that incentivise favourable (no-threat) responses from threateners is the same incentive threateners have to use non-dependent strategies that incentivise favourable (give-into-threat) responses from targets. This problem is discussed in more detail in parts of Responses to apparent rationalist confusions about game / decision theory and in Updatelessness doesn't solve most problems.
There are potential symmetry breakers that privilege a no-threat equilibrium, such as the potential for cooperation between different targets. However, there are also potential symmetry breakers in the other direction. I expect Yudkowsky is aware of the symmetry of this problem and either thinks the symmetry breakers in favour of no-threats seem very strong, or is just very confident in the superintelligences-should-figure-this-stuff-out heuristic. Relatedly, this post argues that mutually transparent agents should be able to avoid most of the harm of threats being executed, even if they are unable to avoid threats from being made.
But these are different arguments to the one you make here, and I'm personally unconvinced even these arguments are strong enough that it's not very important for us to work on preventing harmful threats from being made by or against AIs that humanity deploys.
FYI A lot of Center on Long-Term Risk's research is motivated by this problem; I suggest people reach out to us if you're interested in working on it!
Examples of non-dependent strategies would include
An example of a dependent strategy would be
How to deal with crucial considerations and deliberation ladders (link goes to a transcript + audio).
I like this post a lot! Three other reasons came to mind, which might be technically encompassed by some of the current ones but seemed to mostly fall outside the post's framing of them at least.
Some (non-agentic) repeated selections won't terminate until they find a bad thing
In a world with many AI deployments, an overwhelming majority of deployed agents might be unable to mount a takeover, but the generating process for new deployed agents might not halt until a rare candidate that can mount a takeover is found. More specifically, consider a world where AI progress slows (either due to governance interventions or a new AI winter), but people continue conducting training runs at a fairly constant level of sophistication. Suppose that for these state-of-the-art training runs that (i) there is only a negligible chance of finding a non-gradient-hacked AI that can mount a takeover or enable a pivotal act, but (ii) there is a tiny but nonnegligible chance of finding a gradient hacker that can mount a takeover.[1] Then eventually we will stumble across an unlikely training run that produces a gradient hacker.
This problem mostly seems like a special case of You're being optimised against, though here you are not optimised against by an agent, but rather by the nature of the problem. Alternatively, this example could be lumped into The space you’re selecting over happens to mostly contain bad things if we either (i) reframe the space under consideration from "deployed AIs" to "AIs capable of mounting a takeover" (h/t Thomas Kehrenberg), or (ii) reframe The space you’re selecting over happens to mostly contain bad things to The space you’re selecting over happens to mostly contain bad things, relative to the number of selections made. But I think the fact that a selection may not terminate until a bad thing has been found is an important thing to pay attention to when it comes up, and weakly think it'd be useful to have a separate conceptual handle for it.
Aiming your efforts at worst-case scenarios
As long as some failure states are worse than others, optimising for the satisfaction of a binary success criterion won't generally be sufficient to maximise your marginal impact. Instead, you should target worlds based in part on how bad failure within them would be, along with the change in success probability for a marginal contribution. For example, maybe many low P(doom) worlds are such because intent-aligning AI turns out to be pretty straightforward in them. But easy intent-alignment may imply higher misuse risk, such that if misuse risk is more concerning than accident risk then contributing towards solving alignment problems in ways robust to misuse may remain very high impact in easy-intent-alignment worlds.[2]
One alternative way to state this consideration is that in most domains, there are actually multiple overlapping success criteria. Sometimes the more easily satisfied ones will be much higher-priority to target—even if your marginal contributions result in smaller changes to the odds of satisfying them—because they are more important.
This consideration is the main reason I prioritise worst-case AI outcomes (i.e. s-risks) over ordinary x-risk from AI.
Some bad things might be really bad
In a similar vein, for The space you’re selecting over happens to mostly contain bad things, it's not the raw probability of selecting a bad thing that matters, but the product of that with the expected harm of a bad thing. Since some bad things are Really Very Terrible, sometimes it will make sense to use worst-case assumptions even when bad things are quite rare, as long as the risk of finding one isn't Pascalian. I think the EU of an insecure selection is at particular risk of being awful whenever the left tail of the utility distribution of things you're selecting for is much thicker than the right.
This is plausible to me because gradient-hacking could yield a "sharp left turn", taking us very OOD for the sort of models runs had previously been producing. Some other sharp left turn candidates should work just as well in this example.
This is an interesting example, because in low P(doom) worlds of this sort marginal efforts to advance intent-alignment seem more likely to be harmful. If that were the case, alignment researchers would want to prioritise developing techniques that differentially help align AI to widely endorsed values rather than to the intent of an arbitrary deployer. Efforts to more directly intervene to prevent misuse would also look pretty valuable.
But because of effects like these, it's not obvious that you would want to prioritise low P(doom) worlds even if you were convinced that failure within them was worse than in high P(doom) worlds, since advancing-intent-alignment interventions might be helpful in most other worlds where it might be harder for malevolent users to make use of them. (And it's definitely not apparent to me in reality that failure in low P(doom) worlds is worse than in high P(doom) worlds for this reason; I just thought this would make for a good example!)
Another way interpretability work can be harmful: some means by which advanced AIs could do harm require them to be credible. For example, in unboxing scenarios where a human has something an AI wants (like access to the internet), the AI might be much more persuasive if the gatekeeper can verify the AI's statements using interpretability tools. Otherwise, the gatekeeper might be inclined to dismiss anything the AI says as plausibly fabricated. (And interpretability tools provided by the AI might be more suspect than those developed beforehand.)
It's unclear to me whether interpretability tools have much of a chance of becoming good enough to detect deception in highly capable AIs. And there are promising uses of low-capability-only interpretability -- like detecting early gradient hacking attempts, or designing an aligned low-capability AI that we are confident will scale well. But to the extent that detecting deception in advanced AIs is one of the main upsides of interpretability work people have in mind (or if people do think that interpretability tools are likely to scale to highly capable agents by default), the downsides of those systems being credible will be important to consider as well.
There is another very important component of dying with dignity not captured by the probability of success: the badness of our failure state. While any alignment failure would destroy much of what we care about, some alignment failures would be much more horrible than others. Probably the more pessimistic we are about winning, the more we should focus on losing less absolutely (e.g. by researching priorities in worst-case AI safety).
I feel conflicted about this post. Its central point as I'm understanding it is that much evidence we commonly encounter in varied domains is only evidence about the abundance of extremal values in some distribution of interest, and whether/how we should update our beliefs about the non-extremal parts of the distribution is very much dependent on our prior beliefs or gears-level understanding of the domain. I think this is a very important idea, and this post explains it well.
Also, felt inspired to search out other explanations of the moments of a distribution - this one looks pretty good to me so far.
On the other hand, the men's rights discussion felt out of place to me, and unnecessarily so since I think other examples would be able to work just as well. Might be misjudging how controversial various points you bring up are, but as of now I'd rather see topics of this level of potential political heat discussed in personal blogposts or on other platforms, so long as they're mostly unrelated to central questions of interest to rationalists / EAs.
This is super interesting!
Quick typo note (unless I'm really misreading something): in your setups, you refer to coins that are biased towards tails, but in your analyses, you talk about the coins as though they are biased towards heads.
One is the “cold pool”, in which each coin comes up 1 (i.e. heads) with probability 0.1 and 0 with probability 0.9. The other is the “hot pool”, in which each coin comes up 1 with probability 0.2
random coins with heads-probability 0.2
We started with only tails
full compression would require roughly tails, and we only have about
As far as I'm aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word "goose" makes us more likely to think of the word "moose" as well as "duck" (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.
Strongly agree with this. How I frame the issue: If people want to say that they identify as an "experiencer" who is necessarily conscious, and don't identify with any nonconscious instances of their cognition, then they're free to do that from an egoistic perspective. But from an impartial perspective, what matters is how your cognition influences the world. Your cognition has no direct access to information about whether it's conscious such that it could condition on this and give different outputs when instantiated as conscious vs. nonconscious.
Note that in the case where some simulator deliberately creates a behavioural replica of a (possibly nonexistent) conscious agent, consciousness does enter into the chain of logical causality for why the behavioural replica says things about its conscious experience. Specifically, the role it plays is to explain what sort of behaviour the simulator is motivated to replicate. So many (or even all) non-counterfactual instances of your cognition being nonconscious doesn't seem to violate any Follow the Improbability heuristic.