Anthony DiGiovanni

A high-level model of AI bargaining

Advanced AIs might be capable of various credible commitments unavailable to humans, which they could use when bargaining with each other. “Bargaining” can sound like something pretty specific: haggling over (literal) prices. But, in the sense discussed in Schelling’s The Strategy of Conflict for instance, “bargaining” refers to any attempt...

Jun 2117

[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially: * the kind of workflow I think we should use, to start...

Jun 923

CLR's Safe Pareto Improvements Research Agenda

Executive summary * Safe Pareto improvements (SPIs) are ways of changing agents’ bargaining strategies that make all parties better off, regardless of their original strategies. SPIs are an unusually robust approach to preventing catastrophic conflict between AI systems, especially AIs capable of credible commitments. This is because SPIs can reduce...

Apr 2045

How to not do decision theory backwards

(Subtitle: “And ethics, and epistemology, and…”. Cross-posted from my Substack.) We want to make decisions for good reasons. But I worry some common approaches to decision theory stray from this purpose. They start with a bottom-line verdict, “I should choose this action”, then use this verdict to justify claims about...

Mar 1720

When do intuitions need to be reliable?

(Cross-posted from my Substack.) Here’s an important way people might often talk past each other when discussing the role of intuitions in philosophy.[1] Intuitions as predictors When someone appeals to an intuition to argue for something, it typically makes sense to ask how reliable their intuition is. Namely, how reliable...

Mar 158

Resource guide: Unawareness, indeterminacy, and cluelessness

I’ve argued in my unawareness sequence that when we properly account for our severe epistemic limitations, we are clueless about our impact from an impartial altruistic perspective. However, this argument and my responses to counterarguments involve a lot of moving parts. And the term “clueless” gets used in various importantly...

Jul 7, 202520

Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions

As many folks in AI safety have observed, even if well-intentioned actors succeed at intent-aligning highly capable AIs, they’ll still face some high-stakes challenges.[1] Some of these challenges are especially exotic and could be prone to irreversible, catastrophic mistakes. E.g., deciding whether and how to do acausal trade. To deal...

Jun 20, 202542

Anthony DiGiovanni

Anthony DiGiovanni

Responses to apparent rationalist confusions about game / decision theory

Making AIs less likely to be spiteful

Should you go with your best guess?: Against precise Bayesianism and related views

When does technical work to reduce AGI conflict make a difference?: Introduction

Anthony DiGiovanni

Responses to apparent rationalist confusions about game / decision theory

Making AIs less likely to be spiteful

Should you go with your best guess?: Against precise Bayesianism and related views

When does technical work to reduce AGI conflict make a difference?: Introduction

A high-level model of AI bargaining

[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

CLR's Safe Pareto Improvements Research Agenda

How to not do decision theory backwards

When do intuitions need to be reliable?

Resource guide: Unawareness, indeterminacy, and cluelessness

Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions