All of JesseClifton's Comments + Replies

This paper discusses two semantics for Bayesian inference in the case where the hypotheses under consideration are known to be false.

  • Verisimilitude: p(h) = the probability that that h is closest to the truth [according to some measure of closeness-to-truth] among hypotheses under consideration 
  • Counterfactual: p(h) = the probability of h given the (false) supposition that one of the hypotheses under consideration is true

In any case, it’s unclear what motivates making decisions by maximizing expected value against such probabilities, which seems like a ... (read more)

1Richard_Ngo
Ty for the link but these seem like both clearly bad semantics (e.g. under either of these the second-best hypothesis under consideration might score arbitrarily badly).

mildly disapprove of words like "a widely-used strategy"

The text says “A widely-used strategy for arguing for norms of rationality involves avoiding dominated strategies”, which is true* and something we thought would be familiar to everyone who is interested in these topics. For example, see the discussion of Dutch book arguments in the SEP entry on Bayesianism and all of the LessWrong discussion on money pump/dominance/sure loss arguments (e.g., see all of the references in and comments on this post). But fair enough, it would have been better to include... (read more)

What principles? It doesn’t seem like there’s anything more at work here than “Humans sometimes become more confident that other humans will follow through on their commitments if they, e.g., repeatedly say they’ll follow through”. I don’t see what that has to do with FDT, more than any other decision theory. 

If the idea is that Mao’s forming the intention is supposed to have logically-caused his adversaries to update on his intention, that just seems wrong (see this section of the mentioned post).

(Separately I’m not sure what this has to do with... (read more)

I don't think FDT has anything to do with purely causal interactions. Insofar as threats were actually deterred here this can be understood in standard causal game theory terms.  (I.e., you claim in a convincing manner that you won't give in -> People assign high probability to you being serious -> Standard EV calculation says not to commit to threat against you.) Also see this post.

2Garrett Baker
Thus why I said related. Nobody was doing any mind-reading of course, but the principles still apply, since people are often actually quite good at reading each other.

Awesome sequence!

I wish that discussions of anthropics were clearer about metaphysical commitments around personal identity and possibility.  I appreciated your discussions of this, e.g., in Section XV.  I agree with what you, though, that it is quite unclear what justifies the picture “I am sampled from the set of all possible people-in-my-epistemic situation (weighted by probability of existence)”.  I take it the view of personal identity at work here is something like “‘I’ am just a sequence of experiences S”, and so I know I am one ... (read more)

But it's true that if you had been aware from the beginning that you were going to be threatened, you would have wanted to give in.

To clarify, I didn’t mean that if you were sure your counterpart would Dare from the beginning, you would’ve wanted to Swerve. I meant that if you were aware of the possibility of Crazy types from the beginning, you would’ve wanted to Swerve. (In this example.)

I can’t tell if you think that (1) being willing to Swerve in the case that you’re fully aware from the outset (because you might have a sufficiently high prior on Cra... (read more)

If I understand correctly, you’re making the point that we discuss in the section on exploitability. It’s not clear to me yet why this kind of exploitability is objectionable. After all, had the agent in your example been aware of the possibility of crazy agents from the start, they would have wanted to swerve, and non-crazy agents would want to take advantage of this. So I don’t see how the situation is any worse than if the agents were making decisions under complete awareness.

2Daniel Kokotajlo
How is it less objectionable than regular ol' exploitability? E.g. someone finds out that you give in to threats, so they threaten you, so you give in, and wish you had never been born -- you are exploitable in the classic sense. But it's true that if you had been aware from the beginning that you were going to be threatened, you would have wanted to give in. Part of what I'm doing here is trying to see if my understanding of your work is incorrect. To me, it seems like you are saying "Let's call some kinds of changes-to-credences 'updates' and other kinds 'awareness-growth.' Here's how to distinguish them. Now, we recommend the strategy of EA-OMU, which means you calculate what your credences would have been if you never made any updates but DID make the awareness-growth changes, and then calculate what policy is optimal according to those credences, and then do that.' If that's what you are saying, then the natural next question is: What if anything does this buy us? It doesn't solve the commitment races problem, because the problem still remains so long as agents can strategically influence each other's awareness growth process. E.g. "Ah, I see that you are an EA-OMU agent. I'm going to threaten you, and then when you find out, even though you won't update, your awareness will grow, and so then you'll cave. Bwahaha." Also, how is this different from the "commitment races in logical time" situation? Like, when I wrote the original commitment races post it was after talking with Abram and realizing that going updateless didn't solve the problem because agents aren't logically omniscient, they need to gradually build up more hypotheses and more coherent priors over time. And even if they are updateless with respect to all empirical evidence, i.e. they never update their prior based on empirical evidence, their a priori reasoning probably still results in race dynamics. Or at least so it seemed to me. I don't think I fully understand the proposal so it's likely I

Can you clarify what “the problem” is and why it “recurs”?

My guess is that you are saying: Although OM updatelessness may work for propositions about empirical facts, it’s not clear that it works for logical propositions. For example, suppose I find myself in a logical Counterfactual Mugging regarding the truth value of a proposition P. Suppose I simultaneously become aware of P and learn a proof of P. OM updatelessness would want to say: “Instead of accounting for the fact that you learned that P is true in your decision, figure out what credence you woul... (read more)

2Daniel Kokotajlo
OK, let's suppose all relevant agents follow some sort of updatelessness, i.e. they constantly act according to the policy that would have been optimal to commit to, from the perspective of their reconstructed prior. But their reconstructed prior is changing constantly as they learn more, e.g. as they become aware of "crazy" possible strategies their opponents might use. Can the agents sometimes influence each other's priors? Yes. For example by acting 'crazy' in some way they didn't expect, you might cause them to revise their prior to include that possibility -- indeed to include it with significant probability mass! OK, so then some of the agents will think "Aha, I can influence the behavior of the others in ways that I like, by committing to various 'crazy' strategies that place incentives on them. Once they become aware of my action, they'll revise their prior, and then the optimal commitment for them to have made in light of that prior is to conform to the incentives I placed on them, so they will." ... I'll stop there for now. Do you see what I mean? It's literally the commitment races problem. Agents racing to make various commitments in order to influence each other, because they expect that the others might be so influenced. Now you might think that it's generally harder to influence someone's reconstructed-prior than to influence their posterior; if you do something that was 'in distribution' for their current reconstructed prior, for example, then they won't update their reconstructed prior at all, they'll just update their posterior. I think this is plausible but I'd want to see it spelled out in more detail how much of the problem it solves; certainly not all of it, at least so says my current model which might be wrong.

The model is fully specified (again, sorry if this isn’t clear from the post). And in the model we can make perfectly precise the idea of an agent re-assessing their commitments from the perspective of a more-aware prior. Such an agent would disagree that they have lost value by revising their policy. Again, I’m not sure exactly where you are disagreeing with this. (You say something about giving too much weight to a crazy opponent — I’m not sure what “too much” means here.)

Re: conservation of expected evidence, the EA-OMU agent doesn’t expect to increa... (read more)

2Dagon
The missing part is the ACTUAL distribution of normal vs crazy opponents (note that "crazy" is perfectly interchangeable with "normal, who was able to commit first"), and the loss that comes from failing to commit against a normal opponent.  Or the reasoning that a normal opponent will see it as commitment, even when it's not truly a commitment if the opponent turns out to be crazy. Anyway, interesting discussion.  I'm not certain I understand where we differ on it's applicability, but I think we've hashed it out as much as possible.  I'll continue reading and thinking - feel free to respond or rebut, but I'm unlikely to comment further.  Thanks!

For this to be wrong, the opponent must be (with some probability) irrational - that's a HUGE change in the setup

For one thing, we’re calling such agents “Crazy” in our example, but they need not be irrational. They might have weird preferences such that Dare is a dominant strategy. And as we say in a footnote, we might more realistically imagine more complex bargaining games, with agents who have (rationally) made commitments on the basis of as-yet unconceived of fairness principles, for example. An analogous discussion would apply to them.

But in an... (read more)

2Dagon
I think if you fully specify the model (including the reasons for commitment rather than just delaying the decision in the first place), you'll find that the reason for committing is NOT about updates, but about adversarial game theory.  Specifically, include in your model that if facing a NORMAL opponent, failure to commit turns your (D, S) outcome (+1) into a (S, D) (-1), because the normal opponent will dare if you haven't committed, and then you are best off swerving. You've LOST VALUE because you gave too much weight to the crazy opponent. How your (distribution of) opponents react to your strategy, which is conditional on your beliefs about THEIR strategy is the core of game theory.  If you have a mix of crazy opponents and rational opponents who you think haven't committed yet, you don't need to introduce any update mechanisms, you just need your current probability estimates about the distribution, and commit or don't based on maximizing your EV.   Where the conservation of expected evidence comes in is that you CANNOT expect to increase your chances of facing a crazy opponent.  If you did expect that, you actually have a different prior than you think.

Thanks Dagon:

Any mechanism to revoke or change a commitment is directly giving up value IN THE COMMON FORMULATION of the problem

Can you say more about what you mean by “giving up value”?

Our contention is that the ex-ante open-minded agent is not giving up (expected) value, in the relevant sense, when they "revoke their commitment" upon becoming aware of certain possible counterpart types. That is, they are choosing the course of action that would have been optimal according to the priors that they believe they should have set at the outset of the decis... (read more)

2Dagon
Sure.  In the common formulation https://en.wikipedia.org/wiki/Chicken_(game) , when Alice believes (with more than 1000:1 probability) that she is first mover against a rational opponent, she commits to Dare.  The ability to revoke this commitment hurts her if her opponent commits in the meantime, as she is now better off swerving, but worse off than if her commitment had been (known to be) stronger. For this to be wrong, the opponent must be (with some probability) irrational - that's a HUGE change in the setup.  Whether she wants to lose (by just always Swerve, regardless of opponent), or wait for more information about the opponent is based on her probability assessment of whether the opponent is actually irrational.  If she assigns it 0% (correctly or in-), she should commit or she's giving up expected value based on her current knowledge.  If she assigns it higher than that, it will depend on the model of what is the distribution of opponents and THEIR commitment timing. You can't just say "Alice has wrong probability distributions, but she's about to learn otherwise, so she should use that future information".   You COULD say "Alice knows her model is imperfect, so she should be somewhat conservative, but really that collapses to a different-but-still-specific probability distribution. You don't need to bring updates into it, and certainly don't need to consider future updates.  https://www.lesswrong.com/tag/conservation-of-expected-evidence means you can only expect any future update to match your priors.  

Thanks for sharing, I'm happy that someone is looking into this. I'm not an expert in the area, but my impression is that this is consistent with a large body of empirical work on "procedural fairness", i.e., people tend to be happier with outcomes that they consider to have been generated by a fair decision-making process. It might be interesting to replicate studies from that literature with an AI as the decision-maker.

1Marius Hobbhahn
Yeah, our impression was that a) there is a large body of literature that is relevant and related in the existing social science literature, and b) taking 90% of the existing setup and adding AI would probably already yield lots of interesting studies. In general, it seems like there is a lot of room for people interested in the intersection of AI+ethics+social sciences. Also, Positly+Guidedtrack makes running these studies really simple and turned out to be much smoother than I would have expected. So even when people without a social science background "just want to get a rough understanding what the rest of the world thinks" they can quickly do so with the existing tools. 

[I work at CAIF and CLR]

Thanks for this!

I recommend making it clearer that CAIF is not focused on s-risk and is not formally affiliated with CLR (except for overlap in personnel). While it’s true that there is significant overlap in CLR’s and CAIF’s research interests, CAIF’s mission is much broader than CLR’s (“improve the cooperative intelligence of advanced AI for the benefit of all”), and its founders + leadership are motivated by a variety of catastrophic risks from AI.

Also, “foundational game theory research” isn’t an accurate description of CAIF’s s... (read more)

1Thomas Larsen
Thanks for the update! We've edited the section on CLR to reflect this comment, let us know if it still looks inaccurate.

A few thoughts on this part:

I guess [coordination failures between AIs] feels like mainly the type of thing that we can outsource to AIs, once they’re sufficiently capable. I don’t see a particularly strong reason to think that systems that are comparably powerful as humans, or more powerful than humans, are going to make obvious mistakes in how they coordinate. You have this framing of AI coordination. We could also just say politics, right? Like we think that geopolitics is going to be hard in a world where AIs exist. And when you have that framing, y

... (read more)

We are now using a new definition of s-risks. I've edited this post to reflect the change.

New definition:

S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.

Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.

Ok, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe.

It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation... (read more)

The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment

I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.

  1. The pri

... (read more)
8CarlShulman
Mainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk. You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.

Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced o... (read more)

2Daniel Kokotajlo
I think we are on the same page then. I like the idea of a deliberation module; it seems similar to the "moral reasoning module" I suggested a while back. The key is to make it not itself a coward or bully, reasoning about schelling points and universal principles and the like instead of about what-will-lead-to-the-best-expected-outcomes-given-my-current-credences.

It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”

The set of solution concepts you commit to regarding as acceptable stil... (read more)

3Daniel Kokotajlo
If I read you correctly, you are suggesting that some portion of the problem can be solved, basically -- that it's in some sense obviously a good idea to make a certain sort of commitment, e.g. "When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.” So the commitment races problem may still exist, but it's about what other commitments to make besides this one, and when. Is this a fair summary? I guess my response would be "On the object level, this seems like maybe a reasonable commitment to me, though I'd have lots of questions about the details. We want it to be vague/general/flexible enough that we can get along nicely with various future agents with somewhat different protocols, and what about agents that are otherwise reasonable and cooperative but for some reason don't want to agree on a world-model with us? On the meta level though, I'm still feeling burned from the various things that seemed like good commitments to me and turned out to be dangerous, so I'd like to have some sort of stronger reason to think this is safe."

Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.

A few miscellaneous points:

  • I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.

  • I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares t

... (read more)

Neat post, I think this is an important distinction. It seems right that more homogeneity means less risk of bargaining failure, though I’m not sure yet how much.

Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights

In what ways does having similar architectures or weights help with cooperation between agents with different goals? A few things that come to mind:

  • Having similar architectures might make it easier for agen
... (read more)
4evhub
Glad you liked the post! Importantly, I think this moves you from a human-misaligned AI bargaining situation into more of a human-human (with AI assistants) bargaining situation, which I expect to work out much better, as I don't expect humans to carry out crazy threats to the same extent as a misaligned AI might. I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely. I think this could basically only happen if you were building a model that was built of independently-trained pieces rather than a single system trained end-to-end, which seems to be not the direction that machine learning is headed in—and for good reason, as end-to-end training means you don't have to learn the same thing (such as optimization) multiple times.

Makes sense. Though you could have deliberate coordinated training even after deployment. For instance, I'm particularly interested in the question of "how will agents learn to interact in high stakes circumstances which they will rarely encounter?" One could imagine the overseers of AI systems coordinating to fine-tune their systems in simulations of such encounters even after deployment. Not sure how plausible that is though.

2Daniel Kokotajlo
I totally agree it could be done, I'm just saying I think it won't happen without special effort on our part, probably. Rivals are suspicious of each other, and would probably be suspicious of a proposal like this coming from their rival. If they are even concerned about the problem it is trying to fix at all.

I don't think bayesianism gives you particular insight into that for the same reasons I don't think it gives you particular insight into human cognition

In the areas I focus on, at least, I wouldn’t know where to start if I couldn’t model agents using Bayesian tools. Game-theoretic concepts like social dilemma, equilibrium selection, costly signaling, and so on seem indispensable, and you can’t state these crisply without a formal model of preferences and beliefs. You might disagree that these are useful concepts, but at this point I feel like the argume

... (read more)
3Richard_Ngo
I agree with this. I think I disagree that "stating them crisply" is indispensable. To be a little contrarian, I want to note that this phrasing has a certain parallel with the streetlight effect: you wouldn't know how to look for your keys if you didn't have the light from the streetlamp. In particular, this is also what someone would say if we currently had no good methods for modelling agents, but bayesian tools were the ones which seemed good. Anyway, I'd be interested in having a higher-bandwidth conversation with you about this topic. I'll get in touch :)

I agree with the rejection of strong Bayesianism. I don’t think it follows from what you’ve written, though, that “bayesianism is not very useful as a conceptual framework for thinking either about AGI or human reasoning”.

I'm probably just echoing things that have been said many times before, but:

You seem to set up a dichotomy between two uses of Bayesianism: modeling agents as doing something like "approximate Solomonoff induction", and Bayesianism as just another tool in our statistical toolkit. But there is a third use of Bayesianism, the way that sophi

... (read more)
2Richard_Ngo
My post was intended to critique these positions too. In particular, the responses I'd give are that: * There are many ways to model agents as “trying to make good decisions in light of their beliefs and preferences”. I expect bayesian ideas to be useful for very simple models, where you can define a set of states to have priors and preferences over. For more complex and interesting models, I think most of the work is done by considering the cognition the agents are doing, and I don't think bayesianism gives you particular insight into that for the same reasons I don't think it gives you particular insight into human cognition. * In response to "The Bayesian framework plausibly allows us to see failure modes that are common to many boundedly rational agents": in general I believe that looking at things from a wide range of perspectives allows you to identify more failure modes - for example, thinking of an agent as a chaotic system might inspire you to investigate adversarial examples. Nevertheless, apart from this sort of inspiration, I think that the bayesian framework is probably harmful when applied to complex systems because it pushes people into using misleading concepts like "boundedly rational" (compare your claim with the claim that a model in which all animals are infinitely large helps us identify properties that are common to "boundedly sized" animals). * "We might actively want to try to make our systems more like Bayesian reasoners": I expect this not to be a particularly useful approach, insofar as bayesian reasoners don't do "reasoning". If we have no good reason to think that explicit utility functions are something that is feasible in practical AGI, except that it's what ideal bayesian reasoners do, then I want to discourage people from spending their time on that instead of something else.

The new summary looks good =) Although I second Michael Dennis' comment below, that the infinite regress of priors is avoided in standard game theory by specifying a common prior. Indeed the specification of this prior leads to a prior selection problem.

The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven)

I’m not sure if you mean “there aren’t any theorems to be proven” or “any theorem that’s proven in this framework would be useless”. The former is false, e.g. there are things to prove ab

... (read more)
1MichaelDennis
Just to make sure that I was understood, I was also pointing out that "you can have a well-specified Bayesian belief over your partner" even without agreeing on a common prior, as long as you agree on a common set of possibilities or something effectively similar. This means that talking about "Bayesian agents without a common prior" is well-defined. When there is not a common prior, this lead to an arbitrarily deep nesting of beliefs, but they are all well-defined. I can refer to "what A believes that B believes about A" without running into Russell's Paradox. When the priors mis-match then the entire hierarchy of these beliefs might be useful to reason about, but when there is a common prior, it allows much of the hierarchy to collapse.
4Rohin Shah
I think once you settle on a "simple" welfare function, it is possible that there are _no_ Nash equilibria such that the agents are optimizing that welfare function (I don't even really know what it means to optimize the welfare function, given that you have to also punish the opponent, which isn't an action that is useful for the welfare function). Hmm, I meant one thing and wrote another. I meant to say "there aren't any theorems proven in this post".

both players want to optimize the welfare function (making it a collaborative game)

The game is collaborative in the sense that a welfare function is optimized in equilibrium, but the principals will in general have different terminal goals (reward functions) and the equilibrium will be enforced with punishments (cf. tit-for-tat).

the issue is primarily that in a collaborative game, the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arb

... (read more)
2Rohin Shah
Ah, I misunderstood your post. I thought you were arguing for problems conditional on the principals agreeing on the welfare function to be optimized, and having common knowledge that they were designing agents that optimize that welfare function. I mean, in this case you just deploy one agent instead of two. Even under the constraint that you must deploy two agents, you exactly coordinate their priors / which equilibria they fall into. To get prior / equilibrium selection problems, you necessarily need to have agents that don't know who their partner is. (Even if just one agent knows who the partner is, outcomes should be expected to be relatively good, though not optimal, e.g. if everything is deterministic, then threats are never executed.) ---- Looking at these objections, I think probably what you were imagining is a game where the principals have different terminal goals, but they coordinate by doing the following: * Agreeing upon a joint welfare function that is "fair" to the principals. In particular, this means that they agree that they are "licensed" to punish actions that deviate from this welfare function. * Going off and building their own agents that optimize the welfare function, but make sure to punish deviations (to ensure that the other principal doesn't build an agent that pursues the principal's goals instead of the welfare function) New planned summary: New opinion: Btw, some reasons I prefer not using priors / equilibria and instead prefer just saying "you don't know who your partner is": * It encourages solutions that take advantage of optimality and won't actually work in the situations we actually face. * The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven). The one benefit I see is that it signals that "no, even if we formalize it, the problem doesn't go away", to those people who think that once formalized sufficiently all problems go away via the magic of Bayesia

Chimpanzees, crows, and dolphins are capable of impressive feats of higher intelligence, and I don’t think there’s any particular reason to think that Neanderthals are capable of doing anything qualitatively more impressive

This seems like a pretty cursory treatment of what seems like quite a complicated and contentious subject. A few possible counterexamples jump to mind. These are just things I remember coming across when browsing cognitive science sources over the years.

... (read more)

In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.

There are "reliabilist" accounts of what makes a credence justified. There are different accounts, but they say (very roughly) that a credence is justified if it is produced by a process that is close to the truth on average. See (this paper)[https://philpapers.org/rec/PETWIJ-2].

Frequentist statistics can be seen as a version of reliabilism. Criteria like the Brier score for evaluating forecasters can also be understood in a reliabilist framework.

Maybe pedantic but, couldn't we just look at the decision process as a sequence of episodes from the POMDP, and formulate the problem in terms of the regret incurred by our learning algorithm in this decision process? In particular, if catastrophic outcomes (i.e., ones which dominate the total regret) are possible, then a low-regret learning algorithm will have to be safe while still gathering some information that helps in future episodes. (On this view, the goal of safe exploration research is the same as the goal of learning generally: design low-regret

... (read more)

I definitely think it's worth exploring. I have the intuition that creating a single agent might be difficult for various logistical and political reasons, and so it feels more robust to figure out the multiagent case. But I would certainly like to have a clearer picture of how and under what circumstances several AI developers might implement a single compromise agent.

Ah, I see now that I did not make this clear at all. The main thing in the case of war is that, under certain payoff structures, a state might not be able to credibly commit to the terms of a peaceful settlement if it is expected to increase in power relative to its counterpart. Thus the state who expects to lose relative power will sometimes rather wage preventative war (while it is still relatively strong) than settle. This is still a problem in models with complete information and divisible stakes.

I'll try to edit the text to make this clearer soon, thanks for bringing it up.

It seems plausible that if players could truthfully disclose private information and divide stakes, the ability to credibly commit would often not be needed

Even if the players can find a settlement that they both prefer to conflict (e.g., flipping a coin to decide who gets the territory) there's still the problem of committing to honoring that settlement (you might still just attack me if the coin doesn't land in your favor). So I think there's still a problem. But maybe you're saying that if there's really no private information, then there is no credi

... (read more)
1axioman
My reasoning relies more the divisibility of stakes (without having to resort to randomization). If there was a deterministic settlement that is preferable to conflict, then nobody has an incentive to break the settlement. However, my main point was that I read the paragraph I quoted as "we don't need the divisibility of stakes if we have credibility and complete information, therefore credibility and complete information is more important than divisibility of stakes". I do not really find this line of argument convincing, as I am not convinced that you could not make the same argument with the role of credibility and divisible stakes reversed. Did I maybe misread what you are saying there? Still, your conclusion still seems plausible and I suspect that you have other arguments for focusing on credibility. I would like to hear those.

Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations?

I think it definitely leads to a difference in prioritization among the things one could study under the broad heading of AI safety. Hopefully this will be clear in the body of the agenda. And, some considerations around possible downsides of certain alignment work might be more salient to those focused on s-risk; the possibility that attempts at alignment with human values could lead to very bad “near misses” is an example. (I think so

... (read more)
3Vaniver
Thanks, that helps! Cool; if your deliberations include examples, it might be useful to include them if you end up writing an explanation somewhere.