Per AlexMennen, I'd rename CarefulBot-as-formulated to something like BoldBot, and apply the name CarefulBot to "swerve unless it's proven the other bot swerves."
I think (I'm an amateur here, full disclosure; could be mistakes) we can capture some of the behavior you're interested in by considering bots with different levels of proof system. So e.g. CBn swerves unless it can prove, using PA+n, that the opponent swerves.
Then we see that for n > m, CBn(CBm) = D. Proof: Suppose CBn doesn't swerve. In PA+n, it is an axiom that provability in PA+m implies truth. Therefore PA+m cannot prove that CBn swerves. In a contest between CarefulBots, the stronger proof system wins.
Now consider BoldBot. We can see that BBn does not swerve against BBn, because by symmetry both agents prove the same thing; and if they both proved their opponent does not swerve, they would both swerve, and both have proven a falsehood.
Analyzing BBn vs. BBm is proving difficult, but I'm going to try a bit more.
Actually, I notice that BBn vs. BBm is isomorphic to CBn vs. CBm! Just interchange 'swerve' and 'don't swerve' in the specification of one to convert it into the other. This implies that BBn swerves against BBm, and BBm does not swerve, if my proof about CBn vs. CBm is valid. I'm no longer so sure it is...
I thought it might be good fun to try doing modal chicken. This is my first time getting into the dirty technical details of MIRI's research, so do tolerate and point out misunderstandings.
Chicken is a game where two drivers drive towards each other on a collision course: one must swerve, or both die in the crash, but if one driver swerves and the other does not, the one who swerved will be called a "chicken," and lose social status. Chicken differs from the Prisoner's Dilemma in that the sucker's payoff is preferable to the punishment payoff, as opposed to the converse. (If your opponent defects, it is more preferable to defect than to cooperate. If your opponent doesn't swerve, it is more preferable to swerve than not to swerve.) That is, Chicken is an anti-coordination game: a game in which it is mutually beneficial for the players to play different strategies. We define the payoff matrix as follows:
In the game of Chicken, we want to define agents that always swerve against themselves, and that only swerve if their opponents don't swerve against them. Let's try defining some modal agents with these properties.
As usual, the agents herein are defined as modal formulas in Godel-Lob provability logic. In particular, our ''agents'' will be formulas in Peano Arithmetic, and our criterion for action will be the existence of a finite proof in the tower of formal systems PA+n, where PA is Peano Arithmetic, and PA+(n+1) is the formal system whose axioms are the axioms of PA+n, plus the axiom that PA+n is consistent.
Fix a particular Godel numbering scheme, and let
and
each denote well-formed formulas with one free variable. Then let
denote the formula where we replace each instance of the free variable in
with the Godel number of
. If such a formula holds in the standard model of Peano Arithmetic, we interpret that as
swerving against
; if its negation holds, we interpret that as
not swerving against
. In particular, we will prove theorems in PA+n to establish whether the agents we discuss swerve or don't swerve against one another. Thus we can regard such formulas of arithmetic as decision-theoretic agents, and we will use ''source code'' to refer to their Godel numbers.
and
will be used interchangeably with
and
.
Some very simple agents are worth defining first, like SwerveBot, the modal agent that always swerves:
Definition 1 (SwerveBot). = S)
And naturally, NerveBot, the modal agent that never swerves: