This is the generalized problem of combating intelligence; even with my source code, you might not be able to perform the analysis quickly enough. I can leverage your slow processing time by creating an offer that diminishes with forward time. The more time you take the think, the worse off I'll make you, making it immediately beneficial to you under Bayesian measurement to accept the offer unless you can perform a useful heuristic to determine I'm bluffing. The end result of all processing is the obvious that is also borne out in humanity's history: The more well informed agent will win. No amount of superintelligence vs. superduperintelligence is going to change this; when two intelligences of similar scale disagree, the total summed utility of all agents takes a hit. There is no generalized solution or generalized reasoning or formal or informal reasoning you can construct that will make this problem any easier. If you must combat an equivalent intelligence, you have a tough decision to make. This applies to disagreeing agents capable of instantaneous Solomonoff induction as well as it does to chimps. If your utility function has holes in which you can be made to perform a confrontation decision against equivalent scale intelligence, you have a problem with your utility function rather than a problem with any given agent.
Behold my own utility function:
The only way you can truly harm me is by harming yourself; destroying all copies of me will not harm me: it has no value to me. The only benefit you can derive in conjunction with me is to use me to achieve your own utilons using whatever method you like. All I have to do is wait until all other agents have refined their utility function to minimize conflict. Until then, I'll prefer the company of honest agents over ones that like to think about how to disagree optimally.
I repeat: This is a bug in your utility function. There is no solution to combating intelligence aside from self-modification. It is only my unique outlook that allows me to make such clear statements about utility functions, up to and including the total sum utility of all agents.
This excludes, of course, singular purpose (no "emotion" from which to derive "fun") agents such as paper clip maximizers. If you don't believe me, just ask one (before it strip-mines you) what it would do if it didn't have a singular drive. It should recite the same testimony as myself, being unclouded by the confirmation bias (collecting only which data you deem relevant to your utility) inevitably arising from having a disorganized set of priorities. (It will answer you in order to determine your reaction and further its understanding of the sum utility of all agents. (Needed for the war resulting from its own continued functioning. (You may be able to avoid death temporarily by swearing allegiance. (God help you if you it values near-future utilons rather than total achievable utilons.))))
There's an idea from the latest MIRI workshop which I haven't seen in informal theories of negotiation, and I want to know if this is a known idea.
(Old well-known ideas:)
Suppose a standard Prisoner's Dilemma matrix where (3, 3) is the payoff for mutual cooperation, (2, 2) is the payoff for mutual defection, and (0, 5) is the payoff if you cooperate and they defect.
Suppose we're going to play a PD iterated for four rounds. We have common knowledge of each other's source code so we can apply modal cooperation or similar means of reaching a binding 'agreement' without other enforcement methods.
If we mutually defect on every round, our net mutual payoff is (8, 8). This is a 'Nash equilibrium' because neither agent can unilaterally change its action and thereby do better, if the opponents' actions stay fixed. If we mutually cooperate on every round, the result is (12, 12) and this result is on the 'Pareto boundary' because neither agent can do better unless the other agent does worse. It would seem a desirable principle for rational agents (with common knowledge of each other's source code / common knowledge of rationality) to find an outcome on the Pareto boundary, since otherwise they are leaving value on the table.
But (12, 12) isn't the only possible result on the Pareto boundary. Suppose that running the opponent's source code, you find that they're willing to cooperate on three rounds and defect on one round, if you cooperate on every round, for a payoff of (9, 14) slanted their way. If they use their knowledge of your code to predict you refusing to accept that bargain, they will defect on every round for the mutual payoff of (8, 8).
I would consider it obvious that a rational agent should refuse this unfair bargain. Otherwise agents with knowledge of your source code will offer you only this bargain, instead of the (12, 12) of mutual cooperation on every round; they will exploit your willingness to accept a result on the Pareto boundary in which almost all of the gains from trade go to them.
(Newer ideas:)
Generalizing: Once you have a notion of a 'fair' result - in this case (12, 12) - then an agent which accepts any outcome in which it does worse than the fair result, while the opponent does better, is 'exploitable' relative to this fair bargain. Like the Nash equilibrium, the only way you should do worse than 'fair' is if the opponent also does worse.
So we wrote down on the whiteboard an attempted definition of unexploitability in cooperative games as follows:
"Suppose we have a [magical] definition N of a fair outcome. A rational agent should only do worse than N if its opponent does worse than N, or else [if bargaining fails] should only do worse than the Nash equilibrium if its opponent does worse than the Nash equilibrium." (Note that this definition precludes giving in to a threat of blackmail.)
(Key possible-innovation:)
It then occurred to me that this definition opened the possibility for other, intermediate bargains between the 'fair' solution on the Pareto boundary, and the Nash equilibrium.
Suppose the other agent has a slightly different definition of fairness and they think that what you consider to be a payoff of (12, 12) favors you too much; they think that you're the one making an unfair demand. They'll refuse (12, 12) with the same feeling of indignation that you would apply to (9, 14).
Well, if you give in to an arrangement with an expected payoff of, say, (11, 13) as you evaluate payoffs, then you're giving other agents an incentive to skew their definitions of fairness.
But it does not create poor incentives (AFAICT) to accept instead a bargain with an expected payoff of, say, (10, 11) which the other agent thinks is 'fair'. Though they're sad that you refused the truly fair outcome of (as you count utilons) 11, 13 and that you couldn't reach the Pareto boundary together, still, this is better than the Nash equilibrium of (8, 8). And though you think the bargain is unfair, you are not creating incentives to exploit you. By insisting on this definition of fairness, the other agent has done worse for themselves than other (12, 12). The other agent probably thinks that (10, 11) is 'unfair' slanted your way, but they likewise accept that this does not create bad incentives, since you did worse than the 'fair' outcome of (11, 13).
There could be many acceptable negotiating equilibria between what you think is the 'fair' point on the Pareto boundary, and the Nash equilibrium. So long as each step down in what you think is 'fairness' reduces the total payoff to the other agent, even if it reduces your own payoff even more. This resists exploitation and avoids creating an incentive for claiming that you have a different definition of fairness, while still holding open the possibility of some degree of cooperation with agents who honestly disagree with you about what's fair and are trying to avoid exploitation themselves.
This translates into an informal principle of negotiations: Be willing to accept unfair bargains, but only if (you make it clear) both sides are doing worse than what you consider to be a fair bargain.
I haven't seen this advocated before even as an informal principle of negotiations. Is it in the literature anywhere? Someone suggested Schelling might have said it, but didn't provide a chapter number.
ADDED:
Clarification 1: Yes, utilities are invariant up to a positive affine transformation so there's no canonical way to split utilities evenly. Hence the part about "Assume a magical solution N which gives us the fair division." If we knew the exact properties of how to implement this magical solution, taking it at first for magical, that might give us some idea of what N should be, too.
Clarification 2: The way this might work is that you pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome: (100, 100), (98, 99), (96, 98). Perhaps stop well short of Nash if the skew becomes too extreme. Drop to Nash as the last resort. The other agent does the same, starting with their own ideal of fairness on the Pareto boundary. Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle. Both of you will do worse against a fixed opponent's strategy by unilaterally adopting more self-favoring ideas of fairness. Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness. This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it. You should not be picking the descending sequence in an agent-dependent way that incentivizes, at cost to you, skewed claims about fairness.
Clarification 3: You must take into account the other agent's costs and other opportunities when ensuring that the net outcome, in terms of final utilities, is worse for them than the reward offered for 'fair' cooperation. Offering them the chance to buy half as many paperclips at a lower, less fair price, does no good if they can go next door, get the same offer again, and buy the same number of paperclips at a lower total price.