That works if the AI knows that the other agent will keep its promise, and the other agent knows what the AI will do in the future. In particular the AI has to know the other agent is going to successfully anticipate what the AI will do in the future, even though the AI doesn't know itself. And the AI has to be able to infer all this from actual sensory experience, not by divine revelation. Hmm, I suppose that's possible.
That's the thing about mathematical proofs, you need to conclusively rule out every possibility. When dealing with something like a super-intelligence there will be unforeseen circumstances, and nothing short of full mathematical rigour will save you.
Hmm, it's really easy to specify a causal AI, along the lines of AIXI but you can skip the arguments about it being near-optimal. Is there a similar simple spec of a timeless AI
I don't know of one off-hand, but I think AIXI can easily be made Timeless. Just modify the bit which says roughly "calculate a probability distribution over all possible outcomes for each possible action" and replace it with "calculate a probability distribution over all possible outcomes for each possible decision".
This may be worth looking into further, I havn't looked very deeply into the literature around AIXI.
When I think through what the causal AI would do, it would be in a situation where it didn't know whether the actions it chooses are in the real world or in the other agent's simulation of the AI when the other agent is predicting what the AI would do. If it reasons correctly about this uncertainty, the causal AI might do the right thing anyway. I'll have to think about this. Thanks for the pointer.
This looks like you might be stumbling towards Updateless Decision Theory, which is IMHO even stronger than TDT and may solve an even wider range of problems.
It could build and deploy an unfriendly AI completely different from itself.
I could come up with an argument for this falling into either category.
Roughly, the importance is that there's only two kinds of truly catastrophic mistakes that an AI could make, mistakes which manage to wipe out to whole planet in one shot and errors in modifying its own code. Everything else can be recovered from.
It could build and deploy an unfriendly AI completely different from itself.
I could come up with an argument for this falling into either category.
I'm claiming that the concept of self-modification is useless since it's a special case of engineering. We have to get engineering right, and if we do that, w...
I don't know if this is a little too afar field for even a Discussion post, but people seemed to enjoy my previous articles (Girl Scouts financial filings, video game console insurance, philosophy of identity/abortion, & prediction market fees), so...
I recently wrote up an idea that has been bouncing around my head ever since I watched Death Note years ago - can we quantify Light Yagami's mistakes? Which mistake was the greatest? How could one do better? We can shed some light on the matter by examining DN with... basic information theory.
Presented for LessWrong's consideration: Death Note & Anonymity.