If it has any trouble at all achieving its primary goal, won't it just be easier to taunt you into blowing it up? It's indifferent between the two, after all. I mean, this may well be a way of creating a safe AI, create an AI that kills itself if it tries to do anything interesting.
I certainly agree with the conclusion of the note:
All in all, great care must be used to maintain indifference with a self-improving AI.
'Indifference', remember, is just 'U(S) = U(T )' That is, a successful defection equals the value of the explosion going off? This is really problematic depending on how we cash out 'defection'. For example, suppose defection involves taking over everything on Earth including the detonator - then defection looks better than taunting, since there are two avenues to 'U(S)=U(T)':
I just noticed that LessWrong has not yet linked to FHI researcher Stuart Amstrong's brief technical report, Utility Indifference (2010). It opens: