One question I'm not sure about - and remember, the comment above is just a sketch - is whether it can be formally shown that there is always a 'sploit.
(If so, then what you would need for security is to make such a 'sploit infeasible for practical purposes. The question in security is always "what's the threat model?")
For purposes of 'sploits on mere human minds, I think it's enough to note that in security terms the human mind is somewhere around Windows '98 and that general intelligence is a fairly late addition that occasionally affects what the human does.
There isn't always an exploit, for certain classes of exploits.
For instance, when we compile a statically checked language like Java, we guarantee that it won't take over the VM it's executing in. Therefore, it won't have exploits of some varieties: for instance, we can limit its CPU time and memory use, and we can inspect and filter all its communications with any other programs or data. This is essentially a formal proof of properties of the program's behavior.
The question is, can we prove enough interesting properties about something? This depends mostly on the design of the AI mind executing (or looking at) the new ideas.
Hey. I'm relatively new around here. I have read the core reading of the Singularity Institute, and quite a few Less Wrong articles, and Eliezer Yudkowsky's essay on Timeless Decision Theory. This question is phrased through Christianity, because that's where I thought of it, but it's applicable to lots of other religions and nonreligious beliefs, I think.
According to Christianity, belief makes you stronger and better. The Bible claims that people who believe are substantially better off both while living and after death. So if a self modifying decision maker decides for a second that the Christian faith is accurate, won't he modify his decision making algorithm to never doubt the truth of Christianity? Given what he knows, it is the best decision.
And so, if we build a self modifying AI, switch it on, and the first ten milliseconds caused it to believe in the Christian god, wouldn't that permanently cripple it, as well as probably causing it to fail most definitions of Friendly AI?
When designing an AI, how do you counter this problem? Have I missed something?
Thanks, GSE
EDIT: Yep, I had misunderstood what TDT was. I just meant self modifying systems. Also, I'm wrong.