to make a guaranteed friendly AI you probably need to prove theorems about your AI design.
Wouldn't this be a level mismatch in a multi-level AI architecture? Like, proving things about low-level neural computational substrate instead of about the conceptual level where actual cognition would take place, and where the actual friendliness would be defined? [and this level can't be isomorphic to any formal logical system, except in symbolic AI, which doesn't work]
figuring out the right decision theory in the presence of copies seems to be a necessary step on the road to FAI
And again, a conceptual-level understanding should do the trick. Like, knowing that I play PD against myself would be sufficient to cooperate. Besides, as EY frequently says, it's really hard to find oneself in a true PD. Usually, it's iterated PD, or some Schelling's conflict game [BTW, huge thanks for recommending his book in one of your posts!]
If a multilevel architecture (whatever it is) makes provable friendliness impossible, then FAI can't use it.
I imagine the future FAI as closer to AIXI, which works fine without multilevel architecture, than to the Lisp programs of the 70s.
Some people on LW have expressed interest in what's happening on the decision-theory-workshop mailing list. Here's an example of the kind of work we're trying to do there.
In April 2010 Gary Drescher proposed the "Agent simulates predictor" problem, or ASP, that shows how agents with lots of computational power sometimes fare worse than agents with limited resources. I'm posting it here with his permission:
About a month ago I came up with a way to formalize the problem, along the lines of my other formalizations:
Also Wei Dai has a tentative new decision theory that solves the problem, but this margin (and my brain) is too small to contain it :-)
Can LW generate the kind of insights needed to make progress on problems like ASP? Or should we keep working as a small clique?