Yes, it does seem safer to build non-self-modifying AIs. But I'm not quite saying that should be the limit. I'm saying that any AI that can self-modify ought to have a hard barrier where there is code that can't be modified.
I know there has been excitement here about a transhuman AI being able to bypass pretty much any control humans could devise (that excitement is the topic that first brought me here, in fact). But going for a century or so with AIs that can't self-modify seems like a pretty good precaution, no?
But what counts as "self-modification"?
Simply making a promise could be considered self-modification, since you presumably behave differently after making the promise than you would have counterfactually.
Learning some fact about the world could be considered self-modification, for the same reason.
Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms? Or, what may amount to the same thing, can we give criteria for rationally self-modifying, for each class of self-modification? That is, for example, when is it rational to make promises? When is it rational to update our beliefs about the world?
Link: physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/
Suzanne Gildert basically argues that any AGI that can considerably self-improve would simply alter its reward function directly. I'm not sure how she arrives at the conclusion that such an AGI would likely switch itself off. Even if an abstract general intelligence would tend to alter its reward function, wouldn't it do so indefinitely rather than switching itself off?
If it wants to maximize its reward by increasing a numerical value, why wouldn't it consume the universe doing so? Maybe she had something in mind along the lines of an argument by Katja Grace:
Link: meteuphoric.wordpress.com/2010/02/06/cheap-goals-not-explosive/
I am not sure if that argument would apply here. I suppose the AI might hit diminishing returns but could again alter its reward function to prevent that, though what would be the incentive for doing so?
ETA:
I left a comment over there:
ETA #2:
What else I wrote: