It feels like community discussion has largely abandoned the topic of AGI having the self-modifying property, which makes sense because there are a lot of more fundamental things to figure out.
But I think we should revisit the question at least in the context of narrow AI, because the tools are now available to accomplish exactly this on several levels. This thought was driven by reading a blog post, Writing BPF Code in Rust.
BPF stands for Berkeley Packet Filter, which was originally for network traffic analysis but has since been used for tracing the Linux kernel. The pitch is that this can now be used to let userspace code run in the kernel, which is to say change the way the kernel is working.
The ability to use code to write code is very old; it is arguably the core insight of LISP. But this is becoming increasingly common as a practice now, including things like writing OCaml or Haskell for generating correct C code, and increasingly powerful code generation tricks in compilers.
It's also possible to change how the compilers work now. The Julia language has a package named Cassette.jl, which allows dynamically injecting behavior into the Just-in-Time compiler. As it happens both this Julia trick and the BPF trick for the Linux kernel rely on LLVM.
All of this means at present we can in fact write code that modifies itself, and that modifies the way it is compiled, and that modifies the way the environment runs it (assuming the environment is Linux). The infrastructure seems to be present for large-scale, multi-layer self modification to take place. This seems like it makes questions about self modification testable in a way they weren't before. The BPF and Cassette.jl tricks don't even require software written for them explicitly, they work on previously existing code whose authors had no idea such capability existed. These methods are independent of ideas like Software 2.0/Differentiable Programming, and combined they make me wonder if the safety problems we are concerned with might actually start appearing at the level of applications first.
I'll be interested if you have any more specific ideas here. I can't think of anything because:
The question of "How can an AGI self-modify into a safe and beneficial AGI?" seems pretty similar to "How can a person program a safe and beneficial AGI?", at least until the system is so superhumanly advanced that it can hopefully figure out the answer itself. So in that sense, everyone is thinking about it all the time.
The challenges of safe self-modification don't seem wildly different than the challenges of safe learning (after all, learning changes the agent too), including things like goal stability, ontological crises, etc. And whereas learning is basically mandatory, deeper self-modification could (probably IMO) be turned off if necessary, again at least until the system is so superhumanly advanced that it can solve the problem itself. So in that sense, at least some people are sorta thinking about it these days.
I dunno, I just can't think of any experiment we could do with today's AI in this domain that would discover or prove something that wasn't already obvious. (...Which of course doesn't mean that such experiments don't exist.)