Suppose there is a useful formulation of the alignment problem that is mathematically unsolvable. Suppose that as a corollary, modifying your own mind while ensuring any non-trivial property of the resulting mind was also impossible.
Would that prevent a new AI from trying to modify itself?
Has this direction been explored before?
Thanks!
Do you have any specific posts in mind?
To be clear, I'm not suggesting that because of this possibility we can just hope that this is how it plays out and we will get lucky.
If we could find a hard limit like this, it seems like it would make the problem more tractable, however. It doesn't have to exist simply because we want it to exist. Searching for it still seems like a good idea.
There's a hundred problems to solve, but it seems like it could avoid the main bad scenario at least: that of AI rapidly self-improving. Improving its hardware wouldn't be trivial for a human-level AI, and it wouldn't have options present in other scenarios. And scaling beyond a single machine seems likely to be a significant barrier at least.
It could still create millions of copies of itself. That's still a problem, but also still a better problem to have than a single AI with no coordination overhead.