What if someone proves that advanced AGI (or even some dumb but sophisticated AI) cannot be "contained" nor reliably guaranteed to be "friendly"/"aligned"/etc. (whatever it may mean) ? Can be something vaguely goedelian, along the lines of "any sufficiently advanced system ...".
Those godelian style arguments (like the no-free-lunch theorems) work much better in theory than in practice. We only need a reasonably high probability of the AI being contained or friendly or aligned...
I imagine a reward function would be implemented such that it takes the current state of the universe as input. Obviously it wouldn't be the state of the entire universe, so it would be limited to some finite observable portion of it, whatever the sensory capabilities of the agent are. So if the "capacity" of the agent was increased, that could mean different things - it could mean that the agent now has a greater share of the universe it can observe at a given point of time, or, it could mean that the agent has more computational power to calculate some function of the current input. Depending on how that increased capacity manifests itself, it could drastically alter the final output of the reward function. So a properly "aligned" reward function would essentially have to remain aligned under all possible changes to the implementation details. And the difficulty with that is: Can an agent reliably "predict" the outcome of its own computations under capacity increases?
What is your take on the "unfriendly natural intelligence" problem? Humans don't seem aligned wth human happiness in today's normal circumstances. Do you think this is related, in that humans are aligned in situations where they have less power? Or are humans just unaligned jerks in all circumstances?
Another way to look at this is as a result of "ought implies can". A function mapping the correct choices for each possible action is not guaranteed to have correct/desirable outcomes for no-longer-impossible inputs.
Also, this brings forth a confusion in the word "aligned". If it's foundationally aligned with underlying values, then increased capacity doesn't change anything. If it's only aligned in the specific cases that were envisioned, then increase in capacity may or may not remain aligned. Definitionally, if r is aligned only for some cases, then it's not aligned for all cases. So unpack another level: why would you not make r align generally, rather than only for current capacities?
Crossposted at the Intelligent Agents Forum.
I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.
Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.
The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.
These are not the lands of your forefathers
Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).
Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.
Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.
The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.
If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.
And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.
The nature of the problem
So what went wrong? At what point did the agents go out of alignment?
In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.