Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
Crossposted at the Intelligent Agents Forum.
I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.
Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.
The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.
These are not the lands of your forefathers
Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).
Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.
Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.
The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.
If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.
And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.
The nature of the problem
So what went wrong? At what point did the agents go out of alignment?
In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.
Cross-posted at the Intelligent Agent forum.
According to the basic AI drives thesis, (almost) any agent capable of self-modification will self-modify into an expected utility maximiser.
The typical examples are the inconsistent utility maximisers, the satisficers, unexploitable agents, and it's easy to think that all agents fall roughly into these broad categories. There's also the observation that when looking at full policies rather than individual actions, many biased agents become expected utility maximisers (unless they want to lose pointlessly).
Nevertheless... there is an entire category of agents that generically seem to not self-modify into maximisers. These are agents that attempt to maximise f(E(U)) where U is some utility function, E(U) is its expectation, and f is a function that is neither wholly increasing nor decreasing.
View more: Next