Yeah this seems pretty sensible. It's probably closely connected to causal scrubbing and John Wentworth's notion of natural abstraction as variables invariant under resampling.
Would this work for models that account for larger levels of context? Or would something like instrumental convergence lead to similar misalignments being very common at sufficiently strong levels of AGI?
If they are oracles, we can provide them the same context. The idea is actually used in spaceflight there in some spacecraft there was three computers running parallely. If one of them differs from two other, its output was ignored.
Strong AGIs can do some tricks like acausally cooperate to give similar mis-aligned answers. They can also generalize to "let me out of the box". So this trick alone can't ensure safety of advance AIs.
Also, CEV is some unexpected for us generalisation of many our goals.
Problem: An AI may incorrectly generalize a goal we present it.
Proposal: Run many slightly different instances of AI and find the “Schelling point” around correct generalization.
The idea is that if we run many similar AIs or copies of the same AI in parallel, we will be able to detect goal’s mis-generalizations, as such wrong generalizations will be different in most AIs. The only “Schelling point” will be the correct generalization, at which many AIs will converge.
For example, we want AI to clean a room from bowling balls. We ask AI to “remove all spherical objects”. We run many slightly different copies of the AI and they produce the following suggestions:
1st AI: suggests to remove balls
2nd AI: suggests to remove balls
3d AI: suggests to remove balls and lamps
4th AI: suggests to remove balls and human heads.
1st and 2nd AI have come to the same conclusion, and it is the right one. 3rd and 4th AIs have different conclusions and these conclusions are wrong. 4th AI’s conclusion is dangerous.
Counterarguments:
The first problem could be (partly) solved by proper randomisation of initial parameters, weights or inputs. For example, a question could be presented several times in different forms for different instances of the AI. E;g.: “Remove all round objects”, “remove all circular objects”, “remove all sport utilities”.
There could be two “peaks” in generalization which means that our presented goal has intrinsic ambiguity. We could then ask AI why it has different interepations.
In real life, this approach is also used: we could say a driver “Turn left. Turn there on Main Street. Here!” In that case, we formulate our goal several times in different words hoping that it will eventually converge into correct model inside driver’s mind. The driver’s mind selects solution which satisfies all three commands, and disregard those solutions which satisfy only one command.
Another example: If we have a set of calculators which either produce correct answer or wrong one, we could find the broken calculators as they all will produce different answers, and all correct calculators will produce the same answer.