I think this is a good idea, though it's not new. I have written about this at some length (jessica linked to a few examples, but much of the content here is relevant), and it's what people usually are trying to do in apprenticeship learning. I agree there is probably no realistic scenario where you would use the reduced impact machinery instead of doing this the way you describe (i.e. the way people already do it).
Having the AI try to solve the problem (rather than simply trying to mimic the human) doesn't really buy you that much, and has big costs. If the human can't solve the problem with non-negligible probability, then you simply aren't going to get a good result using this technique. And if the human can solve the problem, then you can just train on instances where the human successfully solves it. You don't save anything computationally with the conditioning.
Bootstrapping seems like the most natural way to improve performance to superhuman levels. I expect bootstrapping to work fine, if you could get the basic protocol off the ground.
The connection to adversarial networks is not really a "parallel." They are literally the same thing (modulo your extra requirement that the system do the task, which is equivalent to Jessica's quantilization proposal but which I think should definitely be replaced with bootstrapping).
I think the most important problem is that AI systems do tasks in inhuman ways, such that imitating a human entails a significant disadvantage. Put a different way, it may be harder to train an AI to imitate a human than to simply do the task. So I think the main question is how to get over that problem. I think this is the baseline to start from, but it probably won't work in general.
Overall I feel more optimistic about approval-direction than imitation for this reason. But approval-direction has its own (extremely diluted) versions of the usual safety concerns, and imitation is pretty great since it literally avoids them altogether. So if it could be fixed that would be great.
This post covers the basic idea of collecting training data with low probability online. This post describes why it might result in very low overhead for aligned AI systems.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
I'm not saying the situation is impossible, just really really unlikely. The AI would need to output big binary files like images, and know someone intended to decode them, and somehow get around statistical detection by AI 2 (stenography is detectable since the lowest order bits of an image are not uniformly random.)
You might have a point that it's probably not best to publish things produced by the AI on the internet. If this is a serious risk, then it could still be done safely with a small group.
The general lesson from steganography is that it is computationally easier to change a distribution in an important way than to detect such a change. In order to detect a change you need to consider all possible ways in which a distribution could be meaningfully altered, while in order to make a change you just have to choose one. From a theory perspective, this is a huge asymmetry that favors the an attacker.
This point doesn't seem directly relevant though, unless someone offers any good reason to actually include the non-imitation goal, rather than simply imitating the successful human trials. (Though there are more subtle reasons to care about problematic behavior that is neither penalized nor rewarded by your training scheme. It would be nicer to have positive pressure to do only those things you care about. So maybe the point ends up being relevant after all.)
Actually, in the scheme as you wrote it there is literally no reason to include this second goal. The distinguisher is already trying to distinguish the generator's behavior from [human conditioned on success], so the generator already has to succeed in order to win the game. But this doesn't introduce any potentially problematic optimization pressure, so it just seems better.