As Eric and EY jointly point out, this article seems to be roughly pointing at a simple classifier that places a big penalty on false positives, e.g.:
loss = 100*(1-lambda)(false_positive_rate) + (1-lambda)(false_negative_rate) + lambda*regularization
After all, the purpose of regularization is to ensure simplicity.
To the extent that conservative concepts are at all different, it should run through the notion of ambiguity detection and KWIK learning. At least that's what machine learning people will round the proposal off to until they have some other concrete proposals. Though maybe I'm missing something.
It seems that classifiers trained on adversarial examples may be finding (more) conservative concept boundaries:
We also found that the weights of the learned model changed significantly, with the weights of the adversarially trained model being significantly more localized and interpretable
Suppose there are existing generic techniques for developing classifiers that prioritize avoiding false positives over avoiding false negatives -- would you not expect them to find a "conservative concept boundary" by default?
I'm pushing back a little on this "classifier that avoids false positives" description because that's what every classifier is in some sense intended to do
Well presumably there's a trade-off between avoiding false positives and avoiding false negatives. And you want a classifier that tries really hard to avoid false positives, as I understand.
Well, the purpose is to avoid the AGI classifying potential goal fulfillments in a way that, from the user's perspective, is a "false positive". The reason why we have to spend a lot of time thinking about really, really good ways to have the AGI not guess positive labels on things that we wouldn't label as positive, is that the training data we present to the AI may be ambiguous in some way we don't know about, or many ways we don't know about. Meaning that the AI does not actually have the information to figure out what we meant by looking for the simplest ways to classify the training cases, and instead has to do something that's very very similar to the positively labeled training instances to minimize the probability of screwing up.
I'm pushing back a little on this "classifier that avoids false positives" description because that's what every classifier is in some sense intended to do; you have to be specific about how, or what approach you're taking, in order to say something that means more than just "classifier that is a good classifier".
As Eric and EY jointly point out, this article seems to be roughly pointing at a simple classifier that places a big penalty on false positives, e.g.: loss = 100*(1-lambda)(false_positive_rate) + (1-lambda)(false_negative_rate) + lambda*regularization
After all, the purpose of regularization is to ensure simplicity.
To the extent that conservative concepts are at all different, it should run through the notion of ambiguity detection and KWIK learning. At least that's what machine learning people will round the proposal off to until they have some other concrete proposals. Though maybe I'm missing something.
It seems that classifiers trained on adversarial examples may be finding (more) conservative concept boundaries:
Explaining and Harnessing Adversarial Examples
Would it be fair to summarize the idea of a conservative concept boundary as a classifier that avoids false positives while remaining simple?
Suppose there are existing generic techniques for developing classifiers that prioritize avoiding false positives over avoiding false negatives -- would you not expect them to find a "conservative concept boundary" by default?
Well presumably there's a trade-off between avoiding false positives and avoiding false negatives. And you want a classifier that tries really hard to avoid false positives, as I understand.
Well, the purpose is to avoid the AGI classifying potential goal fulfillments in a way that, from the user's perspective, is a "false positive". The reason why we have to spend a lot of time thinking about really, really good ways to have the AGI not guess positive labels on things that we wouldn't label as positive, is that the training data we present to the AI may be ambiguous in some way we don't know about, or many ways we don't know about. Meaning that the AI does not actually have the information to figure out what we meant by looking for the simplest ways to classify the training cases, and instead has to do something that's very very similar to the positively labeled training instances to minimize the probability of screwing up.
I'm pushing back a little on this "classifier that avoids false positives" description because that's what every classifier is in some sense intended to do; you have to be specific about how, or what approach you're taking, in order to say something that means more than just "classifier that is a good classifier".