One of the old fables in machine learning is the story of the "tank classifier" - a neural network that had supposedly been trained to detect enemy tanks hiding in a forest. It turned out that all the photos of enemy tanks had been taken on sunny days and all the photos of the same field without the tanks had been taken on cloudy days, meaning that the neural net had really just trained itself to recognize the difference between sunny and cloudy days (or just the difference between bright and dim pictures). (Source.)

We could view this problem as follows: A human looking at the labeled data might have seen several concepts that someone might be trying to point at - tanks vs. no tanks, cloudy vs. sunny days, or bright vs. dim pictures. A human might then ask, "Which of these possible categories did you mean?" and describe the difference using words; or, if it was easier for them to generate pictures than to talk, generate new pictures that distinguished among the possible concepts that could have been meant. Since learning a simple boundary that separates positive from negative instances in the training data is a form of induction, we could call this problem noticing "inductive ambiguities" or "ambiguous inductions".

This problem bears some resemblance to numerous setups in computer science where we can query an oracle about how to classify instances and we want to learn the concept boundary using a minimum number of instances. However, identifying an "inductive ambiguity" doesn't seem to be exactly the same problem, or at least, it's not obviously the same problem. Suppose we consider the tank-classifier problem. Distinguishing levels of illumination in the picture is a very simple concept, so it would probably be the first one learned; then, treating the problem in classical oracle-query terms, we might imagine the AI presenting the user with various random pixel fields at intermediate levels of illumination. The user, not having any idea what's going on, classifies these intermediate levels of illumination as 'not tanks', and so the AI soon learns that only quite sunny levels of illumination are required.

Perhaps what we want is less like "figure out exactly where the concept boundary lies by querying the edge cases to the oracle, assuming our basic idea about the boundary is correct" and more like "notice when there's more than one plausible idea that describes the boundary" or "figure out if the user could have been trying to communicate more than one plausible idea using the training dataset".

Possible approaches

Some possibly relevant approaches that might feed into the notion of "identifying inductive ambiguities":

Conservatism. Can we draw a much narrower, but somewhat more complicated, boundary around the training data?

Can we get a concept that more strongly predicts or more tightly predicts the training cases we saw? (Closely related to conservatism - if we suppose there's a generator for the training cases, then a more conservative generator concentrates more probability density into the training cases we happened to see.)

Can we detect commonalities in the positive training cases that aren't already present in the concept we've learned?
- This might be a good fit for something like a generative adversarial approach, where we generate random instances of the concept we learned, then ask if we can detect the difference between those random instances and the actual positively labeled training cases.

Is there a way to blank out the concept we've already learned so that it doesn't just get learned again, and ask if there's a different concept that's learnable instead? That is, whatever algorithm we're using, is there a good way to tell it "Don't learn this concept, now try to learn" and see if it can learn something substantially different?

Something something Gricean implication.

Relevance in value alignment

Since inductive ambiguities are meant to be referred to the user for resolution rather than resolved automatically (the whole point is that the necessary data for an automatic resolution isn't there), they're instances of "user queries" and all standard worries about user queries would apply.

The hope about a good algorithm for identifying inductive ambiguities is that it would help catch edge instantiations and unforeseen maximums, and maybe just simple errors of communication.

LESSWRONG
Wikitags
LW

Identifying ambiguous inductions

Summaries

Possible approaches

Relevance in value alignment