Open-Category Classification

TurnTrout

Introduction

If I present you with five examples of burritos, I don’t want you to pursue the simplest way of classifying burritos versus non-burritos. I want you to come up with a way of classifying the five burritos and none of the non-burritos that covers as little area as possible in the positive examples, while still having enough space around the positive examples that the AI can make a new burrito that’s not molecularly identical to the previous ones.

- AI Alignment: Why It's Hard and Where to Start

Consider the problem of designing classifiers that are able to reliably say "I don't know" for inputs well outside of their training set. In the literature, this problem is known as open-category classification [7]; a closely-related problem in AI safety is inductive ambiguity detection [4].

Solution of generalized inductive ambiguity detection could allow for agents who robustly know when to ask for clarification; in this post, we discuss the problem in the context of classification. Although narrower in scope, the solution of the classification subproblem would herald the arrival of robust classifiers which extrapolate conservatively and intuitively from their training data.

Obviously, we can't just train classifiers to "recognize" the $u n k n o w n$ class. One, this class isn't compact, and two, it doesn't make sense - it's a map/territory confusion (the label $u n k n o w n$ is a feature of the world model, not a meaningful ontological class). Let's decompose the concept a bit further.

Weight regularization helps us find a mathematically-simple volume in the input space which encapsulates the data we have seen so far. In fact, a binary classifier enforces a hyperplane which cleaves the entire space into two volumes in a way which happens to nearly-optimally classify the training data. This hyperplane may be relatively simple to describe mathematically, but the problem is that the two volumes created are far too expansive.

In short, we'd like our machine learning algorithms to learn explanations which fit the facts seen to date, are simple, and don't generalize too far afield. This is an example of conservatism:

Conservatism and conservative planning seems like it might directly tackle some standard concerns [in alignment] head-on and in a sufficiently-basic way to avoid loopholes, and might also be subject to those concerns. E.g.:

• Edge instantiation - if in full generality we don't go to the edge of the graph but try to stay in the center of what's already been positively classified, maybe we can avoid this.

• Unforeseen maximum - if we stick to things very similar to already-positively-classified instances, we won't automatically go into the unimagined parts of the graph.

• Context disaster - a sufficiently conservative optimizer might go on using options similar to previously-whitelisted ones even if large new sections of planning space opened up.

Prior Work

Taylor et al. provide an overview of relevant research [4]. Taylor also introduced an approach for rejecting examples from distributions too far from the training distribution [5].

Yu et al. employed adversarial sample generation to train classifiers to distinguish seen from unseen data, achieving moderate success [7]. Li and Li $^{1}$ trained a classifier to sniff out adversarial examples by detecting whether the input is from a different distribution [3]. Once detected, the example had a local average filter applied to it to recover the non-adversarial original.

The Knows What It Knows framework allows a learner to avoid making mistakes by deferring judgment to a human some polynomial number of times [2]. However, this framework makes the unrealistically-strong assumption that the data are i.i.d.; furthermore, efficient KWIK algorithms are not known for complex hypothesis classes (such as those explored in neural network-based approaches).

Penalizing Volume

If we want a robust $c a t$ / $u n k n o w n$ classifier, we should indeed cleave the space in two, but with the vast majority of the space being allocated to $u n k n o w n$ . In other words, we're searching for the smallest, simplest volume which encapsulates the cat training data.

The smallest encapsulation of the training data is a strange, contorted volume only encapsulating the training set. However, we still penalize model complexity, so that shouldn't happen. As new examples are added to the training set, the classifier would have to find a way to expand or contract its class volumes appropriately. This is conceptually similar to version space learning.

This may be advantageous compared to current approaches in that we aren't training an inherently-incorrect classifier to prune itself or to sometimes abstain. Instead, we structure the optimization pressure in such a way that conservative generalization is the norm. $^{2}$

Formalization

Let $V$ be a function returning the proportion $^{3}$ of input space volume occupied by non- $u n k n o w n$ classes: $\frac{| inputs not classified as u n k n o w n |}{| all inputs |}$ . We need some function which translates this to a reasonable penalty (as the proportion alone would be a rounding error in the overall loss function). For now, assume this function is $^V$ .

We observe a datum $x$ whose ground truth label is $y$ . Given loss function $L$ , weights $θ$ , classifier $f_{θ}$ , complexity penalty function $R$ , and $λ_{1}, λ_{2} \in R$ , the total loss is

L (y, f_{θ} (x)) + λ_{1} R (θ) + λ_{2}^V (f_{θ}) .

Depending on the formulation of $^V$ , $f_{θ}$ may elect to misclassify a small amount of data to avoid generalizing too far. If more similar examples are added, $L$ exerts an increasing amount of pressure on $f_{θ}$ , eventually forcing expansion of the class boundary to the data points in question. This optimization pressure seems desirable.

We then try to minimize expected total loss over the true distribution $X$ :

a r g m i n θ E_{x, y \sim X} [L (y, f_{θ} (x)) + λ_{1} R (θ) + λ_{2}^V (f_{θ})] .

Tractable Approximations

Exhaustively enumerating the input space to calculate $V$ is intractable - the space of $256 \times 256$ RGB images is of size $256^{3^{256 \times 256}}$ . How do we measure or approximate the volume taken up by the non- $u n k n o w n$ classes?

Black Box

Random image sampling wouldn't be very informative (beyond answering "is this class extending indefinitely along arbitrary dimensions?").
Yu et al.'s results weakly suggest that adversarial approaches can approximate the hypersurface of the class boundaries [7].
Bayesian optimization could deduce the hypersurface of the class boundaries, giving a reasonable estimate of volume.
The continuous latent space of a variational autoencoder [1] could afford new approximation approaches for $^V$ (more on this later).

Analytic

We do know the details of $f_{θ}$ - how can we exploit this?

Decision trees offer a particularly simple solution. Consider a decision tree on a one grayscale pixel input space; the tree outputs $u n k n o w n$ if pixel value $x > 127$ and $d o g$ otherwise. Then without having to run a single input through the tree, we find the proportion of the space taken up by non- $u n k n o w n$ classes to be $1 - \frac{128}{256} = .5$ .

This result generalizes to any decision tree over discrete variables; simply calculate in closed form how many possible inputs conform to the conditions for each $u n k n o w n$ leaf, and add them up. For an input space $S$ classified by a decision tree with $m$ nodes, we reduce the time complexity from $O (| S |)$ to $O (m)$ ; this is great, as it is almost always the case that $| S | ≫ m$ .

Neural networks are a bit trickier. Work has been done on extracting decision trees from neural networks [6], but this requires training a new network to do so and the resulting trees may not have sufficient fidelity.

Since the output of a one hidden-layer neural network can be represented as the linear equation $W X = Z$ , we can, perhaps, simply fix $W$ ( $θ$ ) and $Z$ (the layer's output values) to solve for $X$ (the input). Given such a solution, we might, in theory, be able to calculate the volume by integrating over the possible $Z$ values for each classification. The universal approximation theorem shows that a one hidden-layer neural network (with perhaps an exponential number of nodes) can approximate any function.

All this is getting a bit silly. After all, what we really care about is the proportion of the input space taken up by non- $u n k n o w n$ classes; we don't necessarily need to know exactly which inputs correspond to each output.

White Box

Let's treat this as a prediction problem. Take $n$ randomly-generated images and run them through the first layer of the network, recording the values for the activation of node $k$ in the first layer ( $a_{1 k}$ ); assume we incur some approximation error $ϵ$ due to having insufficiently sampled the true distribution. Derive a PDF over the activation values for each node.

Repeat this for each of the $l$ network layers, sampling input values from the PDFs for the previous layer's nodes. At the end of the network, we have a probabilistic estimate of the output class proportions.

However, this may only yield results comparable to random image sampling, as many node activations may not occur often for arbitrary data. While this would be descriptive of the input space as a whole, it would be unlikely to sample any $c a t$ images, for example.

Blessing of Dimensionality

In my opinion, the most promising approach involves abusing the properties of high-dimensional volumes; in particular, the well-known Curse of Dimensionality.

To illustrate, the volume of a ball in $p$ dimensions is

V_{p} (R) = \frac{π^{\frac{p}{2}}}{Γ (\frac{p}{2} + 1)} R^{p},

where $R$ is the radius of the ball and $Γ$ is the Gamma function.

For high-dimensional volumes, the overwhelming majority of points in the $p$ -ball are located in the very outer shell. How do we know this? For $N$ points uniformly distributed in a $p$ -dimensional unit ball (a ball with $R = 1)$ , the median distance from the closest data point to the origin is defined as

d (p, N) = {(1 - {\frac{1}{2}}^{\frac{1}{N}})}^{\frac{1}{p}} .

If we distribute $10,000$ points in the $1,000$ -dimensional unit ball, the closest point to the center has a median distance of $.9905$ . The vast, vast majority of the points we uniformly distributed find themselves at the outermost reaches of the ball. This is fantastic news (for mathematical and visual intuitions as to why, read this).

Train a variational autoencoder with a $k$ -dimensional latent space on the dataset $D_{t r a i n}$ . Then train a multilayer perceptron $f_{θ}$ to classify $D_{t r a i n}$ using the latent representation of each image. $^{4}$ This is the network to which we will apply the volume penalty. New images will be translated to the latent space by the trained encoder and then fed to $f_{θ}$ for classification.

Variationally-encoded MNIST digits, with clustering pressure provided by reconstruction loss and density encouraged by KL loss.

Image credit: towardsdatascience.com

We're basically going to do high-dimensional lidar $^{5}$ in the latent space to image the outer shell of $f_{θ}$ 's non- $u n k n o w n$ classification volumes. Due to the blessing of dimensionality, we don't need to worry whether the inside of the volume is classified as $u n k n o w n$ or not - for high-dimensional latent spaces, it's a rounding error. Therefore, we need only search until we reach a non- $u n k n o w n$ classification shell; we could then start from the other direction to estimate the other side. After doing this for all dimensions, we have a set of points approximating the hypersurface of the non- $u n k n o w n$ classification volume.

The complexity of this seems to be $O (k m n^{k - 1})$ , where $k$ is latent space dimensionality, $m$ is how many sampling operations are needed on average $^{6}$ to reach the outer shell, and $n$ is the per-dimension sampling density (for example, $n = 10$ points along an axis in a 3-dimensional space would entail pinging a grid of $10^{3 - 1} = 100$ points). Although exponential in $k$ , this is already orders of magnitude more tractable than sampling the original high-dimensional input space. By sacrificing some accuracy, we may be able to improve the exponential term further (perhaps to $log k$ or even a constant).

We should be able to provably bound our error as a function of the latent space dimensionality and the sampling density; due to the blessing of dimensionality, these bounds are likely to be sufficiently tight. $^{7}$ Although increasing the dimensionality tightens the bound on the volume approximation error, it also increases compute time exponentially (as of this initial formulation).

This approach requires that the latent space have enough dimensions to allow for $u n k n o w n$ inputs to not be encoded to areas occupied by known classes. If this is not feasible, there are other architectural choices available. Note that we need not compute the exact proportion of $u n k n o w n$ -labeled inputs, but rather an approximation; therefore, as long as the latent space has a reasonable number of features, it likely doesn't need to encode all relevant features.

Future Directions

Extremely small input spaces allow for enumerative calculation of volume. By pitting a volume-penalizing classifier against its vanilla counterpart in such a space, one could quickly gauge the promise of this idea.

If the experimental results support the hypothesis, the obvious next step is to formulate tractable approximations of $V$ (ideally with provably-bounded error).

Conclusion

This proposal may be an intractable robust solution to the open-category classification problem, with several promising tractable approaches already apparent.

$^{1}$ One of whom was my excellent deep learning professor.

$^{2}$ This approach is inspired by the AI safety mindset in that the classifier strives to be conservative in extrapolating from known data.

$^{3}$ Set underflow aside for the moment.

$^{4}$ I've updated from my previous position that the i.i.d. assumption about future data implied by a latent space is disqualifying. Although the latent space's structure would indeed make assumptions, the space would still approximate the volumetric properties of the $u n k n o w n$ /non- $u n k n o w n$ portions of the input space for the current classifier. Ultimately, this is what we care about!

$^{5}$ I'm sure there's a formal name for this technique, but I don't know it.

$^{6}$ Due to the statistical properties of variational autoencoders, it's possible that this could be done very quickly using bounded binary search.

$^{7}$ Unlike in the $p$ -ball example, the encoded points are distributed normally (as opposed to uniformly). This isn't necessarily directly relevant, as we're measuring $^V$ with respect to the multilayer perceptron's input space - the latent space (which, after being discretized and bounded, has uniformly "distributed" points).

Bibliography

[1] D. P. Kingma and M. Welling. Auto-encoding variational bayes. 2016.

[2] L. Li, M.L. Littman, and T.J. Walsh. Knows what it knows: a framework for self-aware learning. 2008.

[3] X. Li and F. Li. Adversarial examples detection in deep networks with convolutional filter statistics. 2016.

[4] J. Taylor et al. Alignment for advanced machine learning systems. 2016.

[5] J. Taylor. Conservative classifiers. 2016.

[6] R. Krishnan, G. Sivakumar, and P. Bhattacharya. Extracting decision trees from trained neural networks. 1999.

[7] Y. Yu et al. Open-category classification by adversarial sample generation. 2017.

[-]TurnTrout7y50

I'd like to reassure everyone that this is a rather brief read (and not 100 minutes). Something strange is afoot with $L A T E X$ and the estimated read times.

Edit: no longer that brief of a read. Still not 100 minutes (although the read time seems to be fixed now).

[-]habryka7y80

I jokingly proposed when we were coming up with the read times, that every equation should add 30 minutes to your expected reading time. And somehow we introduced a bug that did exactly that.

[-]TurnTrout7y30

[Stephen Hawking] notes in [A Brief History of Time]'s acknowledgements that he was warned that for every equation in the book, the readership would be halved, hence it includes only a single equation: $E = m c^{2}$ .

[-]lolbifrons7y30

To think he could have had twice as many readers.

[-]romeostevensit7y20

I really like the idea of a volume minimization function in dynamic tension with a complexity of shape minimization function and wonder what sorts of functions might mediate the tradeoff.

[-]TurnTrout7y20

One thought I had was $^V (f_{θ}) L (y, f_{θ} (x)) + λ R (θ)$ , but this has the problem of vanishing $L$ when the volumes are compact. I'm still not sure whether the relationship between $^V$ and $L$ should be multiplicative or additive. I am fairly sure that $^V$ shouldn't directly affect $R$ , as that would have lots of nasty failure modes (such as high complexities being justified by tiny, weird volumes).

edit: It should be additive, otherwise 0 total loss is achieved by classifying everything as $u n k n o w n$ .

LESSWRONG
is fundraising!
LW
$

14