# Superexponential Conceptspace, and Simple Words

**Followup to**: Mutual Information, and Density in Thingspace

Thingspace, you might think, is a rather huge space. Much larger than reality, for where reality only contains things that actually exist, Thingspace contains everything that *could *exist.

Actually, the way I "defined" Thingspace to have dimensions for every possible attribute—including correlated attributes like density and volume and mass—Thingspace may be too poorly defined to have anything you could call a *size.* But it's important to be able to visualize Thingspace *anyway. *Surely, no one can *really* understand a flock of sparrows if all they see is a cloud of flapping cawing things, rather than a cluster of points in Thingspace.

But as vast as Thingspace may be, it doesn't hold a candle to the size of Conceptspace.

"Concept", in machine learning, means a rule that includes or excludes examples. If you see the data 2:+, 3:-, 14:+, 23:-, 8:+, 9:- then you might guess that the concept was "even numbers". There is a rather large literature (as one might expect) on how to learn concepts from data... given random examples, given chosen examples... given possible errors in classification... and most importantly, given different spaces of possible rules.

Suppose, for example, that we want to learn the concept "good days on which to play tennis". The possible attributes of Days are:

Sky: {Sunny, Cloudy, Rainy}

AirTemp: {Warm, Cold}

Humidity: {Normal, High}

Wind: {Strong, Weak}

We're then presented with the following data, where + indicates a positive example of the concept, and - indicates a negative classification:

+ Sky: Sunny; AirTemp: Warm; Humidity: High; Wind: Strong.

- Sky: Rainy; AirTemp: Cold; Humidity: High; Wind: Strong.

+ Sky: Sunny; AirTemp: Warm; Humidity: High; Wind: Weak.

What should an algorithm infer from this?

A machine learner might represent *one* concept that fits this data as follows:

Sky: ?; AirTemp: Warm; Humidity: High; Wind: ?

In this format, to determine whether this concept accepts or rejects an example, we compare element-by-element: ? accepts anything, but a specific value accepts only that specific value.

So the concept above will accept only Days with AirTemp=Warm and Humidity=High, but the Sky and the Wind can take on any value. This fits both the negative and the positive classifications in the data so far—though it isn't the *only* concept that does so.

We can also simplify the above concept representation to {?, Warm, High, ?}.

Without going into details, the classic algorithm would be:

- Maintain the set of the most general hypotheses that fit the data—those that positively classify as many examples as possible, while still fitting the facts.
- Maintain another set of the most specific hypotheses that fit the data—those that negatively classify as many examples as possible, while still fitting the facts.
- Each time we see a new negative example, we strengthen all the most general hypotheses as little as possible, so that the new set is again as general as possible while fitting the facts.
- Each time we see a new positive example, we relax all the most specific hypotheses as little as possible, so that the new set is again as specific as possible while fitting the facts.
- We continue until we have only a single hypothesis left. This will be the answer
*if*the target concept was in our hypothesis space at all.

In the case above, the set of most general hypotheses would be {?, Warm, ?, ?} and {Sunny, ?, ?, ?}, while the set of most specific hypotheses is the single member {Sunny, Warm, High, ?}.

Any other concept you can find that fits the data will be strictly more specific than one of the most general hypotheses, and strictly more general than the most specific hypothesis.

(For more on this, I recommend Tom Mitchell's *Machine Learning,* from which this example was adapted.)

Now you may notice that the format above *cannot* represent all possible concepts. E.g. "Play tennis when the sky is sunny *or* the air is warm". That fits the data, but in the concept representation defined above, there's no quadruplet of values that describes the rule.

Clearly our machine learner is not very general. Why not allow it to represent *all possible* concepts, so that it can learn with the greatest possible flexibility?

Days are composed of these four variables, one variable with 3 values and three variables with 2 values. So there are 3*2*2*2 = 24 possible Days that we could encounter.

The format given for representing Concepts allows us to require any of these values for a variable, or leave the variable open. So there are 4*3*3*3 = 108 concepts in that representation. For the most-general/most-specific algorithm to work, we need to start with the most specific hypothesis "no example is ever positively classified". If we add that, it makes a total of 109 concepts.

Is it suspicious that there are more possible concepts than possible Days? Surely not: After all, a concept can be viewed as a *collection* of Days. A concept can be viewed as the set of days that it classifies positively, or isomorphically, the set of days that it classifies negatively.

So the space of *all possible* concepts that classify Days is the set of all possible sets of Days, whose size is 2^{24} = 16,777,216.

This complete space includes all the concepts we have discussed so far. But it also includes concepts like "Positively classify only the examples {Sunny, Warm, High, Strong} and {Sunny, Warm, High, Weak} and reject everything else" or "Negatively classify only the example {Rainy, Cold, High, Strong} and accept everything else." It includes concepts with no compact representation, just a flat list of what is and isn't allowed.

That's the problem with trying to build a "fully general" inductive learner: They can't learn concepts until they've seen every possible example in the instance space.

If we add on more attributes to Days—like the Water temperature, or the Forecast for tomorrow—then the number of possible days will grow exponentially in the number of attributes. But this isn't a problem with our restricted concept space, because you can narrow down a large space using a logarithmic number of examples.

Let's say we add the Water: {Warm, Cold} attribute to days, which will make for 48 possible Days and 325 possible concepts. Let's say that each Day we see is, usually, classified positive by around half of the currently-plausible concepts, and classified negative by the other half. Then when we learn the actual classification of the example, it will cut the space of compatible concepts in half. So it might only take 9 examples (2^{9} = 512) to narrow 325 possible concepts down to one.

Even if Days had forty binary attributes, it should still only take a manageable amount of data to narrow down the possible concepts to one. 64 examples, if each example is classified positive by half the remaining concepts. *Assuming,* of course, that the *actual* rule is one we can represent at all!

If you want to think of all the possibilities, well, good luck with that. The space of *all possible* concepts grows *super*exponentially in the number of attributes.

By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible *concepts* is past two-to-the-trillionth-power. To narrow down that *super*exponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out. You'd have to see every possible example, in fact.

That's with forty binary attributes, mind you. 40 bits, or 5 bytes, to be classified simply "Yes" or "No". 40 bits implies 2^40 possible examples, and 2^(2^40) possible concepts that classify those examples as positive or negative.

So, here in the real world, where objects take more than 5 bytes to describe *and* a trillion examples are not available *and* there is noise in the training data, we only even *think* about *highly regular* concepts. A human mind—or the whole observable universe—is not nearly large enough to consider all the other hypotheses.

From this perspective, learning doesn't just *rely on* inductive bias, it is *nearly all* inductive bias—when you compare the number of concepts ruled out *a priori,* to those ruled out by mere evidence*.*

But what has this (you inquire) to do with the proper use of words?

It's the whole reason that words have intensions as well as extensions.

In yesterday's post, I concluded:

The way to carve reality at its joints, is to draw boundaries around concentrations of unusually high probability density.

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn't explain it until today. A better statement would be:

The way to carve reality at its joints, is to draw

simpleboundaries around concentrations of unusually high probability density in Thingspace.

Otherwise you would just gerrymander Thingspace. You would create really odd noncontiguous boundaries that collected the observed examples, examples that couldn't be described in any shorter message than your observations themselves, and say: "This is what I've seen before, and what I expect to see more of in the future."

In the real world, nothing above the level of molecules repeats itself *exactly.* Socrates is shaped a lot like all those other humans who were vulnerable to hemlock, but he isn't shaped *exactly* like them. So your guess that Socrates is a "human" relies on drawing *simple* boundaries around the human cluster in Thingspace. Rather than, "Things shaped exactly like [5-megabyte shape specification 1] and with [lots of other characteristics], *or* exactly like [5-megabyte shape specification 2] and [lots of other characteristics]", ..., are human."

If you don't draw *simple* boundaries around your experiences, you can't do inference with them. So you try to describe "art" with intensional definitions like "that which is intended to inspire any complex emotion for the sake of inspiring it", rather than just pointing at a long list of things that are, or aren't art.

In fact, the above statement about "how to carve reality at its joints" is a bit chicken-and-eggish: You can't assess the *density* of actual observations, until you've already done at least a little carving. And the probability distribution comes from drawing the boundaries, not the other way around—if you already *had* the probability distribution, you'd have everything necessary for inference, so why would you bother drawing boundaries?

And this suggests another—yes, yet another—reason to be suspicious of the claim that "you can define a word any way you like". When you consider the superexponential size of Conceptspace, it becomes clear that singling out one particular concept for consideration is an act of no small audacity—not just for us, but for any mind of bounded computing power.

Presenting us with the word "wiggin", defined as "a black-haired green-eyed person", without some reason for raising *this particular concept* to the level of our deliberate attention, is rather like a detective saying: "Well, I haven't the slightest shred of support one way or the other for who could've murdered those orphans... not even an intuition, mind you... but have we considered John Q. Wiffleheim of 1234 Norkle Rd as a suspect?"

Part of the sequence *A Human's Guide to Words*

Next post: "Conditional Independence, and Naive Bayes"

Previous post: "Mutual Information, and Density in Thingspace"

## Comments (12)

OldI'll second the recommendation for Tom Mitchell's book (although it has been a long time since I have read it and I have moved away from the machine learning philosophy since).

Are you going to go on to mention that the search in a finite concept space can be seen as a search the space of regular languages and therefore a search in the space of FSM? And then move onto Turing Machines and the different concepts they can represent e.g. the set of strings that have exactly the same number of 0s and 1s in.

Hmm, lets fast forward to where I think the disagreement might lie in our philosophies.

Let us say I have painstakingly come up with a Turing machine that represents a concept, e.g. the even 0s and 1s I mentioned above. We shall call this the evenstring concept, I want to give this concept to a machine as it I have found it useful for something.

Now I could try and teach this to a machine by giving it a series of of positive and negative examples

0011 +, 0101 +, 000000111111 +, 1 -, 0001 - etc...

It would take infinite bits to fully determine this concept, agreed? You might get there early if you have a nice short way of describing evenstring in the AIs space of TM.

Instead if we had an agreed ordering of Turing machines I could communicate the bits of the Turing machine corresponding to evenstring first and ignore the evidence about evenstring entirely, instead we are looking at evidence for what is the TM of! That is I am no longer doing traditional induction. It would only take n bits of evidence to nail down the Turing Machine, where n is the length of the turing machine I am trying to communicate to the AI. I could communicate partial bits and it could try and figure out the rest, if I specified the length or a bound on it.

If you want to add evidence back into the picture, I could communicate the evenstring concept initially to the AI and make it increase the prior of the evenstring concept in some fashion. Then it collect evidence in the normal way, in case we were wrong when we communicated evenstring in some fashion, or it had a different coding for TMs.

However this is still not enough for me. The concepts that this could deal with would only be to do with the outside world, that is the communicated TM would be a mapping from external senses to thing space. I'm interested in concepts that map the space of turing machines to another space of turing machines ("'et' is the french word for 'and'"), and other weird and wonderful concepts.

I'll leave it at that for now.

Actually I'll skip to the end. In order to be able to represent all the possible concepts (e.g. concepts about concept formation, concepts about languages) I would like to be able to represent I need a general purpose, stored program computer. A lot like a PC, but slightly different.

Of course, once someone has defined the word "wiggin" in that way, this gives us a reason to attend to that concept, and additionally, presumably the definer would never have thought of it in the first place without SOME reason for it, even if an extremely weak one.

As I was reading about that learning algorithm, I was sure it looked familiar (I've never even heard of it before). Then suddenly it hit me - this is why I always win at Cluedo!

Say there's no correlation between the size of a rock and how much Vanadium it contains. I collect Vanadium you see. However, I can only carry rocks under a certain size back to my Vanadium-Extraction Facility. I bring back rocks I can carry, and test them for Vanadium. However, for ease of reference, I call all carrying-size, Vanadium-containing rocks 'Spargs'. This is useful since I can say 'I found 5 Spargs today', instead of having to say 'I found 5 smallish rocks containing Vanadium today'. I have observed no correlation between rock size and Vanadium content. Is 'Sparg' a lie?

Ben, to put that point more generally, Eliezer seems to be neglecting to consider the fact that utility is sometimes a reason to associate several concepts, even apart from their probability of being associated with one another or with other things. An example from another commenter would be "I want a word for red flowers because I like red flowers"; this is entirely reasonable.

A nice point. But it there is something a bit weird.

We all agree that utility is subjective - you like red flowers, I like blue. As Bayesians, we also know that empirical information is somewhat subjective as well. We don't all have access to the same empirical observations.

So, it appears that we cannot really "carve nature at the (objective) joints. The best we can do is to carve where we (subjectively) estimate the joints to be. Now, that is fine if we are using words only to carry out private inferences. But we also frequently need to use words to communicate.

It appears that even Bayesian rationalists who understand cluster analysis must sometimes argue about definitions. There may be arguments regarding how to delimit the population. There may be arguments about how best to quantize the variables. It is not completely clear to me that it is always possible to distinguish communication problems from single-person inference problems.

this reminds me of The Dumpster.

http://www.flong.com/projects/dumpster/

also, i can't help but think of the concept of the Collective Unconscious, and the title of an old Philip K Dick story, Do Androids Dream of Electric Sheep?

Unknown - pretty much, yeah. I just wanted to use the words 'Vanadium' and 'Sparg'.

Obvious counter: utility is subjective, the joints in reality (by definition!) aren't. So categorisation based on 'stuff we want' or 'stuff we like' can go any way you want and doesn't have to fall along reality's joints. There is a marked distinction between these and the type of (objective?) categories that fit in with the world.

If I am searching for a black-haired, green-eyed person to be in my movie, I have a motive for using the word Wiggin. However, the existence of the word Wiggin doesn't reflect a natural group of Things in Thingspace, and hence doesn't have any bearing on my

expectations. Just as the coining of a word meaning 'red flower' wouldn't be a reflection of any natural grouping in Thingspace - flowers can be lots of colours, and lots of things are red. Sound good?Not every regularity of categorization is derived from physical properties local to the object itself. I watch Ben Jones returning to the facility, and notice that

all the rocks he carriesare vanadium-containing and weighing less than ten pounds, at a surprisingly high rate relative to the default propensity of vanadium-containing and less-than-ten-pound rocks. So I call these rocks "Spargs", and there's no reason Ben himself can't do the same.Remember, it's legitimate to have a word for something with properties A, B where A, B do not occur in conjunction at greater than default probability, but which, when they do occur in conjunction, have properties C, D, and E at greater than default probability.

It's legitimate for us to have a word for

anyreason. Words have more than one purpose - they're not just for making predictions about likely properties.Isn't that the same as {Sunny, Warm, High, ?} in the notation used above?