(full transcript with time stamps emailed to Louie Helm)
What should an unsupervised intelligent agent, be it a human baby or an artificial agent, what should it do? How should it deal with data that is streaming in through the input centers in response to the actions that it's executing?
First of all, and this is a very trivial thing to do in principle at least, you should store all the data that is coming in. You shouldn't throw away any of the data, if you can. And it makes sense because within a couple of years we will be able to store one hundred years of lifetime at the resolution of a high-definition TV video. And maybe human brains can also store one hundred years of human lifetime at a rate -- I once made a rough calculation -- comparable to a low-resolution MPEG video.
So in principle that is not a problem, but with that by itself you cannot do anything. You have to find regularities in this history of inputs and actions that you store and, in other words, you have to compress it. You have to compress that history.
Whenever there's a regularity, a symmetry, whatever, then you can write a program that needs less bits than the raw data and still encodes the entire data. So that's what compression's about. Now let's define the simplicity or the subjective compressibility or the subjective beauty of some data point X, given some subjective observer O at a given point in his life, T. And that is just the number of bits you need to encode the incoming data -- the X -- at this point in time with the given limited compression algorithm that you have.
For example, most of you know a lot about human faces, and that's because you saw so many of these faces. Now you are carrying around with you some sort of prototype face which allows you to encode new faces in the visual field, but just encoding the deviations from the prototype. So whenever a new face comes along and it looks very much like the prototype face, then you just need a few extra bits to store that new face. And your lazy brain likes that because it doesn't want to waste a lot of storage space. The more the face looks like the prototype face, you could assume the fewer bits you need to encode it, and the prettier in a certain sense you find it.
This is just a word. We just count the bits we need to store the new incoming data. For example, a face that is very regular doesn't need a lot of bits to be encoded.
The important thing is not the compression by itself, but the first derivative of the compressibility. Because what's really going on is that, as new data is coming in, your compression algorithm improves all the time and becomes a better predictor of the data. Whatever you can predict, you can compress, because you don't have to store as extra what you already can predict.
So prediction and compression are almost the same thing, and to the extent that your learning algorithm is improving the predictor such that it becomes a better predictor on the observed data so far, you are saving bits. You can count this progress in bits you are saving. That's the only interesting thing which signifies that there's a novel pattern in the inference stream where you still have some learning progress.
So what you're interested in is, what is the interestingness of some data X? Well, it's not the number of bits that you need to encode the data. It's the first derivative, the change of the number of bits as your subjective learning algorithm based on your subjective previous knowledge is improving the compressibility. So you have to count the number of bits that you're saving.
Once you have that in place and you can formally nail it down and implement it in computers and robots, you just need an additional learning algorithm: a reward-optimizing algorithm. Whenever you save a few bits, it means you have a novel pattern and you count how novel it is by counting how many bits did you save and that's an internal reward signal, an intrinsic motivation. That's what you want to maximize for the future. You want your controller that is directing your arms and your actuators to move such that you get additional data from the environment where you can still get additional compression programs of this type, where your compression algorithm can still make this type of progress.
There are many reward-maximizing algorithms and reinforcement learning algorithms that in principle can do this. This is the basic principle. I'm going to explain the rest of my talk only [in terms of] how this explains art and science, and whatever.
Again, in discrete time, the formulation without derivatives, if you don't like that. The simplicity or compressibility -- or beauty, if you want -- of the data is the number of bits you need to encode it given what you already know about the data. The interestingness of the data is the change in the number of bits. So you get the data, you learn a little bit on it which means you can now compress it a little bit better. So the raw data is like that. The compressed data is like that. Then you improve the compressor a little bit. It learns something. It becomes a better neural network that predicts the data. And now it takes so many bits, and this is what you save, and that's your internal reward signal because you have a novel pattern which you didn't know yet. And that's why you find it interesting. You can just subtract the number of bits you needed before from the number of bits that you need afterwards and there you go. So that's the reward signal.
Let me give you a very simple example: a robot sitting in a dark room. The input doesn't change. It sits there and no matter what it does it's always black, black, black. So it's extremely compressible input, because it already can predict that very easily because the next frame is exactly like the previous one. You can totally compress the input and it's totally boring because there is no compression progress because you don't see a pattern that it didn't already know.
Now let me give you another extreme example which is just the opposite. Suppose you are sitting in front of a screen with white noise. There are all these black and white pixels coming with equal probability at you, conveying maximum traditional Shannon information or Boltzmann information. And still this stream of inputs is totally boring again because, yes, it's very uncompressible. You cannot find a short pattern and you cannot improve your current description of the signal, which again means that there is no compression progress, so this is also boring. The only thing that is interesting is stuff like certain types of music which you didn't know yet but which was maybe a little bit similar to what you already knew about music, and whether there was a new little harmony in there which you hadn't heard just in this way, and there you have a little pattern where you save a couple of bits. That's what motivates you to listen to the same song again.
Again, here we have boring white noise and no internal reward for things like that. So a discovery in physics for example is just a very large compression improvement. Suppose you have one million videos of falling apples and they all fall in the same way. It's always the same way they fall down. You can extract the rule behind this behavior and it turns out it's a very simple program that describes gravity, essentially. It's always a very short program that you can use again and again for all these many different videos of falling apples to greatly compress these orange blobs that are falling down.
You cannot compress everything. There are random fluctuations and noise and whatever that you can't compress, but there is a substantial aspect of the incoming data that you can compress. And there you can make a lot of compression progress and suddenly save a lot of bits.
The same is true also in the arts. Suppose there's a guy who figured out a way of drawing Obama with just five lines, such that everybody says, "Hey, that's Obama." You have an artist who's somehow extracted the essence of the face such that you have the same impression as you're looking at this face as you get when you are looking at a high-resolution photograph with a million pixels. Somehow there was a compression progress in the artist as he was trying many times to come up with a convincing caricature, and there is a similar thing happening in the observer when he sees that for the first time.
So the scientist and the artist have something in common; they always try to make new data which is compressible in a new, previously unknown way. A new pattern, a novel pattern means yes, it's compressible, but in a way that I didn't know yet, such that my compressor can make this learning progress and save a couple of bits.