Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
Previously in series: Lawful Uncertainty
You may have noticed a certain trend in recent posts: I've been arguing that randomness hath no power, that there is no beauty in entropy, nor yet strength from noise.
If one were to formalize the argument, it would probably run something like this: that if you define optimization as previously suggested, then sheer randomness will generate something that seems to have 12 bits of optimization, only by trying 4096 times; or 100 bits of optimization, only by trying 1030 times.
This may not sound like a profound insight, since it is true by definition. But consider - how many comic books talk about "mutation" as if it were a source of power? Mutation is random. It's the selection part, not the mutation part, that explains the trends of evolution.
Or you may have heard people talking about "emergence" as if it could explain complex, functional orders. People will say that the function of an ant colony emerges - as if, starting from ants that had been selected only to function as solitary individuals, the ants got together in a group for the first time and the ant colony popped right out. But ant colonies have been selected on as colonies by evolution. Optimization didn't just magically happen when the ants came together.
And you may have heard that certain algorithms in Artificial Intelligence work better when we inject randomness into them.
Is that even possible? How can you extract useful work from entropy?
But it is possible in theory, since you can have things that are anti-optimized. Say, the average state has utility -10, but the current state has an unusually low utility of -100. So in this case, a random jump has an expected benefit. If you happen to be standing in the middle of a lava pit, running around at random is better than staying in the same place. (Not best, but better.)
A given AI algorithm can do better when randomness is injected, provided that some step of the unrandomized algorithm is doing worse than random.
Imagine that we're trying to solve a pushbutton combination lock with 20 numbers and four steps - 160,000 possible combinations. And we try the following algorithm for opening it:
- Enter 0-0-0-0 into the lock.
- If the lock opens, return with SUCCESS.
- If the lock remains closed, go to step 1.
Obviously we can improve this algorithm by substituting "Enter a random combination" on the first step.
If we were to try and explain in words why this works, a description might go something like this: "When we first try 0-0-0-0 it has the same chance of working (so far as we know) as any other combination. But if it doesn't work, it would be stupid to try it again, because now we know that 0-0-0-0 doesn't work."
The first key idea is that, after trying 0-0-0-0, we learn something - we acquire new knowledge, which should then affect how we plan to continue from there. This is knowledge, quite a different thing from randomness...
What exactly have we learned? We've learned that 0-0-0-0 doesn't work; or to put it another way, given that 0-0-0-0 failed on the first try, the conditional probability of it working on the second try, is negligible.
Consider your probability distribution over all the possible combinations: Your probability distribution starts out in a state of maximum entropy, with all 160,000 combinations having a 1/160,000 probability of working. After you try 0-0-0-0, you have a new probability distribution, which has slightly less entropy; 0-0-0-0 has an infinitesimal probability of working, and the remaining 159,999 possibilities each have a 1/159,999 probability of working. To try 0-0-0-0 again would now be stupid - the expected utility of trying 0-0-0-0 is less than average; the vast majority of potential actions now have higher expected utility than does 0-0-0-0. An algorithm that tries 0-0-0-0 again would do worse than random, and we can improve the algorithm by randomizing it.
One may also consider an algorithm as a sequence of tries: The "unrandomized algorithm" describes the sequence of tries 0-0-0-0, 0-0-0-0, 0-0-0-0... and this sequence of tries is a special sequence that has far-below-average expected utility in the space of all possible sequences. Thus we can improve on this sequence by selecting a random sequence instead.
Or imagine that the combination changes every second. In this case, 0-0-0-0, 0-0-0-0 is just as good as the randomized algorithm - no better and no worse. What this shows you is that the supposedly "random" algorithm is "better" relative to a known regularity of the lock - that the combination is constant on each try. Or to be precise, the reason the random algorithm does predictably better than the stupid one is that the stupid algorithm is "stupid" relative to a known regularity of the lock.
In other words, in order to say that the random algorithm is an "improvement", we must have used specific knowledge about the lock to realize that the unrandomized algorithm is worse-than-average. Having realized this, can we reflect further on our information, and take full advantage of our knowledge to do better-than-average?
The random lockpicker is still not optimal - it does not take full advantage of the knowledge we have acquired. A random algorithm might randomly try 0-0-0-0 again; it's not impossible, but it could happen. The longer the random algorithm runs, the more likely it is to try the same combination twice; and if the random algorithm is sufficiently unlucky, it might still fail to solve the lock after millions of tries. We can take full advantage of all our knowledge by using an algorithm that systematically tries 0-0-0-0, 0-0-0-1, 0-0-0-2... This algorithm is guaranteed not to repeat itself, and will find the solution in bounded time. Considering the algorithm as a sequence of tries, no other sequence in sequence-space is expected to do better, given our initial knowledge. (Any other nonrepeating sequence is equally good; but nonrepeating sequences are rare in the space of all possible sequences.)
A combination dial often has a tolerance of 2 in either direction. 20-45-35 will open a lock set to 22-44-33. In this case, the algorithm that tries 0-1-0, 0-2-0, et cetera, ends up being stupid again; a randomized algorithm will (usually) work better. But an algorithm that tries 0-5-0, 0-10-0, 0-10-5, will work better still.
Sometimes it is too expensive to take advantage of all the knowledge that we could, in theory, acquire from previous tests. Moreover, a complete enumeration or interval-skipping algorithm would still end up being stupid. In this case, computer scientists often use a cheap pseudo-random algorithm, because the computational cost of using our knowledge exceeds the benefit to be gained from using it. This does not show the power of randomness, but, rather, the predictable stupidity of certain specific deterministic algorithms on that particular problem. Remember, the pseudo-random algorithm is also deterministic! But the deterministic pseudo-random algorithm doesn't belong to the class of algorithms that are predictably stupid (do much worse than average).
There are also subtler ways for adding noise to improve algorithms. For example, there are neural network training algorithms that work better if you simulate noise in the neurons. On this occasion it is especially tempting to say something like:
"Lo! When we make our artificial neurons noisy, just like biological neurons, they work better! Behold the healing life-force of entropy!"
What might actually be happening - for example - is that the network training algorithm, operating on noiseless neurons, would vastly overfit the data.
If you expose the noiseless network to the series of coinflips "HTTTHHTTH"... the training algorithm will say the equivalent of, "I bet this coin was specially designed to produce HTTTHHTTH every time it's flipped!" instead of "This coin probably alternates randomly between heads and tails." A hypothesis overfitted to the data does not generalize. On the other hand, when we add noise to the neurons and then try training them again, they can no longer fit the data precisely, so instead they settle into a simpler hypothesis like "This coin alternates randomly between heads and tails." Note that this requires us - the programmers - to know in advance that probabilistic hypotheses are more likely to be true.
Or here's another way of looking at it: A neural network algorithm typically looks at a set of training instances, tweaks the units and their connections based on the training instances, and in this way tries to "stamp" the experience onto itself. But the algorithms which do the stamping are often poorly understood, and it is possible to stamp too hard. If this mistake has already been made, then blurring the sensory information, or blurring the training algorithm, or blurring the units, can partially cancel out the "overlearning".
Here's a simplified example of a similar (but not identical) case: Imagine that the environment deals us a random mix of cards, 70% blue and 30% red. But instead of just predicting "blue" or "red", we have to assign a quantitative probability to seeing blue - and the scoring rule for our performance is one that elicits an honest estimate; if the actual frequency is 70% blue cards, we do best by replying "70%", not 60% or 80%. ("Proper scoring rule.")
If you don't know in advance the frequency of blue and red cards, one way to handle the problem would be to have a blue unit and a red unit, both wired to an output unit. The blue unit sends signals with a positive effect that make the target unit more likely to fire; the red unit inhibits its targets - just like the excitatory and inhibitory synapses in the human brain! (Or an earthworm's brain, for that matter...)
Each time we see a blue card in the training data, the training algorithm increases the strength of the (excitatory) synapse from the blue unit to the output unit; and each time we see a red card, the training algorithm strengthens the (inhibitory) synapse from the red unit to the output unit.
But wait! We haven't said why the blue or red units might fire in the first place. So we'll add two more excitatory units that spike randomly, one connected to the blue unit and one connected to red unit. This simulates the natural background noise present in the human brain (or an earthworm's brain).
Finally, the spiking frequency of the output unit becomes the predicted probability that the next card is blue.
As you can see - assuming you haven't lost track of all the complexity - this neural network learns to predict whether blue cards or red cards are more common in the mix. Just like a human brain!
At least that's the theory. However, when we boot up the neural network and give it a hundred training examples with 70 blue and 30 red cards, it ends up predicting a 90% probability that each card will be blue. Now, there are several things that could go wrong with a system this complex; but my own first impulse would be to guess that the training algorithm is too strongly adjusting the synaptic weight from the blue or red unit to the output unit on each training instance. The training algorithm needs to shift a little less far - alter the synaptic weights a little less.
But the programmer of the neural network comes up with a different hypothesis: the problem is that there's no noise in the input. This is biologically unrealistic; real organisms do not have perfect vision or perfect information about the environment. So the programmer shuffles a few randomly generated blue and red cards (50% probability of each) into the training sequences. Then the programmer adjusts the noise level until the network finally ends up predicting blue with 70% probability. And it turns out that using almost the same noise level (with just a bit of further tweaking), the improved training algorithm can learn to assign the right probability to sequences of 80% blue or 60% blue cards.
Success! We have found the Right Amount of Noise.
Of course this success comes with certain disadvantages. For example, maybe the blue and red cards are predictable, in the sense of coming in a learnable sequence. Maybe the sequence is 7 blue cards followed by 3 red cards. If we mix noise into the sensory data, we may never notice this important regularity, or learn it imperfectly... but that's the price you pay for biological realism.
What's really happening is that the training algorithm moves too far given its data, and adulterating noise with the data diminishes the impact of the data. The two errors partially cancel out, but at the cost of a nontrivial loss in what we could, in principle, have learned from the data. It would be better to adjust the training algorithm and keep the data clean.
This is an extremely oversimplified example, but it is not a total strawman. The scenario seems silly only because it is simplified to the point where you can clearly see what is going wrong. Make the neural network a lot more complicated, and the solution of adding noise to the inputs might sound a lot more plausible. While some neural network algorithms are well-understood mathematically, others are not. In particular, systems crafted with the aim of biological realism are often not well-understood.
But it is an inherently odd proposition that you can get a better picture of the environment by adding noise to your sensory information - by deliberately throwing away your sensory acuity. This can only degrade the mutual information between yourself and the environment. It can only diminish what in principle can be extracted from the data. And this is as true for every step of the computation, as it is for the eyes themselves. The only way that adding random noise will help is if some step of the sensory processing is doing worse than random.
Now it is certainly possible to design an imperfect reasoner that only works in the presence of an accustomed noise level. Biological systems are unable to avoid noise, and therefore adapt to overcome noise. Subtract the noise, and mechanisms adapted to the presence of noise may do strange things.
Biological systems are often fragile under conditions that have no evolutionary precedent in the ancestral environment. If somehow the Earth's gravity decreased, then birds might become unstable, lurching up in the air as their wings overcompensated for the now-decreased gravity. But this doesn't mean that stronger gravity helps birds fly better. Gravity is still the difficult challenge that a bird's wings work to overcome - even though birds are now adapted to gravity as an invariant.
What about hill-climbing, simulated annealing, or genetic algorithms? These AI algorithms are local search techniques that randomly investigate some of their nearest neighbors. If an investigated neighbor is superior to the current position, the algorithm jumps there. (Or sometimes probabilistically jumps to a neighbor with probability determined by the difference between neighbor goodness and current goodness.) Are these techniques drawing on the power of noise?
Local search algorithms take advantage of the regularity of the search space - that if you find a good point in the search space, its neighborhood of closely similar points is a likely place to search for a slightly better neighbor. And then this neighbor, in turn, is a likely place to search for a still better neighbor; and so on. To the extent this regularity of the search space breaks down, hill-climbing algorithms will perform poorly. If the neighbors of a good point are no more likely to be good than randomly selected points, then a hill-climbing algorithm simply won't work.
But still, doesn't a local search algorithm need to make random changes to the current point in order to generate neighbors for evaluation? Not necessarily; some local search algorithms systematically generate all possible neighbors, and select the best one. These greedy algorithms work fine for some problems, but on other problems it has been found that greedy local algorithms get stuck in local minima. The next step up from greedy local algorithms, in terms of added randomness, is random-restart hill-climbing - as soon as we find a local maximum, we restart someplace random, and repeat this process a number of times. For our final solution, we return the best local maximum found when time runs out. Random-restart hill-climbing is surprisingly useful; it can easily solve some problem classes where any individual starting point is unlikely to lead to a global maximum or acceptable solution, but it is likely that at least one of a thousand individual starting points will lead to the global maximum or acceptable solution.
The non-randomly-restarting, greedy, local-maximum-grabbing algorithm, is "stupid" at the stage where it gets stuck in a local maximum. Once you find a local maximum, you know you're not going to do better by greedy local search - so you may as well try something else with your time. Picking a random point and starting again is drastic, but it's not as stupid as searching the neighbors of a particular local maximum over and over again. (Biological species often do get stuck in local optima. Evolution, being unintelligent, has no mind with which to "notice" when it is testing the same gene complexes over and over.)
Even more stupid is picking a particular starting point, and then evaluating its fitness over and over again, without even searching its neighbors. This is the lockpicker who goes on trying 0-0-0-0 forever.
Hill-climbing search is not so much a little bit randomized compared to the completely stupid lockpicker, as almost entirely nonrandomized compared to a completely ignorant searcher. We search only the local neighborhood, rather than selecting a random point from the entire state space. That probability distribution has been narrowed enormously, relative to the overall state space. This exploits the belief - the knowledge, if the belief is correct - that a good point probably has good neighbors.
You can imagine splitting a hill-climbing algorithm into components that are "deterministic" (or rather, knowledge-exploiting) and "randomized" (the leftover ignorance).
A programmer writing a probabilistic hill-climber will use some formula to assign probabilities to each neighbor, as a function of the neighbor's fitness. For example, a neighbor with a fitness of 60 might have probability 80% of being selected, while other neighbors with fitnesses of 55, 52, and 40 might have selection probabilities of 10%, 9%, and 1%. The programmer writes a deterministic algorithm, a fixed formula, that produces these numbers - 80, 10, 9, and 1.
What about the actual job of making a random selection at these probabilities? Usually the programmer will hand that job off to someone else's pseudo-random algorithm - most language's standard libraries contain a standard pseudo-random algorithm; there's no need to write your own.
If the hill-climber doesn't seem to work well, the programmer tweaks the deterministic part of the algorithm, the part that assigns these fixed numbers 80, 10, 9, and 1. The programmer does not say - "I bet these probabilities are right, but I need a source that's even more random, like a thermal noise generator, instead of this merely pseudo-random algorithm that is ultimately deterministic!" The programmer does not go in search of better noise.
It is theoretically possible for a poorly designed "pseudo-random algorithm" to be stupid relative to the search space; for example, it might always jump in the same direction. But the "pseudo-random algorithm" has to be really shoddy for that to happen. You're only likely to get stuck with that problem if you reinvent the wheel instead of using a standard, off-the-shelf solution. A decent pseudo-random algorithm works just as well as a thermal noise source on optimization problems. It is possible (though difficult) for an exceptionally poor noise source to be exceptionally stupid on the problem, but you cannot do exceptionally well by finding a noise source that is exceptionally random. The power comes from the knowledge - the deterministic formula that assigns a fixed probability distribution. It does not reside in the remaining ignorance.
And that's why I always say that the power of natural selection comes from the selection part, not the mutation part.
As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid - so that a randomized algorithm seems wiser - you should be able to use the same knowledge to produce a superior derandomized algorithm. If nothing else seems obvious, just avoid outputs that look "unrandomized"! If you know something is stupid, deliberately avoid it! (There are exceptions to this rule, but they have to do with defeating cryptographic adversaries - that is, preventing someone else's intelligence from working on you. Certainly entropy can act as an antidote to intelligence!) And of course there are very common practical exceptions whenever the computational cost of using all our knowledge exceeds the marginal benefit...
Still you should find, as a general principle, that randomness hath no power: there is no beauty in entropy, nor strength from noise.