What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.
(you are of course free to suggest and to suspect anything you like; I will, however, point out that suspicion is no substitute for building something that actually works...)
It is getting late, so this may be way off, and have not the time to read the paper.
This is also assuming finite trials right? Because over infinite trials if you have a non-zero probability of siding with the wrong group of classifiers, you will make infinite mistakes. No matter how small the probabilities go.
It seems it is trading off a better expectation for a worse real worse case.
Also are you thinking of formalising an alternative to the infinite SI worst case you describe?
Um, maybe I misunderstood what you said or the proofs, but isn't there a much simpler way to reject the proof of superiorness of the WMA, namely that in the first proof is, as you say, the worst case scenario.... but the second proof is NOT. It uses expected value rather than worst case, right? So it's not so much "worst case analysis does really weird things" but "not only that, the two proofs aren't even validly comparable, because one is worst case analysis and one involves expected value analysis"
Or did I horribly misunderstand either the second proof or your argument against it? (if so, I blame lack of sleep)
Psy, the "worst case expected value" is a bound directly comparable to the "worst case value" because the actual trajectory of expert weights is the same between the two algorithms.
Pearson, as an infinite set atheist I see no reason to drag infinities into this case, and I don't recall any of the proofs talked about them either.
Has anyone proved a theorem on the uselessness of randomness?
Nominull's suggestion:
"By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution."
Might be a starting point for such a proof. However it does rely on being able to find that "one point" whose expected value is >= the average expected value of the distribution. Perhaps it is easy to prove that good points lie in a certain range, but all obvious specific choices in that range are bad.
So, the randomized algorithm isn't really better than the unrandomized one because getting a bad result from the unrandomized one is only going to happen when your environment maliciously hands you a problem whose features match up just wrong with the non-random choices you make, so all you need to do is to make those choices in a way that's tremendously unlikely to match up just wrong with anything the environment hands you because it doesn't have the same sorts of pattern in it that the environment might inflict on you.
Except that the definition of "...
I think this series of posts should contain a disclaimer at some point stating the usability of randomized algorithms in practice (even excluding situations like distributed computing and cryptography). In a broad sense, though randomness offers us no advantage information theoretically (this is what you seem to be saying), randomness does offer a lot of advantages computationally. This fact is not at odds with what you are saying, but deserves to be stated more explicitly.
There are many algorithmic tasks which become feasible via randomized algorithms, for which no feasible deterministic alternatives are known - until recently primality testing was one such problem (even now, the deterministic primality test is unusable in practice). Another crucial advantage of randomized algorithms is that in many cases it is much easier to come up with a randomized algorithm and later, if needed, derandomize the algorithm. Randomized algorithms are extremely useful in tackling algorithmic problems and I think any AI that has to design algorithms would have to be able to reason in this paradigm (cf. Randomized algorithms, Motwani and Raghavan or Probability and Computing, Mitzenmacher and Upfal).
To put my point another way...
Say your experts, above, predict instance frequencies in populations. Huge populations, so you can only test expert accuracy by checking subsets of those populations.
So, you write an algorithm to choose which individuals to sample from populations when testing expert's beliefs. Aren't there situations where the best algorithm, the one least likely to bias for any particular expert algorithm, would contain some random element?
I fear randomness is occasionally necessary just to fairly check the accuracy of beliefs about the world and overcome bias.
Eliezer: I recognize that the two algorithms have the exact same trajectory for the weights, that the only thing that's different between them is the final output, not the weights. However, maybe I'm being clueless here, but I still don't see that as justifying considering the two different proof results really comparable.
The real worst case analysis for the second algorithm would be more like "the random source just happened to (extremely unlikely, but...) always come up with the wrong answer."
Obviously this sort of worst case analysis isn't rea...
Sure, randomness is the lazy way out. But if computational resources are not infinite, laziness is sometimes the smart way out. And I said "infinite" - even transhuman quantum computers are finite.
I agree with Psy-Kosh, I don't think the proofs are comparable.
The difference between the two derivations is that #1 uses a weak upper bound for W, while #2 uses an equality. This means that the final bound is much weaker in #1 than in #2.
It's possible to redo the derivation for the nonrandomized version and get the exact same bound as for the randomized case. The trick is to write M as the number of samples for which F_i > 1/2. Under weak assumptions on the performance of the experts you can show that this is less than Sum_i F_i.
Eliezer: thanks for the AI brain-teaser, I hope to see more posts like this in the future.
I agree with Psy-Kosh too. The key is, as Eliezer originally wrote, never. That word appears in Theorem 1 (about the deterministic algorithm), but it does not appear in Theorem 2 (the bound on the randomized algorithm).
Basically, this is the same insight Eliezer suggests, that the environment is being allowed to be a superintelligent entity with complete knowledge in the proof for the deterministic bound, but the environment is not allowed the same powers in the proof for the randomized one.
In other words, Eliezer's conclusion is correct, but I don't thi...
The question whether randomized algorithms can have an advantage over deterministic algorithms is called the P vs BPP problem. As far as I know, most people believe that P=BPP, that is, randomization doesn't actually add power. However, nobody knows how to prove this. Furthermore, there exists an oracle relative to which P is not equal to BPP. Therefore any proof that P=BPP has to be nonrelativizing. This rules out a lot of the more obvious methods of proof.
see: complexity zoo
Nicely done. Quite a number of points raised that escape many people and which are rarely raised in practice.
Someone please tell me if I understand this post correctly. Here is my attempt to summarize it:
"The two textbook results are results specifically about the worst case. But you only encounter the worst case when the environment can extract the maximum amount of knowledge it can about your 'experts', and exploits this knowledge to worsen your results. For this case (and nearby similar ones) only, randomizing your algorithm helps, but only because it destroys the ability of this 'adversary' to learn about your experts. If you instead average over all cases, the non-random algorithm works better."
Is that the argument?
I am interested in what Scott Aaronson says to this.
I am unconvinced, and I agree with both the commenters g and R above. I would say Eliezer is underestimating the number of problems where the environment gives you correlated data and where the correlation is essentially a distraction. Hash functions are, e.g., widely used in programming tasks and not just by cryptographers. Randomized algorithms often are based on non-trivial insights into the problem at hand. For example, the insight in hashing and related approaches is that "two (different) object...
@ comingstorm: Quasi Monte Carlo often outperforms Monte Carlo integration for problems of small dimensionality involving "smooth" integrands. This is, however, not yet rigorously understood. (The proven bounds on performance for medium dimensionality seem to be extremely loose.)
Besides, MC doesn't require randomness in the "Kolmogorov complexity == length" sense, but in the "passes statistical randomness tests" sense. Eliezer has, as far as I can see, not talked about the various definitions of randomness.
Fascinating post. I especially like the observation that the purpose of randomness in all these examples is to defeat intelligence. It seems that this is indeed the purpose for nearly every use of randomness that I can think of. We randomise presidents routes to defeat the intelligence of an assasin, a casino randomises its roulette wheels to defeat the intelligence of the gambler. Even in sampling, we randomise our sample to defeat our own intelligence, which we need to treat as hostile for the purpose of the experiment.
What about in compressed sensing where we want an incoherent basis? I was under the impression that random linear projections was the best you could do here.
kyb, see the discussion of quicksort on the other thread. Randomness is used to protect against worst-case behavior, but it's not because we're afraid of intelligent adversaries. It's because worst-case behavior for quicksort happens a lot. If we had a good description of naturally occurring lists, we could design a deterministic pivot algorithm, but we don't. We only have the observation simple guess-the-median algorithms perform badly on real data. It's not terribly surprising that human-built lists resonate with human-designed pivot algorithms; but the opposite scenario, where the simplex method works well in practice is not surprising either.
I agree with Psy, the bounds are not comparable.
In fact, the bound for #2 should be the same as the one for #1. I mean, in the really worst case, the randomized algorithm makes exactly the same predictions as the deterministic one. I advise to blame and demolish your quantum source of random numbers in this case.
Tentative, but what about cases where you don't trust your own judgement and suspect that you need to be shaken out of your habits?
What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.
Monte Carlo methods can't buy you any correctness. They are useful because they allow you to sacrifice an unnecessary bit of correctness in order to give you a result in a much shorter time on otherwise intractable problem. They are also useful to simulate the effects of real world randomness (or at least behavior you have no idea how to systematically predict).
So, for example, I used a Monte Carlo script to determine expected ...
Congratulations Eliezer, this subject is very interesting.
Nontheless useful, i.e. Monter Carlo methods with pseudo-random numbers ARE actually deterministic, but we discard the details (very complex and not useful for the application) and only care about some statistical properties. This is a state of subjective partial information, the business of bayesian probability, only this partial information is deliberate! (or better, a pragmatic compromise due to limitations in computational power. Otherwise we'd just use classical numerical integration happily in...
As stated the statement "Randomness didn't buy me any accuracy, it was a way of trading accuracy for development time." is a complete non-sequitur since nobody actually believes that randomness is a means of increased accuracy in the long term it is used as a performance booster. It merely means you can get the answer faster with less knowledge which we do all the time since there are a plethora of systems that we can't understand.
Eliezer, can you mathematically characterize those environments where randomization can help. Presumably these are the "much more intelligent than you, and out to get you" environments, but I'd like to see that characterized a bit better.
A slight tangent here, but Blum loses me even before the considerations that Eliezer has so clearly pointed out. Comparing an upper bound for the worst case of one algorithm with an upper bound for the average case of the other is meaningless. Even were like things being compared, mere upper bounds of the true values are only weak evidence of the quality of the algorithms.
FWIW, I ran a few simulations (for the case of complete independence of all experts from themselves and each other), and unsurprisingly, the randomised algorithm performed on average wors...
Anyone who evaluates the performance of an algorithm by testing it with random data (e.g., simulating these expert-combining algorithms with randomly-erring "experts") is ipso facto executing a randomized algorithm...
g: Indeed I was. Easier than performing an analytical calculation and sufficiently rigorous for the present purpose. (The optimal weights, on the other hand, are log(p/(1-p)), by an argument that is both easier and more rigorous than a simulation.)
Richard, I wasn't suggesting that there's anything wrong with your running a simulation, I just thought it was amusing in this particular context.
Silas - yeah, that's about the size of it.
Eliezer, when you come to edit this for popular publication, lose the maths, or at least put it in an endnote. You're good enough at explaining concepts that if someone's going to get it, they're going to get it without the equations. However, a number of those people will switch off there and then. I skipped it and did fine, but algebra is a stop sign for a number of very intelligent people I know.
I am interested in what Scott Aaronson says to this.
I fear Eliezer will get annoyed with me again :), but R and Stephen basically nailed it. Randomness provably never helps in average-case complexity (i.e., where you fix the probability distribution over inputs) -- since given any ensemble of strategies, by convexity there must be at least one deterministic strategy in the ensemble that does at least as well as the average.
On the other hand, if you care about the worst-case running time, then there are settings (such as query complexity) where randomness provably does help. For example, suppose you're given n bits, you're promised that either n/3 or 2n/3 of the bits are 1's, and your task is to decide which. Any deterministic strategy to solve this problem clearly requires looking at 2n/3 + 1 of the bits. On the other hand, a randomized sampling strategy only has to look at O(1) bits to succeed with high probability.
Whether randomness ever helps in worst-case polynomial-time computation is the P versus BPP question, which is in the same league as P versus NP. It's conjectured that P=BPP (i.e., randomness never saves more than a polynomial). This is known to be true if really ...
@ Scott Aaronson. Re: your n-bits problem. You're moving the goalposts. Your deterministic algorithm determines with 100% accuracy which situation is true. Your randomized algorithm only determines with "high probability" which situation is true. These are not the same outputs.
You need to establish goal with a fixed level of probability for the answer, and then compare a randomized algorithm to a deterministic algorithm that only answers to that same level of confidence.
That's the same mistake that everyone always makes, when they say that "randomness provably does help." It's a cheaper way to solve a different goal. Hence, not comparable.
Scott, I don't dispute what you say. I just suggest that the confusing term "in the worst case" be replaced by the more accurate phrase "supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated 'random'".
Don: When you fix the goalposts, make sure someone can't kick the ball straight in! :-) Suppose you're given an n-bit string, and you're promised that exactly n/4 of the bits are 1, and they're either all in the left half of the string or all in the right half. The problem is to decide which. It's clear that any deterministic algorithm needs to examine at least n/4 + 1 of the bits to solve this problem. On the other hand, a randomized sampling algorithm can solve the problem with certainty after looking at only O(1) bits on average.
Eliezer: I often tell people that theoretical computer science is basically mathematicized paranoia, and that this is the reason why Israelis so dominate the field. You're absolutely right: we do typically assume the environment is an adversarial superintelligence. But that's not because we literally think it is one, it's because we don't presume to know which distribution over inputs the environment is going to throw at us. (That is, we lack the self-confidence to impose any particular prior on the inputs.) We do often assume that, if we generate random bits ourselves, then the environment isn't going to magically take those bits into account wh...
the environment is an adversarial superintelligence
Oh, Y'all are trying to outwit Murphy? No wonder you're screwed.
Could Scott_Aaronson or anyone who knows what he's talking about, please tell me the name of the n/4 left/right bits problem he's referring to, or otherwise give me a reference for it? His explanation doesn't seem to make sense: the deterministic algorithm needs to examine 1+n/4 bits only in the worst case, so you can't compare that to the average output of the random algorithm. (Average case for the determistic would, it seems, be n/8 + 1) Furthermore, I don't understand how the random method could average out to a size-independent constant.
Is the randomized algorithm one that uses a quantum computer or something?
Silas: "Solve" = for a worst-case string (in both the deterministic and randomized cases). In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you'll have queried either a 1 in the left half or a 1 in the right half, at which point you're done.
As far as I know this problem doesn't have a name. But it's how (for example) you construct an oracle separating P from ZPP.
@Scott_Aaronson: Previously, you had said the problem is solved with certainty after O(1) queries (which you had to, to satisfy the objection). Now, you're saying that after O(1) queries, it's merely a "high probability". Did you not change which claim you were defending?
Second, how can the required number of queries not depend on the problem size?
Finally, isn't your example a special case of exactly the situation Eliezer_Yudkowsky describes in this post? In it, he pointed out that the "worst case" corresponds to an adversary who kno...
Silas: Look, as soon you find a 1 bit, you've solved the problem with certainty. You have no remaining doubt about the answer. And the expected number of queries until you find a 1 bit is O(1) (or 2 to be exact). Why? Because with each query, your probability of finding a 1 bit is 1/2. Therefore, the probability that you need t or more queries is (1/2)^(t-1). So you get a geometric series that sums to 2.
(Note: I was careful to say you succeed with certainty after an expected number of queries independent of n -- not that there's a constant c such tha...
Isn't the probability of finding a bit 1/4 in each sample? Because you have to sample both sides. If you randomly sample without repetitions you can do even better than that. You can even bound the worse case for the random algorithm at n/4+1 as well by seeing if you ever pick n/4+1 0s from any side. So there is no downside to doing it randomly as such.
However if it is an important problem and you think you might be able to find some regularities, the best bet would to be to do bayesian updates on the most likely positions to be ones and preferentially ch...
Let's say the solver has two input tapes: a "problem instance" input and a "noise source" input.
When Eli says "worst-case time", he means the maximum time, over all problem instances and all noise sources, taken to solve the problem.
But when Scott says "worst-case time," he means taking the maximum (over all problem instances X) of the average time (over all noise sources) to solve X.
Scott's criterion seems relevant to real-world situations where one's input is being generated by an adversarial human who knows which ...
@michael e sullivan: re "Monte Carlo methods can't buy you any correctness" -- actually, they can. If you have an exact closed-form solution (or a rapidly-converging series, or whatever) for your numbers, you really want to use it. However not all problems have such a thing; generally, you either simplify (giving a precise, incorrect number that is readily computable and hopefully close), or you can do a numerical evaluation, which might approach the correct solution arbitrarily closely based on how much computation power you devote to it.
Quad...
Silas is right; Scott keeps changing the definition in the middle, which was exactly my original complaint.
For example, Scott says: In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you'll have queried either a 1 in the left half or a 1 in the right half, at which point you're done.
And yet this is no different from a deterministic algorithm. It can also query O(1) bits, and "with high probability" have a certain answer.
I'm really astonished that Scott can't see the slight-of-hand i...
Don how well a deterministic algorithm does depends upon what the deterministic algorithm is and what the input is, it cannot always query O(1) queries.
E.g. in a 20 bit case
If the deterministic algorithm starts its queries from the beginning, querying each bit in turn and this is pattern is always 00000111110000000000
It will always take n/4+1 queries. Only when the input is random will it on average take O(1) queries.
The random one will take O(1) on average on every type of input.
Don - the "sleight of hand", as you put it is that (as I think has been said before) we're examining worst-case. We explicitly are allowing the string of 1s and 0s to be picked by an adversarial superintelligence who knows our strategy. Scott's algorithm only needs to sample 4 bits on average to answer the question even when the universe it out to get him - the deterministic version will need to sample exactly n/4+1.
Basically, Eliezer seems to be claiming that BPP = P (or possibly something stronger), which most people think is probably true, but...
To generalize Peter's example, a typical deterministic algorithm has low Kolmogorov complexity, and therefore its worst-case input also has low Kolmogorov complexity and therefore a non-negligible probability under complexity-based priors. The only possible solutions to this problem I can see are:
1. add randomization
2. redesign the deterministic algorithm so that it has no worst-case input
3. do a cost-benefit analysis to show that the cost of doing either 1 or 2 is not justified by the expected utility of avoiding the worst-case performance of the origi...
Quantum branching is "truly random" in the sense that branching the universe both ways is an irreducible source of new indexical ignorance. But the more important point is that unless the environment really is out to get you, you might be wiser to exploit the ways that you think the environment might depart from true randomness, rather than using a noise source to throw away this knowledge.
I'm not making any nonstandard claims about computational complexity, only siding with the minority "derandomize things" crowd
Note that I also enthusiastically belong to a "derandomize things" crowd! The difference is, I think derandomizing is hard work (sometimes possible and sometimes not), since I'm unwilling to treat the randomness of the problems the world throws at me on the same footing as randomness I generate myself in the course of solving those problems. (For those watching at home tonight, I hope the differences are now reasonably clear...)
I certainly don't say "it's not hard work", and the environmental probability distribution should not look like the probability distribution you have over your random numbers - it should contain correlations and structure. But once you know what your probability distribution is, then you should do your work relative to that, rather than assuming "worst case". Optimizing for the worst case in environments that aren't actually adversarial, makes even less sense than assuming the environment is as random and unstructured as thermal noise...
once you know what your probability distribution is...
I'd merely stress that that's an enormous "once." When you're writing a program (which, yes, I used to do), normally you have only the foggiest idea of what a typical input is going to be, yet you want the program to work anyway. This is not just a hypothetical worry, or something limited to cryptography: people have actually run into strange problems using pseudorandom generators for Monte Carlo simulations and hashing (see here for example, or Knuth vol 2).
Even so, intuition suggests it sh...
Even if P=BPP, that just means that giving up randomness causes "only" a polynomial slowdown instead of an exponential one, and in practice we'll still need to use pseudorandom generators to simulate randomness.
It seems clear to me that noise (in the sense of randomized algorithms) does have power, but perhaps we need to develop better intuitions as to why that is the case.
Eliezer: There are a few classes of situations I can think of in which it seems like the correct solution (given certain restrictions) requires a source of randomness:
One example would be Hofstadter's Luring Lottery. Assuming all the agents really did have common knowledge of each other's rationality, and assuming no form of communication is allowed between them in any way, isn't it true that no deterministic algorithm to decide how many entries to send in, if any at all, has a higher expected return than a randomized one? (that is, the solutions which are...
The leaders in the NetflixPrize competition for the last couple years have utilized ensembles of large numbers of models with a fairly straightforward integration procedure. You can only get so far with a given model, but if you randomly scramble its hyperparameters or training procedure, and then average multiple runs together, you will improve your performance. The logical path forward is to derandomize this procedure, and figure out how to predict, a priori, which model probabilities become more accurate and which don't. But of course until you figure o...
@ Will: You happen to have named a particular deterministic algorithm; that doesn't say much about every deterministic algorithm. Moreover, none of you folks seem to notice much that pseudo-random algorithms are actually deterministic too...
Finally, you say: "Only when the input is random will [the deterministic algorithm] on average take O(1) queries. The random one will take O(1) on average on every type of input."
I can't tell you how frustrating it is to see you just continue to toss in the "on average" phrase, as though it doesn't...
@ John, @ Scott: You're still doing something odd here. As has been mentioned earlier in the comments, you've imagined a mind-reading superintelligence ... except that it doesn't get to see the internal random string.
Look, this should be pretty simple. The phrase "worst case" has a pretty clear layman's meaning, and there's no reason we need to depart from it.
You're going to get your string of N bits. You need to write an algorithm to find the 1s. If your algorithm ever gives the wrong answer, we're going to shoot you in the head with a gun a...
To look at it one more time ... Scott originally said Suppose you're given an n-bit string, and you're promised that exactly n/4 of the bits are 1, and they're either all in the left half of the string or all in the right half.
So we have a whole set of deterministic algorithms for solving the problem over here, and a whole set of randomized algorithms for solving the same problem. Take the best deterministic algorithm, and the best randomized algorithm.
Some people want to claim that the best randomized algorithm is "provably better". Really? B...
Don, you missed my comment saying you could bound the randomized algorithm in the same way to n/4+1 by keeping track of where you pick and if you find n/4+1 0s in one half you conclude it is in the other half.
I wouldn't say that a randomized algorithm is better per se, just more generally useful than the a particular deterministic one. You don't have to worry about the types of inputs given to it.
This case matters because I am a software creator and I want to reuse my software components. In most cases I don't care about performance of every little sub com...
Don, if n is big enough then sure, I'm more than willing to play your game (assuming I get something out of it if I win).
My randomised algorithm will get the right answer within 4 tries in expectation. If I'm allowed to sample a thousand bits then the probability that it won't have found the answer by then is .75^1000 which I think is something like 10^-120. And this applies even if you're allowed to choose the string after looking at my algorithm.
If you play the game this way with your deterministic algorithm you will always need to sample n/4 + 1 bits ...
@ Will: Yes, you're right. You can make a randomized algorithm that has the same worst-case performance as the deterministic one. (It may have slightly impaired average case performance compared to the algorithm you folks had been discussing previously, but that's a tradeoff one can make.) My only point is that concluding that the randomized one is necessarily better, is far too strong a conclusion (given the evidence that has been presented in this thread so far).
But sure, you are correct that adding a random search is a cheap way to have good confiden...
@ John: can you really not see the difference between "this is guaranteed to succeed", vs. "this has only a tiny likelihood of failure"? Those aren't the same statements.
"If you play the game this way" -- but why would anyone want to play a game the way you're describing? Why is that an interesting game to play, an interesting way to compare algorithms? It's not about worst case in the real world, it's not about average case in the real world. It's about performance on a scenario that never occurs. Why judge algorithms on...
@Scott: would it be fair to say that the P=BPP question boils down to whether one can make a "secure" random source such that the adversarial environment can't predict it? Is that why the problem is equivalent to having good PRNG? (I'm a physicist, but who thinks he knows some of the terminology of complexity theory --- apologies if I've misunderstood the terms.)
I can't help but think that the reason why randomization appear to add information in the context of machine learning classifiers for the dichotomous case is that the actual evaluation of the algorithms is empirical, and, as such, use finite data. Two classes of non-demonic error always exist in the finite case. These are measurement error, and sampling error.
Multiattribute classifiers use individual measurements from many attributes. No empirical measurement is made without error; the more attributes that are measured, the more measurement error intrud...
When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy.
It can be handy even if it's not adversarial, and even if it is equally as smart as you are but not smarter. Suppose you're playing a game with payoffs (A,A) = (0,0), (A,B) = (1,1), (B,A) = (1,1), (B,B) = (0,0) against a clone of yourself (and you can't talk to each other); the mixed strategy where you pick A half of the time and B half of the time wins half of the time, but either p...
Sometimes the environment really is adversarial, though.
Regular expressions as implemented in many programming languages work fine for almost all inputs, but with terrible upper bounds. This is why libraries like re2 are needed if you want to let anyone on the Internet use regular expressions in your search engine.
Another example that comes to mind is that quicksort runs in n·log n expected time if randomly sorted beforehand.
So let me see if I've got this straight.
Computer Scientists: For some problems, there are random algorithms with the property that they succeed with high probability for any possible input. No deterministic algorithm for these problems has this property. Therefore, random algorithms are superior.
Eliezer: But if we knew the probability distribution over possible inputs, we could create a deterministic algorithm with the property.
Computer Scientists: But we do not know the probability distribution over possible inputs.
Eliezer: Never say "I don't know&qu...
Followup to: Worse Than Random, Trust In Bayes
In the wider field of Artificial Intelligence, it is not universally agreed and acknowledged that noise hath no power. Indeed, the conventional view in machine learning is that randomized algorithms sometimes perform better than unrandomized counterparts and there is nothing peculiar about this. Thus, reading an ordinary paper in the AI literature, you may suddenly run across a remark: "There is also an improved version of this algorithm, which takes advantage of randomization..."
Now for myself I will be instantly suspicious; I shall inspect the math for reasons why the unrandomized algorithm is being somewhere stupid, or why the randomized algorithm has a hidden disadvantage. I will look for something peculiar enough to explain the peculiar circumstance of a randomized algorithm somehow doing better.
I am not completely alone in this view. E. T. Jaynes, I found, was of the same mind: "It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought." Apparently there's now a small cottage industry in derandomizing algorithms. But so far as I know, it is not yet the majority, mainstream view that "we can improve this algorithm by randomizing it" is an extremely suspicious thing to say.
Let us now consider a specific example - a mainstream AI algorithm where there is, apparently, a mathematical proof that the randomized version performs better. By showing how subtle the gotcha can be, I hope to convince you that, even if you run across a case where the randomized algorithm is widely believed to perform better, and you can't find the gotcha yourself, you should nonetheless trust that there's a gotcha to be found.
For our particular example we shall examine the weighted majority algorithm, first introduced by Littlestone and Warmuth, but as a substantially more readable online reference I will use section 2 of Blum's On-Line Algorithms in Machine Learning. (Blum has an easier-to-read, more concrete version of the proofs; and the unrandomized and randomized algorithms are presented side-by-side rather than far part.)
The weighted majority algorithm is an ensemble method - a way to combine the advice from several other algorithms or hypotheses, called "experts". Experts strive to classify a set of environmental instances into "positive" and "negative" categories; each expert produces a prediction for each instance. For example, each expert might look at a photo of a forest, and try to determine whether or not there is a camouflaged tank in the forest. Expert predictions are binary, either "positive" or "negative", with no probability attached. We do not assume that any of our experts is perfect, and we do not know which of our experts is the best. We also do not assume anything about the samples (they may not be independent and identically distributed).
The weighted majority algorithm initially assigns an equal weight of 1 to all experts. On each round, we ask all the experts for their predictions, and sum up the weights for each of the two possible predictions, "positive" or "negative". We output the prediction that has the higher weight. Then, when we see the actual result, we multiply by 1/2 the weight of every expert that got the prediction wrong.
Suppose the total number of experts is n, and the best expert makes no more than m mistakes over some given sampling sequence. Then we can prove that the weighted majority algorithm makes a total number of mistakes M that is bounded by 2.41*(m + log2(n)). In other words, the weighted majority algorithm makes no more mistakes than the best expert, plus a factor of log n, times a constant factor of 2.41.
Proof (taken from Blum 1996; a similar proof appears in Littlestone and Warmuth 1989): The combined weight of all the experts at the start of the problem is W = n. If the weighted majority algorithm makes a mistake, at least half the total weight of experts predicted incorrectly, so the total weight is reduced by a factor of at least 1/4. Thus, after the weighted majority algorithm makes M mistakes, its total weight W has been reduced by at least (3/4)^M. In other words:
But if the best expert has made at most m mistakes, its weight is at least (1/2)^m. And since W includes the weight of all experts,
Therefore:
Blum then says that "we can achieve a better bound than that described above", by randomizing the algorithm to predict "positive" or "negative" with probability proportional to the weight assigned each prediction. Thus, if 3/4 of the expert weight went to "positive", we would predict "positive" with probability 75%, and "negative" with probability 25%.
An essentially similar proof, summing over the expected probability of a mistake on each round, will show that in this case:
Since the penalty applied to particular experts does not depend on the global prediction but only on the actual outcome, most of the proof proceeds as before. We have
where again m is the best expert and W is the total weight of all experts.
We also have that W is the starting weight n times the product of (1 - 1/2 F_i), where F_i is the fraction of mistaken expert weight on round i:
And since we predict with probability proportional to the expert weight, the expected number of mistakes is just the sum over F_i:
So:
Behold, we have done better by randomizing!
We should be especially suspicious that the randomized algorithm guesses with probability proportional to the expert weight assigned. This seems strongly reminiscent of betting with 70% probability on blue, when the environment is a random mix of 70% blue and 30% red cards. We know the best bet - and yet we only sometimes make this best bet, at other times betting on a condition we believe to be less probable.
Yet we thereby prove a smaller upper bound on the expected error. Is there an algebraic error in the second proof? Are we extracting useful work from a noise source? Is our knowledge harming us so much that we can do better through ignorance?
Maybe the unrandomized algorithm fails to take into account the Bayesian value of information, and hence fails to exhibit exploratory behavior? Maybe it doesn't test unusual cases to see if they might have surprising results?
But, examining the assumptions, we see that the feedback we receive is fixed, regardless of the prediction's output. Nor does the updating of the expert weights depend on the predictions we output. It doesn't matter whether we substitute a completely random string of 1s and 0s as our actual output. We get back exactly the same data from the environment, and the weighted majority algorithm updates the expert weights in exactly the same way. So that can't be the explanation.
Are the component experts doing worse than random, so that by randomizing our predictions, we can creep back up toward maximum entropy? But we didn't assume anything about how often the component experts were right, or use the fact in the proofs. Therefore the proofs would carry even if we specified that all experts were right at least 60% of the time. It's hard to see how randomizing our predictions could help, in that case - but the proofs still go through, unchanged.
So where's the gotcha?
Maybe I'm about to tell you that I looked, and I couldn't find the gotcha either. What would you believe, in that case? Would you think that the whole thesis on the futility of randomness was probably just wrong - a reasonable-sounding philosophical argument that simply wasn't correct in practice? Would you despair of being able to follow the math, and give up and shrug, unable to decide who might be right?
We don't always see the flaws right away, and that's something to always remember.
In any case, I'm about to start explaining the gotcha. If you want to go back, read the paper, and look yourself, you should stop reading now...
There does exist a rare class of occasions where we want a source of "true" randomness, such as a quantum measurement device. For example, you are playing rock-paper-scissors against an opponent who is smarter than you are, and who knows exactly how you will be making your choices. In this condition it is wise to choose randomly, because any method your opponent can predict will do worse-than-average. Assuming that the enemy knows your source code has strange consequences: the action sequence 1, 2, 1, 3 is good when derived from a quantum noise generator, but bad when derived from any deterministic algorithm, even though it is the same action sequence. Still it is not a totally unrealistic situation. In real life, it is, in fact, a bad idea to play rock-paper-scissors using an algorithm your opponent can predict. So are we, in this situation, deriving optimization from noise?
The random rock-paper-scissors player does not play cleverly, racking up lots of points. It does not win more than 1/3 of the time, or lose less than 1/3 of the time (on average). The randomized player does better because its alternatives perform poorly, not from being smart itself. Moreover, by assumption, the opponent is an intelligence whom we cannot outsmart and who always knows everything about any method we use. There is no move we can make that does not have a possible countermove. By assumption, our own intelligence is entirely useless. The best we can do is to avoid all regularity that the enemy can exploit. In other words, we do best by minimizing the effectiveness of intelligence within the system-as-a-whole, because only the enemy's intelligence is allowed to be effective. If we can't be clever ourselves, we might as well think of ourselves as the environment and the enemy as the sole intelligence within that environment. By becoming the maximum-entropy environment for rock-paper-scissors, we render all intelligence useless, and do (by assumption) the best we can do.
When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy. (In a practical sense you're just screwed.) Again, this is not because we're extracting useful work from a noise source; it's because the most powerful intelligence in the system is adversarial, and we're minimizing the power that intelligence can exert in the system-as-a-whole. We don't do better-than-average, we merely minimize the extent to which the adversarial intelligence produces an outcome that is worse-than-average (from our perspective).
Similarly, cryptographers have a legitimate interest in strong randomness generators because cryptographers are trying to minimize the effectiveness of an intelligent adversary. Certainly entropy can act as an antidote to intelligence.
Now back to the weighted majority algorithm. Blum (1996) remarks:
From the start, we did our analysis for an upper bound on the number of mistakes made. A global upper bound is no better than the worst individual case; thus, to set a global upper bound we must bound the worst individual case. In the worst case, our environment behaves like an adversarial superintelligence.
Indeed randomness can improve the worst-case scenario, if the worst-case environment is allowed to exploit "deterministic" moves but not "random" ones. It is like an environment that can decide to produce a red card whenever you bet on blue - unless you make the same bet using a "random" number generator instead of your creative intelligence.
Suppose we use a quantum measurement device to produce a string of random ones and zeroes; make two copies of this string; use one copy for the weighted majority algorithm's random number generator; and give another copy to an intelligent adversary who picks our samples. In other words, we let the weighted majority algorithm make exactly the same randomized predictions, produced by exactly the same generator, but we also let the "worst case" environment know what these randomized predictions will be. Then the improved upper bound of the randomized version is mathematically invalidated.
This shows that the improved upper bound proven for the randomized algorithm did not come from the randomized algorithm making systematically better predictions - doing superior cognitive work, being more intelligent - but because we arbitrarily declared that an intelligent adversary could read our mind in one case but not the other.
This is not just a minor quibble. It leads to the testable prediction that on real-world problems, where the environment is usually not an adversarial telepath, the unrandomized weighted-majority algorithm should do better than the randomized version. (Provided that some component experts outperform maximum entropy - are right more than 50% of the time on binary problems.)
Analyzing the worst-case scenario is standard practice in computational learning theory, but it makes the math do strange things. Moreover, the assumption of the worst case can be subtle; it is not always explicitly labeled. Consider the upper bound for the unrandomized weighted-majority algorithm. I did not use the term "worst case" - I merely wrote down some inequalities. In fact, Blum (1996), when initially introducing this bound, does not at first use the phrase "worst case". The proof's conclusion says only:
Key word: Never.
The worst-case assumption for the unrandomized algorithm was implicit in calculating the right-hand-side of the inequality by assuming that, on each mistake, the total weight went down by a factor of 1/4, and the total weight never decreased after any successful prediction. This is the absolute worst possible performance the weighted-majority algorithm can give.
The assumption implicit in those innocent-looking equations is that the environment carefully maximizes the anguish of the predictor: The environment so cleverly chooses its data samples, that on each case where the weighted-majority algorithm is allowed to be successful, it shall receive not a single iota of useful feedback - not a single expert shall be wrong. And the weighted-majority algorithm will be made to err on sensory cases that produce the minimum possible useful feedback, maliciously fine-tuned to the exact current weighting of experts. We assume that the environment can predict every aspect of the predictor and exploit it - unless the predictor uses a "random" number generator which we arbitrarily declare to be unknowable to the adversary.
What strange assumptions are buried in that innocent little inequality,
Moreover, the entire argument in favor of the randomized algorithm was theoretically suspect from the beginning, because it rested on non-transitive inequalities. If I prove an upper bound on the errors of algorithm X, and then prove a smaller upper bound on the errors of algorithm Y, it does not prove that in the real world Y will perform better than X. For example, I prove that X cannot do worse than 1000 errors, and that Y cannot do worse than 100 errors. Is Y a better algorithm? Not necessarily. Tomorrow I could find an improved proof which shows that X cannot do worse than 90 errors. And then in real life, both X and Y could make exactly 8 errors.
But that doesn't mean that 4 > 9.
So the next time you see an author remark, "We can further improve this algorithm using randomization..." you may not know exactly where to find the oddity. If you'd run across the above-referenced example (or any number of others in the machine-learning literature), you might not have known how to deconstruct the randomized weighted-majority algorithm. If such a thing should happen to you, then I hope that I have given you grounds to suspect that an oddity exists somewhere, even if you cannot find it - just as you would suspect a machine that proposed to extract work from a heat bath, even if you couldn't keep track of all the gears and pipes.
Nominull put it very compactly when he said that, barring an environment which changes based on the form of your algorithm apart from its output, "By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution."
As I remarked in Perpetual Motion Beliefs:
If you ever find yourself mathematically proving that you can do better by randomizing, I suggest that you suspect your calculations, or suspect your interpretation of your assumptions, before you celebrate your extraction of work from a heat bath.