Sniffnoy comments on Metaphilosophical Mysteries - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (255)
The reason the universal prior is called "universal" is because given initial segments, where the infinite strings come from any computable distribution, and updating on those samples, it will, in fact, converge to the actual distribution on what the next bit should be. Now I'll admit to not actually knowing the math here, but it seems to me that if most any prior had that property, as you seem to imply, we wouldn't need to talk about a universal prior in the first place, no?
Also, if we interpret "universe" as "the actual infinite string that these segments are initial segments" of, then, well... take a look at that sum you posted and decompose it. The universal prior is basically assigning a probability to each infinite string, namely the sum of the probabilities of programs that generate it, and then collapsing that down to a distribution on initial segments in the obvious way. So if we want to consider its hypotheses about the actual law of the universe, the whole string, it will always assign 0 probability to an uncomputable sequence.
Convergence is more the result of the updates than the original prior. All the initial prior has to be to result in convergence is not completely ridiculous (1, 0, infinitessimals, etc). The idea of a good prior is that it helps initially, before an agent has any relevant experience to go on. However, that doesn't usually last for very long - real organic agents are pretty quickly flooded with information about the state of the universe, and are then typically in a much better position to make probabililty estimates. You could build agents that were very confident in their priors - and updated them slowly - but only rarely would you want an agent that was handicapped in its ability to adapt and learn.
Picking the best reference machine would be nice - but I think most people understand that for most practical applications, it doesn't matter - and that even a TM will do.
Are you certain of this? Could you provide some sort of proof or reference, please, ideally together with some formalization of what you mean by "completely ridiculous"? I'll admit to not having looked up a proof of convergence for the universal prior or worked it out myself, but what you say were really the case, there wouldn't actually be be very much special about the universal prior, and this convergence property of it wouldn't be worth pointing out - so I think I have good reason to be highly skeptical of what you suggest.
Better, yes. But good enough? Arbitrarily close?
Sorry, but what does this even mean? I don't understand how this notion of "update speed" translates into the Bayesian setting.
Here's Shane Legg on the topic of how little priors matter when predicting the environment:
"In some situations, for example with Solomonoff induction, the choice of the reference machine doesn’t matter too much. [...] the choice of reference machine really doesn’t matter except for very small data sets (which aren’t really the ones we’re interested in here). To see this, have a look at the Solomonoff convergence bound and drop a compiler constant in by the complexity of the environment. The end result is that the Solomonoff predictor needs to see just a few more bytes of the data sequence before it converges to essentially optimal predictions."
This doesn't address what I said at all. We don't speak of "the" universal prior because there's a specific UTM it's defined with respect to, we speak of "the" universal prior because we don't much care about the distinction between different universal priors! The above article is still about doing Bayesian updating starting with a universal prior. That which particular universal prior you start from doesn't matter much is not new information and in no way supports your claim that any "reasonable" prior - whatever that might mean - will also have this same property.
I think when he says "the choice of the reference machine doesn’t matter too much" and "the choice of reference machine really doesn’t matter except for very small data sets" he literally means those things. I agree that my position on this is not new.
Sorry, how does "literally" differ from what I stated? And you seem to be stating something very different from him. He is just stating that the UTM used to define the universal prior is irrelevant. You are claiming that any "reasonable" prior, for some unspecified but expansive-sounding notion of "reasonable", has the same universal property as a universal prior.
That seems like quite a tangle, and alas, am not terribly interested in it . But:
The term was "reference machine". No implication that it is a UTM is intended - it could be a CA - or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really "universal reference machines" which are different from other "reference machines" - or if there are "universal" just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn't seem to make much sense.
Convergence on accurate beliefs has precious little to do with the prior - it is a property of the updating scheme. The original priors matter little after a short while - provided they are not zero, one - or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
Sorry, what? Of course it can be any sort of universal computer; why would we care whether it's a Turing machine or some other sort? Your statement that taking a universal computer and generating the corresponding universal prior will get you "literally any distribution of priors you can imagine" is just false, especially as it will only get you uncomputable ones! Generating a universal prior will only get you universal priors. Perhaps you were thinking of some other way of generating a prior from a universal computer? Because that isn't what's being talked about.
You have still done nothing to demonstrate this. The potential for dependence on priors has been demonstrated elsewhere (anti-inductive priors, etc). The "updating scheme" is Bayes' Rule. (This might not suffice in the continuous-time case, but you explicitly invoked the discrete-time case above!) But to determine all those probabilities, you need to look at the prior. Seriously, show me (or just point me to) some math. If you refuse to say what makes a prior "reasonable", what are you actually claiming? That the set of priors with this property is large in some appropriate sense? Please name what sense. Why should we not just use some equivalent of maxent, if what you say is true?
"Of course it can be any sort of universal computer; why would we care whether it's a Turing machine or some other sort?"
Well, different reference machines produce different prior distributions - so the distribution used matters initially, when the machine is new to the world.
"Your statement that taking a universal computer and generating the corresponding universal prior will get you "literally any distribution of priors you can imagine" is just false, especially as it will only get you uncomputable ones! "
"Any distribution you can compute", then - if you prefer to think that you can imagine the uncomputable.
"You have still done nothing to demonstrate this."
Actually, I think I give up trying to explain. From my perspective you seem to have some kind of tangle around the word "universal". "Universal" could usefully refer to "universal computation" or to a prior that covers "every hypothesis in the universe". There is also the "universal prior" - but I don't think "universal" there has quite the same significance that you seem to think it does. There seems to be repeated miscommunication going on in this area.
It seems non-trivial to describe the class of priors that leads to "fairly rapid" belief convergence in an intelligent machine. Suffice to say, I think that class is large - and that the details of priors are relatively insignificant - provided there is not too much "faith" - or "near faith". Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter - and for most of the ones that you don't. If you don't agree, fine - I won't bang on about it further in an attempt to convince you.
Re: "I don't understand how this notion of "update speed" translates into the Bayesian setting."
Say you think p(heads) is 0.5. If you see ten heads in a row, do you update p(heads) a lot, or a little? It depends on how confident you are of your estimate.
If you had previously seen a thousand coin flips from the same coin, you might be confident of p(heads) being 0.5 - and therefore update little. If you were told that it was a biased coin from a magician, then your estimate of p(heads) being 0.5 might be due to not knowing which way it was biased. Then you might update your estimate of p(heads) rapidly - on seing several heads in a row.
Like that.
What you have just laid out are not different "update speeds" but different priors. "It's a biased coin from a magician" is of the same class of prior assumptions as "It's probably a fair coin" or "It's a coin with some fixed probability of landing heads, but I have no idea what" or "It's a rigged coin that can only come up heads 10 times once activated".
After each toss, you do precisely one Bayesian update. Perhaps the notion of "update speed" might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not. The amount you update is determined by Bayes' Law; different apparent "update speeds" are due to differing priors. "Speed" probably isn't even a good term, as updates aren't even necessarily in the same direction! If you think the coin can only come up heads 10 times, each appearance of heads makes it less likely to come up again.
"Update speed" seems fine to me - when comparing:
.5, .500001, .500002, .500003, .500004...
....with...
.5, 0.7, 0.9, 0.94, 0.96
...but use whatever term you like.
That's a statistic, not a parameter - and it's a statistic ultimately determined by the prior.
I do not know where the idea that "speeds" are "parameters" and not "statistics" comes from. An entity being a statistic doesn't imply that it is not a speed.
The same goes for discrete systems. They have the concept of speed too:
http://en.wikipedia.org/wiki/Glider_%28Conway%27s_Life%29
This is utterly irrelevant. The problem with what you say is not that there's no notion of speed, it's that there is precisely one way of doing updates, and it has no "speed" parameter.
In the game of life, the update speed is always once per generation. However, that doesn't mean it has no concept of speed. In fact the system exhibits gliders with many different speeds.
It's much the same with an intelligent agent's update speed in response to evidence - some will update faster than others - depending on what they already know.
You claimed that:
"Perhaps the notion of "update speed" might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not."
However, the concept of "speed" works equally well in discrete and continuous systems - as the GOL illustrates. "Discreteness" is an irrelevance.
Hm; there may not be a disagreement here. You seemed to be using it in a way that implied it was not determined by (or even was independent of) the prior. Was I mistaken there?
The idea was that some agents update faster than others (or indeed not at all).
If you like you can think of the agents that update relatively slowly as being confident that they are uncertain about the things they are unsure about. That confidence in their own uncertainty could indeed be represented by other priors.
Re: "there wouldn't actually be be very much special about the universal prior"
Well, Occam's razor is something rather special. However, agents don't need an optimal version of it built into them as a baby - they can figure it out from their sensory inputs.