cousin_it comments on Metaphilosophical Mysteries - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (255)
The universal prior enumerates all Turing machines, not all possible priors generated by all Turing machines.
Priors are probabality estimates for uncertain quantities.
In Solomonoff induction they are probabality estimates for bitstrings - which one can think of as representing possible sensory inputs for an agent.
With a standard TM_length-based encoding, no finite bitstring is assigned a zero probability - and we won't have to worry about perceiving infinite bitstrings until after the universal heat death - so there is no problem with assigning certain bitstrings a zero prior probability.
Whether the bitstrings were created using uncomputable physics is neither here nor there. They are still just bitstrings - and so can be output by a TM with a finite program on its tape.
No, sorry. You're confused. A prior is not an assignment of credences to all bitstrings that you can observe. A prior is an assignment of credences to hypotheses, i.e. possible states of the world that generate bitstrings that you observe. Otherwise you'd find yourself in this text (see part II, "Escaping the Greek Hinterland").
No. We were talking about the universal prior. Here is how that is defined for sequences:
"The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p."
The universal prior of a sequence is the probability of that particular sequence arising (as a prefix). It is not the probabilty of any particular hypothesis or program. Rather is a weighted sum of the probabilities of all the programs that generate that sequence.
You can talk about the probabilities of hypothesis and programs as well if you like - but the universal prior of a sequence is perfectly acceptable subject matter - and is not a "confused" idea.
No finite sequence has a probabilty of zero - according to the universal prior.
All finite bitstrings can be produced by computable means - even if they were generated as the output of an uncomputable physical process.
Is this misconception really where this whole idea arises from?
This is all true, but... Why do you think the universal prior talks about computer programs at all? If I only wanted a prior over all finite bitstrings, I'd use a simpler prior that assigned every string of length N a credence proportional to 2^-N. Except that prior has a rather major shortcoming: it doesn't help you predict the future! No matter how many bits you feed it, it always says the next bit is going to be either 0 or 1 with probability 50%. It will never get "swamped" by the data, never gravitate to any conclusions. This is why we want the universal prior to be based on computer programs instead: it will work better in practice, if the universe is in fact computable. But what happens if the universe is uncomputable? That's the substantive question here.
ETA: the last two sentences are wrong, disregard them.
Nothing much happens to intelligent agents - because an intelligent agents' original priors mostly get left behind shortly after they are born - and get replaced by evidence-based probability estimates of events happening. If convincing evidence comes in that the world is uncomputable, that just adds to the enormous existing stack of evidence they have about the actual frequencies of things.
Anyhow, priors being set to 0 or 1 is not a problem for observable sense data. No finite sense data has p assigned to 0 or 1 under the universal prior - so an agent can always update successfully - if it gets sufficient evidence that a sequence was actually produced. So, if it sees a system that apparently solves the halting problem for arbitrary programs, that is no big deal for it. It may have found a Turing oracle! Cool!
I suppose it might be possible to build an semi-intelligent agent with a particular set of priors permanently wired into it - so the agent was incapable of learning and adapting if its environment changed. Organic intelligent agents are not much like that - and I am not sure how easy it would be to build such a thing. Such agents would be incapable of adapting to an uncomputable world. They would always make bad guesses about uncomputable events. However, this seems speculative - I don't see why people would try to create such agents. They would do very badly in certain simulated worlds - where Occam's razor doesn't necessarily hold true - and it would be debatable whether their intelligence was really very "general".
The reason the universal prior is called "universal" is because given initial segments, where the infinite strings come from any computable distribution, and updating on those samples, it will, in fact, converge to the actual distribution on what the next bit should be. Now I'll admit to not actually knowing the math here, but it seems to me that if most any prior had that property, as you seem to imply, we wouldn't need to talk about a universal prior in the first place, no?
Also, if we interpret "universe" as "the actual infinite string that these segments are initial segments" of, then, well... take a look at that sum you posted and decompose it. The universal prior is basically assigning a probability to each infinite string, namely the sum of the probabilities of programs that generate it, and then collapsing that down to a distribution on initial segments in the obvious way. So if we want to consider its hypotheses about the actual law of the universe, the whole string, it will always assign 0 probability to an uncomputable sequence.
Convergence is more the result of the updates than the original prior. All the initial prior has to be to result in convergence is not completely ridiculous (1, 0, infinitessimals, etc). The idea of a good prior is that it helps initially, before an agent has any relevant experience to go on. However, that doesn't usually last for very long - real organic agents are pretty quickly flooded with information about the state of the universe, and are then typically in a much better position to make probabililty estimates. You could build agents that were very confident in their priors - and updated them slowly - but only rarely would you want an agent that was handicapped in its ability to adapt and learn.
Picking the best reference machine would be nice - but I think most people understand that for most practical applications, it doesn't matter - and that even a TM will do.
Are you certain of this? Could you provide some sort of proof or reference, please, ideally together with some formalization of what you mean by "completely ridiculous"? I'll admit to not having looked up a proof of convergence for the universal prior or worked it out myself, but what you say were really the case, there wouldn't actually be be very much special about the universal prior, and this convergence property of it wouldn't be worth pointing out - so I think I have good reason to be highly skeptical of what you suggest.
Better, yes. But good enough? Arbitrarily close?
Sorry, but what does this even mean? I don't understand how this notion of "update speed" translates into the Bayesian setting.
Here's Shane Legg on the topic of how little priors matter when predicting the environment:
"In some situations, for example with Solomonoff induction, the choice of the reference machine doesn’t matter too much. [...] the choice of reference machine really doesn’t matter except for very small data sets (which aren’t really the ones we’re interested in here). To see this, have a look at the Solomonoff convergence bound and drop a compiler constant in by the complexity of the environment. The end result is that the Solomonoff predictor needs to see just a few more bytes of the data sequence before it converges to essentially optimal predictions."
Re: "I don't understand how this notion of "update speed" translates into the Bayesian setting."
Say you think p(heads) is 0.5. If you see ten heads in a row, do you update p(heads) a lot, or a little? It depends on how confident you are of your estimate.
If you had previously seen a thousand coin flips from the same coin, you might be confident of p(heads) being 0.5 - and therefore update little. If you were told that it was a biased coin from a magician, then your estimate of p(heads) being 0.5 might be due to not knowing which way it was biased. Then you might update your estimate of p(heads) rapidly - on seing several heads in a row.
Like that.
Re: "there wouldn't actually be be very much special about the universal prior"
Well, Occam's razor is something rather special. However, agents don't need an optimal version of it built into them as a baby - they can figure it out from their sensory inputs.
Prior determines how evidence informs your estimates, what things you can consider. In order to "replace priors with evidence-based probability estimates of events", you need a notion of event, and that is determined by your prior.
Prior evaluates, but it doesn't dictate what is being evaluated. In this case, "events happening" refers to subjective anticipation, which in turn refers to prior, but this connection is far from being straightforward.
"Determined" in the sense of "weakly influenced". The more actual data you get, the weaker the influence of the original prior becomes - and after looking at the world for a little while, your original priors become insignificant - swamped under a huge mountain of sensory data about the actual observed universe.
Priors don't really affect what things you can consider - since you can consider (and assign non-zero probability to) receiving any sensory input sequence.
I use the word "prior" in the sense of priors as mathematical objects, meaning all of your starting information plus the way you learn from experience.
I can't quite place "you need a notion of event, and that is determined by your prior", but I guess the mapping between sample space and possible observations is what you meant.
Well yes, you can have "priors" that you have learned from experience. An uncomputable world is not a problem in that case either - since you can learn about uncomputable physics, in just the same way that you learn about everything else.
This whole discussion seems to be a case of people making a problem out of nothing.