cousin_it comments on Metaphilosophical Mysteries - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (255)
I'm not sure - did you miss the idea that all "universal priors" that we know how to construct today assign zero credence to certain hypotheses about the universe, or did you miss the idea that a zero-credence hypothesis will never rise above zero no matter how much data comes in, or is it me who's missing something?
Also, correct me if I'm wrong, but doesn't the Solomonoff prior bypass the issue of explicit hypotheses? That is, it puts a (non-zero) prior on every (prefix) bitstream of sensory data.
So, it doesn't even seem necessary to talk about what such an agent's "priors on hypotheses" are -- everything it believes is encoded as an expectation of sensory data, and nothing more. It does not explicitly represent concepts like, "this thing is a halting oracle".
Instead, when it encounters a halting oracle, it increases the weight it assigns to expectations of observations of things that are consistent with having been produced by a halting oracle, not the existence of a halting oracle as such.
No matter how uncomputable, or lengthy-to-specify, a function might be, you can always finitely specify your expectation weight on a finite observation prefix stream (i.e. the first n things you observe from the oracle).
So, I don't see how an agent with a Solomonoff prior chokes on an encounter with a halting oracle.
Normal agents won't. Genuinely intelligent agents won't.
I think those who are arguing that it will are imagining an agent with the Solomonoff prior totally wired into them in a manner that they can't possibly unlearn.
But still, even if you have the Occamian prior (which I think is what's meant by the Solomonoff prior), there is no need to unlearn it. You retain a prior on all hypotheses that decreases in weight exponentially with length, and it persists on top of any observations you've updated on. Those new observations, combined with the Occamian prior give you the optimal weights on (prefix) sensory bitstreams, discounting the ruled-out ones and favoring those closer to what you've actually observed.
Even then, it keeps updating in favor of the observations that match what an oracle gives (without having to explicitly represent that they're from an oracle). No penalty from failure to unlearn.
The thing is, there is no one true razor. Different sources have different associated reference machines - some are more like Turing Machines, others are more like CA. If what you are looking at is barcodes, then short ones are pretty rare - and if you go into simulated worlds, sources can have practically any distribution you care to mention.
Yes, you can model these as "complier overhead" constants - which represent the "cost" of simulating one reference machine in another - but that is just another way of saying you have to unlearn the Solomonoff prior and use another one - which is more appropriate for your source.
You can still do that, whatever your reference machine is - provided it is computationally universal - and doesn't have too much "faith".
I'm not sure exactly what can qualify as a prior.
Is "Anomalies may be clues about a need to make deep changes in other priors" a possible prior?
A prior is not a program that tells you what to do with the data. A prior is a set of hypotheses with a number assigned to each. When data comes in, we compute the likelihoods of the data given each hypothesis on the list, and use these numbers to obtain a posterior over the same hypotheses. There's no general way to have a "none of the above" (NOTA) hypothesis in your prior, because you can't compute the likelihood of the data given NOTA.
Another equivalent way to think about it: because of the marginalization step (dividing everything by the sum of all likelihoods), Bayesian updating doesn't use the total likelihood of the data given all current hypotheses - only the relative likelihoods of one hypothesis compared to another. This isn't easy to fix because "total likelihood" is a meaningless number that doesn't indicate anything - it could easily be 1000 in a setup with an incorrect prior or 0.001 in a setup with a correct prior.
People have beliefs about how various sorts of behavior will work out, though I think it's rare to have probabilities attached.
If you are assigning p(some empirical hypothesis) = 0, surely you are a broken system.
The example seems to be that a using a Turing machine to generate your priors somehow results in an expectation of p(uncomputable universe)=0. That idea just seems like total nonsense to me. It just doesn't follow. For all I care, my priors could have been assigned to me using a Turing machine model at birth - but I don't think p(uncomputable universe)=0. The whole line of reasoning apparently makes no sense.
The universal prior enumerates all Turing machines, not all possible priors generated by all Turing machines.
Priors are probabality estimates for uncertain quantities.
In Solomonoff induction they are probabality estimates for bitstrings - which one can think of as representing possible sensory inputs for an agent.
With a standard TM_length-based encoding, no finite bitstring is assigned a zero probability - and we won't have to worry about perceiving infinite bitstrings until after the universal heat death - so there is no problem with assigning certain bitstrings a zero prior probability.
Whether the bitstrings were created using uncomputable physics is neither here nor there. They are still just bitstrings - and so can be output by a TM with a finite program on its tape.
No, sorry. You're confused. A prior is not an assignment of credences to all bitstrings that you can observe. A prior is an assignment of credences to hypotheses, i.e. possible states of the world that generate bitstrings that you observe. Otherwise you'd find yourself in this text (see part II, "Escaping the Greek Hinterland").
No. We were talking about the universal prior. Here is how that is defined for sequences:
"The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p."
The universal prior of a sequence is the probability of that particular sequence arising (as a prefix). It is not the probabilty of any particular hypothesis or program. Rather is a weighted sum of the probabilities of all the programs that generate that sequence.
You can talk about the probabilities of hypothesis and programs as well if you like - but the universal prior of a sequence is perfectly acceptable subject matter - and is not a "confused" idea.
No finite sequence has a probabilty of zero - according to the universal prior.
All finite bitstrings can be produced by computable means - even if they were generated as the output of an uncomputable physical process.
Is this misconception really where this whole idea arises from?
This is all true, but... Why do you think the universal prior talks about computer programs at all? If I only wanted a prior over all finite bitstrings, I'd use a simpler prior that assigned every string of length N a credence proportional to 2^-N. Except that prior has a rather major shortcoming: it doesn't help you predict the future! No matter how many bits you feed it, it always says the next bit is going to be either 0 or 1 with probability 50%. It will never get "swamped" by the data, never gravitate to any conclusions. This is why we want the universal prior to be based on computer programs instead: it will work better in practice, if the universe is in fact computable. But what happens if the universe is uncomputable? That's the substantive question here.
ETA: the last two sentences are wrong, disregard them.
Nothing much happens to intelligent agents - because an intelligent agents' original priors mostly get left behind shortly after they are born - and get replaced by evidence-based probability estimates of events happening. If convincing evidence comes in that the world is uncomputable, that just adds to the enormous existing stack of evidence they have about the actual frequencies of things.
Anyhow, priors being set to 0 or 1 is not a problem for observable sense data. No finite sense data has p assigned to 0 or 1 under the universal prior - so an agent can always update successfully - if it gets sufficient evidence that a sequence was actually produced. So, if it sees a system that apparently solves the halting problem for arbitrary programs, that is no big deal for it. It may have found a Turing oracle! Cool!
I suppose it might be possible to build an semi-intelligent agent with a particular set of priors permanently wired into it - so the agent was incapable of learning and adapting if its environment changed. Organic intelligent agents are not much like that - and I am not sure how easy it would be to build such a thing. Such agents would be incapable of adapting to an uncomputable world. They would always make bad guesses about uncomputable events. However, this seems speculative - I don't see why people would try to create such agents. They would do very badly in certain simulated worlds - where Occam's razor doesn't necessarily hold true - and it would be debatable whether their intelligence was really very "general".
The reason the universal prior is called "universal" is because given initial segments, where the infinite strings come from any computable distribution, and updating on those samples, it will, in fact, converge to the actual distribution on what the next bit should be. Now I'll admit to not actually knowing the math here, but it seems to me that if most any prior had that property, as you seem to imply, we wouldn't need to talk about a universal prior in the first place, no?
Also, if we interpret "universe" as "the actual infinite string that these segments are initial segments" of, then, well... take a look at that sum you posted and decompose it. The universal prior is basically assigning a probability to each infinite string, namely the sum of the probabilities of programs that generate it, and then collapsing that down to a distribution on initial segments in the obvious way. So if we want to consider its hypotheses about the actual law of the universe, the whole string, it will always assign 0 probability to an uncomputable sequence.
Prior determines how evidence informs your estimates, what things you can consider. In order to "replace priors with evidence-based probability estimates of events", you need a notion of event, and that is determined by your prior.