It is misleading to attribute that book solely to Voevodsky.
This was surprisingly clarifying for me.
I'm still not sure whether I find it too concerning. We assign higher probability to simpler theories so long as they follow the constraint that Q(n) below 10^100 is 90%. The theory which randomly assigns Q(n) to true/false for 10^99 simple values of n (and then gets forced to assign Q as true for most of the remaining n below 10^100) is just one form of theory which may be generated by the process, and not a really simple one. The theory that Q(n) represents "not divisible by 10" is going to be much more probable than all these theories.
In other words, the write-up estimates Q(n) based on a narrow subset of the theories which assign probability mass. I don't really think it's representative...
Imagine that we encounter a truly iid random sequence of 90% likely propositions Q(0),Q(1),Q(2),... Perhaps they are merely pseudorandom but impossibly complicated to reason about, or perhaps they represent some random external output that an agent observes. After observing a very large number of these Q(i), one might expect to place high probability on something like "About 90% of the next 10^100 Q(j) I haven't observed yet will be true," but there is unlikely to be any simple rule that describes the already observed Q(i). Do you think that the next 10^100 Q(j) will all individually be believed 90% likely to be true, or will the simpler to describe Q(j) receive closer to 50% probability?
After writing this I realize that there is a much simpler prior on finite sets S of consistent statements: simply have a prior over all sets of statements, and keep only the consistent ones. If your language is chosen such that it contains X if and only if it also contains ¬X, then this is equivalent to choosing a truth value for each basic statement, and a uniform prior over these valuations would work fine.
The key here is that you are using finite S. What do you do if S is infinite? More concretely, is your schema convergent if you grow your finite S by adding more and more statements? I believe we touch on such worries in the writeup.
I am not convinced by the problematic example in the "Scientific Induction in Probabilistic Mathematics" writeup. Let's say that there are n atoms ϕ(1)..ϕ(n). If you don't condition, then because of symmetry, all consistent sets S drawn from the process have equal probability. So the prior on S is uniform and the probability of ϕ(i) is therefore 1/2, by
P(ϕ(i)) = ∑{S} 1[ϕ(i)∈S] * P(S)
n a consistent set S drawn from the process is exactly 1/2 for all i, this must be true by symmetry because μ(x)=μ(¬x). Now what you should do to condition on some statement X is simply throw out the sets S which don't satisfy that statement, i.e.
P(ϕ(i)|X) = ∑{S} 1[ϕ(i)∈S] * P(S) * 1[X(S)] / ∑{S} P(S) * 1[X(S)]
Since the prior on S was uniform, it will still be uniform on the restricted set after conditioning. So
P(ϕ(i)|X) = ∑{S} 1[ϕ(i)∈S] * 1[X(S)] / ∑{S} 1[X(S)]
Which should just be 90% in the example where X is "90% of the ϕ are true"
The mistake in the writeup is to directly define P(S|X) in an inconsistent way.
To avoid drowning in notation, let's consider a simpler example with the variables a, b and c. We will first pick a or ¬a uniformly, then b or ¬b, and finally c or ¬c. Then we try to condition on X="exactly one of a,b,c is true". You obviously get prior probabilities P(S) = 1/8 for all consistent sets.
If you condition the right way, you get P(S) = 1/3 for the sets with one true attom, and P(S)=0 for the other sets. So then
P(a|X) = P(a|{a,¬b,¬c})P({a,¬b,¬c}|X) + P(a|{¬a,b,¬c})P({¬a,b,¬c}|X) + P(a|{¬a,¬b,c})P({¬a,¬b,c}|X)
= 1/3
What the writeup does instead is first pick a or ¬a uniformly. If it picks a, we know that b and c are false. If we pick ¬a we continue. The uniform choice of a is akin to saying that
P({a,b,c}|X) = P(a) * P({b,c}|a,X).
But that first term should be P(a|X), not P(a)!
Sorry if there is something fishy in the writeup :(. I could believe it, given how rushed I was writing it.
Suppose we consider not just a,~a,b,~b, and c,~c, but also statements q="exactly one of a,b,c is true" and ~q. Suppose now that we uniformly pick a truth value for a, then for q, then a logically consistent but otherwise random value for b, and finally a logically consistent but otherwise random value for c. Such an asymmetric situation could occur if b and c have high mu but a and q have small mu. In the worlds where we believe q, b and c are much more often disbelieved than a. I believe that basically captures the worries about Demski's scheme that Paul was having; maybe he will comment himself.
Does that clarify anything?
I confess I'm ignorant about what report you mean - could you throw me a link?
Hm, x being simple or complicated...
For example, if our predicate P(x) is true just for prime numbers, then "the trillionth prime number" is a simple description - no prime number can have a greater complexity than its description qua prime number, and so if we assign high probability to simple descriptions like primeness, we'll expect P to be true for simple numbers.
I would argue that this is proper behavior. It is not that our robot cares about the complexity of the number. It just tries to find simple patterns in the data, and naturally some numbers will fit into more simple patterns than others, once we're given information about the density of P(x) being true.
An extreme example would be if P(x) were only true (or, equivalently, false) for a single number. Then our probability about P(x) really would be about complexity of the number. Why is this proper behavior? Because information that justifies pattern-matching behavior is information about the complexity of P(x)! If P(x) is a single number and we know that its description is less than 100 bits, that knowledge is what allows us to exhibit a preference for simpler patterns, and thus simpler numbers.
Fun fact: if we know that P(x) is true for 50% of numbers, the complexity of x has essentially no impact, even though we still favor simple patterns.
Hi Manfred, the link is in the theme 1and section of the original post. Let me know if it clarifies our worries about simplicity priors.
Well, I wrote down some stream of consciousness stuff :P
The statement "P(x) is true for 90% of x in some domain" is an interesting one. A message-length or complexity based prior over patterns (like I justified the use of in my post, and as is commonly used in things like minimum message length prediction) does not have statements like that as basic statements. Instead the basic statements are about the different patterns that could correspond to the values of P(x) over its domain.
If you've done random sampling on domain x and gotten P(x) true 90% of the time, then a robot using a complexity-based prior refuses to throw away the information about what you sampled, and just tries to find simple patterns in your data. This does lead it to predict that some additional statement has high probability of being true, because simple patterns tend to be consistent. But it does not assign equal probability to the truth of each P(x), because different x's can fit into different patterns.
So a complexity prior over functions exhibits the correct behavior, at least when you give it the raw data. But what if we give if the information "P(x) is true for 90% of x in some domain?"
Well, there are lots of different patterns with P(x) true for 90% of x. But the math is fairly straightfoward - our robot takes its prior, cuts out every pattern that doesn't have P(x) true for 90% of x, and then renormalizes to get its new distribution over patterns. Again, not every x has the same probability, because simple patterns contribute more, but the probability is around 90% for various x.
However, at no point does our complexity-prior robot ever think of the statement "P(x) is true for 90% of x in some domain" on its own. To do that, we'd have to make a robot that approximated by throwing out the information of what the sampled P(x) were, and only keeping the frequencies, and then didn't consider any information about the complexity of the function (equivalent to a uniform non-standard prior), so that it didn't update to a more complicated distribution.
This is a reasonable path to take for approximation. After all, the complexity prior isn't going to be computationally tractable for many real-world problems. Throwing out the sampled P(x) is obviously bad if you have enough computational resources to use that information (e.g. imagine if 50% were true and it was just the odd numbers), so along with approximation schemes, we should remember to think about ways of controlling and doing cost/benefit analyses on those approximations.
Hi Manfred, It might help to communicate if you read the report, though I'm not sure it's understandable since it was written quite rapidly :(. Thanks for sending your stream of consciousness and I look forward to hearing more of it. You seem quite convinced that a simplicity prior exhibits reasonable behavior, but I am less so. A lot of the workshop involved understanding potential pitfalls of simplicity priors. Basically, some pathologies appear to arise from the fact that a very large pseudorandom number with a short description is treated very differently from a truly random large number with no short description.
It is difficult to analyze simplicity priors since they are so intractably complicated, but do you believe that, after updating on the first 1 million values of a random 90% true sequence of propositions P(0), P(1),...,P(1 million), P(x) will be assigned similar truth probabilities both when x admits short description and when x admits only a complicated description?
suppose that you find evidence that Q(x) is true of exactly 90% of all numbers between x=1 and x=10^100. Then we would expect that if you plug in some arbitrary number in this range, say n = floor(7^7^7^7 * pi) mod 10^100, then the posterior probability of Q(n), after conditioning on your evidence, would be 0.9. But it turns out that in Demski's proposal, the probability of Q(n) would be approximately 0.5.
This is because the task you're talking about is a task for generalizing based on a pattern rather than on proofs, which early proposals for assigning logical probabilities didn't include. I have literally written an article about this, check it out.
I'll plug my own posts as many times as it takes :P
Hi Manfred, your post indeed touches on many of the same issues we were considering on the workshop. If I understand it correctly, you are worried about the robot correctly learning universal generalizations. For example, if P(x) is true for all odd x and false for all even x, you want your robot to correctly guess that after observing P(0),P(1),P(2),...,P(1 million). In fact, most current proposals, including Demski's, are capable of learning these universal generalizations. Our current programs are able to correctly learn rules like "P(x) is true for all odd x," or even "P(x) is true for 90% of x between 0 and 10 million." You could call this step "generalization." There are many interesting things to say about generalization, including your post and Will Sawin's Pi1-Pi2 Theorem. The problem Paul brought to our attention at this workshop however was not generalization, but "specialization." If you believe that "P(x) is true for all odd x," it's rather easy to say what you believe about P(994543). But it's surprisingly subtle to understand what you should believe about P(994543) given only that "P(x) is true for 90% of x between 0 and 10 million," particularly when P(994543) is part of a different sequence of propositions Q(x) which you have different statistical beliefs about. Properly handling the "specialization" step was the focus of the workshop, and I'd be interested to hear if you have any thoughts about the subtleties brought up in the write-up.
|A kind of counter-example to your claim is the following: http://www.math.rutgers.edu/~zeilberg/GT.html It is an automated reasoning system for Euclidean geometry. Starting from literally nothing, it derived all of the geometric propositions in Euclid's Elements in a matter of seconds. Then it proceeded to produce a number of geometric theorems of human interest that were never noticed by any previous Euclidean geometers, classical or modern.
This is simply to point out that there are some fields of math that are classically very hard but computers find trivial. Another example is the verification of random algebraic identities by brute force.
On the other hand, large fields of mathematics have not yet come to be dominated by computers. For those I think this paper is a good introduction to some state-of-the-art, machine-learning based techniques: http://arxiv.org/abs/1108.3446 One can see from the paper that there is plenty of room for machine learning techniques to be ported from fields like speech and vision.
Progress in machine learning in vision and speech has recently been driven by the existence of huge training data-sets. It is only within the last few years that truly large databases of human-made proofs in things like set theory or group theory have been formalized. I think that future progress will come as these databases continue to grow.
For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):
P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y
The unsatisfying thing is that one direction has "<=" while the other has "<".
The negated statements become 'or', so we get x <= P(A) or P(A) <= y, right?
To me, the strangest thing about this is the >0 condition... if the probability of this type of statement is above 0, it is true!
I agree that the derivation of (4) from (3) in the paper is unclear. The negation of a<b<c is not a>=b>=c.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
The mathematician and Fields medalist Vladimir Voevodsky on using automated proof assistants in mathematics:
[...]
[...]
[...]
[...]
From a March 26, 2014 talk. Slides available here.
Computer scientists seem much more ready to adopt the language of homotopy type theory than homotopy theorists at the moment. It should be noted that there are many competing new languages for expressing the insights garnered by infinity groupoids. Though Voevodsky's language is the only one that has any connection to computers, the competing language of quasi-categories is more popular.