Yeah, you're right. I assumed that Bayes(M, P) is continuous in both arguments, but we don't know what that means or how to prove it.
It's not clear how to define continuity in P for a given M, because pointwise convergence doesn't cut it and we don't have any other ideas. But continuity in M for a given P seems to be easier, because we can define the distance between two M's as the total weight of sentences where they disagree. Under this definition, I think I've found an example where Bayes(M, P) is not continuous in M.
1) Let's take a language consisting of sentences phi_k = "x=k" for each natural k, and all their combinations using logical connectives (and/or/not). This language has models M_k = "phi_k is true, all others are false" for each k, and also M_inf = "all phi_k are false".
2) To define the weights, fix c<1 and set mu(phi_k) = c*2^-k for each k, and distribute the remaining 1-c of weight among the remaining sentences arbitrarily.
3) For every k, any statement where M_k and M_inf disagree must include phi_k as a component, because all other phi_n have the same truth value in M_k and M_inf. The total weight of sentences that include phi_k goes to zero as k goes to infinity, so the model M_inf is the limit of models M_k under the above metric.
4) Now let's define a probability assignment P so that P(phi_k) = (2^(-2^k))/(1+2^(-2^k)) for each k, and P(anything else) = 1/2. You can check that Bayes(M_k, P) has the same value for each k, but Bayes(M_inf, P) has a different, higher value.
I actually thought of exactly this construction at our last MIRIx on Saturday. It made me sad. However here is a question I couldn't answer. Can you do the same thing with a coherent P? That I do not know.
If Bayes is continuous in M for coherent P then we would be able to prove the first conjecture.
Something slightly easier, maybe we can show that if Bayes is not continuous at the point M for coherent P, then P must assign nonzero probability to M. I am also not sure if this would prove the conjecture, but haven't really thought about this angle yet.
Anoth...
In this post, I propose an answer to the following question:
Given a consistent but incomplete theory, how should one choose a random model of that theory?
My proposal is rather simple. Just assign probabilities to sentences in such that if an adversary were to choose a model, your Worst Case Bayes Score is maximized. This assignment of probabilities represents a probability distribution on models, and choose randomly from this distribution. However, it will take some work to show that what I just described even makes sense. We need to show that Worst Case Bayes Score can be maximized, that such a maximum is unique, and that this assignment of probabilities to sentences represents an actual probability distribution. This post gives the necessary definitions, and proves these three facts.
Finally, I will show that any given probability assignment is coherent if and only if it is impossible to change the probability assignment in a way that simultaneously improves the Bayes Score by an amount bounded away from 0 in all models. This is nice because it gives us a measure of how far a probability assignment is from being coherent. Namely, we can define the "incoherence" of a probability assignment to be the supremum amount by which you can simultaneously improve the Bayes Score in all models. This could be a useful notion since we usually cannot compute a coherent probability assignment so in practice we need to work with incoherent probability assignments which approach a coherent one.
I wrote up all the definitions and proofs on my blog, and I do not want to go through the work of translating all of the latex code over here, so you will have to read the rest of the post there. Sorry. In case you do not care enough about this to read the formal definitions, let me just say that my definition of the "Bayes Score" of a probability assignment P with respect to a model M is the sum over all true sentences s of m(s)log(P(s)) plus the sum over all false sentences s of m(s)log(1-P(s)), where m is some fixed nowhere zero probability measure on all sentences. (e.g. m(s) is 1/2 to the number of bits needed to encode s)
I would be very grateful if anyone can come up with a proof that this probability distribution which maximizes Worst Case Bayes Score has the property that its Bayes Score is independent of the choice of what model we use to judge it. I believe it is true, but have not yet found a proof.