Ok, I see.
So, I could instead think of iteratively eliminating inconsistent sets of sentences from the measure M, and looking at the limit of M(X)/M(all). I will assume for convenience that we eliminate one inconsistency at a time (so, remove mass from one set of sentences at a time). Let's call this M{d}, so M{d+1} has one more inconsistent set removed from the measure (which is to say, set to measure 0).
Each set eliminated may take some mass away from the numerator, and will definitely take some mass away from the denominator. (Yet the numerator will always remain less than or equal to the denominator.)
If we remove a set of sentences which does not include X and has no overlap with any of the other sets of sentences removed so far, then the mass of both the numerator and the denominator get multiplied by 1 - .5^n, where n is the number of sentences in the set. To make this more general, for M{0}(Y) we can think of ~Y as having already been removed from its mass to begin; therefore, we can say that for any M{d}(Y), if d+1 is based on an inconsistent set S not involving anything removed from M{d}(Y) previously, then M{d+1}(Y) = M{d}(Y) * (1 - .5^|S|).
If S does contain a sentence which has been involved in inconsistency previously, we remove a strictly smaller fraction. If a subset of S has already been removed, then nothing gets removed at all. But if a set S being removed from M{d}(Y) contains only Y and some other sentences which have never appeared before, then M{d+1}(Y) = M{d}(Y) * (1 - .5^|S-1|).
Let's name the ratio M(X)/M(all) at depth d as R(d). Considering R(d)/R(d+1), we know its limit must be in the range [0,1], because M(X) cannot be more than M(all) (so R(d) cannot diverge upward). If the limit of R(d) is greater than zero, the limit of R(d)/R(d+1) must be 1.
Assume X and ~X are both consistent.
In some ordering of logical depth, we have that: infinitely often, we eliminate a set of the form {Z, Z->~X, X} where Z and Z->~X have both never been seen before in an elimination set. Thus, M{d+1}(X) = .75 M{d}(X).
M{d}(all) = M{d}(X) + M{d}(~X). Nothing will be removed from M{d}(~X). Since X and ~X are consistent, both have some measure. Therefore, M{d+1}(all) = .75 M{d}(X) + M{d}(~X) = M{d}(all) ( .75 R(d) + M{d}(~X)/M{d}(all)) = M{d}(all) ( .75 R(d) + 1 - R(d)) = M{d}(all) * (1 - .25 R(d)).
Thus, infinitely often, R(d+1) = {.75 M{d}(X)} / {M{d}(all) * (1 - .25 R(d))} = .75 R(d) / (1 - .25 R(d))
Let c be R(d+1) - R(d).
So, infinitely often, we have
R(d) + c = .75 R(d) / (1 - .25 R(d))
c = .75 R(d) / (1 - .25 R(d)) - R(d)
If c goes to zero, R(d) must go to:
0 = .75 R(d) / (1 - .25 R(d)) - R(d)
R(d) = .75 R(d) / (1 - .25 R(d))
1 = .75 / (1 - .25 R(d))
1 - .25 R(d) = .75
.25 = .25 R(d)
R(d) = 1
So we see that if the probability converges, it must go to either 0 or 1! Unless I've made some mistake.
I didn't follow but the conclusion doesn't sound right to me. Your argument looks like it should apply to any language and proof system. So if I introduce "X" as a logical symbol that isn't constrained by any axioms or inference rules, why would its probability go to either 0 or 1? It seems to converge to 1/2 since for any consistent truth assignment with X=T we have a symmetric consistent truth assignment with X=F.
Followup to: Intelligence Metrics and Decision Theory
Related to: Bridge Collapse: Reductionism as Engineering Problem
A central problem in AGI is giving a formal definition of intelligence. Marcus Hutter has proposed AIXI as a model of perfectly intelligent agent. Legg and Hutter have defined a quantitative measure of intelligence applicable to any suitable formalized agent such that AIXI is the agent with maximal intelligence according to this measure.
Legg-Hutter intelligence suffers from a number of problems I have previously discussed, the most important being:
Logical Uncertainty
The formalism introduced here was originally proposed by Benja.
Fix a formal system F. We want to be able to assign probabilities to statements s in F, taking into account limited computing resources. Fix D a natural number related to the amount of computing resources that I call "depth of analysis".
Define P0(s) := 1/2 for all s to be our initial prior, i.e. each statement's truth value is decided by a fair coin toss. Now define
PD(s) := P0(s | there are no contradictions of length <= D).
Consider X to be a number in [0, 1] given by a definition in F. Then dk(X) := "The k-th digit of the binary expansion of X is 1" is a statement in F. We define ED(X) := Σk 2-k PD(dk(X)).
Remarks
PD(s) = 0.
Non-Constructive UDT
Consider A a decision algorithm for optimizing utility U, producing an output ("decision") which is an element of C. Here U is just a constant defined in F. We define the U-value of c in C for A at depth of analysis D to be
VD(c, A; U) := ED(U | "A produces c" is true). It is only well defined as long as "A doesn't produce c" cannot be proved at depth of analysis D i.e. PD("A produces c") > 0. We define the absolute U-value of c for A to be
V(c, A; U) := ED(c, A)(U | "A produces c" is true) where D(c, A) := max {D | PD("A produces c") > 0}. Of course D(c, A) can be infinite in which case Einf(...) is understood to mean limD -> inf ED(...).
For example V(c, A; U) yields the natural values for A an ambient control algorithm applied to e.g. a simple model of Newcomb's problem. To see this note that given A's output the value of U can be determined at low depths of analysis whereas the output of A requires a very high depth of analysis to determine.
Naturalized Induction
Our starting point is the "innate model" N: a certain a priori model of the universe including the agent G. This model encodes the universe as a sequence of natural numbers Y = (yk) which obeys either specific deterministic or non-deterministic dynamics or at least some constraints on the possible histories. It may or may not include information on the initial conditions. For example, N can describe the universe as a universal Turing machine M (representing G) with special "sensory" registers e. N constraints the dynamics to be compatible with the rules of the Turing machine but leaves unspecified the behavior of e. Alternatively, N can contain in addition to M a non-trivial model of the environment. Or N can be a cellular automaton with the agent corresponding to a certain collection of cells.
However, G's confidence in N is limited: otherwise it wouldn't need induction. We cannot start with 0 confidence: it's impossible to program a machine if you don't have even a guess of how it works. Instead we introduce a positive real number t which represents the timescale over which N is expected to hold. We then assign to each hypothesis H about Y (you can think about them as programs which compute yk given yj for j < k; more on that later) the weight QS(H) := 2-L(H) (1 - e-t(H)/t). Here L(H) is the length of H's encoding in bits and t(H) is the time during which H remains compatible with N. This is defined for N of deterministic / constraint type but can be generalized to stochastic N.
The weights QS(H) define a probability measure on the space of hypotheses which induces a probability measure on the space of histories Y. Thus we get an alternative to Solomonoff induction which allows for G to be a mechanistic part of the universe, at the price of introducing N and t.
Remarks
Intelligence Metric
To assign intelligence to agents we need to add two ingredients:
Instead, we define I(Q0) := EQS(Emax(U(Y(H)) | "Q(Y(H)) = Q0" is true)). Here the subscript max stands for maximal depth of analysis, as in the construction of absolute UDT value above.
Remarks