Eliezer_Yudkowsky comments on Take heed, for it is a trap - Less Wrong

47 Post author: Zed 14 August 2011 10:23AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (187)

You are viewing a single comment's thread.

Comment author: Eliezer_Yudkowsky 16 August 2011 12:32:05AM *  3 points [-]

A statement, any statement, starts out with a 50% probability of being true, and then you adjust that percentage based on the evidence you come into contact with.

That's wildly wrong. "50% probability" is what you assign if someone tells you, "One and only one of the statements X or Y is true, but I'm not going to give you the slightest hint as to what they mean" and it's questionable whether you can even call that a statement, since you can't say anything about its truth-conditions.

Any statement for which you have the faintest idea of its truth conditions will be specified in sufficient detail that you can count the bits, or count the symbols, and that's where the rough measure of prior probability starts - not at 50%. 50% is where you start if you start with 1 bit. If you start with 0 bits the problem is just underspecified.

Update a bit in this direction: That part where Rational Rian said "What the hell do you mean, it starts with 50% probability", he was perfectly right. If you're not confident of your ability to wield the math, don't be so quick to distrust your intuitive side!

Comment author: komponisto 16 August 2011 01:53:15AM 5 points [-]

Any statement for which you have the faintest idea of its truth conditions will be specified in sufficient detail that you can count the bits, or count the symbols...If you start with 0 bits the problem is just underspecified.

What a perfect illustration of what I was talking about when I wrote:

Of course, we almost never reach this level of ignorance in practice, which makes this the type of abstract academic point that people all-too-characteristically have trouble with. The step of calculating the complexity of a hypothesis seems "automatic", so much so that it's easy to forget that there is a step there.

You can call 0 bits "underspecifed" if you like, but the antilogarithm of 0 is still 1, and odds of 1 still corresponds to 50% probability.

Given your preceding comment, I realize you have a high prior on people making simple errors. And, at the very least, this is a perfect illustration of why never to use the "50%" line on a non-initiate: even Yudkowsky won't realize you're saying something sophisticated and true rather than banal and false.

Nevertheless, that doesn't change the fact that knowing the complexity of a statement is knowing something about the statement (and hence not being in total ignorance).

Comment author: Eliezer_Yudkowsky 16 August 2011 06:47:07PM 4 points [-]

I still don't think you're saying something sophisticated and true. I think you're saying something sophisticated and nonsensical. I think it's meaningless to assign a probability to the assertion "understand up without any clams" because you can't say what configurations of the universe would make it true or false, nor interpret it as a question about the logical validity of an implication. Assigning probabilities to A, B, C as in your linked writing strikes me as equally nonsensical. The part where you end up with a probability of 25% after doing an elaborate calculation based on having no idea what your symbols are talking about is not a feature, it is a bug. To convince me otherwise, explain how an AI that assigns probabilities to arbitrary labels about which it knows nothing will function in a superior fashion to an AI that only assigns probabilities to things for which it has nonzero notion of its truth condition.

"If you know nothing, 50% prior probability" still strikes me as just plain wrong.

Comment author: gwern 16 August 2011 11:54:46PM *  9 points [-]

"If you know nothing, 50% prior probability" still strikes me as just plain wrong.

That strikes me as even weirder and wrong. So given a variable A which could be every possible variable, I should assign it... 75% and ~A 25%? or 25%, and make ~A 75%? Or what? - Isn't 50% the only symmetrical answer?

Basically, given a single variable and its negation, isn't 1/2 the max-entropy distribution, just as a collection of n variables has 1/n as the max-ent answer for them?

Comment author: ArisKatsaris 16 August 2011 11:35:48PM *  5 points [-]

Okay, I was among the first people here who called Zed's statement plain wrong, but I now think that there are enough high-status individuals of the community that are taking that same position, that it would serve knowledge more if I explained a bit in what slight sense his statement might not be completely wrong.

One would normally say that you calculate 3^4 by multiplying 3 four times: 3* 3 * 3 * 3
But someone like Zed would say: "No! Every exponential calculation starts out with the number 1. You ought say 3 ^ 4 =1 * 3 * 3 * 3 * 3".
And most of us would then say: "What the hell sense does that make? What would it help an AI to begin by multiplying the number 1 with 3? You are not making sense."
And then Zed would say "But 0^0 = 1 -- and you can only see that if you add the number 1 in the sequence of the numbers to multiply."
And then we would say "What does it even mean to raise zero in the zeroth power? That has no meaning."
And we would be right in the sense it has no meaning in the physical universe. But Zed would be right in the sense he's mathematically correct, and it has mathematical meaning, and equations wouldn't work without the fact of 0^0=1.

I think we can visualize the "starting probability of a proposition" as "50%" in the same way we can visualize the "starting multiplier" of an exponential calculation as "1". This starting number really does NOT help a computer calculate anything. In fact it's a waste of processor cycles for a computer to make that "1*3" calcullation, instead of just using the number 3 as the first number to use.

But "1" can be considered to be the number that remains if all the multipliers are taken away one by one.

Likewise, imagine that we have used both several pieces of evidence and the complexity of a proposition to calculate its probability -- but then for some reason we have to start taking away these evidence -- (e.g. perhaps the AI has to calculate what probability a different AI would have calculated, using less evidence). As we take away more and more evidence, we'll eventually end up reaching towards 50%, same way that 0^0=1.

Comment author: Plasmon 17 August 2011 05:29:27AM 2 points [-]

I feel compelled to point out that 0^0 is undefined, since the limit of x^0 at x=0 is 1 but the limit of 0^x at x=0 is 0.

Yes, in combinatorics assuming 0^0=1 is sensible since it simplifies a lot of formulas which would otherwise have to include special cases at 0.

Comment author: komponisto 16 August 2011 07:50:50PM 1 point [-]

To convince me otherwise, explain how an AI that assigns probabilities to arbitrary labels about which it knows nothing will function in a superior fashion to an AI that only assigns probabilities to things for which it has nonzero notion of its truth condition.

If you're thinking truly reductionistically about programming an AI, you'll realize that "probability" is nothing more than a numerical measure of the amount of information the AI has. And when the AI counts the number of bits of information it has, it has to start at some number, and that number is zero.

The point is about the internal computations of the AI, not the output on the screen. The output on the screen may very well be "ERROR: SYNTAX" rather than "50%" for large classes of human inputs. The human inputs are not what I'm talking about when I refer to unspecified hypotheses like A,B, and C. I'm talking about when, deep within its inner workings, the AI is computing a certain number associated with a string of binary digits. And if the string is empty, the associated number is 0.

The translation of

-- "What is P(A), for totally unspecified hypothesis A?"

-- "50%."

into AI-internal-speak is

-- "Okay, I'm about to feed you a binary string. What digits have I fed you so far?"

-- "Nothing yet."

"If you know nothing, 50% prior probability" still strikes me as just plain wrong.

That's because in almost all practical human uses, "know nothing" doesn't actually mean "zero information content".

Comment author: Eliezer_Yudkowsky 16 August 2011 10:51:23PM 2 points [-]

If you're thinking truly reductionistically about programming an AI, you'll realize that "probability" is nothing more than a numerical measure of the amount of information the AI has.

And here I thought it was a numerical measure of how credible it is that the universe looks a particular way. "Probability" is what I plug into expected utility calculations. I didn't realize that I ought to be weighing futures based on "the amount of information" I have about them, rather than how likely they are to come to pass.

Comment author: komponisto 17 August 2011 01:13:55AM *  5 points [-]

A wise person once said (emphasis -- and the letter c -- added):

"Uncertainty exists in the map, not in the territory. In the real world, the coin has either come up heads, or come up tails. Any talk of 'probability' must refer to the information that I have about the coin - my state of partial ignorance and partial knowledge - not just the coin itself. Furthermore, I have all sorts of theorems showing that if I don't treat my partial knowledge a certain way, I'll make stupid bets. If I've got to plan, I'll plan for a 50/50 state of uncertainty, where I don't weigh outcomes conditional on heads any more heavily in my mind than outcomes conditional on tails. You can call that number whatever you like, but it has to obey the probability laws on pain of stupidity. So I don't have the slightest hesitation about calling my outcome-weighting a probability."

That's all we're talking about here. This is exactly like the biased coin where you don't know what the bias is. All we know is that our hypothesis is either true or false. If that's all we know, there's no probability other than 50% that we can sensibly assign. (Maybe using fancy words like "maximum entropy" will help.)

I fully acknowledge that it's a rare situation when that's all we know. Usually, if we know enough to be able to state the hypothesis, we already have enough information to drive the probability away from 50%. I grant this. But 50% is still where the probability gets driven away from.

Denying this is tantamount to denying the existence of the number 0.

Comment author: wmorgan 17 August 2011 04:26:40AM *  4 points [-]

Let n be an integer. Knowing nothing else about n, would you assign 50% probability to n being odd? To n being positive? To n being greater than 3? You see how fast you get into trouble.

You need a prior distribution on n. Without a prior, these probabilities are not 50%. They are undefined.

The particular mathematical problem is that you can't define a uniform distribution over an unbounded domain. This doesn't apply to the biased coin: in that case, you know the bias is somewhere between 0 and 1, and for every distribution that favors heads, there's one that favors tails, so you can actually perform the integration.

Finally, on an empirical level, it seems like there are more false n-bit statements than true n-bit statements. Like, if you took the first N Godel numbers, I'd expect more falsehoods than truths. Similarly for statements like "Obama is the 44th president": so many ways to go wrong, just a few ways to go right.

Edit: that last paragraph isn't right. For every true proposition, there's a false one of equal complexity.

Comment author: Zed 17 August 2011 09:13:54AM *  5 points [-]

Finally, on an empirical level, it seems like there are more false n-bit statements than true n-bit statements.

I'm pretty certain this intuition is false. It feels true because it's much harder to come up with a true statement from N bits if you restrict yourself to positive claims about reality. If you get random statements like "the frooble fuzzes violently" they're bound to be false, right? But for every nonsensical or false statement you also get the negation of a nonsensical or false statement. "not( the frooble fuzzes violiently)". It's hard to arrive at a statement like "Obama is the 44th president" and be correct, but it's very easy to enumerate a million things that do not orbit Pluto (and be correct).

(FYI: somewhere below there is a different discussion about whether there are more n-bit statements about reality that are false than true)

Comment author: ArisKatsaris 17 August 2011 10:30:32AM *  4 points [-]

There's a 1-to-1 correspondence between any true statement and its negation, and the sets aren't overlapping, so there's an equal number of true and false statements - and they can be coded in the identical amount of bits, as the interpreting machine can always be made to consider the negation of the statement you've written to it.

You just need to add the term '...NOT!' at the end. As in 'The Chudley Cannons are a great team... NOT!"

Or we may call it the "He loves me, he loves me not" principle.

Comment author: wmorgan 17 August 2011 12:45:03PM *  1 point [-]

Doesn't it take more bits to specify NOT P than to specify P? I mean, I can take any proposition and add "..., and I like pudding" but this doesn't mean that half of all n-bit propositions are about me liking pudding.

Comment author: ArisKatsaris 17 August 2011 01:16:12PM *  0 points [-]

Doesn't it take more bits to specify NOT P than to specify P?

No. If "NOT P" took more bits to specify than "P", this would also mean that "NOT NOT P" would take more bits to specify than "NOT P". But NOT NOT P is identical to P, so it would mean that P takes more bits to specify than itself.

With actual propositions now, instead of letters:

If you have the proposition "The Moon is Earth's satellite", and the proposition "The Moon isn't Earth's satellite", each is the negation of the other. If a proposition's negation takes more bits to specify than the proposition, then you're saying that each statement takes more bits to specify than the other.

Even simpler -- can you think any reason why it would necessarily take more bits to codify "x != 5" than "x == 5"?

Comment author: ESRogs 02 May 2012 12:03:45AM 0 points [-]

I read the rest of this discussion but did not understand the conclusion. Do you now think that the first N Godel numbers would be expected to have the same number of truths as falsehoods?

Comment author: wmorgan 02 May 2012 01:34:43AM 1 point [-]

It turns out not to matter. Consider a formalism G', identical to Godel numbering, but that reverses the sign, such that G(N) is true iff G'(N) is false. In the first N numbers in G+G', there are an equal number of truths and falsehoods.

For every formalism that makes it easy to encode true statements, there's an isomorphic one that does the same for false statements, and vice versa. This is why the set of statements of a given complexity can never be unbalanced.

Comment author: ESRogs 02 May 2012 11:44:12PM 0 points [-]

Gotcha, thanks.

Comment author: komponisto 17 August 2011 05:26:04AM 0 points [-]

Let n be an integer...You need a prior distribution on n. Without a prior, these probabilities are not 50%. They are undefined.

Who said anything about not having a prior distribution? "Let n be a [randomly selected] integer" isn't even a meaningful statement without one!

What gave you the impression that I thought probabilities could be assigned to non-hypotheses?

Finally, on an empirical level, it seems like there are more false n-bit statements than true n-bit statements.

This is irrelevant: once you have made an observation like this, you are no longer in a state of total ignorance.

Comment author: wmorgan 17 August 2011 06:02:01AM *  1 point [-]

We agree that we can't assign a probability to a property of a number without a prior distribution. And yet it seems like you're saying that it is nonetheless correct to assign a probability of truth to a statement without a prior distribution, and that the probability is 50% true, 50% false.

Doesn't the second statement follow from the first? Something like this:

  1. For any P, a nontrivial predicate on integers, and an integer n, Pr(P(n)) is undefined without a distribution on n.
  2. Define X(n), a predicate on integers, true if and only if the nth Godel number is true.
  3. Pr(X(n)) is undefined without a distribution on n.

Integers and statements are isomorphic. If you're saying that you can assign a probability to a statement without knowing anything about the statement, then you're saying that you can assign a probability to a property of a number without knowing anything about the number.

Comment author: komponisto 17 August 2011 06:34:43AM *  1 point [-]

We agree that we can't assign a probability to a property of a number without a prior distribution. And yet it seems like you're saying that it is nonetheless correct to assign a probability of truth to a statement without a prior distribution,

That is not what I claim. I take it for granted that all probability statements require a prior distribution. What I claim is that if the prior probability of a hypothesis evaluates to something other than 50%, then the prior distribution cannot be said to represent "total ignorance" of whether the hypothesis is true.

This is only important at the meta-level, where one is regarding the probability function as a variable -- such as in the context of modeling logical uncertainty, for example. It allows one to regard "calculating the prior probability" as a special case of "updating on evidence".

Comment author: Zed 17 August 2011 09:10:31AM *  0 points [-]

[ replied to the wrong person ]

Comment author: Wei_Dai 17 August 2011 09:56:35AM 2 points [-]

I fully acknowledge that it's a rare situation when that's all we know.

When is this ever the situation?

Usually, if we know enough to be able to state the hypothesis, we already have enough information to drive the probability away from 50%. I grant this. But 50% is still where the probability gets driven away from.

Can you give an example of "driving the probability away from 50%"? I note that no one responded to my earlier request for such an example.

Comment author: lessdazed 17 August 2011 09:25:01PM -1 points [-]

When is this ever the situation?...Can you give an example of "driving the probability away from 50%"? I note that no one responded to my earlier request for such an example.

No one can give an example because it is logically impossible for it to be the situation, it's not just rare. It cannot be that "All we know is that our hypothesis is either true or false." because to know that something is a hypothesis entails knowing more than nothing. It's like saying "knowing that a statement is either false or a paradox, but having no information at all as to whether it is false or a paradox".

Comment author: Tyrrell_McAllister 16 August 2011 09:31:58PM *  1 point [-]

The translation of

-- "What is P(A), for totally unspecified hypothesis A?"

-- "50%."

into AI-internal-speak is

-- "Okay, I'm about to feed you a binary string. What digits have I fed you so far?"

-- "Nothing yet."

You seem to be using a translation scheme that I have not encountered before. You give one example of its operation, but that is not enough for me to distill the general rule. As with all translation schemes, it will be easier to see the pattern if we see how it works on several different examples.

So, with that in mind, suppose that the AI were asked the question

-- "What is P(A), for a hypothesis A whose first digit is 1, but which is otherwise totally unspecified?"

What should the AI's answer be, prior to translation into "AI-internal-speak"?

Comment author: Tyrrell_McAllister 16 August 2011 09:01:06PM *  1 point [-]

Why does not knowing the hypothesis translate into assigning the hypothesis probability 0.5 ?

If this is the approach that you want to take, then surely the AI-internal-speak translation of "What is P(A), for totally unspecified hypothesis A?" would be "What proportion of binary strings encode true statements?"

ETA: On second thought, even that wouldn't make sense, because the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map. Perhaps sense could be salvaged by passing to a meta-language. Then the AI could translate "What is P(A), for totally unspecified hypothesis A?" as "What is the expected value of the proportion of binary strings that encode true statements?".

But really, the question "What is P(A), for totally unspecified hypothesis A?" just isn't well-formed. For the AI to evaluate "P(A)", the AI needs already to have been fed a symbol A in the domain of P.

Your AI-internal-speak version is a perfectly valid question to ask, but why do you consider it to be the translation of "What is P(A), for totally unspecified hypothesis A?" ?

Comment author: Tyrrell_McAllister 16 August 2011 05:34:29PM 1 point [-]

Given your preceding comment, I realize you have a high prior on people making simple errors. And, at the very least, this is a perfect illustration of why never to use the "50%" line on a non-initiate: even Yudkowsky won't realize you're saying something sophisticated and true rather than banal and false.

I don't see how the claim is "sophisticated and true". Let P and Q be statements. You cannot simultaneously assign 50% prior probability to each of the following three statements:

  • P
  • P & Q
  • P & ~Q

This remains true even if you don't know the complexities of these statements.

Comment author: komponisto 16 August 2011 05:39:45PM 0 points [-]

See here.

Comment author: Tyrrell_McAllister 16 August 2011 07:19:23PM *  1 point [-]

See here.

I think that either you are making a use-mention error, or you are confusing syntax with semantics.

Formally speaking, the expression "p(A)" makes sense only if A is a sentence in some formal system.

I can think of three ways to try to understand what's going in your dialogue, but none leads to your conclusion. Let Alice and Bob be the first and second interlocutor, respectively. Let p be Bob's probability function. My three interpretations of your dialogue are as follows:

  1. Alice and Bob are using different formal systems. In this case, Bob cannot use Alice's utterances; he can only mention them.

  2. Alice and Bob are both using the same formal system, so that A, B, and C are sentences—e.g., atomic proposition letters—for both Alice and Bob.

  3. Alice is talking about Bob's formal system. She somehow knows that Bob's model-theoretic interpretations of the sentences C and A&B are the same, even though [C = A&B] isn't a theorem in Bob's formal system. (So, in particular, Bob's formal system is not complete.)

Under the first interpretation, Bob cannot evaluate expressions of the form "p(A)", because "A" is not a sentence in his formal system. The closest he can come is to evaluate expressions like "p(Alice was thinking of a true proposition when she said 'A')". If Bob attends to the use-mention distinction carefully, he cannot be trapped in the way that you portray. For, while C = A & B may be a theorem in Alice's system,

  • (Alice was thinking of a true proposition when she said 'C') = (Alice was thinking of a true proposition when she said 'A') & (Alice was thinking of a true proposition when she said 'B')

is not (we may suppose) a theorem in Bob's formal system. (If, by chance, it is a theorem in Bob's formal system, then the essence of the remarks below apply.)

Now consider the second interpretation. Then, evidently, C = A & B is a theorem in Alice and Bob's shared formal system. (Otherwise, Alice would not be in a position to assert that C = A & B.) But then p, by definition, will respect logical connectives so that, for example, if p(B & ~A) > 0, then p(A) < p(C). This is true even if Bob hasn't yet worked out that C = A & B is in fact a consequence of his axioms. It just follows from the fact that p is a coherent probability function over propositions.

This means that, if the algorithm that determines how Bob answers a question like "What is p(A)?" is indeed an implementation of the probability function p, then he simply will not in all cases assert that p(A) = 0.5, p(B) = 0.5, and p(C) = 0.5.

Finally, under the third interpretation, Bob did not say that p(A|B) = 1 when he said that p(C)/ p(B) = 1, because A&B is not syntactically equivalent to C under Bob's formal system. So again Alice's trap fails to spring.

Comment author: Vladimir_Nesov 16 August 2011 10:41:03PM 0 points [-]

Formally speaking, the expression "p(A)" makes sense only if A is a sentence in some formal system.

How does it makes sense then? Quite a bit more would need to be assumed and specified.

Comment author: Tyrrell_McAllister 16 August 2011 10:43:00PM *  1 point [-]

Quite a bit more would need to be assumed and specified.

Hence the "only if". I am stating a necessary, but not sufficient, condition. Or do I miss your point?

Comment author: Vladimir_Nesov 16 August 2011 10:45:33PM 0 points [-]

Well, we could also assume and specify additional things that would make "p(A)" make sense even if "A" is not a statement in some formal system. So I don't see how your remark is meaningful.

Comment author: Tyrrell_McAllister 16 August 2011 11:08:10PM *  0 points [-]

Well, we could also assume and specify additional things that would make "p(A)" make sense even if "A" is not a statement in some formal system.

Do you mean, for example, that p could be a measure and A could be a set? Since komponisto was talking about expressions of the form p(A) such that A can appear in expressions like A&B, I understood the context to be one in which we were already considering p to be a function over sentences or propositions (which, following komponisto, I was equating), and not, for example, sets.

Do you mean that "p(A)" can make sense in some case where A is a sentence, but not a sentence in some formal system? If so, would you give an example? Do you mean, for example, that A could be a statement in some non-formal language like English?

Or do you mean something else?

Comment author: komponisto 17 August 2011 12:07:59AM 1 point [-]

In my own interpretation, A is a hypothesis -- something that represents a possible state of the world. Hypotheses are of course subject to Boolean algebra, so you could perhaps model them as sentences or sets.

You have made a number of interesting comments that will probably take me some time to respond to.

Comment author: lessdazed 16 August 2011 03:31:51AM -2 points [-]

Knowing that a statement is a proposition is far from being in total ignorance.

Writing about propositions using the word "statements" and then correcting people who say you are wrong based on true things they say about actual statements would be annoying. Please make it clear you aren't doing that.

Comment author: komponisto 16 August 2011 04:41:04AM 0 points [-]

Neither the grandparent nor (so far as I can tell) the great-grandparent makes the distinction between "statements" and "propositions" that you have drawn elsewhere. I used the term "statement" because that was what was used in the great-grandparent (just as I used it in my other comment because it was used in the post). Feel free to mentally substitute "proposition" if that is what you prefer.