All of one_forward's Comments + Replies

I see two reasons not to treat every measurement from the survey as having zero weight.

First, you'd like an approach that makes sense when you haven't considered any data samples previously, so you don't ignore the first person to tell you "humans are generally between 2 and 10 feet tall".

Second, in a different application you may not believe there is no causal mechanism for a new study to provide unique information about some effect size. Then there's value in a model that updates a little on the new study but doesn't update infinitely on infinite studies.

3gwern
The approach I suggest is that you can model standard biases like p-hacking via shrinkage, and you can treat extremely discrete systematic biases like fraud or methodological errors (such as confounding which is universal among all studies) as a mixture model, where the different mixtures correspond to the different discrete values. This lets you model the 'flip-flop' behavior of a single key node without going full Pearl DAG. So for example, if I have a survey I think is fraudulent - possibly just plain made up in a spreadsheet - and a much smaller survey which I trust but which has large sampling error, I can express this as a mixture model and I will get a bimodal distribution over the estimate with a small diffuse peak and a big sharp peak, which corresponds to roughly "here's what you get if the big one is fake, and here's what you get if it's real and pooled with the other one". If you can get more gold data, that updates further the switching parameter, and at some point if the small surveys keep disagreeing with the big one, the probability of it being fake will approach 1 and it'll stop visibly affecting the posterior distribution because it'll just always be assigned to the 'fake' component and not affect the posteriors of interest (for the real components). You can take this approach with confounding too. A confounded study is not simply going to exaggerate the effect size X%, it will deliver potentially arbitrarily different and opposite signed estimates, and no matter how many confounded studies you combine, they will never be the causal estimate and they may all agree with each other very precisely if they are collecting data confounded the same way. So if you have an RCT which contradicts all your cohort correlational results, you're in the same situation as with the two surveys.

Thanks gwern! Jaynes is the original source of the height example, though I read it years ago and did not have the reference handy. I wrote this recently after realizing (1) the fallacy is standard practice in meta-analysis and (2) there is a straightforward better approach.

You can stagger the bets and offer either a 1A -> 1B -> 1A circle or a 2B -> 2A -> 2B circle.

Suppose the bets are implemented in two stages. In stage 1 you have an 89% chance of the independent payoff ($1 million for bets 1A and 1B, nothing for bets 2A and 2B) and an 11% chance of moving to stage 2. In stage 2 you either get $1 million (for bets 1A and 2A) or a 10/11 chance of getting $5 million.

Then suppose someone prefers a 10/11 chance of 5 million (bet 3B) to a sure $1 million (bet 3A), prefers 2A to 2B, and currently has 2B in this stagger... (read more)

Yeah, I don't think it makes much difference in high-dimensions. It's just more natural to talk about smoothness in the continuous case.

A note on notation - [0,1] with square brackets generally refers to the closed interval between 0 and 1. X is a continuous variable, not a boolean one.

1Richard_Kennaway
Actually, I should have been using curly brackets, as when I wrote (0,1) I meant the set with two elements, 0 and 1, which is what I had taken X to be a product of copies of, hence my obtaining 50000 as the expected Manhattan distance between any two members. I'll correct the post to make that clear. I think everything I said would still apply to the continuous case. If it doesn't, that would be better addressed with a separate comment.

Why does UDT lose this game? If it knows anti-Newcomb is much more likely, it will two-box on Newcomb and do just as well as CDT. If Newcomb is more common, UDT one-boxes and does better than CDT.

2dankane
I guess my point is that it is nonsensical to ask "what does UDT do in situation X" without also specifying the prior over possible universes that this particular UDT is using. Given that this is the case, what exactly do you mean by "losing game X"?

You seem to be comparing SMCDT to a UDT agent that can't self-modify (or commit suicide). The self-modifying part is the only reason SMCDT wins here.

The ability to self-modify is clearly beneficial (if you have correct beliefs and act first), but it seems separate from the question of which decision theory to use.

This is a good example. Thank you. A population of 100% CDT, though, would get 100% DD, which is terrible. It's a point in UDT's favor that "everyone running UDT" leads to a better outcome for everyone than "everyone running CDT."

Ok, that example does fit my conditions.

What if the universe cannot read your source code, but can simulate you? That is, the universe can predict your choices but it does not know what algorithm produces those choices. This is sufficient for the universe to pose Newcomb's problem, so the two agents are not identical.

The UDT agent can always do at least as well as the CDT agent by making the same choices as a CDT would. It will only give a different output if that would lead to a better result.

4dankane
Actually, here's a better counter-example, one that actually exemplifies some of the claims of CDT optimality. Suppose that the universe consists of a bunch of agents (who do not know each others' identities) playing one-off PDs against each other. Now 99% of these agents are UDT agents and 1% are CDT agents. The CDT agents defect for the standard reason. The UDT agents reason that my opponent will do the same thing that I do with 99% probability, therefore, I should cooperate. CDT agents get 99% DC and 1% DD. UDT agents get 99% CC and 1% CD. The CDT agents in this universe do better than the UDT agents, yet they are facing a perfectly symmetrical scenario with no mind reading involved.
2dankane
Fine. How about this: "Have $1000 if you would have two-boxed in Newcomb's problem."

Can you give an example where an agent with a complete and correct understanding of its situation would do better with CDT than with UDT?

An agent does worse by giving in to blackmail only if that makes it more likely to be blackmailed. If a UDT agent knows opponents only blackmail agents that pay up, it won't give in.

If you tell a CDT agent "we're going to simulate you and if the simulation behaves poorly, we will punish the real you," it will ignore that and be punished. If the punishment is sufficiently harsh, the UDT agent that changed its beh... (read more)

2dankane
Well, if the universe cannot read your source code, both agents are identical and provably optimal. If the universe can read your source code, there are easy scenarios where one or the other does better. For example, "Here have $1000 if you are a CDT agent" Or "Here have $1000 if you are a UDT agent"

wolfgang proposed a similar example on Scott's blog:

I wonder if we can turn this into a real physics problem:

1) Assume a large-scale quantum computer is possible (thinking deep thoughts, but not really self-conscious as long as its evolution is fully unitary).

2) Assume there is a channel which allows enough photons to escape in such a way to enable consciousness.

3) However, at the end of this channel we place a mirror – if it is in the consciousness-OFF position the photons are reflected back into the machine and unitarity is restored, but in the conscio

... (read more)
2Luke_A_Somers
His example is different in a very particular way: His conscious entity gets to dump photons into de Sitter space directly and only if you open it. This makes Scott's counter-claim prima facie basically plausible - if your putative consciousness only involves reversible actions, then is it really conscious? But, I specifically drew a line between Alice and Alice's Room, and specified that Alice's normal operations are irreversible - but they must also dump entropy into the Room, taking in one of its 0 bits and returning something that might be 1 or 0, and if you feed her a 1 bit, she dies on waste heat (maybe she has some degree of tolerance for 1s, but as the density of 1s approaches 50% she cannot survive). If you were to just leave the Room open all the time, always resetting its qbits to 0, Alice would operate the same, aside from having no risk of heatstroke. (In this case, of course, if you run the simulation backwards, the result would not be where you started, but catastrophe). I think this is a pretty crucial distinction. ... At least that find explains why the comment disappeared without a ripple. It triggered "I've seen this before".

A&B cannot be more probable than A, but evidence may support A&B more than it supports A.

For example, suppose you have independent prior probabilities of 1/2 for A and for B. The prior probability of A&B is 1/4. If you are then told "A iff B," the probability for A does not change but the probability of A&B goes up to 1/2.

The reason specific theories are better is not that they are more plausible, but that they contain more useful information.

Your scheme seems to be Jaynes's Ap distribution, discussed on LW here.

1iarwain1
That is precisely what I was proposing, just he explains it much better of course. Thanks! In the subsequent article he makes essentially the same argument as Lumifer's point about this having the potential to be turtles all the way down. In a comment to the first article the author quotes Jaynes as saying: The "pending a better understanding of what that means" is also what I've been grappling with. In last week's thread I initially proposed looking at it as the likelihood that I'll find evidence that will make me change my probability estimate, and then I modified that to being how strong the evidence would have to be to make me change my probability estimate. [Aside from just understanding what "probabilities of probabilities" would actually mean, these ways of expressing it make the concept much more universally applicable than the narrow cases that the linked article is referring to.] But is there a better way of understanding it?