JonasMoss

Wiki Contributions

Comments

Sorted by

The number of elements in  won't change when removing every other element from it. The cardinality of   is countable. And when you remove every other element, it is still countable, and indistinguishable from .  If you're unconvinced, ask yourself how many elements  with every other element removed contains. The set is certainly not larger than , so it's at most countable. But it's certainly not finite either. Thus you're dealing with a set of countably many 0s. As there is only one such multiset,  equals  with every other element removed.

That there is only one such multiset follows from the definition of a multiset, a set of pairs , where  is an element and  is its cardinality. It would also be true if we define multisets using sets containing all the pairs  -- provided we ignore the identity of each pair. I believe this is where our disagreement lies. I ignore identities, working only with sets. I think you want to keep the identities intact. If we keep the identities, the set  is not equal to , and my argument (as it stands) fails. 

I don't understand what you mean. The upgraded individuals are better off than the non-upgraded individuals, with everything else staying the same, so it is an application of Pareto.

Now, I can understand the intuition that (a) and (b) aren't directly comparable due to identity of individuals. That's what I mean with the caveat "(Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)"

Pareto: If two worlds (w1 and w2) contain the same people, and w1 is better for an infinite number of them, and at least as good for all of them, then w1 is better than w2.

As far as I can see, the Pareto principle is not just incompatible with the agent-neutrality principle, it's incompatible with set theory itself. (Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)

Let's take a look at, for instance,  vs , where  is the multiset containing  and  is the disjoint union. Now consider the following scenarios:

(a) Start out with  and multiply every utility by  to get . Since infinitely many people are better off and no one is worse off, .

(b) Start out with  and take every other of the -utilities from  and change them to . Since a copy of  is still left over, this operation leaves us with . Again, since infinitely many are better off and no one worse off, .

In conclusion, both  and , a contradiction.

Okay, thanks for the clarification! Let's see if I understand your setup correctly. Suppose we have the probability measures and , where is the probability measure of the expert. Moreover, we have an outcome

In your post, you use , where is an unknown outcome known only to the expert. To use Bayes' rule, we must make the assumption that . This assumption doesn't sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations.

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and necessarily loses some information.

I'm not sure. When we're looking directly at the probability of an event (instead of the probability of the probability an event), things get much simpler than I thought.

Let's see what happens to the likelihood when you aggregate from the expert's point of view. Letting , we need to calculate the expert's likelihoods and . In this case,

which is essentially your calculations, but from the expert's point of view. The likelihood depends on , the prior of the expert, which is unknown to you. That shouldn't come as a surprise, as he needs to use the prior of in order to combine the probability of the events and .

But the calculations are exactly the same from your point of view, leading to

Now, suppose we want to generally ensure that . Which is what I believe you want to do, and which seems pretty natural to do, at least since we're allowed to assume that for all simple events . To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or .

Do you agree with this summary?

Do you have a link to the research about the effect of a bachelor of education?

I find the beginning of this post somewhat strange, and I'm not sure your post proves what you claim it does. You start out discussing what appears to be a combination of two forecasts, but present it as Bayesian updating. Recall that Bayes theorem says . To use this theorem, you need both an  (your data / evidence), and a  (your parameter). Using “posterior prior  likelihood” (with priors  and likelihoods ), you're talking as if your expert's likelihood equals  – but is that true in any sense? A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

I can see two interpretations of what you're doing at the beginning of your post:

  1. You're combining two forecasts. That is, with  being the outcome, you have your own pmf  and the expert's , then combine them using . That's fair enough, but I suppose  or maybe  for some  would be a better way to do it.
  2. It might be possible to interpret your calculations as a proper application of Bayes' rule, but that requires stretching it. Suppose  is your subjective probability vector for the outcomes  and  is the subjective probability vector for the event supplied by an expert (the value of  is unknown to us). To use Bayes' rule, we will have to say that the evidence vector , the probability of observing an expert judgment of  given that  is true. I'm not sure we ever observe such quantities directly, and it is pretty clear from your post that you're talking about  in the sense used above, not .

Assuming interpretation 1, the rest of your calculations are not that interesting, as you're using a method of knowledge pooling no one advocates.

Assuming interpretation 2, the rest of your calculations are probably incorrect. I don't think there is a unique way to go from to, let's say, , where  is the expert's probability vector over  and  your probability vector over .

Children became grown-ups 200 years ago too. I don't think we need to teach them anything at all, much less anything in particular.

According to this SSC post, kids can easily catch up in math even if they aren't taught any math at all in the 5 first years of school.

In the Benezet experiment, a school district taught no math at all before 6th grade (around age 10-11). Then in sixth grade, they started teaching math, and by the end of the year, the students were just as good at math as traditionally-educated children with five years of preceding math education.

That would probably work for reading too, I guess. (Reading appears to require more purpose-built brain circuitry than math. At least I got that impression from reading Henrich's WEIRD. I don't have any references though.)

I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it's simply too hard for me to confirm / disconfirm most of them. Here are some (but far from all!) examples:

1. About local optimizers. I didn't understand this section at all! Are you claiming that gradient descent isn't a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?

2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn't work (it's inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you're thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.

3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?

4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you're looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don't do that. In what sense is "the bayesian prior could be encoded purely in the initial weight distribution." What's more, it seems to me you're thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?

I disagree. Sometimes your entire payoffs also change when you change your action space (in the informal description of the problem). That is the point of the last example, where precommitment changes the possible payoffs, not only restricts the action space.

Paradoxical decision problems are paradoxical in the colloquial sense (such as Hilbert's hotel or Bertrand's paradox), not the literal sense (such as "this sentence is false"). Paradoxicality is in the eye of the beholder. Some people think Newcomb's problem is paradoxical, some don't. I agree with you and don't find it paradoxical.

Load More