All of Nick Hay's Comments + Replies

Very nice. These notes say that every countable nonstandard model of Peano arithmetic is isomorphic, as an ordered set, to the natural numbers followed by lexicographically ordered pairs (r, z) for r a positive rational and z an integer. If I remember rightly, the ordering can be defined in terms of addition: x <= y iff exists z. x+z <= y. So if we want to have a countable nonstandard model of Peano arithmetic with successor function and addition we need all these nonstandard numbers.

It seems that if we only care about Peano arithmetic with the s... (read more)

2IlyaShpitser
I think whether naturals plus one non-standard integer line is a model of Peano's axioms for the successor only (no addition/multiplication) depends on whether we use second order or first order logic to express induction. (No in second order formulation due to Dedekind's result, yes for any first order formulation).

Fascinating, I thought Tennanbaum's theorem implied non-standard models were rather impossible to visualize. The non-standard model of Peano arithmetic illustrated in the diagram only gives the successor relation, there's no definition of addition and multiplication. Tennenbaum's theorem implies there's no computable way to do this, but is there a proof that they can be defined at all for this particular model?

The chapter on Chomsky is contrasting the generative grammar approach, which Lakoff used to work within, to the cognitive science inspired cognitive linguistics approach, which Lakoff has been working in for the last few decades. Cognitive linguistics includes cognitive semantics which is rather different to generative semantics.

I largely agree with your critique, but more as a description of a different book that could have been written in this book's place. For example, a book on philosophy applying the results of this book's methodology, of which chapter 25 is a poor substitute. Or books drilling into one particular area in more detail with careful connections to the literature. This book serves better as an inspiring manifesto.

While these chapters are enlightening, they depend too heavily on the earlier account of metaphor, rarely draw upon other findings in cognitive sci

... (read more)

If the goal in exercise is to lose weight, have you tried replacing carbohydrates with fat in your diet? Forcing yourself to exercise will serve to work up an appetite and make you hungry, but not to lose weight. There is a correlation between exercising and being thin, but the causality is generally perceived the wrong way around. There is also a correlation between exercising and (temporarily) losing weight, but that is confounded by diet changes which typically involving reducing carbohydrate intake.

I've heard you mention Gary Taube's work, but not that... (read more)

1AndrewH
Yes, the refined carbohydrates are the real killer here. Eat as much meat as you want but no more white bread! The complete notes are a fantastic summary.

The T-rex is in the Valley Life Sciences Building. There's a few other fossils there too.

Idealized Bayesians don't have to be logically omniscient -- they can have a prior which assigns probability to logically impossible worlds.

I would be there, but I'm not back in NZ until 16th December! Everyone else should definitely go.

The Von-Neumann Morgenstern axioms talk just about preference over lotteries, which are simply probability distributions over outcomes. That is you have an unstructured set O of outcomes, and you have a total preordering over Dist(O) the set of probability distributions over O. They do not talk about a utility function. This is quite elegant, because to make decisions you must have preferences over distributions over outcomes, but you don't need to assume that O has a certain structure, e.g. that of the reals.

The expected utility theorem says that prefe... (read more)

1Stuart_Armstrong
Alas no. I've changed my post to explain the difficulties as I can change the mean and SD of any distribution by just changing my utility function. I have a new post up that argues that where small sums are concerned, you have to have a utility function linear in cash.

To be concrete, suppose you want to maximise the average utility people have, but you also care about fairness so, all things equal, you prefer the utility to be clustered about its average. Then maybe your real utility function is not

U = (U[1] + .... + U[n])/n

but

U' = U + ((U[1]-U)^2 + .... + (U[n]-U)^2)/n

which is in some sense a mean minus a variance.

0Stuart_Armstrong
Precisely the model I often have in mind (except I use the standard deviation, not the variance, as it is in the same units as the mean). But let us now see the problem with the independence axiom. Replace expected utility with phi="expected utility minus half the standard deviation". Then if A and B are two independent probability distributions, phi(A+B) >= phi(A) + phi(B) by Jensen's inequality, as the square root is a concave function. Equality happens only if the variance of A or B is zero. Now imagine that B and C are identical distributions with non-zero variances, and that A has no variance with phi(A) = phi(B) = phi(C). Then phi(A+B) = phi(A) + phi(B) = phi(B) + phi(C) < phi(B+C), violating independence. (if we use variance rather than standard deviation, we get phi(2B) < 2phi(B), giving similar results)

Can you translate your complaint into a problem with the independence axiom in particular?

Your second example is not a problem of variance in final utility, but aggregation of utility. Utility theory doesn't force "Giving 1 util to N people" to be equivalent to "Giving N util to 1 person". That is, it doesn't force your utility U to be equal to U1 + U2 + ... + UN where Ui is the "utility for person i".

1Nick Hay
To be concrete, suppose you want to maximise the average utility people have, but you also care about fairness so, all things equal, you prefer the utility to be clustered about its average. Then maybe your real utility function is not U = (U[1] + .... + U[n])/n but U' = U + ((U[1]-U)^2 + .... + (U[n]-U)^2)/n which is in some sense a mean minus a variance.

Your use of the terms parametric vs. nonparametric doesn't seem to be that used by people working in nonparametric Bayesian statistics, where the distinction is more like whether your statistical model has a fixed finite number of parameters or has no such bound. Methods such as Dirichlet processes, and its many variants (Hierarchical DP, HDP-HMM, etc), go beyond simple modeling of surface similarities using similarity of neighbours.

See, for example, this list of publications coauthored by Michael Jordan:

Thou Art Godshatter: gives an intuitive grasp for why and how human morality is complex, but that not any complex thing will do.

4grobstein
Totally agree -- helps if you can convince them to read Fire Upon the Deep, too. I'm not being facetious; the explicit and implicit background vocabulary (seems to) make it easier to understand the essays. (EDIT: to clarify, it is not that I think Fire in particular must be elevated as a classic of rationality, but that it's part of a smart sci/fi tradition that helps lay the ground for learning important things. There's an Eliezer webpage about this somewhere.)

How about buttons "High quality", "Low quality", "Accurate", "Inaccurate". We're increasing options here, but there's probably a nice way to design the interface to reduce the cognitive load.

Using the word "vote" seems broken here more generally -- we aren't implementing some democratic process, we're aggregating judgments (read: collecting evidence) across a population.

3AnnaSalamon
I completely agree about the word "vote". "High quality" / "Low quality" has good brevity, but for myself I'm still tempted to blend in agreement/disagreement with my ratings when I picture those words -- to regard comments I disagree with as "low quality". If we could have the question "Does this add to or subtract from the conversation?" surrounded by up/down arrows (or by "adds" / "subtracts"), I imagine myself voting better.

Because quality and truth are separate judgments in practice, and forcing them to be conflated into a single scale is losing information. To the extent that truth is positively correlated with quality this will fall out automatically: highly truthy posts will tend to have high quality. Low quality and high truth are not opposites.

3steven0461
I agree it's losing information, but that's something you have to weigh against the inconvenience of multiple dimensions. To the extent that truth is positively correlated with quality you're just making people click twice, and I suspect clicks are a limited resource. As I see it the voting system is there to put comments in a convenient order and remove the really bad ones from sight, not to provide opinion poll information.

Z. M. Davis: Good point, I was brushing that distinction under the rug. From this perspective all people arguing about values are trying to change someone's value computation, to a greater or lesser degree i.e. this is not the place to look if you want to discriminate between "liberal" and "conservative".

With the obvious way to implement a CEV, you start by modeling a population of actual humans (e.g. Earth's), then consider extrapolations of these models (know more, thought faster, etc). No "wipe culturally-defined values" step, however that would be defined.

Where was it suggested otherwise?

Ian C: neither group is changing human values as it is referred to here: everyone is still human, no one is suggesting neurosurgery to change how brains compute value. See the post value is fragile.

Interestingly, you can have unboundedly many children with only quadratic population growth, so long as they are exponentially spaced. For example, give each newborn sentient a resource token, which can be used after the age of maturity (say, 100 years or so) to fund a child. Additionally, in the years 2^i every living sentient is given an extra resource token. One can show there is at most quadratic growth in the number of resource tokens. By adjusting the exponent in 2^i we can get growth O(n^{1+p}) for any nonnegative real p.

Phil: Yes. CEV completely replaces and overwrites itself, by design. Before this point it does not interact with the external world to change it in a significant sense (it cannot avoid all change; e.g. its computer will add tiny vibrations to the Earth, as all computers do). It executes for a while then overwrites itself with a computer program (skipping every intermediate step here). By default, and if anything goes wrong, this program is "shutdown silently, wiping the AI system clean."

(When I say "CEV" I really mean a FAI which s... (read more)

Personally, I prefer the longer posts.

guest: right, so with those definitions you are overconfident if you are suprised more than you expected, underconfident if you are suprised less, calibration being how close your suprisal is to your expectation of it.

I think there's a sign error in my post -- C(x0) = \log p(x0) + H(p) it should be.

Anon: no, I mean the log probability. In your example, the calibratedness will generally be high: - \log 0.499 - H(p) ~= 0.00289 each time you see tails, and - log 0.501 - H(p) ~= - 0.00289 each time you come up tails. It's continuous.

Let's be specific. We have H(p) = - \sum_x p(x) \log p(x), where p is some probability distribution over a finite set. If we observe x0, the say the predictor's calibration is

C(x0) = \sum_x p(x) \log p(x) - \log p(x0) = - \log p(x0) - H(p)

so the expected calibration is 0 by the definition of H(p). The calibration is co... (read more)

Anon: well-calibrated means roughly that in the class of all events you think have probability p to being true, the proportion of them that turn out to be true is p.

More formally, suppose you have a probability distribution over something you are going to observe. If the log probability of the event which actually occurs is equal to the entropy of your distribution, you are well calibrated. If it is above you are over confident, if it is below you are under confident. By this measure, assigning every possibility equal probability will always be calibrated.

This is related to relative entropy.

Just in case it's not clear from the above: there are uncountably many degrees of freedom to an arbitrary complex function on the real line, since you can specify its value at each point independently.

A continuous function, however, has only countably many degrees of freedom: it is uniquely determined by its values on the rational numbers (or any dense set).

Eliezer: poetic and informative. I like it.

Tiiba:

The hypothesis is actual immortality, to which nonzero probability is being assigned. For example, suppose under some scenario your probability of dying at each time decreases by a factor of 1/2. Then, your total probability of dying is 2 times the probability of dying at the very first step, which we can assume far less than 1/2.

Eliezer: "You could see someone else's engine operating materially, through material chains of cause and effect, to compute by "pure thought" that 1 + 1 = 2. How is observing this pattern in someone else's brain any different, as a way of knowing, from observing your own brain doing the same thing? When "pure thought" tells you that 1 + 1 = 2, "independently of any experience or observation", you are, in effect, observing your own brain as evidence."

Richard: "It's just fundamentally mistaken to conflate reason... (read more)

One reason is Cox's theorem, which shows any quantitative measure of plausibility must obey the axioms of probability theory. Then this result, conservation of expected evidence, is a theorem.

What is the "confidence level"? Why is 50% special here?

Perhaps this formulation is nice:

0 = (P(H|E)-P(H))P(E) + (P(H|~E)-P(H))P(~E)

The expected change in probability is zero (for if you expected change you would have already changed).

Since P(E) and P(~E) are both positive, to maintain balance if P(H|E)-P(H) < 0 then P(H|~E)-P(H) > 0. If P(E) is large then P(~E) is small, so (P(H|~E)-P(H)) must be large to counteract (P(H|E)-P(H)) and maintain balance.

Hey, sorry if it's mad trivial, but may I ask for a derivation of this? You can start with "P(H) = P(H|E)P(E) + P(H|~E)P(~E)" if that makes it shorter.

(edit):

Never mind, I just did it. I'll post it for you in case anyone else wonders.

1} P(H) = P(H|E)P(E) + P(H|~E)P(~E) [CEE]
2} P(H)P(E) + P(H)P(~E) = P(H|E)P(E) + P(H|~E)P(~E) [because ab + (1-a)b = b]
3} (P(H) - P(H))P(E) + (P(H) - P(H))P(~E) = (P(H|E) - P(H))P(E) + (P(H|~E) - P(H))P(~E) [subtract P(H) from every value to be weighted]
4} (P(H) - P(H))P(E) + (P(H) - P(H))P(~E) = P(H) - P(H)... (read more)

It seems the point of the exercise is to think of non-obvious cognitive strategies, ways of thinking, for improving things. The chronophone translation is both a tool both for finding these strategies by induction, and a rationality test to see if the strategies are sufficiently unbiased and meta.

But what would I say? The strategy of searching for and correcting biases in thought, failures of rationality, would improve things. But I think I generated that suggestion by thinking of "good ideas to transmit" which isn't meta enough. Perhaps if I... (read more)