Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
By and large, I would bet money that the devoted, experienced, and properly sequenced LWer, is a better philosopher than the average current philosophy majors concentrating in the analytic tradition. I say this because I have regular philosophical conversations with both populations, and notice many philosophical desiderata lacking in my conversations with my classmates, from my school and others, that I find abundantly on this website. Those desiderata, which are roughly the twelve virtues. I find that though my classmates have healthy doses of curiosity, empiricism and even scholarship, they lack in, evenness, lightness, relinquishment, precision, perfectionism and true humility.
How could that be? LW has built a huge positivized reductionist metaphysics, and a Bayesian epistemology which can almost be read as a self improvement manual. These are unprecedented, and in some circles, outrageous truths. This is not to mention the original work that has been done in LW posts and comment trees on, meta-ethics, ethics, biases, mathematics, rationality, quantum physics, economics, self-hack, etc. We have here a self-updating reliably transmittable well oiled machine, the likes of which philosophy has only so rarely seen.
What is even more impressive to me about LW as a philosophical movement, is that it seems to be nearly self contained when it comes to philosophy. I mean most experienced LWers probably really haven't read very much Kant, maybe some Wittgenstein or Quine; but LWers can still somehow solve the problems philosophers spend their lives solving by building disconnected and competing philosophical systems specifically designed for each task, by the use of roughly one rather generally successful epistemology and metaphysics, which can be called together LWism.
So if you agree that LW does better philosophy than analytic philosophers, let's put our money where our mouths are, as our own philosophy suggests we should. I will post a series of discussion posts each concentrating only on one currentish question from academic philosophy. In each post, I will cover the essentials of the problem, as well as provide external resources on the problem. Each post will also include a list of posts from the sequences which are recommended before participation. Each question will be solved with a consensus of less than 2 to 1 odds amongst professional philosophers, i.e., if more than 2/3s of professional philosophers agree, we won't bother. So as to not waste our time with small fish.
You guys, will then in turn cooperate in comment trees to find solutions and decide amongst them, then I'll compare the LW solutions to the solutions given by a random sampling of vaguely successful analytic philosophers, (I will use a university search for my sampling). I will compare the ratio of types of solutions of the two populations, and look for solutions that happen in the one population that don't occur in the other, then I'll post the results, hopefully the next week. (edit): This process of comparison will be the hardest part of this project for me, and if anyone with training or experience in statistics might want to help me with this, please let me know, and we can work on the comparison and the report thereof together. My prediction is that we will be able to quickly reach a high consensus on many issues that analytics have not internally resolved.
The series will be called: the "Enthusiastic Youngsters Formally Tackle Analytic Problems Test" or "the Eyftapt series" [pronounced: afe-taped]. Alternatively Eyftapt could stand for the "Eliezer Yudkowsky and Friends Train Amazing Philosophers Test." Besides shedding moderate light on our philosophical-competence/toolbox juxtaposed to analytic philosophical-competence/toolbox, I'd also like to learn what LW training offers that analytics are currently missing. So that we can focus in on that kind of training for our own benefit, and so that we can offer some advice to the analytics. That is, assuming my prediction that we'll do better is correct. This will not be as easy as comparing solutions, and I may need much more data than what I'll get out of this series, but it couldn't hurt to have a bunch of LWers doing difficult philosophy added to the available data.
What do you guys and gals think, might you be interested in something like this? Mind you it would be in discussion posts, since the main point is to discuss an issue.
(I know some of you cats don't like "philosophy", just call it "arguing about systems and elucidating messy language and thought in order to answer questions" instead. That is what I think we do better.)
BTW, if you have some problem you think we should work on, or or if you think we would be really good at solving some problem or really bad at it compared to non-LW philosophy, message me or comment below, and I'll give you credit for the suggestion. These are the topics I am already decided on: universals/nominalism, correspondence/deflation/coherency, grue/induction, science realism/constructivism, what is math?, scientific underdetermination, a priori knowledge?, radical translation, analytic synthetic division, proper name/description, deduction induction division, modality and possible worlds, what does it mean for a grammatical sentence to be meaningless and how do you tell?, meta-philosophy, i.e., questions about philosophy, and finally, personal identity, roughly to be posted in that order.
(edited after first posting, I just realized it may be worth mention that):
I was not happy about coming to this view. I have always thought of myself as an aspiring analytic philosopher, and even got attached to the ascetics of analytic philosophy. I thought of analytic philosophy as the new science of philosophy that finally got it right. It bothered me to no end that I had been lead to have more faith in the philosophical maturity/competence of a bunch of amateurs on a blog, than in the experts and students of the field that I planned to spend the rest of my life on. I have committed myself to the methods of academic-analytic philosophy publicly in speeches and to my closest friends, colleagues, and family; to turn around in under a year and say that that was all naive enthusiasm, and that there's this blog of college kids that do it better, made me look very stupid in more than one eye, I cared and care about. More than once, I have dissolved a question in my philosophy and cog-sci classes into an obvious cognitive error, explained why we are built to make this error, and left the class with little to do. Professors have praised me for this, and had even started approaching me outside of class to ask me about where I got my analysis from; their faces often came to a sincere awe when I tell them: "I made it up myself, but all the methods I used are neatly organized, generalized, and exemplified in this text called the 'sequences' on this blog of youngsters called 'Less Wrong'. It's only a few hundred pages, kinda reads like G.E.B."
One day, a few months back, one of my professors who I am on a particularly friendly basis with asked me: "Every time we are in class and there is a question, you use this blog of yours, and it seems it gives you an answer for everything, so why are you still studying the analytics, instead of just studying your blog?" I think he meant to ask this question sardonically, but that is not how I took it. I took it as a serious question about how to optimize my time if my goal is to do good philosophy. Not having a good answer to this question, and craving one, probably more than anything, is what prompted me to think of doing this series.
I may be wrong, and it may be that LW has just as hard of a time forming consensus on the issues that analytics have a hard time with, though I doubt it. But I am much more confident, that for some reason, even though I have had very good training, have a very high GPA, have read every classic philosophy text I could get my hands on, and had been reading several modern philosophy journals, all before I even knew about LW, LW has done more for my philosophical maturity, competence, and persuasiveness, than the entirety of the rest of my training, and I wouldn't doubt that many others have had similar thoughts.
One of the core aims of the philosophy of probability is to explain the relationship between frequency and probability. The frequentist proposes identity as the relationship. This use of identity is highly dubious. We know how to check for identity between numbers, or even how to check for the weaker copula relation between particular objects; but how would we test the identity of frequency and probability? It is not immediately obvious that there is some simple value out there which is modeled by probability, like position and mass are values that are modeled by Newton's Principia. You can actually check if density * volume = mass, by taking separate measurements of mass, density and volume, but what would you measure to check a frequency against a probability?
There are certain appeals to frequentest philosophy: we would like to say that if a bag has 100 balls in it, only 1 of which is white, then the probability of drawing the white ball is 1/100, and that if we take a non-white ball out, the probability of drawing the white ball is now 1/99. Frequentism would make the philosophical justification of that inference trivial. But of course, anything a frequentist can do, a Bayesian can do (better). I mean that literally: it's the stronger magic.
A Subjective Bayesian, more or less, says that the reason frequencies are related to probabilities is because when you learn a frequency you thereby learn a fact about the world, and one must update one's degrees of belief on every available fact. The subjective Bayesian actually uses the copula in another strange way:
Probability is subjective degree of belief.
and subjective Bayesians also claim:
Probabilities are not in the world, they are in your mind.
These two statements are brilliantly championed in Probability is Subjectively Objective. But ultimately, the formalism which I would like to suggest denies both of these statements. Formalists do not ontologically commit themselves to probabilities, just as they do not say that numbers exist; hence we don't allocate probabilities in the mind or anywhere else; we only commit ourselves to number theory, and probability theory. Mathematical theories are simply repeatable processes which construct certain sequences of squiggles called "theorems", by changing the squiggles of other theorems, according to certain rules called "inferences". Inferences always take as input certain sequences of squiggles called premises, and output a sequence of squiggles called the conclusion. The only thing an inference ever does is add squiggles to a theorem, take away squiggles from a theorem, or both. It turns out that these squiggle sequences mixed with inferences can talk about almost anything, certainly any computable thing. The formalist does not need to ontologically commit to numbers to assert that "There is a prime greater than 10000.", even though "There is x such that" is a flat assertion of existence; because for the formalist "There is a prime greater than 10000." simply means that number theory contains a theorem which is interpreted as "there is a prime greater than 10000." When you say a mathematical fact in English, you are interpreting a theorem from a formal theory. If under your suggested interpretation, all of the theorems of the theory are true, then whatever system/mechanism your interpretation of the theory talks about, is said to be modeled by the theory.
So, what is the relation between frequency and probability proposed by formalism? Theorems of probability, may be interpreted as true statements about frequencies, when you assign certain squiggles certain words and claim the resulting natural language sentence. Or for short we can say: "Probability theory models frequency." It is trivial to show that Komolgorov models frequency, since it also models fractions; it is an algebra after all. More interestingly, probability theory models rational distributions of subjective degree of believe, and the optimal updating of degree of believe given new information. This is somewhat harder to show; dutch-book arguments do nicely to at least provide some intuitive understanding of the relation between degree of belief, betting, and probability, but there is still work to be done here. If Bayesian probability theory really does model rational belief, which many believe it does, then that is likely the most interesting thing we are ever going to be able to model with probability. But probability theory also models spatial measurement? Why not add the position that probability is volume to the debating lines of the philosophy of probability?
Why are frequentism's and subjective Bayesianism's misuses of the copula not as obvious as volumeism's? This is because what the Bayesian and frequentest are really arguing about is statistical methodology, they've just disguised the argument as an argument about what probability is. Your interpretation of probability theory will determine how you model uncertainty, and hence determine your statistical methodology. Volumeism cannot handle uncertainty in any obvious way; however, the Bayesian and frequentest interpretations of probability theory, imply two radically different ways of handling uncertainty.
The easiest way to understand the philosophical dispute between the frequentist and the subjective Bayesian is to look at the classic biased coin:
A subjective Bayesian and a frequentist are at a bar, and the bartender (being rather bored) tells the two that he has a biased coin, and asks them "what is the probability that the coin will come up heads on the first flip?" The frequentist says that for the coin to be biased means for it not have a 50% chance of coming up heads, so all we know is that it has a probability that is not equal 50%. The Bayesain says that that any evidence I have for it coming up heads, is also evidence for it coming up tails, since I know nothing about one outcome, that doesn't hold for its negation, and the only value which represents that symmetry is 50%.
I ask you. What is the difference between these two, and the poor souls engaged in endless debate over realism about sound in the beginning of Making Beliefs Pay Rent?
If a tree falls in a forest and no one hears it, does it make a sound? One says, "Yes it does, for it makes vibrations in the air." Another says, "No it does not, for there is no auditory processing in any brain."
One is being asked: "Are there pressure waves in the air if we aren't around?" the other is being asked: "Are there auditory experiences if we are not around?" The problem is that "sound" is being used to stand for both "auditory experience" and "pressure waves through air". They are both giving the right answers to these respective questions. But they are failing to Replace the Symbol with the Substance and they're using one word with two different meanings in different places. In the exact same way, "probability" is being used to stand for both "frequency of occurrence" and "rational degree of belief" in the dispute between the Bayesian and the frequentist. The correct answer to the question: "If the coin is flipped an infinite amount of times, how frequently would we expect to see a coin that landed on heads?" is "All we know, is that it wouldn't be 50%." because that is what it means for the coin to be biased. The correct answer to the question: "What is the optimal degree of belief that we should assign to the first trial being heads?" is "Precisely 50%.", because of the symmetrical evidential support the results get from our background information. How we should actually model the situation as statisticians depends on our goal. But remember that Bayesianism is the stronger magic, and the only contender for perfection in the competition.
For us formalists, probabilities are not anywhere. We do not even believe in probability technically, we only believe in probability theory. The only coherent uses of "probability" in natural language are purely syncategorematic. We should be very careful when we colloquially use "probability" as a noun or verb, and be very careful and clear about what we mean by this word play. Probability theory models many things, including degree of belief, and frequency. Whatever we may learn about rationality, frequency, measure, or any of the other mechanisms that probability models, through the interpretation of probability theorems, we learn because probability theory is isomorphic to those mechanisms. When you use the copula like the frequentist or the subjective Bayesian, it makes it hard to notice that probability theory modeling both frequency and degree of belief, is not a contradiction. If we use "is" instead of "model", it is clear that frequency is not degree of belief, so if probability is belief, then it is not frequency. Though frequency is not degree of belief, frequency does model degree of belief, so if probability models frequency, it must also model degree of belief.
This is a first stab at solving Goodman's famous grue problem. I haven't seen a post on LW about the grue paradox, and this surprised me since I had figured that if any arguments would be raised against Bayesian LW doctrine, it would be the grue problem. I haven't looked at many proposed solutions to this paradox, besides some of the basic ones in "The New Problem of Induction". So, I apologize now if my solution is wildly unoriginal. I am willing to put you through this dear reader because:
- I wanted to see how I would fare against this still largely open, devastating, and classic problem, using only the arsenal provided to me by my minimal Bayesian training, and my regular LW reading.
- I wanted the first LW article about the grue problem to attack it from a distinctly Lesswrongian aproach without the benefit of hindsight knowledge of the solutions of non-LW philosophy.
- And lastly, because, even if this solution has been found before, if it is the right solution, it is to LW's credit that its students can solve the grue problem with only the use of LW skills and cognitive tools.
I would also like to warn the savvy subjective Bayesian that just because I think that probabilities model frequencies, and that I require frequencies out there in the world, does not mean that I am a frequentest or a realist about probability. I am a formalist with a grain of salt. There are no probabilities anywhere in my view, not even in minds; but the theorems of probability theory when interpreted share a fundamental contour with many important tools of the inquiring mind, including both, the nature of frequency, and the set of rational subjective belief systems. There is nothing more to probability than that system which produces its theorems.
Lastly, I would like to say, that even if I have not succeeded here (which I think I have), there is likely something valuable that can be made from the leftovers of my solution after the onslaught of penetrating critiques that I expect form this community. Solving this problem is essential to LW's methods, and our arsenal is fit to handle it. If we are going to be taken seriously in the philosophical community as a new movement, we must solve serious problems from academic philosophy, and we must do it in distinctly Lesswrongian ways.
"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
The nth emerald ever observed was green.
There is a very high probability that a never before observed emerald will be green."
That is the inference that the grue problem threatens, courtesy of Nelson Goodman. The grue problem starts by defining "grue":
"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."
So you see that before time T, from the list of premises:
"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
The nth emerald ever observed was green."
(we will call these the green premises)
it follows that:
"The first emerald ever observed was grue.
The second emerald ever observed was grue.
The third emerald ever observed was grue.
The nth emerald ever observed was grue."
(we will call these the grue premises)
The proposer of the grue problem asks at this point: "So if the green premises are evidence that the next emerald will be green, why aren't the grue premises evidence for the next emerald being grue?" If an emerald is grue after time T, it is not green. Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%, which is of course, a contradiction of the Komolgorov axioms. We should not be able to form a statement with a probability greater than one.
This threatens the whole of science, because you cannot simply keep this isolated to emeralds and color. We may think of the emeralds as trials, and green as the value of a random variable. Ultimately, every result of a scientific instrument is a random variable, with a very particular and useful distribution over its values. If we can't justify inferring probability distributions over random variables based on their previous results, we cannot justify a single bit of natural science. This, of course, says nothing about how it works in practice. We all know it works in practice. "A philosopher is someone who say's, 'I know it works in practice, I'm trying to see if it works in principle.'" - Dan Dennett
We may look at an analogous problem. Let's suppose that there is a table and that there are balls being dropped on this table, and that there is an infinitely thin line drawn perpendicular to the edge of the table somewhere which we are unaware of. The problem is to figure out the probability of the next ball being right of the line given the last results. Our first prediction should be that there is a 50% chance of the ball being right of the line, by symmetry. If we get the result that one ball landed right of the line, by Laplace's rule of succession we infer that there is a 2/3ds chance that the next ball will be right of the line. After n trials, if every trial gives a positive result, the probability we should assign to the next trial being positive as well is n+1/n +2.
If this line was placed 2/3ds down the table, we should expect that the ratio of rights to lefts should approach 2:1. This gives us a 2/3ds chance of the next ball being a right, and the fraction of Rights out of trials approaches 2/3ds ever more closely as more trials are performed.
Now let us suppose a grue skeptic approaching this situation. He might make up two terms "reft" and "light". Defined as you would expect, but just in case:
"A ball is reft of the line iff it is right of it before time T when it lands, or if it is left of it after time T when it lands.
A ball is light of the line iff it is left of the line before time T when it lands, or if it is right of the line after time T when it first lands."
The skeptic would continue:
"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"
Things for some reason become perfectly clear at this point for the defender of Bayesian inference, because now we have an easy to imaginable model. Of course, if a ball landing right of the line is evidence for Right, then it cannot possibly be evidence for ~Right; to be evidence for Reft, after time T, is to be evidence for ~Right, because after time T, Reft is logically identical to ~Right; hence it is not evidence for Reft, after time T, for the same reasons it is not evidence for ~Right. Of course, before time T, any evidence for Reft is evidence for Right for analogous reasons.
But now the grue skeptic can say something brilliant, that stops much of what the Bayesian has proposed dead in its tracks:
"Why can't I just repeat that paragraph back to you and swap every occurrence of 'right' with 'reft' and 'left' with 'light', and vice versa? They are perfectly symmetrical in terms of their logical realtions to one another.
If we take 'reft' and 'light' as primitives, then we have to define 'right' and 'left' in terms of 'reft' and 'light' with the use of time intervals."
What can we possibly reply to this? Can he/she not do this with every argument we propose then? Certainly, the skeptic admits that Bayes, and the contradiction in Right & Reft, after time T, prohibits previous Rights from being evidence of both Right and Reft after time T; where he is challenging us is in choosing Right as the result which it is evidence for, even though "Reft" and "Right" have a completely symmetrical syntactical relationship. There is nothing about the definitions of reft and right which distinguishes them from each other, except their spelling. So is that it? No, this simply means we have to propose an argument that doesn't rely on purely syntactical reasoning. So that if the skeptic performs the swap on our argument, the resulting argument is no longer sound.
What would happen in this scenario if it were actually set up? I know that seems like a strangely concrete question for a philosophy text, but its answer is a helpful hint. What would happen is that after time T, the behavior of the ratio: 'Rights:Lefts' as more trials were added, would proceed as expected, and the behavior of the ratio: 'Refts:Lights' would approach the reciprocal of the ratio: 'Rights:Lefts'. The only way for this to not happen, is for us to have been calling the right side of the table "reft", or for the line to have moved. We can only figure out where the line is by knowing where the balls landed relative to it; anything we can figure out about where the line is from knowing which balls landed Reft and which ones landed Light, we can only figure out because in knowing this and and time, we can know if the ball landed left or right of the line.
To this I know of no reply which the grue skeptic can make. If he/she say's the paragraph back to me with the proper words swapped, it is not true, because In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts.
This thin line is analogous to the frequency of emeralds that turn out green out of all the emeralds that get made. This is why we can assume that the line will not move, because that frequency has one precise value, which never changes. Its other important feature is reminding us that even if two terms are syntactically symmetrical, they may have semantic conditions for application which are ignored by the syntactical model, e.g., checking to see which side of the line the ball landed on.
Every random variable has as a part of it, stored in its definition/code, a frequency distribution over its values. By the fact that somethings happen sometimes, and others happen other times, we know that the world contains random variables, even if they are never fundamental in the source code. Note that "frequency" is not used as a state of partial knowledge, it is a fact about a set and one of its subsets.
The reason that:
"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
The nth emerald ever observed was green.
There is a very high probability that a never before observed emerald will be green."
is a valid inference, but the grue equivalent isn't, is that grue is not a property that the emerald construction sites of our universe deal with. They are blind to the grueness of their emeralds, they only say anything about whether or not the next emerald will be green. It may be that the rule that the emerald construction sites use to get either a green or non-green emerald change at time T, but the frequency of some particular result out of all trials will never change; the line will not move. As long as we know what symbols we are using for what values, observing many green emeralds is evidence that the next one will be grue, as long as it is before time T, every record of an observation of a green emerald is evidence against a grue one after time T. "Grue" changes meanings from green to blue at time T, 'green'''s meaning stays the same since we are using the same physical test to determine green-hood as before; just as we use the same test to tell whether the ball landed right or left. There is no reft in the universe's source code, and there is no grue. Green is not fundamental in the source code, but green can be reduced to some particular range of quanta states; if you had the universes source code, you couldn't write grue without first writing green; writing green without knowing a thing about grue would be just as hard as while knowing grue. Having a physical test, or primary condition for applicability, is what privileges green over grue after time T; to have a physical consistent test is the same as to reduce to a specifiable range of physical parameters; the existence of such a test is what prevents the skeptic from performing his/her swaps on our arguments.
Take this more as a brainstorm than as a final solution. It wasn't originally but it should have been. I'll write something more organized and consize after I think about the comments more, and make some graphics I've designed that make my argument much clearer, even to myself. But keep those comments coming, and tell me if you want specific credit for anything you may have added to my grue toolkit in the comments.
Why don't we start treating the log2 of the probability — conditional on every available piece of information — you assign to the great conjunction, as the best measure of your epistemic success? Let's call: log_2(P(the great conjunction|your available information)), your "Bayesian competence". It is a deductive fact that no other proper scoring rule could possibly give: Score(P(A|B)) + Score(P(B)) = Score(P(A&B)), and obviously, you should get the same score for assigning P(A|B) to A, after observing B, and assigning P(B) to B a priori, as you would get for assigning P(A&B) to A&B a priori. The great conjunction is the conjunction of all true statements expressible in your idiolect. Your available information may be treated as the ordered set of your retained stimulus.
It is standard LW doctrine that we should not name the highest value of rationality, and it is often defended quite brilliantly:
You may try to name the highest principle with names such as “the map that reflects the territory” or “experience of success and failure” or “Bayesian decision theory”. But perhaps you describe incorrectly the nameless virtue. How will you discover your mistake? Not by comparing your description to itself, but by comparing it to that which you did not name.
and of course also:
How can you improve your conception of rationality? Not by saying to yourself, “It is my duty to be rational.” By this you only enshrine your mistaken conception. Perhaps your conception of rationality is that it is rational to believe the words of the Great Teacher, and the Great Teacher says, “The sky is green,” and you look up at the sky and see blue. If you think: “It may look like the sky is blue, but rationality is to believe the words of the Great Teacher,” you lose a chance to discover your mistake.
These quotes are from the end of Twelve Virtues
Should we really be wondering if there's a virtue higher than bayesian competence? Is there really a probability worth worrying about that the description of bayesian competence above is misunderstood? Is the description not simple enough to be mathematical? What mistake might I discover in my understanding of bayesian competence by comparing it to that which I did not name, after I've already given a proof that bayesian competence is proper, and that the restrictions: score(P(B)*P(A|B)) = score(P(B)) + score(P(A|B)), and: must be a proper scoring rule, uniquely specify Logb?
I really want answers to these questions. I am still undecided about them; and change my mind about them far too often.
Of course, your bayesian competence is ridiculously difficult to compute. But I am not proposing the measure for practical reasons. I am proposing the measure to demonstrate that degree of rationality is an objective quantity that you could compute given the source code to the universe, even though there are likely no variables in the source that ever take on this value. This may be of little to no value to the most obsessively pragmatic practitioners of rationality. But it would be a very interesting result to philosophers of science and rationality.
Updated to better express view of author, and take feedback into account. Apologies to any commenter who's comment may have been nullified.
The comment below:
The general reason Eliezer advocates not naming the highest virtue (as I understand it) is that there may be some type of problem for which bayesian updating (and the scoring rule referred to) yields the wrong answer. This idea sounds rather improbable to me, but there is a non-negligible probability that bayes will yield a wrong answer on some question. Not naming the virtue is supposed to be a reminder that if bayes ever gives the wrong answer, we go with the right answer, not bayes.
has changed my mind about the openness of the questions I asked.
Can anyone tell me why it is that if I use my rationality exclusively to improve my conception of rationality I fall into an infinite recursion? EY say's this in The Twelve Virtues and in Something to Protect, but I don't know what his argument is. He goes as far as to say that you must subordinate rationality to a higher value.
I understand that by committing yourself to your rationality you lose out on the chance to notice if your conception of rationality is wrong. But what if I use the reliability of win that a given conception of rationality offers me as the only guide to how correct that conception is. I can test reliability of win by taking a bunch of different problems with known answers that I don't know, solving them using my current conception of rationality and solving them using the alternative conception of rationality I want to test, then checking the answers I arrived at with each conception against the right answers. I could also take a bunch of unsolved problems and attack them from both conceptions of rationality, and see which one I get the most solutions with. If I solve a set of problems with one, that isn't a subset of the set of problems I solved with the other, then I'll see if I can somehow take the union of the two conceptions. And, though I'm still not sure enough about this method to use it, I suppose I could also figure out the relative reliability of two conceptions by making general arguments about the structures of those conceptions; if one conception is "do that which the great teacher says" and the other is "do that which has maximal expected utility", I would probably not have to solve problems using both conceptions to see which one most reliably leads to win.
And what if my goal is to become as epistimically rational as possible. Then I would just be looking for the conception of rationality that leads to truth most reliably. Testing truth by predictive power.
And if being rational for its own sake just doesn't seem like its valuable enough to motivate me to do all the hard work it requires, let's assume that I really really care about picking the best conception of rationality I know of, much more than I care about my own life.
It seems to me that if this is how I do rationality for its own sake — always looking for the conception of goal-oriented rationality which leads to win most reliably, and the conception of epistemic rationality which leads to truth most reliably — then I'll always switch to any conception I find that is less mistaken than mine, and stick with mine when presented with a conception that is more mistaken, provided I am careful enough about my testing. And if that means I practice rationality for its own sake, so what? I practice music for its own sake too. I don't think that's the only or best reason to pursue rationality, certainly some other good and common reasons are if you wanna figure something out or win. And when I do eventually find something I wanna win or figure out that no one else has (no shortage of those), if I can't, I'll know that my current conception isn't good enough. I'll be able to correct my conception by winning or figuring it out, and then thinking about what was missing from my view of rationality that wouldn't let me do that before. But that wouldn't mean that I care more about winning or figuring some special fact than I do about being as rational as possible; it would just mean that I consider my ability to solve problems a judge of my rationality.
I don't understand what I loose out on if I pursue the Art for its own sake in the way described above. If you do know of something I would loose out on, or if you know Yudkowsky's original argument showing the infinite recursion when you motivate yourself to be rational by your love of rationality, then please comment and help me out. Thanks ahead of time.
A recent post: Consistently Inconsistent, raises some problems with the unitary view of the mind/brain, and presents the modular view of the mind as an alternate hypothesis. The parallel/modular view of the brain not only deals better with the apparent hypocritical and contradictory ways our desires, behaviors, and believes seem to work, but also makes many successful empirical predictions, as well as postdictions. Much of that work can be found in Dennett's 1991 book: "Consciousness Explained" which details both the empirical evidence against the unitary view, and the intuition-fails involved in retaining a unitary view after being presented with that evidence.
The aim of this post is not to present further evidence in favor of the parallel view, nor to hammer any more nails in the the unitary view's coffin; the scientific and philosophical communities have done well enough in both departments to discard the intuitive hypothesis that there is some executive of the mind keeping things orderly. The dilemma I wish to raise is a question: "How should we update our decision theories to deal with independent, and sometimes inconsistent, desires and believes being had by one agent?"
If we model one agent's desires by using one utility function, and this function orders the outcomes the agent can reach on one real axis, then it seems like we might be falling back into the intuitive view that there is some me in there with one definitive list of preferences. The picture given to us by Marvin Mimsky and Dennett involves a bunch of individually dumb agents, each with a unique set of specialized abilities and desires, interacting in such a way so as to produce one smart agent, with a diverse set of abilities and desires, but the smart agent only apears when viewed from the right level of description. For convenience, we will call those dumb-specialized agents "subagents", and the smart-diverse agent that emerges from their interaction "the smart agent". When one considers what it would be useful for a seeing-neural-unit to want to do, and contrasts it with what it would be useful for a get that food-neural-unit to want to do, e.g., examine that prey longer v.s. charge that prey, turn head v.s. keep running forward, stay attentive v.s. eat that food, etc. it becomes clear that cleverly managing which unit gets to have how much control, and when, is an essential part of the decision making process of the whole. Decision theory, as far as I can tell, does not model any part of that managing process; instead we treat the smart agent as having its own set of desires, and don't discuss how the subagents' goals are being managed to produce that global set of desires.
It is possible that the many subagents in a brain act isomorphically to an agent with one utility function and a unique problem space, when they operate in concert. A trivial example of such an agent might have only two subagents "A" and "B", and possible outcomes O1 through On. We can plot the utilities that each subagent gives to these outcomes on a two dimensional positive Cartesian graph; A's assigned utilities being represented by position in X, and B's utilities by position in Y. The method by which these subagents are managed to produce behavior might just be: go for the possible outcome furthest from (0,0); in, which case, the utility function of the whole agent U(Ox) would just be the distance from (0,0) to (A's U(Ox) , B's U(Ox)).
An agent which manages its subagents so as to be isomorphic to one utility function on one problem space is certainly mathematically describable, but also implausible. It is unlikely that the actual physical-neural subagents in a brain deal with the same problem spaces, i.e., they each have their own unique set of O1 through On. It is not as if all the subagents are playing the same game, but each has a unique goal within that game – they each have their own unique set of legal moves too. This makes it problematic to model the global utility function of the smart agent as assigning one real number to every member of a set of possible outcomes, since there is no one set of possible outcomes for the smart agent as a whole. Each subagent has its own search space with its own format of representation for that problem space. The problem space and utility function of the smart agent are implicit in the interactions of the subagents; they emerge from the interactions of agents on a lower level; the smart agents utility function and problem space are never explicitly written down.
A useful example is smokers that are quitting. Some part of their brains that can do complicated predictions doesn't want its body to smoke. This part of their brain wants to avoid death, i.e., will avoid death if it can, and knows that choosing the possible outcome of smoking puts its body at high risk for death. Another part of their brains wants nicotine, and knows that choosing the move of smoking gets it nicotine. The nicotine craving subagent doesn't want to die, it also doesn't want to stay alive, these outcomes aren't in the domain of the nicotine-subagent's utility function at all. The part of the brain responsible for predicting its bodies death if it continues to smoke, probably isn't significantly rewarded by nicotine in a parallel manner. If a cigarette is around and offered to the smart agent, these subagents must compete for control of the relevant parts of their body, e.g., nicotine-subagent might set off a global craving, while predict-the-future-subagent might set off a vocal response saying "no thanks, I'm quitting." The overall desire to smoke or not smoke of the smart agent is just the result of this competition. Similar examples can be made with different desires, like a desire to over eat and a desire to look slim, or the desire to stay seated and the desire to eat a warm meal.
We may call the algorithm which settles these internal power struggles the "managing algorithm", and we may call a decision theory which models managing algorithms a "parallel decision theory". It's not the businesses of decision theorists to discover the specifics of the human managing process, that's the business of empirical science. But certain parts of the human managing algorithm can be reasonably decided on. It is very unlikely that our managing algorithm is utilitarian for example, i.e., the smart agent doesn't do whatever gets the highest net utility for its subagents. Some subagents are more powerful than others; they have a higher prior chance of success than their competitors; some others are weak in a parallel fashion. The question of what counts as one subagent in the brain is another empirical question which is not the business of decision theorists either, but anything that we do consider a subagent in a parallel theory must solve its problem in the form of a CSA, i.e., it must internally represent its outcomes, know what outcomes it can get to from whatever outcome it is at, and assign a utility to each outcome. There are likely many neural units that fit that description in the brain. Many of them probably contain as parts subsubagnets which also fit this description, but eventually, if you divide the parts enough, you get to neurons which are not CSAs, and thus not subagents.
If we want to understand how we make decisions, we should try to model a CSA, which is made out of more spcialized sub-CSAs competing and agreeing, which are made out of further specialized sub-sub-CSAs competing and agreeing, which are made out of, etc. which are made out of non-CSA algorithms. If we don't understand that, we don't understand how brains make decisions.
I hope that the considerations above are enough to convince reductionists that we should develop a parallel decision theory if we want to reduce decision making to computing. I would like to add an axiomatic parallel decision theory to the LW arsenal, but I know that that is not a one man/woman job. So, if you think you might be of help in that endeavor, and are willing to devote yourself to some degree, please contact me at firstname.lastname@example.org. Any team we assemble will likely not meet in person often, and will hopefully frequently meet on some private forum. We will need decision theorists, general mathematicians, people intimately familiar with the modular theory of mind, and people familiar with neural modeling. What follows are some suggestions for any team or individual that might pursue that goal independently:
- The specifics of the managing algorithm used in brains are mostly unknown. As such, any parallel decision theory should be built to handle as diverse a range of managing algorithms as possible.
- No composite agent should have any property that is not reducible to the interactions of the agents it is made out of. If you have a complete description of the subagents, and a complete description of the managing algorithm, you have a complete description of the smart agent.
- There is nothing wrong with treating the lowest level of CSAs as black boxes. The specifics of the non-CSA algorithms, which the lowest level CSAs are made out of are not relevant to parallel decision theory.
- Make sure that the theory can handle each subagent having its own unique set of possible outcomes, and its own unique method of representing those outcomes.
- Make sure that each CSA above the lowest level actually has "could", "should", and "would" labels on the nodes in its problem space, and make sure that those labels, their values, and the problem space itself can be reduced to the managing of the CSAs on the level below.
- Each level above the lowest should have CSAs dealing with more a more diverse range of problems than the ones on the level bellow. The lowest level should have the most specialized CSAs.
- If you've achieved the six goals above, try comparing your parallel decision theory to other decision theories; see how much predictive accuracy is gained by using a parallel decision theory instead of the classical theories.
Before I read Probability is in the Mind and Probability is Subjectively Objective I was a realist about probabilities; I was a frequentest. After I read them, I was just confused. I couldn't understand how a mind could accurately say the probability of getting a heart in a standard deck of playing cards was not 25%. It wasn't until I tried to explain the contrast between my view and the subjective view in a comment on Probability is Subjectively Objective that I realized I was a subjective Bayesian all along. So, if you've read Probability is in the Mind and read Probability is Subjectively Objective but still feel a little confused, hopefully, this will help.
I should mention that I'm not sure that EY would agree with my view of probability, but the view to be presented agrees with EY's view on at least these propositions:
- Probability is always in a mind, not in the world.
- The probability that an agent should ascribe to a proposition is directly related to that agent's knowledge of the world.
- There is only one correct probability to assign to a proposition given your partial knowledge of the world.
- If there is no uncertainty, there is no probability.
And any position that holds these propositions is a non-realist-subjective view of probability.
Imagine a pre-shuffled deck of playing cards and two agents (they don't have to be humans), named "Johnny" and "Sally", which are betting 1 dollar each on the suit of the top card. As everyone knows, 1/4 of the cards in a playing card deck are hearts. We will name this belief F1; F1 stands for "1/4 of the cards in the deck are hearts.". Johnny and Sally both believe F1. F1 is all that Johnny knows about the deck of cards, but sally knows a little bit more about this deck. Sally also knows that 8 of the top 10 cards are hearts. Let F2 stand for "8 out of the 10 top cards are hearts.". Sally believes F2. John doesn't know whether or not F2. F1 and F2 are beliefs about the deck of cards and they are either true or false.
So, sally bets that the top card is a heart and Johnny bets against her, i.e., she puts her money on "Top card is a heart." being true; he puts his money on "~The top card is a heart." being true. After they make their bets, one could imagine Johnny making fun of Sally; he might say something like: "Are you nuts? You know, I have a 75% chance of winning. 1/4 of the cards are hearts; you can't argue with that!" Sally might reply: "Don't forget that the probability you assign to '~The top card is a heart.' depends on what you know about the deck. I think you would agree with me that there is an 80% chance that 'The top card is a heart' if you knew just a bit more about the state of the deck."
To be undecided about a proposition is to not know which possible world you are in; am I in the possible world where that proposition is true, or in the one where it is false? Both Johnny and Sally are undecided about "The top card is a heart."; their model of the world splits at that point of representation. Their knowledge is consistent with being in a possible world where the top card is a heart, or in a possible world where the top card is not a heart. The more statements they decide on, the smaller the configuration space of possible worlds they think they might find themselves in; deciding on a proposition takes a chunk off of that configuration space, and the content of that proposition determines the shape of the eliminated chunk; Sally's and Johnny's beliefs constrain their respective expected experiences, but not all the way to a point. The trick when constraining one's space of viable worlds, is to make sure that the real world is among the possible worlds that satisfy your beliefs. Sally still has the upper hand, because her space of viably possible worlds is smaller than Johnny's. There are many more ways you could arrange a standard deck of playing cards that satisfies F1 than there are ways to arrange a deck of cards that satisfies F1 and F2. To be clear, we don't need to believe that possible worlds actually exist to accept this view of belief; we just need to believe that any agent capable of being undecided about a proposition is also capable of imagining alternative ways the world could consistently turn out to be, i.e., capable of imagining possible worlds.
For convenience, we will say that a possible world W, is viable for an agent A, if and only if, W satisfies A's background knowledge of decided propositions, i.e., A thinks that W might be the world it finds itself in.
Of the possible worlds that satisfy F1, i.e., of the possible worlds where "1/4 of the cards are hearts" is true, 3/4 of them also satisfy "~The top card is a heart." Since Johnny holds that F1, and since he has no further information that might put stronger restrictions on his space of viable worlds, he ascribes a 75% probability to "~The top card is a heart." Sally, however, holds that F2 as well as F1. She knows that of the possible worlds that satisfy F1 only 1/4 of them satisfy "The top card is a heart." But she holds a proposition that constrains her space of viably possible worlds even further, namely F2. Most of the possible worlds that satisfy F1 are eliminated as viable worlds if we hold that F2 as well, because most of the possible worlds that satisfy F1 don't satisfy F2. Of the possible worlds that satisfy F2 exactly 80% of them satisfy "The top card is a heart." So, duh, Sally assigns an 80% probability to "The top card is a heart." They give that proposition different probabilities, and they are both right in assigning their respective probabilities; they don't disagree about how to assign probabilities, they just have different resources for doing so in this case. P(~The top card is a heart|F1) really is 75% and P(The top card is a heart|F2) really is 80%.
This setup makes it clear (to me at least) that the right probability to assign to a proposition depends on what you know. The more you know, i.e., the more you constrain the space of worlds you think you might be in, the more useful the probability you assign. The probability that an agent should ascribe to a proposition is directly related to that agent's knowledge of the world.
This setup also makes it easy to see how an agent can be wrong about the probability it assigns to a proposition given its background knowledge. Imagine a third agent, named "Billy", that has the same information as Sally, but say's that there's a 99% chance of "The top card is a heart." Billy doesn't have any information that further constrains the possible worlds he thinks he might find himself in; he's just wrong about the fraction of possible worlds that satisfy F2 that also satisfy "The top card is a heart.". Of all the possible worlds that satisfy F2 exactly 80% of them satisfy "The top card is a heart.", no more, no less. There is only one correct probability to assign to a proposition given your partial knowledge.
The last benefit of this way of talking I'll mention is that it makes probability's dependence on ignorance clear. We can imagine another agent that knows the truth value of every proposition, lets call him "FSM". There is only one possible world that satisfies all of FSM's background knowledge; the only viable world for FSM is the real world. Of the possible worlds that satisfy FSM's background knowledge, either all of them satisfy "The top card is a heart." or none of them do, since there is only one viable world for FSM. So the only probabilities FSM can assign to "The top card is a heart." are 1 or 0. In fact, those are the only probabilities FSM can assign to any proposition. If there is no uncertainty, there is no probability.
The world knows whether or not any given proposition is true (assuming determinism). The world itself is never uncertain, only the parts of the world that we call agents can be uncertain. Hence, Probability is always in a mind, not in the world. The probabilities that the universe assigns to a proposition are always 1 or 0, for the same reasons FSM only assigns a 1 or 0, and 1 and 0 aren't really probabilities.
In conclusion, I'll risk the hypothesis that: Where 0≤x≤1, "P(a|b)=x" is true, if and only if, of the possible worlds that satisfy "b", x of them also satisfy "a". Probabilities are propositional attitudes, and the probability value (or range of values) you assign to a proposition is representative of the fraction of possible worlds you find viable that satisfy that proposition. You may be wrong about the value of that fraction, and as a result you may be wrong about the probability you assign.
We may call the position summarized by the hypothesis above "Modal Satisfaction Frequency theory", or "MSF theory".
The question I want to ask is "is there a proof for every statement about the natural numbers that seems to be inductively verifiable, but is a general recursive decision problem, provided your sample is large enough?" Simply, if we define a binary property 'P' that can only be tested by algorithms that might not halt, and show, say by computation, that every natural number up to some arbitrarily large number 'N' has the property P, does that mean that there must be a generalized deductive proof that for all natural numbers P holds?
How big can N get before there must be a deductive proof that P holds for all natural numbers? What if N was larger than Graham's number? What if we showed thousands of years from now--using computers that are unimaginably strong by today's standards--that every number less than or equal to the number you get when you put Graham's number as both of the inputs to the Ackermann function has a property P. But they still have no generalized deductive proof that all numbers have P, is that enough for these future number theorists to state that it is a scientific fact that all natural numbers have the property P?
It may seem impossible for such a disagreement to come up between induction and deduction, but we are already at a similar (though admittedly less dramatic) impasse. The disagreement centers around the Collatz conjecture, which states that every number is a Collatz number. To test if some number n is a Collatz number, if n is even, divide it by 2 to get n / 2, if n is odd multiply it by 3 and add 1 to obtain 3n + 1, and repeat this process with the new number thus obtained, indefinitely, if you eventually reach the cycle 4, 2, 1, then n is a Collatz number. Every number up to 20 × 2^58 has been shown to be a Collatz number, and it has been shown that there are no cycles with a smaller period than 35400, yet there is still no deductive proof of the Collatz conjecture. In fact, one method of generalized proof has already been shown to be undecidable. If it was shown that no general proof could ever be found, but all numbers up to the unimaginably large ones described above were shown to be Collatz numbers, what epistemological status should we grant the conjecture?
It has been made clear by the history of mathematics that a lack of small counterexamples to a conjecture of the form "for all natural numbers P(n) = 1" where 'P' is a binary function (let's call this an "inductive-conjecture"), is not at all a proof of that conjecture. There have been many inductive-conjectures with that sort of evidence in their favor that have later been shown not to be theorems, just because the counter examples were very large. But what if a proof of the undecidability of a conjecture of that form is given, what then if no counter example had been found up to the insanely large values described above?
If there is a binary property 'R' that holds for all natural numbers, i.e., there is no counter example, and it can be deductively shown that no proof of 'R' holding for all natural numbers exists, then the implications for epistemology, ontology and the scientific/rational endeavor in general are huge. If some facts about the natural numbers don't follow from the application of valid inferences onto axiom and theorems, then what makes us think that all the facts about the natural world must follow from natural laws in combination with the initial state of the universe? If there is such a property, then that means that in a completely deterministic system where you have all the rules describing all the possible ways that things can change, and you have all the rest of the formally verifiable information about the system, there still might be some fact about this system which is true but does not follow from those rules. Those statements would only be verifiable by finite probabilistic sampling from an infinite population with an undetermined standard deviation, but still be true facts. Our crowning example of such a system would of course be the theory of the natural numbers itself if such a binary property were discovered. Suppose the Collatz conjecture were shown to be undecidable, that would mean that there is no counter example, i.e., all numbers are Collatz numbers (since the existence of a counter example would guarantee the conjecture's decidability), but there would also be no generalized proof that no counter example exists (since the existence of such a proof would guarantee the decidability of the conjecture). So since we can't verify the conjecture either way, should we call it meaningless/unverifiable? Or is logically undecidable somehow distinct from literally meaningless? What restrictions/expectations should we have if we believe that an inductive-conjecture is undecidable, and how would those restrictions change if we believed that conjecture was actually unverifiable.
Let's call the claim that "there is a binary property R which holds for all natural numbers and that there is no counter example that can or will ever be found, but which also cannot be proven to hold for all natural numbers" the "first Potato conjecture". How would one ever show the first potato conjecture, or even offer evidence in it's favor? Let's say we knew that some property 'R_b' held of all natural numbers that we might ever test. Then we would have a proof of this and R_b could not be our R. If we get a candidate property 'R_c' that isn't capable of being proven or dis-proven of all natural numbers, then we will never know if it holds for all natural numbers. Could induction even offer us any evidence in this case? Is a finite sample ever representative of an infinite population with no standard deviation even in the case of simple succession? If not, then what evidence could we ever offer for or against the potato conjecture, if an undecidable inductive-conjecture were discovered? If the answer is no evidence one way or the other, does that mean that the potato conjecture is meaningless, or just undecidable?
(But no, really, I'm asking.)
: Quoting Lagarias 1985: "J. H. Conway proved the remarkable result that a simple generalization of the problem is algorithmically undecidable." The work was reported in: J. H. Conway (1972). "Unpredictable Iterations". Proceedings of the 1972 Number Theory Conference : University of Colorado, Boulder, Colorado, August 14–18, 1972. Boulder: University of Colorado. pp. 49–52.
View more: Next