Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: XiXiDu 12 March 2014 05:34:39PM *  0 points [-]

Machine learning has plenty of low-hanging fruit.

How do you know this? Have there been a lot of findings made by a lot of people without any indication that this stream of discoveries is slowing down? When I looked up e.g. Deep learning it seemed to be a relatively old technique (1980's and early 90's). What are some examples of recent discoveries you would describe as low-hanging fruits?

Comment author: jsteinhardt 13 March 2014 06:47:11AM 0 points [-]

It's worth noting that deep learning has made a huge resurgence lately, and is seeing applications all over the place.

There's tons of active work in online learning, especially under resource constraints.

Structured prediction is older but still an active and important area of research.

Spectral learning / method of moments is a relatively new technique that seems very promising.

Conditional gradient techniques for optimization have had a lot of interest recently, although that may slow down in the next couple years. Similarly for submodular optimization.

There are many other topics that I think are important but haven't been quite as stylish lately; e.g. improved MCMC algorithms, coarse-to-fine inference / cascades, dual decomposition techniques for inference.

Comment author: jsteinhardt 12 March 2014 04:24:35PM 3 points [-]

LessWrong has a relatively strong anti-academic bias, and I'm worried that this is reflected in the comments.

I work as a PhD student in machine learning, and yes, there is a minimum bar of intelligence, perseverance, etc. below which doing high-quality research is unlikely. However, in my experience I have seen many people who are clearly above that bar who nevertheless go into industry. This is not to say that their choice is incorrect, but on balance I think the argument "don't go into academia unless you'll be one of the smartest people in your field" does more harm than good. It also seems to me that the effective altruist movement, in particularly, mostly overlooks academia as an altruistic career option, even though I personally think that for many intelligent people (including myself), working on the right research problems is the most valuable contribution they can make to society.

If you go into a field like mathematics or theoretical physics, yes, you're unlikely to make a meaningful contribution unless you're one of the best people in the field. This is because these fields have basically become an attractor for bright undergrads looking to "prove themselves" intellectually. I'm not trying to argue that these fields are not useful; I am trying to argue that the marginal usefulness of an additional researcher is low barring extraordinary circumstances.

In other fields, especially newer fields, this is far less true. Machine learning has plenty of low-hanging fruit. My impression is that bioinstrumentation and computational neuroscience do as well (not to mention many other fields that I just don't happen to be as familiar with). This is not to say that working in these fields will be a cake-walk, or that there isn't lots of competition for faculty jobs. It is to say that there are huge amounts of value to be created by working in these fields. Even if you don't like pure research as a career option, you can create huge amounts of value by attaching yourself to a good lab as a software engineer.

It's also worth noting that "doing research" isn't some sort of magic skill that you do or don't have. It's something you acquire over time, and the meta-skills learned seem fairly valuable to me.

Comment author: Brillyant 19 February 2014 05:37:52PM 18 points [-]

I've lost 30 pounds since September 17th, 2013*. Interestingly, I've noticed doing so caused me to lose a lot of faith in LW.

In the midst of my diet, discussion in the comments on this series of posts confounded me. I'm no expert on nutrition or dieting(I do know perhaps more than the average person), but my sense is that I encountered a higher noise-to-signal ratio on the subject here at LW than anywhere else I've looked. There seemed to be all sorts of discussion about everything other than the simple math behind weight loss. Lots of super fascinating stuff—but much of it missing the point, I thought.

I learned a few interesting things during the discussion—which I always seem to do here. But in terms of providing a boost to my instrumental rationality, it didn't help at all. In fact, it's possible LW had a negative impact on my ability to win at dieting and weight management.

I notice this got me wondering about LW's views and discussions about many other things that I know very little about. I feel myself asking "How could I rationally believe LW knows what they are talking about in regard to the Singularity, UFAI, etc. if they seem to spin their wheels so badly on a discussion about something as simple as weight loss?"

I'm interested to hear others' thoughts on this.

Have you ever lost confidence in LW after a similar experience? Maybe something where it seemed to you people were "talking a big game" but failing to apply any of that to actually win in real life?

(*Note: To be clear, I've lost 30 pounds since Sept 17th, but only ~15-18 lbs since my "diet" began on Jan 1, 2014. I'm not really bragging about losing weight—I wish it weren't the case. I injured my neck and could no longer use my primary method of exercise (weightlifting) to stay in shape. After eating poorly and lying around for a couple months, I started—on Jan 1—to do consistent, light treadmill work & light core work, as well as cutting my calorie consumption pretty dramatically.)

Comment author: jsteinhardt 22 February 2014 09:41:58AM 5 points [-]

Have you ever lost confidence in LW after a similar experience? Maybe something where it seemed to you people were "talking a big game" but failing to apply any of that to actually win in real life?

As a stats / machine learning person, a lot of the "Bayesian statistics" talk around here is pretty cringe-inducing. My impresion is that physicists probably feel similarly about "many-worlds" discussions. I think LessWrong unfortunately causes people to believe that being a dilettante is enough to merit a confident opinion on a subject.

Comment author: Eliezer_Yudkowsky 20 February 2014 04:34:15AM 7 points [-]

Your criticism of Dutch Book is that it doesn't seem to you useful to add anti-Dutch-book checkers to your toolbox. My support of Dutch Book is that if something inherently produces Dutch Books then it can't be the right epistemological principle because clearly some of its answers must be wrong even in the limit of well-calibrated prior knowledge and unbounded computing power.

The complete class theorem I understand least of the set, and it's probably not very much entwined with my true rejection so it would be logically rude to lead you on here. Again, though, the point that every local optimum is Bayesian tells us something about non-Bayesian rules producing intrinsically wrong answers. If I believed your criticism, I think it would be forceful; I could accept a world in which for every pair of a rational plan with a world, there is an irrational plan which does better in that world, but no plausible way for a cognitive algorithm to output that irrational plan - the plans which are equivalent of "Just buy the winning lottery ticket, and you'll make more money!" I can imagine being shown that the complete class theorem demonstrates only an "unfair" superiority of this sort, and that only frequentist methods can produce actual outputs for realistic situations even in the limit of unbounded computing power. But I do not believe that you have leveled such a criticism. And it doesn't square very much with my current understanding that the decision rules being considered are computable rules from observations to actions. You didn't actually tell me about a frequentist algorithm which is supposed to be realistic and show why the Bayesian rule which beats it is beating it unfairly.

If you want to hit me square in the true rejection I suggest starting with VNM. The fact that our epistemology has to plug into our actions is one reason why I roll my eyes at the likes of Dempster-Shafer or frequentist confidence intervals that don't convert to credibility distributions.

Comment author: jsteinhardt 20 February 2014 04:44:05AM 0 points [-]

One of the criticisms I raised is that merely being able to point to all the local optima is not a particularly impressive property of an epistemological theory. Many of those local optima will be horrible! (My criticism of VNM is essentially the same.)

Many frequentist methods, such as minimax, also provide local optima, but they provide local optima which actually have certain nice properties. And minimax provides a complete decision rule, not just a probability distribution, so it plugs directly into actions.

Comment author: Eliezer_Yudkowsky 19 February 2014 06:08:46PM 15 points [-]

My guess is that you would still be in favor of Bayes as a normative standard of epistemology even if you rejected Dutch book arguments, and the reason why you like it is because you feel like it has been useful for solving a large number of problems.

Um, nope. What it would really take to change my mind about Bayes is seeing a refutation of Dutch Book and Cox's Theorem and Von Neumann-Morgenstern and the complete class theorem , combined with seeing some alternative epistemology (e.g. Dempster-Shafer) not turn out to completely blow up when subjected to the same kind of scrutiny as Bayesianism (the way DS brackets almost immediately go to [0-1] and fuzzy logic turned out to be useless etc.)

Neural nets have been useful for solving a large number of problems. It doesn't make them good epistemology. It doesn't make them a plausible candidate for "Yes, this is how you need to organize your thinking about your AI's thinking and if you don't your AI will explode".

some of which Bayesian statistics cannot solve, as I have demonstrated in this post.

I am afraid that your demonstration was not stated sufficiently precisely for me to criticize. This seems like the sort of thing for which there ought to be a standard reference, if there were such a thing as a well-known problem which Bayesian epistemology could not handle. For example, we have well-known critiques and literature claiming that nonconglomerability is a problem for Bayesianism, and we have a chapter of Jaynes which neatly shows that they all arise from misuse of limits on infinite problems. Is there a corresponding literature for your alleged reductio of Bayesianism which I can consult? Now, I am a great believer in civilizational inadequacy and the fact that the incompetence of academia is increasing, so perhaps if this problem was recently invented there is no more literature about it. I don't want to be a hypocrite about the fact that sometimes something is true and nobody has written it up anyway, heaven knows that's true all the time in my world. But the fact remains that I am accustomed to somewhat more detailed math when it comes to providing an alleged reductio of the standard edifice of decision theory. I know your time is limited, but the real fact is that I really do need more detail to think that I've seen a criticism and be convinced that no response to that criticism exists. Should your flat assertion that Bayesian methods can't handle something and fall flat so badly as to constitute a critique of Bayesian epistemology, be something that I find convincing?

We've already discussed this in one of the other threads, but I'll just repeat here that this isn't correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.

Okay. Though I note that you haven't actually said that my intuitions (and/or my reading of Wikipedia) were wrong; many NP-hard problems will be easy to solve for a randomly generated case.

Anyway, suppose a standard L1-penalty algorithm solves a random case of this problem. Why do you think that's a reductio of Bayesian epistemology? Because the randomly generated weights mean that a Bayesian viewpoint says the credibility is going as the L2 norm on the non-zero weights, but we used an L1 algorithm to find which weights were non-zero? I am unable to parse this into the justifications I am accustomed to hearing for rejecting an epistemology. It seems like you're saying that one algorithm is more effective at finding the maximum of a Bayesian probability landscape than another algorithm; in a case where we both agree that the unbounded form of the Bayesian algorithm would work.

What destroys an epistemology's credibility is a case where even in the limit of unbounded computing power and well-calibrated prior knowledge, a set of rules just returns the wrong answer. The inherent subjectivity of p-values as described in http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequently_subjective/ is not something you can make go away with a better-calibrated prior, correct use of limits, or unlimited computing power; it's the result of bad epistemology. This is the kind of smoking gun it would take to make me stop yammering about probability theory and Bayes's rule. Showing me algorithms which don't on the surface seem Bayesian but find good points on a Bayesian fitness landscape isn't going to cut it!

Comment author: jsteinhardt 20 February 2014 04:16:23AM 4 points [-]

Eliezer, I included a criticism of both complete class and Dutch book right at the very beginning, in Myth 1. If you find them unsatisfactory, can you at least indicate why?

Comment author: adam_strandberg 19 February 2014 04:46:55AM *  1 point [-]

I am deeply confused by your statement that the complete class theorem only implies that Bayesian techniques are locally optimal. If for EVERY non-Bayesian method there's a better Bayesian method, then the globally optimal technique must be a Bayesian method.

Comment author: jsteinhardt 19 February 2014 07:14:45AM 1 point [-]

There is a difference between "the globally optimal technique is Bayesian" and "a Bayesian technique is globally optimal". In the latter case, we now still have to choose from an infinitely large family of techniques (one for each choice of prior). Bayes doesn't help me know which of these I should choose. In contrast there are frequentist techniques (e.g. minimax) that will give me a full prescription of what I ought to. Those techniques can in many (but not all) cases be interpreted in terms of a prior, but "choose a prior and update" wasn't the advice that led me to that decision, rather it was "play the minimax decision rule".

As I said in my post:

I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Comment author: Eliezer_Yudkowsky 12 February 2014 09:44:25PM 22 points [-]

Don't have time for a real response. Quickly and ramblingly:

1) The point of Bayesianism isn't that there's a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there's supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a "resistant concept" which simply cannot sink in for many people.

2) I did initially try to wade into the math of the linear problem (and wonder if I'm the only one who did so, unless others spotted the x-y inversion but didn't say anything), trying to figure out how I would solve it even though that wasn't really relevant for reasons of (1), but found that the exact original problem specified may be NP-hard according to Wikipedia, much as my instincts said it should be. And if we're allowed approximate answers then yes, throwing a standard L1-norm algorithm at it is pretty much what I would try, though I might also try some form of expectation-maximization using the standard Bayesian L2 technique and repeatedly truncating the small coefficients and then trying to predict the residual error. I have no idea how long that would take in practice. It doesn't actually matter, because see (1). I could go on about how for any given solution I can compute its Bayesian likelihood assuming Gaussian noise, and so again Bayes functions well as a background epistemology which gives us a particular minimization problem to be computed by whatever means, and if we have no background epistemology then why not just choose a hundred random 1s, etc., but lack the time for more than rapid rambling here. Jacob didn't say what he thought an actual frequentist or Bayesian approach would be, he just said the frequentist approach would be easy and that the Bayesian one was hard.

(3) Having made a brief effort to wade into the math and hit the above bog, I did not attempt to go into Jacob's claim that frequentist statistics can transcend i.i.d. But considering the context in which I originally complained about the assumptions made by frequentist guarantees, I should very much like to see explained concretely how Jacob's favorite algorithm would handle the case of "You have a self-improving AI which turns out to maximize smiles, in all previous cases it produced smiles by making people happy, but once it became smart enough it realized that it ought to preserve your bad generalization and faked its evidence, and now that it has nanotech it's going to tile the universe with tiny smileyfaces." This is the Context Change Problem I originally used to argue against trying for frequentist-style guarantees based on past AI behavior being okay or doing well on other surface indicators. I frankly doubt that Jacob's algorithm is going to handle it. I really really doubt it. Very very roughly, my own notion of an approach here would be a Bayesian-viewpoint AI which was learning a utility function and knew to explicitly query model ambiguity back to the programmers, perhaps using a value-of-info calculation. I should like to hear what a frequentist viewpoint on that would sound like.

(4) Describing the point of likelihood ratios in science would take its own post. Three key ideas are (a) instead of "negative results" we have "likelihood ratios favoring no effect over 5% effect" and so it's now conceptually simpler to get rid of positive-result bias in publication; (b) if we compute likelihood ratios on all the hypotheses which are actually in play then we can add up what many experiments tell us far more easily and get far more sensible answers than with present "survey" methods; and (c) having the actual score be far below expected log score for the best hypothesis tells us when some of our experiments must be giving us bogus data or having been performed under invisibly different conditions, a huge problem in many cases and something far beyond the ability of present "survey" methods to notice or handle.

EDIT: Also everything in http://lesswrong.com/lw/mt/beautiful_probability/

Comment author: jsteinhardt 19 February 2014 07:06:35AM 8 points [-]

Eliezer,

The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there's supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a "resistant concept" which simply cannot sink in for many people.

First, I object to the labeling of Bayesian explanations as a "resistant concept". I think it's not only uncharitable but also wrong. I started out with exactly the viewpoint that everything should be explained in terms of Bayes (see one of my earliest and most-viewed blog posts if you don't believe me). I moved away from this viewpoint slowly as the result of accumulated evidence that this is not the most productive lens through which to view the world.

More to the point: why is it that you think that everything should have a Bayesian explanation? One of the most-cited reasons why Bayes should be an empistemic ideal is the various "optimality" / Dutch book theorems, which I've already argued against in this post. Do you accept the rebuttals I gave, or disagree with them?

My guess is that you would still be in favor of Bayes as a normative standard of epistemology even if you rejected Dutch book arguments, and the reason why you like it is because you feel like it has been useful for solving a large number of problems. But frequentist statistics (not to mention pretty much any successful paradigm) has also been useful for solving a large number of problems, some of which Bayesian statistics cannot solve, as I have demonstrated in this post. The mere fact that a tool is extremely useful does not mean that it should be elevated to a universal normative standard.

but found that the exact original problem specified may be NP-hard according to Wikipedia, much as my instincts said it should be

We've already discussed this in one of the other threads, but I'll just repeat here that this isn't correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.

I could go on about how for any given solution I can compute its Bayesian likelihood assuming Gaussian noise, and so again Bayes functions well as a background epistemology

The point of this example was to give a problem that, from a modeling perspective, was as convenient for Bayes as possible, but that was computationally intractable to solve using Bayesian techniques. I gave other examples (such as in Myth 5) that demonstrate situations where Bayes breaks down. And I argued indirectly in Myths 1, 4, and 8 that the prior is actually a pretty big deal and has the capacity to cause problems in ways that frequentists have ways of dealing with.

I should very much like to see explained concretely how Jacob's favorite algorithm would handle the case of "You have a self-improving AI which turns out to maximize smiles, in all previous cases it produced smiles by making people happy, but once it became smart enough it realized that it ought to preserve your bad generalization and faked its evidence, and now that it has nanotech it's going to tile the universe with tiny smileyfaces."

I think this is a very bad testing ground for how good a technique is, because it's impossible to say whether something would solve this problem without going through a lot of hand-waving. I think your "notion of how to solve it" is interesting but has a lot of details to fill in, and it's extremely unclear how it would work, especially given that even for concrete problems that people work on now, an issue with Bayesian methods is overconfidence in a particular model. I should also note that, as we've registered earlier, I don't think that what you call the Context Change Problem is actually a problem that an intelligent agent would face: any agent that is intelligent enough to behave at all functionally close to the level of a human would be robust to context changes.

However, even given all these caveats, I'll still try to answer your question on your own terms. Short answer: do online learning with an additional action called "query programmer" that is guaranteed to always have some small negative utility, say -0.001, that is enough to outweigh any non-trivial amount of uncertainty but will eventually encourage the AI to act autonomously. We would need some way of upper-bounding the regret of other possible actions, and of incorporating this utility constraint into the algorithm, but I don't think the amount of fleshing out is any more or less than that required by your proposal.

[WARNING: The rest of this comment is mostly meaningless rambling.]

I want to stress again that the above paragraph is only a (sketch of) an answer to the question as you posed it. But I'd rather sidestep the question completely and say something like: "OK, if we make literally no assumptions, then we're completely screwed, because moving any speck of dust might cause the universe to explode. Being Bayesian doesn't make this issue go away, it just ignores it.

So, what assumptions can we be reasonably okay with making that would help us solve the problem? Maybe I'd be okay assuming that the mechanism that takes in my past actions and returns a utility is a Turing machine of description length less than 10^15. But unfortunately that doesn't help me much, because for every Turing machine M, there's one of not that much longer description length that behaves identically to M up until I'm about to make my current decision, and then penalizes my current decision with some extraordinary large amount of disutility. Note that, again, being Bayesian doesn't deal with this issue, it just assigns it low prior probability.

I think the question of exactly what assumptions one would be willing to make, that would allow one to confidently reason about actions with potentially extremely discontinuous effects, is an important and interesting one, and I think one of the drawbacks of "thinking like a Bayesian" is that it draws attention away from this issue by treating it as mostly solved (via assigning a prior)."

A Fervent Defense of Frequentist Statistics

43 jsteinhardt 18 February 2014 08:08PM

[Highlights for the busy: de-bunking standard "Bayes is optimal" arguments; frequentist Solomonoff induction; and a description of the online learning framework. Note: cross-posted from my blog.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

continue reading »
Comment author: Leon 16 February 2014 09:39:05AM 1 point [-]

Ah, good point. It's like the prior, considered as a regularizer, is too "soft" to encode the constraint we want.

A Bayesian could respond that we rarely actually want sparse solutions -- in what situation is a physical parameter identically zero? -- but rather solutions which have many near-zeroes with high probability. The posterior would satisfy this I think. In this sense a Bayesian could justify the Laplace prior as approximating a so-called "slab-and-spike" prior (which I believe leads to combinatorial intractability similar to the fully L0 solution).

Also, without L0 the frequentist doesn't get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.

[FWIW I like the logical view of probability, but don't hold a strong Bayesian position. What seems most important to me is getting the semantics of both Bayesian (= conditional on the data) and frequentist (= unconditional, and dealing with the unknowns in some potentially nonprobabilistic way) statements right. Maybe there'd be less confusion -- and more use of Bayes in science -- if "inference" were reserved for the former and "estimation" for the latter.]

Comment author: jsteinhardt 16 February 2014 10:27:58PM 1 point [-]

Also, without L0 the frequentist doesn't get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.

See this comment. You actually do get sparse solutions in the scenario I proposed.

Comment author: Leon 15 February 2014 12:07:49AM 2 points [-]

Many L1 constraint-based algorithms (for example the LASSO) can be interpreted as producing maximum a posteriori Bayesian point estimates with Laplace (= double exponential) priors on the coefficients.

Comment author: jsteinhardt 15 February 2014 03:05:25AM 0 points [-]

Yes, but in this setting maximum a posteriori (MAP) doesn't make any sense from a Bayesian perspective. Maximum a posteriori is supposed to be a point estimate of the posterior, but in this case, the MAP solution will be sparse, whereas the posterior given a laplacian prior will place zero mass on sparse solutions. So the MAP estimate doesn't even qualitatively approximate the posterior.

View more: Next