(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google "error correcting codes machine learning")[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learni...
I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it's just that the key insight in the compression is not to just "minimize entropy" but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).
Namely, in Hinton's algorithm the out...
This attacks a straw-man utilitarianism, in which you need to compute precise results and get the one correct answer. Functions can be approximated; this objection isn't even a problem.
Not every function can be approximated efficiently, though. I see the scope of morality as addressing human activity where human activity is a function space itself. In this case the "moral gradient" that the consequentialist is computing is based on a functional defined over a function space. There are plenty of function spaces and functionals which are very...
I would like you to elaborate on the incoherence of deontology so I can test out how my optimization perspective on morality can handle the objections.
To be clear I see the deontologist optimization problem as being a pure "feasibility" problem: one has hard constraints and zero gradient (or approximately zero gradient) on the moral objective function given all decisions that one can make.
Of the many, many critiques of utilitarianism some argue that its not sensible to actually talk about a "gradient" or marginal improvement in moral objective functions. Some argue this on the basis of computational constraints: there's no way that you could ever reasonably compute a moral objective ...
I would argue that deriving principles using the categorical imperative is a very difficult optimization problem and that there is a very meaningful sense in which one is a deontologist and not a utilitarian. If one is a deontologist then one needs to solve a series of constraint-satisfaction problems with hard constraints (i.e. they cannot be violated). In the Kantian approach: given a situation, one has to derive the constraints under which one must act in that situation via moral thinking then one must accord to those constraints.
This is very closely ...
I agree with the beginning of your comment. I would add that the authors may believe they are attacking utilitarianism, when in fact they are commenting on the proper methods for implementing utilitarianism.
I disagree that attacking utilitarianism involves arguing for different optimization theory. If a utilitarian believed that the free market was more efficient at producing utility then the utilitarian would support it: it doesn't matter by what means that free market, say, achieved that greater utility.
Rather, attacking utilitarianism involves arguing...
Bear in mind that having more fat means that the brain gets starved of (glucose)[http://www.loni.ucla.edu/~thompson/ObesityBrain2009.pdf] and blood sugar levels have (impacts on the brain generally)[http://ajpregu.physiology.org/cgi/content/abstract/276/5/R1223]. Some research has indicated that the amount of sugar available to the brain has a relationship with self-control. A moderately obese person may have fat cells that steal so much glucose from their brain that their brain is incapable of mustering the will in order to get them to stop eating poorl...
I think that this post has something to say about political philosophy. The problem as I see it is that we want to understand how our local decision-making affects the global picture and what constraints should we put on our local decisions. This is extremely important because, arguably, people make a lot of local decisions that make us globally worse off: such as pollution ("externalities" in econo-speak). I don't buy the author's belief that we should ignore these global constraints: they are clearly important--indeed its the fear of the pot...
I think there is definitely potential to the idea, but I don't think you pushed the analogy quite far enough. I can see an analogy between what is presented here and human rights and to Kantian moral philosophy.
Essentially, we can think of human rights as being what many people believe to be an essential bare-minimum conditions on human treatment. I.e. that the class of all "good and just" worlds everybody's human rights will be respected. Here human rights corresponds to the "local rigidity" condition of the subgraph. In general, t...
All the sciences mentioned above definitely do rely on controlled experimentation. But their central empirical questions are not amenable to being directly studied by controlled experimentation. We don't have multiple earths or natural histories upon which we can draw inference about the origins of species.
There is a world of difference between saying "I have observed speciation under these laboratory conditions" and "speciation explains observed biodiversity". These are distinct types of inferences. This of course does not mean th...
I think we are talking past each other. I agree that those are experiments in a broad and colloquial use of the term. They aren't "controlled" experiments: which is a term that I was wanting to clarify (since I know a little bit about it). This means that they do not allow you to randomly assign treatments to experimental units which generally means that the risk of bias is greater (hence the statistical analysis must be done with care and the conclusions drawn should face greater scrutiny).
Pick up any textbook on statistical design or statis...
I think it's standard in the literature: "The word experiment is used in a quite precise sense to mean an investigation where the system under study is under the control of the investigator. This means that the individuals or material investigated, the nature of the treatments or manipulations under study and the measurement procedures used are all settled, in their important features at least, by the investigator." The theory of the design of experiments
To be sure there are geological experiments where one, say, takes rock samples and subjects ...
Those sciences are based on observations. Controlled experimentation requires that you have some set of experimental units to which you randomly assign treatments. With geology, for instance, you are trying to figure out the structure of the Earth's crust (mostly). There are no real treatments that you apply, instead you observe the "treatments" that have been applied by the earth to the earth. I.e. you can't decide which area will have a volcano, or an earthquake: you can't choose to change the direction of a plate or change the configurati...
Bear in mind that the people who used steam engines to make money didn't make it by selling the engines: rather, the engines were useful in producing other goods. I don't think that the creators of a cheap substitute for human labor (GAI could be one such example) would be looking to sell it necessarily. They could simply want to develop such a tool in order to produce a wide array of goods at low cost.
I may think that I'm clever enough, for example, to keep it in a box and ask it for stock market predictions now and again. :)
As for the "no free lun...
It actually comes from Peter Norvig's definition that AI is simply good software, a comment that Robin Hanson made: , and the general theme of Shane Legg's definitions: which are ways of achieving particular goals.
I would also emphasize that the foundations of statistics can (and probably should) be framed in terms of decision theory (See DeGroot, "Optimal Statistical Decisions" for what I think is the best book on the topic, as a further note the decision-theoretic perspective is neither frequentist nor Bayesian: those two approaches can be unde...
The fact that there are so many definitions and no consensus is precisely the unclarity. Shane Legg has done us all a great favor by collecting those definitions together. With that said, his definition is certainly not the standard in the field and many people still believe their separate definitions.
I think his definitions often lack an understanding of the statistical aspects of intelligence, and as such they don't give much insight into the part of AI that I and others work on.
I think there is a science of intelligence which (in my opinion) is closely related to computation, biology, and production functions (in the economic sense). The difficulty is that there is much debate as to what constitutes intelligence: there aren't any easily definable results in the field of intelligence nor are there clear definitions.
There is also the engineering side: this is to create an intelligence. The engineering is driven by a vague sense of what an AI should be, and one builds theories to construct concrete subproblems and give a framework ...
I'd meet on June 6 (tentatively). South side is preferable if there are other people down here.
Thanks for the link assistance.
I agree that my mathematics example is insufficient to prove the general claim: "One will master only a small number of skills". I suppose a proper argument would require an in-depth study of people who solve hard problems.
I think the essential point of my claim is that there is high variance with respect to the subset of the population that can solve a given difficult problem. This seems to be true in most of the sciences and engineering to the best of my knowledge (though I know mathematics best). The theory I ...
Asking other people who have solved a similar problem to evaluate your answer is very powerful and simple strategy to follow.
Also, most evidence I have seen is that you can only learn how to do a small number of things well. So if you are solving something outside of your area of expertise (which probably includes most problems you'll encounter during your life) then there is probably somebody out there who can give a much better answer than you (although the cost to find such a person may be too great).
Post Note: The fact that you can only learn a few th...
Expanding on the go meta point:
Solve many hard problems at once
Whatever solution you give to a hard problem should give insight or be consistent with answers given to other hard problems. This is similar in spirit to: "http://lesswrong.com/lw/1kn/two_truths_and_a_lie/" and a point made by Robin Hanson (Youtube link: the point is at 3:31) "...the first thing to do with puzzles is [to] try to resist the temptation to explain them one at a time. I think the right, disciplined way to deal puzzles is to collect a bunch of them: lay them all out ...
From: You and Your Research
When you are famous it is hard to work on small problems. This is what did Shannon in. After information theory, what do you do for an encore? The great scientists often make this error. They fail to continue to plant the little acorns from which the mighty oak trees grow. They try to get the big thing right off. And that isn't the way things go. So that is another reason why you find that when you get early recognition it seems to sterilize you.
Here is another mechanism by which status could make you "stupid", alth...
I think that it should be tested on our currently known theories, but I do think it will probably perform quite well. This is on the basis that its analogically similar to cross validation in the way that Occam's Razor is similar to the information criteria (Aikake, Bayes, Minimum Description Length, etc.) used in statistics.
I think that, in some sense, its the porting over of a statistical idea to the evaluation of general hypotheses.
I think this is cross-validation for tests. There have been several posts on Occam's Razor as a way to find correct theories, but this is the first I have seen on cross-validation.
In machine learning and statistics, a researcher often is trying to find a good predictor for some data and they often have some "training data" on which they can use to select the predictor from a class of potential predictors. Often one has more than one predictor that performs well on the training data so the question is how else can one choose an appropriate predic...
I would like to see more discussion on the timing of artificial super intelligence (or human level intelligence). I really want to understand the mechanics of your disagreement.
One issue with say taking a normal distribution and letting the variance go to infinity (which is the improper prior I normally use) is that the posterior distribution distribution is going to have a finite mean, which may not be a desired property of the resulting distribution.
You're right that there's no essential reason to relate things back to the reals, I was just using that to illustrate the difficulty.
I was thinking about this a little over the last few days and it occurred to me that one model for what you are discussing might actually be an infini...
No problem.
Improper priors are generally only considered in the case of continuous distributions so 'sum' is probably not the right term, integrate is usually used.
I used the term 'weight' to signify an integral because of how I usually intuit probability measures. Say you have a random variable X that takes values in the real line, the probability that it takes a value in some subset S of the real line would be the integral of S with respect to the given probability measure.
There's a good discussion of this way of viewing probability distributions in ...
I think you're making an important point about the uncertainty of what impact our actions will have. However, I think the right way to about handling this issue is to put a bound on what impacts of our actions are likely to be significant.
As an extreme example, I think I have seen much evidence that clapping my hands once right now will have essentially no impact on the people living in Tripoli. Very likely clapping my hands will only affect myself (as no one is presently around) and probably in no huge way.
I have not done a formal statistical model to a...
There's another issue too, which is that it is extraordinarily complicated to assess what the ultimate outcome of particular behavior is. I think this opens up a statistical question of what kinds of behaviors are "significant", in the sense that if you are choosing between A and B, is it possible to distinguish A and B or are they approximately the same.
In some cases they won't be, but I think that in very many they would.
What topology are you putting on this set?
I made the point about the real numbers because it shows that putting a non-informative prior on the infinite bidirectional sequences should be at least as hard as for the real numbers (which is non-trivial).
Usually a regularity is defined in terms of a particular computational model, so if you picked Turing machines (or the variant that works with bidirectional infinite tape, which is basically the same class as infinite tape in one direction), then you could instead begin constructing your prior in terms of Turing machines. I don't know if that helps any.
You can actually simulate a tremendous number of distributions (and theoretically any to an arbitrary degree of accuracy) by doing an approximate inverse CDF applied to a standard uniform random variable see here for example. So the space of distributions from which you could select to do your test is potentially infinite. We can then think of your selection of a probability distribution as being a random experiment and model your selection process using a probability distribution.
The issue is that since the outcome space is the space of all computable p...
In finite dimensional parameter spaces sure, this makes perfect sense. But suppose that we are considering a stochastic process X1, X2, X3, .... where Xn is follows a distribution Pn over the integers. Now put a prior on the distribution and suppose that unbeknown to you Pn is the distribution that puts 1/2 probability weight on -n and 1/2 probability weight on n. If the prior on the stochastic process does not put increasing weight on integers with large absolute value, then in the limit the prior puts zero probability weight on the true distribution (a...
There's a difficulty with your experimental setup in that you implicitly are invoking a probability distribution over probability distributions (since you represent a random choice of a distribution). The results are going to be highly dependent upon how you construct your distribution over distributions. If your outcome space for probability distributions is infinite (which is what I would expect), and you sampled from a broad enough class of distributions then a sampling of 25 data points is not enough data to say anything substantive.
A friend of yours...
I think what Shalizi means is that a Bayesian model is never "wrong", in the sense that it is a true description of the current state of the ideal Bayesian agent's knowledge. I.e., if A says an event X has probability p, and B says X has probability q, then they aren't lying even if p!=q. And the ideal Bayesian agent updates that knowledge perfectly by Bayes' rule (where knowledge is defined as probability distributions of states of the world). In this case, if A and B talk with each other then they should probably update, of course.
In frequen...
I suppose it depends what you want to do, first I would point out that the set is in a bijection with the real numbers (think of two simple injections and then use Cantor–Bernstein–Schroeder), so you can use any prior over the real numbers. The fact that you want to look at infinite sequences of 0s and 1s seems to imply that you are considering a specific type of problem that would demand a very particular meaning of 'non-informative prior'. What I mean by that is that any 'noninformative prior' usually incorporates some kind of invariance: e.g. a uniform prior on [0,1] for a Bernoulli distribution is invariant with respect to the true value being anywhere in the interval.
This isn't always the case if the prior puts zero probability weight on the true model. This can be avoided on finite outcome spaces, but for infinite outcome spaces no matter how much evidence you have you may not overcome the prior.
I've had some training in Bayesian and Frequentist statistics and I think I know enough to say that it would be difficult to give a "simple" and satisfying example. The reason is that if one is dealing with finite dimensional statistical models (this is where the parameter space of the model is finite) and one has chosen a prior for those parameters such that there is non-zero weight on the true values then the Bernstein-von Mises theorem guarantees that the Bayesian posterior distribution and the maximum likelihood estimate converge to the same...
I am uneasy with that sentiment although I'm having a hard time putting my finger one exactly why. But this is how I see it: there are vastly more people in the world than I could possibly ever help and some of them are so poor and downtrodden that they spend most of their money on food since they can't afford luxuries such as drugs. Eventually, I might give money to the drug user if I had solved all the other problems first, but I would prefer my money to be spent on something more essential for survival first before I turn to subsidizing people's luxury spending.
Imposing my values on somebody seems to more aptly describe a situation where I use authority to compel the drug user to not use drugs.
Would a simple solution to this be to say plan a date each year to give away some quantity of money? You could keep a record of all the times you gave money to a beggar, or you could use a simple model to estimate how much you probably would have given, then you can send that amount to a worthwhile charity.
When I get more money that's what I plan on doing.
Also, I'd like to note that the post here included nigh-Yudkowskian levels of cross-linking to other material on LW. When we're talking about "conversation norms on LW", how is that not solid data?
The evidence presented is a number of anecdotes from LW conversation. A fully analysis of LW would need to categorize different types of offending comments, discuss their frequency and what role they play in LW discussion. Even better would be to identify who does them, etc.
Although I do find it plausible that LW should enact a policy of altering present discussions of gender seems I certainly will not say the evidence presented is "overwhelming".
This isn't precisely what Daniel_Burfoot was talking about but its a related idea based on "sparse coding" and it has recently obtained good results in classification:
http://www.di.ens.fr/~fbach/icml2010a.pdf
Here the "theories" are hierarchical dictionaries (so a discrete hierarchy index set plus a set of vectors) which perform a compression (by creating reconstructions of the data). Although they weren't developed with this in mind, support vector machines also do this as well, since one finds a small number of "support vectors&q... (read more)