maxikov comments on Open thread, Nov. 10 - Nov. 16, 2014 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (194)
When I was trying to make sense of Peter Watts' Echopraxia it has occurred to me that there may be two vastly different but both viable kinds of epistemology.
First is the classical hypothesis-driven epistemology, promoted by positivists and Popper, and generalized by Bayesian epistemology and Solomonoff induction. In the most general version, you have to come up with a set of hypotheses with assigned probabilities, and look for information that would change the entropy of this set the most. It's a good idea. It formalizes what is science, and what is not; it provides the framework for research, and, given the infinite amount of computing power on a hypercomputer, extract the theoretical maximum of utility from sensory information. The main problem is that it doesn't an algorithmic way to come up with hypotheses, and the suggestion to test infinitely many of them (aleph-1, as far as I can tell) isn't very helpful either.
On th other hand, you can imagine data-driven epistemology, where you don't really formulate any hypotheses. You just have a lot of pattern-matching power, completely agnostic of the knowledge domain, and you use it to try to find any regularities, predictability, clustering, etc. in the sensory data. Then you just check if any of the discovered knowledge is useful. That barely (if at all) can distinguish correlation and causation, that does not really distinguish scientific and non-scientific beliefs, and it doesn't even guarantee that the findings will be meaningful. However, it does work algorithmically, even with finite resources.
They actually go together rather nice, with data-driven epistemology being the source of hypotheses for the hypothesis-driven epistemology. However, Watts seems to be arguing that given enough computing power, you'd be better off spending it on data-driven pattern matching than on generating and testing hypotheses. And since brains are generally good at pattern matching, System 1, slightly tweaked with yet-to-be-invented technologies, can potentially vastly outperform System 2 running hypothesis-driven epistemology. I wonder to which extent it may actually be true.
Reminds me of "The Cactus and the Weasel".
The philosopher Isaiah Berlin originally proposed a (tongue-in-cheek) classification of people into "hedgehogs", who have a single big theory that explains everything and view the world in that light, and "foxes", who have a large number of smaller theories that they use to explain parts of the world. Later on, the psychologist Philip Tetlock found that people who were closer to the "fox" end of the spectrum tended to be better at predicting future events than the "hedgehogs".
In "The Cactus and the Weasel", Venkat constructs an elaborate hypothesis of the kinds of belief structures that "foxes" and "hedgehogs" have and how they work, talking about how a belief can be grounded in a small number of fundamental elements (typical for hedgehogs) or in an intricate web of other beliefs (typical for foxes). The whole essay is worth reading, but a few excerpts that are related to what you just wrote:
That is very interesting and definitely worth reading. One thing though, it seems to me that a rationalist hedgehog should be capable of discarding their beliefs if the incoming information seems to contradict them.
When you say "pattern-matching," what do you mean? Because when I imagine pattern-matching, I imagine that one has a library of patterns, which are matched against sensory data- and those library of patterns are the 'hypotheses.'
But where does this library come from? It seems to be something along the lines of "if you see it once, store it as a pattern, and increase the relevance as you see it more times / decrease or delete if you don't see it enough" which looks like an approximation to "consider all hypotheses, updating their probability upward when you see them and try to keep total probability roughly balanced."
That is, I think we agree; but I think when we use phrases like "pattern-matching" it helps to be explicit about what we're talking about. Distinguishing between patterns and hypotheses is dangerous!
Probably a better term would be "unsupervised learning". For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. Or we may try to see if the data predicts itself - if we see X, in Y seconds we'll see Z. That doesn't seem to be equivalent to considering infinitely many hypotheses. In Solomonoff induction, hypothesis is the algorithm capable of generating data, and based on the new incoming information, we can decide whether the algorithm fits the data or not. In unsupervised learning, on the other hand, we don't necessarily have an underlying model, or the model may not be generative.
I think it's useful to think of the parameter-space for your model as the hypothesis-space. Saying "our parameter-space is R^600" instead of "our parameter-space is all possible algorithms" is way more reasonable and computable, but what it would mean for an unsupervised learning algorithm to have no hypotheses would be that it has no parameters (which would be worthless!). Remember that we need to seed our neural nets with random parameters so that different parts develop differently, and our clustering algorithms need to be seeded with different cluster centers.
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data, as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
This seems like a good description to me.
I'm not an expert in Solomonoff induction, but my impression is that each model set is a subset of the model set from the last step. That is, you consider every possible output string (implicitly) by considering every possible program that could generate those strings, and I assume stochastic programs (like 'flip a coin n times and output 1 for heads and 0 for tails') are expressed by some algorithmic description followed by the random seed (so that the algorithm itself is deterministic, but the set of algorithms for all possible seeds meets the stochastic properties of the definition).
As we get a new piece of the output string--perhaps we see it move from "1100" to "11001"--we rule out any program that would not have output "11001," which includes about half of our surviving coin-flip programs and about 90% of our remaining 10-sided die programs. So the class of models that "fit the data perfectly" is a very broad class of models, and you could imagine neural networks as estimating the mean of that class of models instead of every instance of the class and then taking the mean of them.