Are search engines perpetuating our biases?
One of the great things about the internet is that there is a social group for almost every interest. Pick an unusual hobby or ideology, and there is probably an online community centered around it. This is especially wonderful for those of us who never quite fit in to mainstream society.
But there's also a downside to this aspect of the internet, which is that the more we immerse ourselves in these small online communities, the less exposure we get to the rest of the world. And the less exposure we get to the rest of the world, the easier it is for us to hold onto false beliefs that the rest of the world rejects. (Of course, it's also easier to hold onto true beliefs that the rest of the world rejects.)
For instance, suppose you believe that pasteurizing milk makes it less healthy, and we should all drink our milk raw. (I picked this example because it's something decidedly non-mainstream that I believe with high probability.) I'm fairly susceptible to social pressures, so at least for me, my belief in this proposition goes up when I'm hanging out with intelligent people who agree with me, and it goes down when I'm hanging out with intelligent people who look at me like I'm insane when I claim such a thing. They don't need to state evidence in either direction to influence my belief-probability, though that certainly helps. The important thing is that I think they're smart and therefore I trust their opinions.
Unsurprisingly, if I spend most of my time hanging out with normal, intelligent, scientifically-minded Americans, I start to question my beliefs regarding raw milk, but if I spend all my time on raw-milk-promoting websites, then my belief that raw milk is good for us is reaffirmed.
We like having our beliefs affirmed; it makes us happy when other people think we are right about things. We'd rather seek out people who agree with us and can relate to our mindsets than seek out groups where everyone disagrees with us strongly. This is normal and reasonable, and it's why all of us rationalists are hanging out here on LessWrong instead of lurking in creationist forums. However, it does put us at risk of creating feedback loops: unusual ideas are proposed by people we respect, we affirm those ideas, others repeat those ideas, and so their prevalence and repetition causes them to be repeated more. Many of those who disagree are hesitant to voice their disagreements for fear of rejection. As a result, LessWrong perpetuates many ideas that the rest of the world considers somewhat odd. Also, the rest of the world perpetuates many ideas that we at LessWrong consider extremely odd.
I'm not saying anything new here, I know. Everything I've written so far has been discussed to death on LessWrong, and if I were less lazy this article would be full of links to the sequences. If I recall correctly, the sequences recommend countering this problem by recognizing that we have these biases, and consciously trying to correct for them.
I try to do this, but I also tend to employ an additional solution to this problem. Because I recognize that I'm easily influenced by others' beliefs, I make sure to expose myself to a myriad of different belief systems. For instance, in politics, I read blogs by liberal feminist scientists as well as conservative anti-feminist traditionalists. Since I respect the authors of all the blogs I read, and recognize that they are intelligent people who have thought deeply about their perspectives, I can't easily dismiss either perspective outright as lies spoken by a moron. Since their beliefs differ so radically, I also can't just fall into the trap of believing everything I read. So I'm forced to really think about the ideas, and question why their proponents believe them, and consolidate them (and other thoughts I might have) into my own coherent worldview.
Thus, I consider it important to be exposed to the ideas of people I disagree with. Meeting intelligent people who think differently than I do keeps my mind open, and reminds me that there are things about the world that I don't know yet, and keeps me from overestimating the probability that my beliefs are true.
Unfortunately, search engines like Google are making it more difficult for me to do so. About a week ago, I attended a lecture on information retrieval, and I was shocked to find out exactly how much our Google searches are customized to our own preferences.
Suppose John and Mary both Google something like "creationism". Now suppose that John is an atheist who reads a lot of atheist forums, and Mary is a fundamentalist Christian who spends most of her time on Christian forums. John's Google results might contain a lot of links to people on his favorite atheist website talking about how much creationism sucks, and Mary's Google results might contain a lot of links to her friends' blogs talking about how God created the earth.
In this example, John and Mary are both having their beliefs reaffirmed, because Google is presenting them with things they want to hear. They will not be exposed to opposing viewpoints, and will be much less likely to change their minds about important issues. In fact, their beliefs in their own viewpoints will probably grow stronger and stronger each time Google gives them back these results, and they will become less and less aware that another viewpoint exists.
Of course, this might happen without Google filtering its search results. John might deliberately avoid reading the views of creationists, or dismiss them outright as moronic, or not ever Google anything that might lead him to their webpages, because he is convinced of his beliefs and would rather have them affirmed than contradicted. Since he would skip past the fundamentalist Christian blog results anyway, Google is doing him a service by ranking the stuff he cares about higher.
But at least for me, this Google filtering is a bad thing. I want to see other webpages which present other viewpoints, instead of being led back to the same places over and over again. And if Google doesn't show them to me when I search for them, and I don't realize that my Google search results are being customized, I might never realize there's something I'm missing, or go to look for it.
I'm probably making this sound more dire than it actually is. Obviously, I can try other search terms, or just ignore websites I've already been to. Or I can follow links on other websites and wander off into regions of the internet without the help of Google. But I still have a visceral reaction against search engines customizing their results to fit my individual ideological preferences, because they are perpetuating my biases without giving me any direct control over which pieces of information I receive.
What do you guys think?
The Joys of Conjugate Priors
(Warning: this post is a bit technical.)
Suppose you are a Bayesian reasoning agent. While going about your daily activities, you observe an event of type . Because you're a good Bayesian, you have some internal parameter
which represents your belief that
will occur.
Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive. Your observation of is a datapoint, and thus you'll want to modify
. But how much should this datapoint influence
? Well, that will depend on how sure you are of
in the first place. If you calculated
based on a careful experiment involving hundreds of thousands of observations, then you're probably pretty confident in its value, and this single observation of
shouldn't have much impact. But if your estimate of
is just a wild guess based on something your unreliable friend told you, then this datapoint is important and should be weighted much more heavily in your reestimation of
.
Of course, when you reestimate , you'll also have to reestimate how confident you are in its value. Or, to put it a different way, you'll want to compute a new probability distribution over possible values of
. This new distribution will be
, and it can be computed using Bayes' rule:
Here, since is a parameter used to specify the distribution from which
is drawn, it can be assumed that computing
is straightforward.
is your old distribution over
, which you already have; it says how accurate you think different settings of the parameters are, and allows you to compute your confidence in any given value of
. So the numerator should be straightforward to compute; it's the denominator which might give you trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.
But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions. If this is your goal, then once you've computed the distribution , you can pick the value of
that maximizes it. This will be your new parameter, and because you have the formula
, you'll know exactly how confident you are in this parameter.
In practice, picking the value of which maximizes
is usually pretty difficult, thanks to the presence of local optima, as well as the general difficulty of optimization problems. For simple enough distributions, you can use the EM algorithm, which is guarranteed to converge to a local optimum. But for more complicated distributions, even this method is intractable, and approximate algorithms must be used. Because of this concern, it's important to keep the distributions
and
simple. Choosing the distribution
is a matter of model selection; more complicated models can capture deeper patterns in data, but will take more time and space to compute with.
It is assumed that the type of model is chosen before deciding on the form of the distribution . So how do you choose a good distribution for
? Notice that every time you see a new datapoint, you'll have to do the computation in the equation above. Thus, in the course of observing data, you'll be multiplying lots of different probability distributions together. If these distributions are chosen poorly,
could get quite messy very quickly.
If you're a smart Bayesian agent, then, you'll pick to be a conjugate prior to the distribution
. The distribution
is conjugate to
if multiplying these two distributions together and normalizing results in another distribution of the same form as
.
Let's consider a concrete example: flipping a biased coin. Suppose you use the bernoulli distribution to model your coin. Then it has a parameter which represents the probability of gettings heads. Assume that the value 1 corresponds to heads, and the value 0 corresponds to tails. Then the distribution of the outcome
of the coin flip looks like this:
It turns out that the conjugate prior for the bernoulli distribution is something called the beta distribution. It has two parameters, and
, which we call hyperparameters because they are parameters for a distribution over our parameters. (Eek!)
The beta distribution looks like this:
Since represents the probability of getting heads, it can take on any value between 0 and 1, and thus this function is normalized properly.
Suppose you observe a single coin flip and want to update your beliefs regarding
. Since the denominator of the beta function in the equation above is just a normalizing constant, you can ignore it for the moment while computing
, as long as you promise to normalize after completing the computation:
Normalizing this equation will, of course, give another beta distribution, confirming that this is indeed a conjugate prior for the bernoulli distribution. Super cool, right?
If you are familiar with the binomial distribution, you should see that the numerator of the beta distribution in the equation for looks remarkably similar to the non-factorial part of the binomial distribution. This suggests a form for the normalization constant:
The beta and binomial distributions are almost identical. The biggest difference between them is that the beta distribution is a function of , with
and
as prespecified parameters, while the binomial distribution is a function of
, with
and
as prespecified parameters. It should be clear that the beta distribution is also conjugate to the binomial distribution, making it just that much awesomer.
Another difference between the two distributions is that the beta distribution uses gammas where the binomial distribution uses factorials. Recall that the gamma function is just a generalization of the factorial to the reals; thus, the beta distribution allows and
to be any positive real number, while the binomial distribution is only defined for integers. As a final note on the beta distribution, the -1 in the exponents is not philosophically significant; I think it is mostly there so that the gamma functions will not contain +1s. For more information about the mathematics behind the gamma function and the beta distribution, I recommend checking out this pdf: http://www.mhtl.uwaterloo.ca/courses/me755/web_chap1.pdf. It gives an actual derivation which shows that the first equation for
is equivalent to the second equation for
, which is nice if you don't find the argument by analogy to the binomial distribution convincing.
So, what is the philosophical significance of the conjugate prior? Is it just a pretty piece of mathematics that makes the computation work out the way we'd like it to? No; there is deep philosophical significance to the form of the beta distribution.
Recall the intuition from above: if you've seen a lot of data already, then one more datapoint shouldn't change your understanding of the world too drastically. If, on the other hand, you've seen relatively little data, then a single datapoint could influence your beliefs significantly. This intuition is captured by the form of the conjugate prior. and
can be viewed as keeping track of how many heads and tails you've seen, respectively. So if you've already done some experiments with this coin, you can store that data in a beta distribution and use that as your conjugate prior. The beta distribution captures the difference between claiming that the coin has 30% chance of coming up heads after seeing 3 heads and 7 tails, and claiming that the coin has a 30% chance of coming up heads after seeing 3000 heads and 7000 tails.
Suppose you haven't observed any coin flips yet, but you have some intuition about what the distribution should be. Then you can choose values for and
that represent your prior understanding of the coin. Higher values of
indicate more confidence in your intuition; thus, choosing the appropriate hyperparameters is a method of quantifying your prior understanding so that it can be used in computation.
and
will act like "imaginary data"; when you update your distribution over
after observing a coin flip
, it will be like you already saw
heads and
tails before that coin flip.
If you want to express that you have no prior knowledge about the system, you can do so by setting and
to 1. This will turn the beta distribution into a uniform distribution. You can also use the beta distribution to do add-N smoothing, by setting
and
to both be N+1. Setting the hyperparameters to a value lower than 1 causes them to act like "negative data", which helps avoid overfitting
to noise in the actual data.
In conclusion, the beta distribution, which is a conjugate prior to the bernoulli and binomial distributions, is super awesome. It makes it possible to do Bayesian reasoning in a computationally efficient manner, as well as having the philosophically satisfying interpretation of representing real or imaginary prior data. Other conjugate priors, such as the dirichlet prior for the multinomial distribution, are similarly cool.
Occam's Razor, Complexity of Verbal Descriptions, and Core Concepts
Occam's razor, as it is popularly known, states that "the simplest answer is most likely to be correct"1. It has been noted in other discussion threads that the phrase "simplest description" is somewhat misleading, and that it actually means something along the lines of "description that is easiest to express concisely using natural language". Occam's razor typically comes into play when we are trying to explain some observed phenomenon, or, in terms of model-building, when we are trying to come up with a model for our observations. The verbal complexity of a new model will depend on the models that already exist in the observer's mind, since, as humans, we express new ideas in terms of concepts with which we are already familiar.
Thus, when applied to natural language, Occam's razor encourages descriptions that are most in line with the observer's existing worldview, and discourages descriptions that seem implausible given the observer's current worldview. Since our worldviews are typically very accurate2, this makes sense as a heuristic.
As an example, if a ship sank in the ocean, a simple explanation would be "a storm destroyed it", and a complicated explanation would be "a green scaly sea-dragon with three horns destroyed it". The first description is simple because we frequently experience storms, and so we have a word for them, whereas most of us never experience green scaly sea-dragons with three horns, and so we have to describe them explicitly. If the opposite were the case, we'd have some word for the dragons (maybe they'd be called "blicks"), and we would have to describe storms explicitly. Then the descriptions above could be reworded as "rain falling from the sky, accompanied by strong gusts of wind and possibly gigantic electrical discharges, destroyed the ship" and "a blick destroyed the ship", respectively.
What I'm getting at is that different explanations will have different complexities for different people; the complexity of a description to a person will depend on that person's collection of life-experiences, and everyone has a different set of life-experiences. This leads to an interesting question: are there universally easy-to-describe concepts? (By universally I mean cross-culturally.) It seems reasonable to claim that a concept C is easy-to-describe for a culture if that culture's language contains a word that means C; it should be a fairly common word and everyone in the culture should know what it means.
So are there concepts that every language has a word for? Apparently, yes. In fact, the linguist Morris Swadesh came up with exactly such a list of core vocabulary terms. Unsurprisingly from an information theoretic perspective, the English versions of the words on this list are extremely short: most are one syllable, and the consonant clusters are small.
Presumably, if you wanted to communicate an idea to someone from a very different culture, and you could express that idea in terms of the core concepts, then you could explain your idea to that person. (Such an expression would likely require symbolism/metaphor/simile, but these are valid ways of expressing ideas.) Alternatively, imagine trying to explain a complicated idea to a small child; this would probably involve expressing the concept in terms of more concrete, fundamental ideas and objects.
Where does this core vocabulary come from? Is it just that these are the only concepts that basically all humans will be familiar with? Or is there a deeper explanation, like an a priori encoding of these concepts in our brains?
I bring all of this up because it is relevant to the question of whether we could communicate with an artificial intelligence if we built one, and whether this AI would understand the world similarly to how we do (I consider the latter a prerequisite for the former). Presumably an AI would reason about and attempt to model its environment, and presumably it would prefer models with simpler descriptions, if only because such models would be more computationally efficient to reason with. But an AI might have a different definition of "simple description" than we do as humans, and therefore it might come up with very different explanations and understandings of the world, or at the very least a different hierarchy of concepts. This would make communication between humans and AIs difficult.
If we encoded the core vocabulary a priori in the AI's mind as some kind of basis of atomic concepts, would the AI develop an understanding of the world that was more in line with ours than it would if we didn't? And would this make it easier for us to communicate intellectually with the AI?
1 Note that Occam's razor does not say that the simplest answer is actually correct; it just gives us a distribution over models. If we want to build a model, we'll be considering p(model|data), which by Bayes' rule is equal to p(data|model)p(model)/p(data). Occam's razor is one way of specifying p(model). Apologies if this footnote is obvious, but I see this misinterpretation all over the place on the internet.
2 This may seem like a bold statement, but I'm talking about in terms of every-day life sort of things. If you blindfolded me and put me in front of a random tree during summer, in my general geographic region, and asked me what it looked like, I could give you a description of that tree and it would probably be very similar to the actual thing. This is because my worldview about trees is very accurate, i.e. my internal model of trees has very good predictive power.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)