The crazier, more-expensive, and more-difficult the method is, the more improvement it should show; craziness should filter out less-committed parents.
I think there'd be a U-shaped curve. Requiring some degree of effort will select for people who are willing to make good choices. At some point, craziness and expense will swamp out the good effects.
I have an example from food rather than schooling-- vegetarians might be healthier than the average person. Vegans might not be, and it's even less likely for fruitarians. Breathairians certainly won't be healthier than average.
This Genomes Unzipped post discusses the possibility that henceforth genetic sequencing will produce self-fulfilling/self-amplifying prophecies. The kid who is told that he or she has a genetic advantage in sprinting versus marathoning will then invest more effort into the advantaged activity at the margin, increasing the predictive power of the relevant genetic variants.
Any new education method will show increases in student test scores if people believe it results in increases in student test scores, because only interested parents will sign up for that method.
For example, Freakonomics tells the story of high school students in Chicago who participated in a lottery for the chance to switch schools. The students who were reassigned to new schools were more likely to graduate; but the students who applied for the lottery but lost did just as well. The explanation given is that the students (or parents) who care about education will attempt to switch to a better school, but the "better" school won't confer an advantage.
Cullen, Jacob, and Levitt. "The Impact of School Choice on Student Outcomes: An Analysis of the Chicago Public Schools". J. Public Econ. 200?.
Cullen, Jacob, and Levitt. "The Effect of School Choice on Student Outcomes: Evidence from Randomized Lotteries". National Bureau of Economic Research working paper, 2003.
The following link lends credence to this line of thought:
http://www.management.wharton.upenn.edu/grant/Grant_JAP2008b_TaskSignificance.pdf
A note:
There was a study done regarding the cause and effect of employee relationships and how it affected job performance that gave as a result that employees performed better simply because of the attention given them, rather than the validity of any of the techniques introduced.
If anyone can provide a resource for that study, I'll vote you up because I am having trouble finding it.
If I remember correctly it is used in: O'Hair, Friedrich, Dixon 2008 Strategic Communication: In Business and the Professions, Pearson
It's called The Hawthorne Effect. At least in the Wikipedia article, no one considers the possibility that mere variation (less boredom) improves performance.
I was thinking of that exact example with regards to the posts discussion on hypothetical Montessori schools.
The filtering effect still wouldn't vanish - instead of filtering FOR the most engaged parents it would filter AGAINST the least engaged (I.E. all the parents that cared would put their kids into Montessori, the parents who didn't care would just put their kids into whatever school was the most convenient for them)
You are right. Unfortunately, "any parent who cared at all" is still a significant selection.
Also there are different kinds of "care". For example a parent may spend their time choosing the right school, and even pay more money for the school... and yet completely ignore what their children are doing at the school. The filter of Montessori schools may work on both these levels -- require both selection and later parent participation. I am not familiar with details of Montessori schools.
A data point: I taught at a school where student grades were not written on paper, but only on a school website. Parents received a password they could use to see their child's grades. During the year 1/3 passwords were never used. That means that parents of 1/3 children either did not care what grades their child has or at least did not care to verify what their child reported. It was a school that required special tests for a child to be accepted, and for many children it was inconveniently far from their home -- so parents did care about choosing the right school, but then 1/3 of them stopped caring about what happened in the school.
I guess this could be explained by signalling. Choosing your child's school is a one-time activity that signals that you are a great parent. Reading your child's grades online is invisible (unless someone curious looks into logs later), therefore useless for signalling.
Even a lie is a psychic fact. --Carl Jung, (1875-1961)
I find that helpful in reminding myself that beliefs, even false ones, can be causal.
This is one of Natures most cruel jokes, as it makes things that appear to have potential to improve your life better than they are.
Do Montessori schools or home-schooling result in better scores on standardized tests?
Back in the late 1980s, neural networks were hot; and evaluations usually indicated that they outperformed other methods of classification. In the early 1990s, genetic algorithms were hot; and evaluations usually indicated that they outperformed other methods of classification. Today, support vector machines (SVMs) are hot; and evaluations usually indicate that they outperform other methods of classifications. Neural networks and genetic algorithms no longer outperform older methods. (I write this from memory, so you shouldn't take it as gospel.)
I know you said you were writing from memory, but this paragraph above is very vague. 'Neural networks' in their many variants have been used (and still are used) in a wide variety of domains and applications, but mainly classification/recognition systems such as character recognition, speech recognition, machine vision in some cases, etc. Genetic algorithms (aka simulated annealing, evolutionary search) from what I recall are used in search and optimization applications, and I'm not aware of usage in recognition at all. Of course, using GA approaches to explore the space of NN designs is a powerful combination even today - there are numerous examples, but in the video game space, one of the most advanced and radical animation systems is driven by GA + NN approaches.
Thats one example, but in general GA-like evolutionary search is a technique that is unlikely to ever be 'trumped' in the domains it reigns. Its simple, old, but effective, like quicksort or radix sort in some sense for its domains. Think of it this way. Fully exploring a search space (for a circuit design or NN virtual circuit or whatever) with a GA/evolutionary search is expensive, but If you have the time to spare and enough population diversity, a big general evolutionary search will eventually converge to the optimal solution.
There's many potential refinements on search, and many newer classification approaches, but many (such as SVM) are very similar to past approaches with a few minor but key improvements.
In a broader sense, all of technology always uses evolutionary search all the time, it is our current global meta-algorithm.
Fads can create self-fulfilling correlations. If neural networks are hot, the smartest people tend to work on neural networks. When you compare their results to other results, it can be difficult to look at neural networks vs., say, logistic regression; and factor out the smartest people vs. pretty smart people effect.
I'm skeptical about your conclusion for the publication bias, mainly because I find that even if it does exist, its swamped out by other effects.
Science and engineering involve something like a landgrab in ideaspace. Its really, really difficult in the modern era to find an idea that is actually new. But that is exactly what you need to make your career. Scientists aren't like moths that are attracted to pretty flames, they are more like explorers and colonizers, who attempt to stake out novel claims. I believe this effect is quite strong, and this bias works in the exact opposite direction to that which you propose.
Researchers developing new techniques have a bias to over-represent their own results and under-represent older results.
All that being said, the two biases may both exist at different scales; like a gold rush effect drawing scientists into a domain combined with a strong local land-grab effect separating them into different regions of the space.
SIWOTI: Simulated annealing is not not not a genetic algorithm.
I admit these are fuzzy, confusing categories, and you can certainly create borderline examples, but most implementations of simulated annealing use a cooling schedule, which has no natural analog in the genetic algorithm, and many implementations of genetic algorithms use crossover, which has no natural analog in simulated annealing.
borderline examples
I give you Differential evolution Monte Carlo! It slices! It dices! It's a genetic algorithm and a simulated annealing algorithm! But wait -- there's more! You also get an self-tuning Metropolis sampler! (The proof of the correctness of the sampler isn't valid, but the theorem is true; the author gives a correct proof in a subsequent paper.)
Fair points. In my mind those techniques are all taxonomically related, close in conceptual space, but some of the boundaries are fuzzy. I'm not even sure what the general category is called, but 'evolutionary search' is what I roughly mean by "the space of algorithms that includes GA variants and simulated annealing, plus some other stuff". Perhaps they aren't that close, but thats where map stands and what I meant.
But anyway, no GA algorithms vary the mutation or crossover parameters over time? That would be closer to SA then. I thought the similarity between SA and GA is they both explore a set of points in solution space simultaneously and thus are more generally robust vs single point exploratory techniques.
Simulated annealing is a kind of single-point stochastic hill-climbing, with the temperature controlling how frequently a "downward" step is taken (that might bounce the single point out of a local optimum). It doesn't explore a set of points simultaneously - I mean, there might be a way of describing it as such, but that would be an exotic, insightful analogy, rather than the usual bland, boring sort of description.
Certainly some GA's vary the mutation or crossover parameters - as I said before, you can certainly create borderline examples, but the (standard, introductory, boring) centers of the two fuzzy categories are pretty distinguishable.
Ahhh my bad. I thought that SA explored multiple solution points at once. Of course, if the stochastic jumps are similar, it could end up being the same exploration path eventually, just serial vs parallel, although that seems to make vanilla SA incredibly non-useful in the modern era of parallel computing.
[...] that seems to make vanilla SA incredibly non-useful in the modern era of parallel computing.
That is what parallel tempering is for.
Also, SA is specifically useful when the original objective function can be easily evaluated, but its derivatives are too expensive. With SA, you don't need to compute derivatives or the normalizing constants. You can try quasi-Newton methods and other approaches, but even these are computationally intractable in many cases. There are certain ways in which a problem can be non-convex that makes SA an attractive alternative. In principle, this could be true even in low dimensional problems, meaning that it's not at all just a question of parallelism. Another thing worth mentioning is that SA lends itself very well to the GPU in some cases where traditional optimizers don't.
Back in the late 1980s, neural networks were hot; and evaluations usually indicated that they outperformed other methods of classification. In the early 1990s, genetic algorithms were hot; and evaluations usually indicated that they outperformed other methods of classification. Today, support vector machines (SVMs) are hot; and evaluations usually indicate that they outperform other methods of classifications. Neural networks and genetic algorithms no longer outperform older methods. (I write this from memory, so you shouldn't take it as gospel.)
If this is true, the implications are beyond my current comprehension. Anyone know where I can find papers suggesting that such a phenomenon exists in statistical classification techniques?
Good point! A nitpick:
Believing that Internet Explorer is safer from malicious websites than Google Chrome won't decrease your computer's infection rate.
I don't think that this is a good counterexample. If people who want to avoid malicious websites tend to believe that IE is safer from them and thus choose IE, then IE should get better results (ceteris paribus) since people who care about malicious websites are also more likely to take care to type paypal.com
directly into the address bar, etc.
In other words, one's attitude to safety from malicious websites has as much effect on infection rates as one's attitude to safety from car crashes has effect on crash rates. (Or more, I would guess offhand.)
Am I missing something?
Or they believe using IE makes them safer and don't bother doing anything else. Or even take more risks otherwise because they feel safer because of IE. Feeling safer can make you less safe. Or have other unintended effects. And "beliefs" are usually based on emotion rather than thought.
I would expect the exact opposite effect. They might go to Paypal in Chrome, then switch to IE to go to whitehouse.com, just to be on the safe side.
The crazier, more-expensive, and more-difficult the method is, the more improvement it should show; craziness should filter out less-committed parents.
Montessori
Your main point may well be valid; I think it probably is. But my daughter attended a Montessori kindergarten (but not a Montessori school) and I have read Maria Montessori's book. Neither seemed at all crazy to me.
The Montessori method is to engage children in activities which are challenging but not discouragingly so. Each activity produces a small increment in a skills. The children seem to become absorbed in the activities and find them very rewarding. In the adult world this would probably be something like "deliberate practice".
This idea of learning skills in small increments - in the sweet spot between "too easy and you learn nothing" and "too hard so you learn nothing and get discouraged" has wide applicability to children and adults. For example after almost a year of conventional swimming lessons and my daughter could not swim, I tried applying this method to swimming.
Swimming of course requires you to do several things at once. If you don't do them all you get a mouth full of water and learn very little.
I bought her a buoyancy vest and fins. She learned to swim with these very quickly. After a while we deflated the vest progressively and she again learned to swim that way, being now responsible for staying afloat. Then we took away the fins and she mastered that quickly. After a few lessons she was a confident swimmer. This was a very dramatic result. Back at the swim school they were surprised she could now swim, but were totally uninterested in how we achieved this.
The Montessori children seem to end up with excellent powers of concentration; that is certainly the case with my daughter. I did hear of a study that found that this was the most prominent effect of the Montessori schools. I would suggest they are worth looking at, but I would check that they are actually following the method.
A minor pedantry; this was a wonderful post:
Sometimes corr(X,Y) means X=>Y...
corr(X,Y) is a number between -1 and 1, and without specifying its value it's hard to see how it could imply anything. Perhaps it would be better stated as something like
Sometimes corr(X,Y) > 0 means that X=>Y...
Well since he starts the sentence with SOMETIMES,wether it's negative or positive his sentence is correct. I guess you could nitpick on insisting that corr(X,Y) != 0 ...
I really like this post. I'd like to see a sequence of ways statistical analysis can go wrong. This article and the inverted-J one both fit that description well.
Are there any other posts on the subject?
I think a theme of LessWrong is that statistical analysis goes wrong by default, and it requires work to keep it right.
Nice article. It seems to be the opposite of goodhart's law (self-destructing correlations) but if there's some higher model that unifies the two, I can't see it at the moment.
It's pretty simple. Belief of a correlation, like anything else, can affect the correlation. If belief strengthens it, it's self-fulfilling. If belief weakens it, it's goodhart's law.
Presumably, you'd look at the accident rate for Volvos compared to the accident rate for similar cars driven by a similar demographic, as reflected, for instance in insurance rates. (My google-fu did not find accident rates posted on the internet, but insurance rates don't come out especially pro-Volvo.)
Tangential, but I think you may be looking at the wrong thing here; this is auto insurance. The thing it measures in addition to how often that sort of car gets into accidents, would be to what extent the car itself is resistant to damage, not how well it protects the passengers. And since other sorts of insurance aren't likely to depend on what sort of car you have, I'm not sure there's any sort of insurance you could look at for quite this info.
Auto insurance is broken down into different types of coverage, with injuries separate from damage to the car. In fact, I'm pretty sure your coverage makes a distinction between injuries to you and injuries to other people that are your fault. Every time I renew my insurance, they ask me if I want to change how much of each type of coverage I have.
The safety indicator that most car buyers look at is the crash test rating, usually done by a government or an insurance industry group. Maybe it's no longer part of the culture, but I remember when car ads would often show crash tests. I think there was one where the crash test dummies (like mannequins full of sensors) talked about which car they liked.
The Insurance Institute for Highway Safety has information on crash tests and statistics on accidents and payouts.
You didn't talk about any self-fulfilling negative correlations. A volvo doesn't prevent accidents, just makes the accidents less deadly, so it may actually be the more reckless drivers that take them so they can continue to be reckless (this effect may be much smaller though) or parents that choose a safe car for their reckless teen. Another example is when seatbelts were introduced, the owners of cars with them became more reckless because they thought they were safer, and actually ended up in more accidents (though the death rate of drivers remained about the same because the seatbelts do actually offer protection, not just self fulfilling).
Maybe someone can think of better examples. I can imagine that these are hard to perpetuate though, because the concept that something is safer is often based on scientific evidence with proper selection, or real world evidence with distorted selection, but with a negative correlation, the distorted selection would show the safety device makes people less safe. You would have to have a good advertising team to overcome both the scientific and real world examples of your safety device. Either that or it really does make you slightly safer but the negative correlation effect would have to be strong enough to overcome that. But in either of these cases it isn't self fulfilling, because the concept isn't caused by the results, but either by actual benefit or advertising.
But does this mean that 'safer' and 'less safe' is meaningless for someone choosing a car? I mean, if I have never driven 'for real', without the instructor and with people I like very much sitting next to me, I do want a safer car; but I have no way to know if I am, on average, 'more reckless' or 'less reckless' than other drivers. And with all of these balancing effects, if I have previously found myself leaning towards buying a Volvo, now I have to doubt, vaguely, whether I want a Volvo because I actually think it would give me more leeway to drive poorly, which is, given my inexperience, even more dangerous. It would seem then, that not only 'safe' doesn't cut reality at its joints, but 'reckless' doesn't cut it, too, and I suspect that if we separate 'recklessness' into things like 'rileability', 'doggedness', 'attention', 'speed of reaction', whatever, they would turn out to not work that well... And how am I supposed to decide upon something as 'high level' as 'what car to buy'?
Yes, as human beings we're great at compensating. When anti-lock brakes were introduced, it made it safer to drive. Accidents stayed the same, since people drove more recklessly - they were after all safer now, weren't they?
But how would you evaluate the claim that Volvos are safer than other cars?
You would look for explanations why Volvos are safer.
Presumably, you'd look at the accident rate for Volvos compared to the accident rate for similar cars driven by a similar demographic, as reflected, for instance in insurance rates. (My google-fu did not find accident rates posted on the internet, but insurance rates don't come out especially pro-Volvo.)
No, this is not what you would do. The accident rate is consistent with many theories, including the theory that Volvos are not safer.
But suppose the results showed that Volvos had only 3/4 as many accidents as similar cars driven by similar people. Would that prove Volvos are safer?
No. Besides having a reputation for safety, Volvos also have a reputation for being overpriced and ugly. Mostly people who are concerned about safety buy Volvos. Once the reputation exists, even if it's not true, a cycle begins that feeds on itself: Cautious drivers buy Volvos, have fewer accidents, resulting in better statistics, leading more cautious drivers to buy Volvos.
Yes, the accident rate data is also consistent with this theory. So looking at accident rates alone isn't going to tell you anything about the safety of Volvos. And rational drivers would know this. They wouldn't buy a Volvo because it has a reputation for having fewer accidents; they would buy a Volvo because they have an explanation for why it is safer than other similar cars.
If a Montessori school cost the same, and was just as convenient for the parents, as every other school, and all factors other than test score were equal, and Montessori schools were believed to increase test scores, then any parent who cared at all would choose the Montessori school.
No, they wouldn't, not if they really cared. To choose an education method on the basis of test scores is irrational. A parent that really cared would try to understand our best epistemology and act according to that. Schools and parents that employ coercion and that care about test scores are flying in the face of what we know about how knowledge grows. A good parent would know that.
It seems to me that your post devalues the role of explanations.
A searcher for explanations wields a double-edged sword. One of the most commonly-leveled criticisms of Evolutionary Psychology is that it makes it easy to construct plausible explanations. With no way to quantitatively distinguish the plausibility of several different explanations that haven't been falsified by experimentation, what do you do?
How about when you're building the lithography for a new CPU at Intel, and you don't know whether the explanation for quantum tunneling is multiple worlds, decoherence, pilot waves, or any number of other things? Do you use the equations that have always worked before, or wait until there's an explanation for why they work?
With no way to quantitatively distinguish the plausibility of several different explanations that haven't been falsified by experimentation, what do you do?
You criticize them to find flaws.
Do you use the equations that have always worked before, or wait until there's an explanation for why they work?
Of course, we do know that quantum phenomena are the result of previously fungible multiversal objects becoming fungible again. But assuming you didn't. Then you would use an explanation of what to do in the absence of a satisfactory explanation of the phenomena in question. We always act on such meta-theories.
It seems to me that rate of car accidents would reflect only minor influence from the properties of the car: anti-lock brakes, bigger mirrors(?), etc. Things that make people think of a safe car - seat belts, air bags, etc. - serve only to prevent physical harm, not accidents. To examine the effect of volvos, you could explore accident rates for drivers on their previous cars vs. their current, comparing the transition values.
It seems to me that driver fatality is a better example than accident frequency. Driver fatality is something a Volvo would actually mitigate, and which would correlate with cautious driving.
Very good examples of perceptions driving self-selection.
It might be useful to discuss direct and indirect effects.
Suppose we want to compare fatality rates if everyone drove a Volvo versus if no one did. If the fatality rate was lower in the former scenario than in the latter, that would indicate that Volvo's (causally) decrease fatality rates.
It's possible that it is entirely through an indirect effect. For example, the decrease in the fatality rate might entirely be due to behavior changes (maybe when you get in a Volvo you think 'safety' and drive slower). On the DAG, we would have an arrow from volvo to behavior to fatality, and no arrow from volvo to fatality.
A total causal effect is much easier to estimate. We would need to assume ignorability (conditional independence of assignment given covariates). And even though safer drivers might tend to self-select into the Volvo group, it's never uniform. Safe drivers who select other vehicles would be given a lot of weight in the analysis. We would just have to have good, detailed data on predictors of driver safety.
Estimating direct and indirect effects is much harder. Typically it requires assuming ignorability of the intervention and the mediator(s). It also typically involves indexing counterfactuals with non-manipulable variables.
as an aside: a machine learning graduate student worked with me last year, and in most simulated data settings that we explored, logistic regression outperformed SVM
Short nitpick -- lots of assumptions other than ignorability can work for identifying direct effects (there is more to life than covariate adjustment). In particular, if we can agree on the causal diagram, then all sorts of crazy identification can become possible.
No. Besides having a reputation for safety, Volvos also have a reputation for being overpriced and ugly. Mostly people who are concerned about safety buy Volvos. Once the reputation exists, even if it's not true, a cycle begins that feeds on itself: Cautious drivers buy Volvos, have fewer accidents, resulting in better statistics, leading more cautious drivers to buy Volvos.
I'm having a hard time integrating what you're saying with my existing knowledge of confounders, which describes a similar phenomenon. Can anyone help me out here by describing self fulfilling phenomenon in terms of confounding?
On somewhat tagential note: due to massive mis representation of science, people put well founded logical conclusions at same level with entirely unfounded ones, and attribute overly high weight to correlation-based 'evidence', seeing the latter as more scientific. That creates a lot of demand for some correlation based 'evidence' where one would instead conduct specific experiments instead (e.g. crash-test the cars with dummies and be satisfied; it changes nothing about my personal car choice whenever people who chose some car drive safer, or whenever people on average drive less safely when they are driving a safer car. The dummy may be a poor model of me, but averaged person may be even worse model).
That could have it's roots in how science is taught at school - formulate hypothesis, test hypothesis with experiment - without the understanding that the reasoning behind hypothesis can very well be considerably stronger as evidence than many of potential 'experiments', or can be so weak as to make it worthless to even bother conducting an experiment. (For example, the reasoning (calculations) that the rocket will end up in a particular point in space after executing well controlled burns, is very solid and if the rocket ends up in the other place the chances overwhelmingly are that the experiment, rather than the reasoning, has failed)
The chances overwhelmingly are that there are factors affecting the rocket's trajectory that the experiment (and by extension, the hypothesis) has failed to take into account.
Unless it's that you have a very specific definition in mind for "well-controlled burns" (ie. burn engine P for X seconds:milliseconds, then burn Q and R for Y seconds:milliseconds, and your position will be Z) and the mechanism controlling the rockets has failed to time them properly, or if your rocket is shot down by an orbital defence laser (or otherwise sabotaged).
That may have been what you meant anyway? To be fair, I would expect that kind of misdirection in a high school experiment. But it parsed to me as implying "if an experiment doesn't do what you predicted, reality is wrong".
Unless it's that you have a very specific definition in mind for "well-controlled burns" (ie. burn engine P for X seconds:milliseconds, then burn Q and R for Y seconds:milliseconds, and your position will be Z) and the mechanism controlling the rockets has failed to time them properly, or if your rocket is shot down by an orbital defence laser (or otherwise sabotaged).
This is of course what I have in mind.
That may have been what you meant anyway? To be fair, I would expect that kind of misdirection in a high school experiment. But it parsed to me as implying "if an experiment doesn't do what you predicted, reality is wrong".
If you take a die that you measure to be perfectly symmetrical and have chance of 1/6 to land on each side (after, say, >10 bounces), and you check your reasoning about the die by throwing it, and measuring probability of it landing on each side, you'd need quite a lot of throws until the strength of evidence from deviations can overwhelm the strength of the reasoning. That is to say, your prior for probability being very close to 1/6 is high and for any other value, very low. The experiments are not always deterministic; deterministic experiments, if anything, are an exception.
Y; sometimes it means Y=>X; sometimes it means W=>X, W=>Y. And sometimes it's an artifact of people's beliefs about corr(X, Y). With intelligent agents, perceived causation causes correlation.> Sometimes the issue could be the choice of the appropriate Y to correlate with your X. In the Montessori example, rather than looking at test scores or even at rate of graduation from high school, it could be argued that it would be more meaningful to look at career performance. However choosing something as general as career performance requires a great deal of "narrowing down". For example: First choose your career to compare, say law. Narrow it down to one, hopefully very large, law school. Further narrow it down to one demographic - all students graduating from similar performing high schools, who themselves performed similarly in their admission tests. Continue to narrow your study down even farther - similar race, religion, family income, number of siblings, birth order, parents' careers and educations, etc. Then you can look at the pre-high school education - Montessori vs. other private educational methods. Ouch!
Correlation does not imply causation. Sometimes corr(X,Y) means X=>Y; sometimes it means Y=>X; sometimes it means W=>X, W=>Y. And sometimes it's an artifact of people's beliefs about corr(X, Y). With intelligent agents, perceived causation causes correlation.
Volvos are believed by many people to be safe. Volvo has an excellent record of being concerned with safety; they introduced 3-point seat belts, crumple zones, laminated windshields, and safety cages, among other things. But how would you evaluate the claim that Volvos are safer than other cars?
Presumably, you'd look at the accident rate for Volvos compared to the accident rate for similar cars driven by a similar demographic, as reflected, for instance in insurance rates. (My google-fu did not find accident rates posted on the internet, but insurance rates don't come out especially pro-Volvo.) But suppose the results showed that Volvos had only 3/4 as many accidents as similar cars driven by similar people. Would that prove Volvos are safer?
Perceived causation causes correlation
No. Besides having a reputation for safety, Volvos also have a reputation for being overpriced and ugly. Mostly people who are concerned about safety buy Volvos. Once the reputation exists, even if it's not true, a cycle begins that feeds on itself: Cautious drivers buy Volvos, have fewer accidents, resulting in better statistics, leading more cautious drivers to buy Volvos.
Do Montessori schools or home-schooling result in better scores on standardized tests? I'd bet that they do. Again, my google-fu is not strong enough to find any actual reports on, say, average SAT-score increases for students in Montessori schools vs. public schools. But the largest observable factor determining student test scores, last I heard, is participation by the parents. Any new education method will show increases in student test scores if people believe it results in increases in student test scores, because only interested parents will sign up for that method. The crazier, more-expensive, and more-difficult the method is, the more improvement it should show; craziness should filter out less-committed parents.
Are vegetarian diets or yoga healthy for you? Does using the phone while driving increase accident rates? Yes, probably; but there is a self-fulfilling component in the data that is difficult to factor out.
Conditions under which this occurs
If you believe X helps you achieve Y, and so you use X when you are most-motivated to achieve Y and your motivation has some bearing on the outcome, you will observe a correlation between X and Y.
This won't happen if your motivation or attitude has no bearing on the outcome (beyond your choice of X). If passengers prefer one airline based on their perception of its safety, that won't make its safety record improve.
However, this is different from either confidence or the placebo effect. I'm not talking about the PUA mantra that "if you believe a pickup line will work, it will work". And I'm not talking about feeling better when you take a pill that you think will help you feel better. This is a sample-selection bias. A person is more likely to choose X when they are motivated to achieve Y relative to other possible positive outcomes of X, and hence more inclined to make many other little trade-offs to achieve Y which will not be visible in the data set.
It's also not the effect people are guarding against with double-blind experiments. That's guarding against the experimenter favoring one method over another. This is, rather, an effect guarded against with random assignment to different groups.
Nor should it happen in cases where the outcome being studied is the only outcome people consider. If a Montessori school cost the same, and was just as convenient for the parents, as every other school, and all factors other than test score were equal, and Montessori schools were believed to increase test scores, then any parent who cared at all would choose the Montessori school. The filtering effect would vanish, and so would the portion of the test-score increase caused by it. Same story if one choice improves all the outcomes under consideration: Aluminum tennis racquets are better than wooden racquets in weight, sweet spot size, bounce, strength, air resistance, longevity, time between restrings, and cost. You need not suspect a self-fulfilling correlation.
It may be cancelled by a balancing effect, when you are more highly-motivated to achieve Y when you are less likely to achieve Y. In sports, if you wear your lucky undershirt only for tough games, you'll find it appears to be unlucky, because you're more likely to lose tough games. Another balancing effect is if your choice of X makes you feel so confident of attaining Y that you act less concerned about Y; an example is (IIRC) research showing that people wearing seat-belts are more likely to get into accidents.
Application to machine learning and smart people
Back in the late 1980s, neural networks were hot; and evaluations usually indicated that they outperformed other methods of classification. In the early 1990s, genetic algorithms were hot; and evaluations usually indicated that they outperformed other methods of classification. Today, support vector machines (SVMs) are hot; and evaluations usually indicate that they outperform other methods of classifications. Neural networks and genetic algorithms no longer outperform older methods. (I write this from memory, so you shouldn't take it as gospel.)
There is a publication bias: When a new technology appears, publications indicating it performs well are interesting. Once it's established, publications indicating it performs poorly are interesting. But there's also a selection bias. People strongly motivated to make their systems work well on difficult problems are strongly motivated to try new techniques; and also to fiddle with the parameters until they work well.
Fads can create self-fulfilling correlations. If neural networks are hot, the smartest people tend to work on neural networks. When you compare their results to other results, it can be difficult to look at neural networks vs., say, logistic regression; and factor out the smartest people vs. pretty smart people effect.
(The attention of smart people is a proxy for effectiveness, which often misleads other smart people - e.g., the popularity of communism among academics in America in the 1930s. But that's yet another separate issue.)