Months ago, my roommate and I were discussing someone who had tried to replicate Seth Roberts' butter mind self-experiment. My roommate seemed to be making almost no inference from the person's self-reports, because they weren't part of a scientific study.
But knowledge does not come in two grades, "scientific" and "useless". Anecdotes do count as evidence, they are just weak evidence. And well designed scientific studies constitute stronger evidence then poorly designed studies. There's a continuum for knowledge quality.
Knowing that humans are biased should make us take their stories and ad hoc inferences less seriously, but not discard them altogether.
There exists some domains where most of our knowledge is fairly low-quality. But that doesn't mean they're not worth study, if the value of information in the domain is high.
For example, a friend of mine read a bunch of books on negotiation and says this is the best one. Flipping through my copy, it looks like the author is mostly just enumerating his own thoughts, stories, and theories. So one might be tempted to discard the book entirely because it isn't very scientific.
But that would be a mistake. If a smart person thinks about something for a while and comes to a conclusion, that's decent-quality evidence that the conclusion is correct. (If you disagree with me on this point, why do you think about things?)
And the value of information in the domain of negotiation can be very high: If you're a professional, being able to negotiate your salary better can net you hundreds of thousands over the course of a career. (Anchoring means your salary next year will probably just be an incremental raise from your salary last year, so starting salary is very important.)
Similarly, this self-help book is about as dopey and unscientific as they come. But doing one of the exercises from it years ago destroyed a large insecurity of mine that I was only peripherally aware of. So I probably got more out of it in instrumental terms than I would've gotten out of a chemistry textbook.
In general, self-improvement seems like a domain of really high importance that's unfortunately flooded with low-quality knowledge. If you invest two hours implementing some self-improvement scheme and find yourself operating 10% more effectively, you'll double your investment in just a week, assuming a 40 hour work week. (ALERT: this seems like a really important point! I'd write an entire post about it, but I'm not sure what else there is to say.)
Here are some free self-improvement resources where the knowledge quality seems at least middling: For people who feel like failures. For students. For mathematicians. Productivity and general ass kicking (web implementation for that last idea). Even more ass kicking ideas that you might have seen already.
I'd be interested to see an analysis of how many failures to replicate we should expect if replicators duplicate methodology perfectly, and whether real-world failures to replicate seem to occur in line with that assumption. Wild guess: there are way more failures to replicate then we should expect. If this guess is accurate, that suggests that experimenters tend to introduce undocumented distorting factors into their experiments, and compiled anecdotal evidence is actually more valuable than experimental evidence if you can find a way to sample it randomly.
To provide some intuition for this guess, I remember reading about some guy who was doing experiments on mice and found that random stuff like the lighting in his laboratory were actually the primary explanatory factors for his experimental results. (Maybe someone else can provide a link? I can't seem to find the guy on Google.) From this he concluded that almost all experiments that had been done on mice previously were useless. But you can imagine a mouse experiment where instead of using 100 mice in a single laboratory, 100 mice in 100 different laboratories are used. This could deal with the random stuff problem pretty well.
Of course, there's also the problem of interpreting study results accurately... So I don't think the number of participants is the bottleneck to making inferences in most cases.
And a meta-analysis obviously won't suffer from the random stuff problem as much.
You're thinking of the mouse study covered by Lehrer in his decline effect New Yorker article, which was Crabbe et al 1999 "Genetics of mouse behavior: interactions with laboratory environment".