4 min read25th Mar 201056 comments

29

Take a look at "Role of Layer 6 of V2 Visual Cortex in Object-Recognition Memory", Science 3 July 2009:
Vol. 325. no. 5936, pp. 87 - 89.  The article has some good points, but I'm going to pick on some of its tests.

The experimenters believed they could enhance object-recognition memory (ORM) by using a lentivirus to insert a gene into area V2 of visual cortex.  They tested the ORM of rats by putting an object in a field with a rat, and then putting either the same object ("old"), or a different object ("new"), in the field 30, 45, or 60 minutes later.  The standard assumption is that rats spend more time investigating unfamiliar than familiar objects.

They chose this test:  For each condition, measure the difference in mean time spent investigating the old object vs. the new object.  If the latter is more than the former, and the difference is statistically-significant, conclude that the rats recognized the old object.

Figure 1 Graph A (below the article summary cutoff) shows how much time normal rats spent investigating an object.  Here it is in HTML table form: How much time the rats spent exploring old and new objects:

Minutes after first exposure 30 45 60
Old 8 12 14
New 17 28 14

The black bars (new) are significantly longer than the white bars (old) after 30 and 45 minutes, but not after 60 minutes.  Therefore, the normal rats recognized the old objects after 30 and 45 minutes, but not after 60 minutes.

Figure 3 Graph D (also below the article-summary cutoff) shows how much time different types of rats spent exploring old and new objects.  The "RGS" group is rats given the gene therapy, but in parietal cortex rather than in V2.

Here it is in HTML form: How much time the rats spent exploring old and new objects, by rat type:

Rat type Normal (after 45 min) Parietal RGS (after 60 min)
Old 10 11
New 27 12

Parietal RGS rats displayed no difference in time spent exploring old and new objects after 60 minutes; therefore, this gene therapy to parietal cortex does not improve ORM.

To recap:

  1. We conclude that rats no longer recognize an old object if they spend about the same time investigating it as investigating a new object.
  2. Normal rats spend the same time investigating old and new objects 60 minutes after first exposure to the old object.
  3. Parietal RGS rats also spend the same time investigating old and new objects after 60 minutes.
  4. Therefore, normal rats and parietal RGS rats both lose ORM by 60 minutes.

So why don't I buy it?

Figure 1 (look at A).

Fig. 1 Animal performance on ORM task.

Figure 3 (look at D):

Fig. 3 Ox7-SAP injection in layer 6 and determination of ORM.

(Original image is here.)

The investigators were trying to determine when rats recognized an old object.  So what's most relevant is how much time they spent investigating the old object.  The time spent investigating new objects is probably supposed to control for variations in their testing procedure.

But in both of the graphs, we see that they are claiming that rats failed to recognize an old object in the 60-minute condition, even though they spent the same amount of time investigating it as in the other conditions.  The difference was only in their response to new objects.  The test methodology assumes that the response to new objects is always the same.

Look at the error bars on those graphs.  The black bars are supposed to all be the same height (except in 1B and 1C).  Yet we see they differ across conditions by what looks like about 10 standard deviations in several cases.

When you regularly get 10 standard deviations of difference in your control variable across cases, you shouldn't say, "Gee, lucky thing I used that control variable!  Otherwise I never would have noticed the large, significant difference between the test and control cases."  No; you say, "Gee, something is wrong with my experimental procedure."

A couple of other things to notice, in addition to the comments above:

  • The leftmost two sets of bars in 1B contrast the time spent examining old and new objects 60 minutes after exposure to the old objects, in normal and treated rats.  Note that, again, there is no difference between the time spent looking at the old objects between normal (control) and treated rats; yet they concluded that the treated rats remembered them, and the normal rats did not, because the treated rats spent more time looking at new objects than untreated rats did.
  • 1D is supposed to show that normal rats could remember only 2 objects, while treated rats could remember 6 objects.  But, again, this conclusion was reached because the normal rats spent less time looking at the new objects when exposed to 4 new objects than when exposed to 2 new objects.  There was no difference in the time they spent looking at the old objects with either type of rat under any of the conditions.

One subtle type of error is committed disproportionately by scientists, because it's a natural by-product of the scientific process of abstracting a theory into a testable hypothesis.  A scientist is supposed to formulate a test before performing the test, to avoid introducing bias into the test formulation in order to get the desired results.  Over-encapsulation is when the scientist performs the test, and examines the results according to the previously-established criteria, without noticing that the test results invalidate the assumptions used to formulate the test.  I call it "over-encapsulation" because the scientist has tried to encapsulate the reasoning process in a box, and put data into the box and get decisions out of it; and the journey into and out of the box strips off relevant but unanticipated information.

Over-encapsulation is especially tricky when you're reasoning about decision theory.  It's possible to construct a formally-valid evaluation of the probabilities of different cases; and then take those probabilities and choose an action based on them using some decision theory, without noticing that some of the cases are inconsistent with the assumptions used in your decision theory.  I hope to write another, more controversial post on this someday.

New Comment
56 comments, sorted by Click to highlight new comments since: Today at 3:14 AM

We have 16 comments so far, and all of them are about image hosting.

I suspect that the software used for this entire community is having a significant non-epistemic influence on our content standards and probably the composition of community itself.

Image hosting is an easier thing to comment about. In general, editorial comments about the form of the post rather than the content are frequently easier to write, but they let people socialize and gain karma just as easily (if that is their real goal). In the meantime they can be honestly helpful at first but then clog up the comments in a way that makes real discussion harder to find and harder to post in a way people might find it.

The smallest change to the site's software that I suspect might fix this problem would be:

  • A radio button for comments that specifies the "meta level" of the comment as either "process" (for stuff aimed at the running of the site itself, like this very comment), or "editorial" (the comments about image hosting), or "object" (in this case, a discussion of either neurobiology or the way scientists frequently lose track of concepts as they move from theory to experiment and back again).

  • When a post has not yet been voted up to the main page all the posts should be defaulted to editorial unless marked otherwise by the commenter.

  • When a post appears on the main page all editorial comments are minimized or otherwise suppressed unless people specifically seek them out.

The smallest change to the site's software that I suspect might fix this problem would be:

Sounds identical to Scoop, the software which runs kuro5hin.org and other sites. But I have not seen very many editorial comments outside of this discussion.

When a post has not yet been voted up to the main page all the posts should be defaulted to editorial unless marked otherwise by the commenter.

There are lots of posts with significant discussion and only a few (or zero) editorial comments that don't end up getting promoted. The default should probably always be object level.

We have 16 comments so far, and all of them are about image hosting.

This is not an easily skimmable post. Readers could absorb it much more easily if you cooked up a simple toy example — something that can be grasped without digesting six bar graphs with noisy real-world data in them.

Do you know the story of the boy who helped the butterfly?

If someone can't follow the argument with a step-by-step explanation, they're never going to detect it in the wild.

(Okay, my true objection is that would require more work.)

And the parent's a meta-comment on the comments, and this comment's a meta-meta-comment.

What can we do to reduce the number of meta-meta-comments on this site?

For meta-discussions, give people omega instead of karma, omega+1 for meta-meta discussions, etc.

Or, keep meta-discussions to the meta-thread.

I would reply, but I don't want to increase the meta-ness of this thread... oh dammit.

I had a difficult time understanding the post on first examination - I promise to comment when I've read it well enough to start clearing my confusion.

Please let me know if you can see these. I don't know if people without a Science subscription can link directly to their gifs.

I can see them, though I feel the need to note that direct linking to a site that isn't your own is bad form.

It is bad form; but I have no site of my own to put them on. I'm homeless on the internet.

RobinZ and nhamann have already floated ideas for image hosting, but if you want a convenient, free, and immediately available option for hosting images, you can try an image hosting service: BAYIMG and imgur look like two good ones.

This sounds like a good idea. But my understanding is that if I link to the original website, I'm not violating copyright; if I link to a copy that I made, I am violating copyright. The penalty for violating copyright is larger than the penalty for poor etiquette.

Are you in the U.S.? According to the U.S. Copyright Office:

The 1961 Report of the Register of Copyrights on the General Revision of the U.S. Copyright Law cites examples of activities that courts have regarded as fair use: "quotation of excerpts in a review or criticism for purposes of illustration or comment; quotation of short passages in a scholarly or technical work, for illustration or clarification of the author’s observations; use in a parody of some of the content of the work parodied; summary of an address or article, with brief quotations, in a news report; reproduction by a library of a portion of a work to replace part of a damaged copy; reproduction by a teacher or student of a small part of a work to illustrate a lesson; reproduction of a work in legislative or judicial proceedings or reports; incidental and fortuitous reproduction, in a newsreel or broadcast, of a work located in the scene of an event being reported." [emphasis added]

Edit: Naturally, your point is apt - I'm just pointing out that there are fair-use exemptions that are often applicable. (I'm not sure hotlinking substantial amounts of copyrighted material is safe, but I Am Not A Lawyer.)

That's a good point. Though I note that "fair use" is not something you can rely on ) in the US. Try releasing a documentary film where you can overhear someone walking past playing a Michael Jackson song for 10 seconds, and see how much protection fair use gives you.

Well, you might win your court case, but it won't keep you from having to pay legal fees.

Thanks; images are now on BAYIMG. Hope they last.

Can't you upload images to LW, for use in posts?

Indeed. The LW wiki seems to support image upload, though you'd need to register as a user first.

But I'm not comfortable with LW hosting copywritten images, even though a rationale for fair use could be made.

I didn't know that.

For what it's worth, you can get absurdly cheap hosting through Amazon's Simple Storage Service. We're talking pennies per month.

PhilGoetz.com is available! Take it.

I wonder if LessWrong could 'sell' image storage for this purpose at 1 karma/kB or something.

Bad form? Generally, but I'm not concerned with Science's bandwidth expenses.

If anyone requests it, I will link to a direct download of the .pdf of the full study.

I can see them, and I like your direct-linking.

Proper etiquette is to duplicate the file at reduced size (so as not to steal bandwidth and protect against link rot) and link to the original (so as to provide proper credit) with sufficient information to allow others to find the same source independently (so as to protect against link rot)

Upvoted for rationale ('don't direct link' is a new norm to me, and the reasons behind it weren't immediately obvious to me).

I've seen quite a few pleas from webcartoonists that hotlinking was costing them significant sums - it's one of those things which isn't obvious from the consumer side.

Oh, yes; the no hotlinking - i.e. no inlining with tags - norm I'm familiar with, but the idea of refraining from -type links is new to me.

(Edit: ohhh, you are refering to hotlinking. Sorry. I jumped into this subthread after seeing some of its comments on the Recent Comments page, rather than from reading the full context. I had thought the several of you were refering to -linking rather than -inlining because I hadn't read PhilGoetz's top-level post. Having looked at it now, I see that he did actually inline the images rather than just linking to them. So my comments don't make sense...but I'll leave them up for transparency's sake.)

So my comments don't make sense...but I'll leave them up for transparency's sake.

Thank you!

Definitely better.

You should email the paper authors and ask for comment.

Someone else should email the paper authors and ask for comment.

No, please don't. I'd rather write the email myself, so I can try to word it more diplomatically than "Here's a link to a post criticizing your paper".

Does the quirk in the numbers make a difference to the core claims that the authors of the study are probing?

I don't currently have access to the paper so I can't answer this question very effectively for myself, but from work experience around science it seems that most real world knowledge work involves a measure of accident (oops, we dropped the sample) and serendipity (I noticed X which isn't part of our grant but is fascinating and might lead somewhere). Generally the good and bad parts of this relatively human process are not always visible in actual papers, but even if the bad parts show up in the data they don't always make much difference to the question you're really interested in.

If you constrain your substantive claims to being consistent with both the things the data shows and the quirks you know about from doing the experiments then these flaws should do no significant harm to the broader scientific community's understanding of the world. This is actually a place where a kind of encapsulation seems like a really good thing to me because it allows useful and relatively trustworthy information to gain broader currency while the idiosyncratic elements of the processes that produce the information are suppressed.

In the meantime, if the data tells the reader something that the authors don't talk about, it represents an opportunity to initiate a conversation with the corresponding author. They might be able to tell you an amusing story about a "dropped sample" that doesn't affect the core claims but explains whatever was puzzling you. Alternatively, the issue might be something really significant - and maybe it will lead to a collaboration or something :-)

Admittedly, the paper you're pointing to may be a really be an obvious case of clearly flawed data analysis that somehow slipped through peer review, but its really hard to say without being able to see the paper.

I'm afraid it's critical. They claim to have shown that their treatment allows rats to remember objects for up to 24 weeks after seeing them, while normal rats could remember objects for only 45 minutes. They did a long series of tests with the treated rats to demonstrate this. But they appear to have stopped testing the normal rats after they "failed" the 60-minute test.

Similarly, they tested the treated rats for memory of up to 6 objects. They stopped testing the normal rats after they "failed" the test for 4 objects (figure 1D) - although, again, they spent no more time examining the old objects in that test; they merely spent less time examining new objects.

If the treatment had any effect, it appears to me that it affected the rats' curiousity, not their memory. But the most likely explanation is that the normal rats failed those 2 critical tests purely by chance, perhaps because something startled them (e.g., there was a hawk overhead during the exposure to the new objects).

Back up - are you suggesting that a random factor could have that big an effect on the results? How small are their sample sizes?

The sample size is 16, which should be enough. A random factor shouldn't have that big an effect if the trials were uncorrelated. To make the trials uncorrelated, they would need to interleave their trials. For instance, if they do all the 30-minute tests, then all the 45-minute tests, then all the 60-minute tests, each group of tests is almost completely correlated in environmental conditions. Because rats have such different senses than humans, it's impossible for a human to tell by observation whether something unusual to a rat is going on.

The expectation of the experimenter is another factor that can correlate results within a trial group. We usually require double-blind tests on humans, but not on rats.

Nice catch. When I was reading the first table, I thought, "Old object -- that looks okay; new object, let's see, 17, 28, 14?! Danger, Will Robinson! The fact that 'old' and 'new' match at 60 minutes is rather difficult to interpret in light of the results at 45 minutes. What are the error bars? Either this whole thing's underpowered or something unexpected is going on in the so-called 'control' condition."

(I'm translating sub-vocal thought fragments into full sentences here.)

Do you think that this problem never occurred to the authors, or did they try to pull a fast one (and succeed, at least on the referees)?

I don't have an opinion on whether it occurred to the authors, and don't think there's any reason to form an opinion on that. If I didn't think it was an easy error to make, I wouldn't have written this post.

and don't think there's any reason to form an opinion on that.

One reason is to increase one's understanding of how science and academia work in practice. Gaining such knowledge can be expected to improve one's ability to update in the correct manner when reading other papers. (And to make myself clear I'll add belief that there is no reason to form an opinion on that is something that is objectively false even though it a useful belief to signal.)

Why did the figures disappear today?

I was skeptical how long an image-hosting site would last, but I expected it to last more than a day.

It's nice that you noticed this.

I agree that in principle we should expect that an intervention could adjust both the amount of time spent on old and new objects. Perhaps they were just assuming that a differential between interest in new/old objects will always exist, regardless of factors that shorten/lengthen examination time of all objects (presumed to do so roughly equally).

If they were actually assuming that time spent on new objects is constant, that would indeed be contradicted by the data, as you point out.

The large difference in general interest (time spent on new objects) in the RGS rats cries out for explanation.

Perhaps they were just assuming that a differential between interest in new/old objects will always exist, regardless of factors that shorten/lengthen examination time of all objects (presumed to do so roughly equally).

Yes; that's what I expect. But making that assumption, and then focusing on just a small piece of the data they collected, caused them to overlook the large differences in time spent examining new objects.

And they could be right in that assumption! It just seems very unlikely. I think it's more likely that someone startled the rats when they were being exposed to the new objects at the 60-minute mark. For instance, one of the lab technicians visited his girlfriend that day, and she has a cat.

I think I might be spazzing out a bit (I haven't been sleeping well), so let me try to get this straight:

  1. The hypothesis is that a rat will spend more time investigating a new object than an old object - the old object will be ignored as familiar, the new object will be attended to as unfamiliar.

  2. Therefore, if the rats recognize the old object, they will ignore it (time spent examining it will be less); if they do not recognize the old object, they will attend to it (time spent examining it will be more).

...and here is where I get lost. You say that the above assumes that time spent examining new objects should be consistent - are you saying that variations in the new-object times imply confounding factors which corrupt the results?

You say that the above assumes that time spent examining new objects should be consistent - are you saying that variations in the new-object times imply confounding factors which corrupt the results?

That's how I read it. The hypothesis is that rats spend a fairly consistent amount of time investigating new objects, and that their deviations from that consistent amount of time can be used to gauge whether they perceive an object as new or old. In practice, rats don't spend a consistent amount of time investigating new objects - by whatever definition of 'new object' the test in question was using - so the concept 'deviations from that consistent amount of time' isn't coherent and can't be used.

It should work, if you have enough rats. And standard statistics should tell you whether you have enough rats. But it looks very suspicious in this case, even though they satisfied the statistical rules-of-thumb.

As a general rule of thumb, BTW, any time someone says "95% confidence", you should interpret it as less than 95% confidence. The assumptions that go into those confidence calculations are never completely true; and the ways in which they are untrue usually make the justified confidence smaller, not larger.

But since we don't know how much less than 95%, we just say "95%", and hope the listener knows that doesn't really mean 95%.

This could be a function of the fact that I have very little training in statistics and am trying to get by on common sense and raw intelligence, but it seems to me that 'enough rats' implies, among other things, enough rats to see a 'fairly' (or 'statistically') consistent amount of time spent investigating new objects if the first part of the hypothesis as I stated it is true. If how much time the rats spend investigating new objects is affected by how recently they investigated a different new object, or some other variable that would affect all the rats on a given trial, rather than being consistent, or random, or affected by something that would affect some random subset of rats independently of a given trial, then I don't see how adding more rats will help - you'd just get a clearer picture of the fact that the time spent investigating new objects varies based on some unconsidered variable that your test is allowing to affect the situation, which you'd then need to find and control for.

That's a good point. If baseline rat curiosity can suddenly drop by half, then the baseline differential between time spent exploring new and old objects could also suddenly change.

are you saying that variations in the new-object times imply confounding factors which corrupt the results?

Technically, yes. But phrasing it that way sounds like the test algorithm should include, "Check that the new object times are consistent". That's not how I detected the error. I said, "Remember that what we originally wanted to know is whether the old object times are different - and they aren't."

The data show the rats spending the same amount of time examining the old objects in all cases. The investigators concluded that the rats didn't recognize the old objects in those cases where they spent less time than usual examining new objects. That interpretation requires believing that it's more likely that the new-object-times plot a strange but reliable function f(M) describing how curious rats are about new objects M minutes after being exposed to a different object, than that your experiment is messed up.

Note also the leftmost two points in figure 1B. This shows that the control rats and the gene-therapy rats both spent the same amount of time investigating the old objects. So now, to continue with the interpretation that the new-object-time is a good control, you have to believe that the gene therapy has both improved the rats' ORM, and made them more inherently curious about objects shown to them 60 minutes after being shown some other object.

In other words, if the setup were good, the old object time ought to increase, rather than the new object time decrease.

That's what I'd expect to see.