The latest in a long series of articles making me glad I don't work in psychology was this piece about replication in the Guardian. This spins off some harsh criticism of replication studies and a call for an official policy requiring consultation with the original authors of a study that you're attempting to replicate. The reason given is that psychology is so complicated that there's no way to capture all the relevant details in a published methods section, so failed replications are likely to happen because some crucial detail was omitted in the follow-up study.
Predictably enough, this kind of thing leads to a lot of eye-rolling from physicists, which takes up most of the column. And, while I have some sympathy for the idea that studying human psychology is a subtle and complicated process, I also can't help thinking that if the font in which a question is printed is sufficient to skew the result of a study one way or the other, then maybe these results aren't really revealing deep and robust truths about the way our brains work. Rather than demanding that new studies duplicate the prior studies in every single detail, a better policy might be to require some variation of things that ought to be insignificant, to make sure that the results really do hold in a general way.
If you go to precision measurement talks in physics-- and I went to a fair number at DAMOP this year, there will inevitably be a slide listing all the experimental parameters that they flipped between different values. Many of these are things that you look at and say "Well, how could that make any difference?" and that's the point. If changing something trivial-- the position of the elevator in the physics building, say-- makes your signal change in a consistent way, odds are that your signal isn't really a signal, but a weird noise effect. In which case, you have some more work to do, to track down the confounding source of noise.
Of course, that's much easier to do in physics than psychology-- physics apparatus is complicated and expensive, but once you have it, atoms are cheap and you can run your experiment over and over and over again. Human subjects, on the other hand, are a giant pain in the ass-- not only do you need to do paperwork to get permission to work with them, but they're hard to find, and many of them expect to be compensated for their time. And it's hard to get them to come in to the lab at four in the morning so you can keep your experiment running around the clock.
This is why the standards for significance are so strikingly different between the fields-- psychologists (and biomedical researchers) are thrilled to see results that are significant at the 1% level, while in many fields of physics, that's viewed as a tantalizing hint, and a sign that much more work is required. But getting enough subjects to hit even the 3-sigma level at which physicists become guardedly optimistic would quickly push the budget for your psych experiment to LHC levels. And if you'd like those subject to come from outside the WEIRD, well...
At the same time, though, physicists shouldn't get too carried away. From some of the quotes in that Guardian article, you'd think that experimental methods sections in physics papers are some Platonic ideal of clarity and completeness, which I find really amusing in light of a conversation I had at DAMOP. I was talking to someone I worked with many years ago, who mentioned that his lab recently started using a frequency comb to stabilize a wide range of different laser frequencies to a common reference. I asked how that was going, and he said "You know, there's a whole lot of stuff they don't tell you about those stupid things. They're a lot harder to use than it sounds when you hear Jun Ye talk."
That's true of a lot of technologies, as anyone who's tried to set up an experimental physics lab from scratch learns very quickly. Published procedure sections aren't incomplete in the sense of leaving out critically important steps, but they certainly gloss over a lot of little details.
There are little quirks of particular atoms that complicate some simple processes-- I struggled for a long time with getting a simple saturated absorption lock going in a krypton vapor cell, because the state I'm interested in turns out to have hellishly large problems with pressure broadening. That's fixable, but not really published anywhere obvious-- I worked it out on my own before I talked to a colleague who did the same thing, and he said "Oh, yeah, that was a pain in the ass..."
There are odd features of certain technologies that crop up-- the frequency comb issue that my colleague mentioned at DAMOP was a dependence on one parameter that turns out to be sinusoidal. Which means it's impossible to automatically stabilize, but requires regular human intervention. After asking around, he discovered that the big comb-using labs tend to have one post-doc or staff scientist whose entire job is keeping the comb tweaked up and running properly, something you wouldn't really get from published papers or conference talks.
And there are sometimes issues with sourcing things-- back in the early days of BEC experiments, the Ketterle lab pioneered a new imaging technique, which required a particular optical element. They spent a very long time tracking down a company that could make the necessary part, and once they got it, it worked brilliantly. Their published papers were scrupulously complete in terms of giving the specifications of the element in question and how it worked in their system, but they didn't give out the name of the company that made it for them. Which meant that anybody who had the ability to make that piece had all the information they needed to do the same imaging technique, but anybody without the ability to build it in-house had to go through the same long process of tracking down the right company to get one.
So, I wouldn't say that experimental physics is totally lacking in black magic elements, particularly in small-lab fields like AMO physics. (Experimental particle physics and astrophysics are probably a little better, as they're sharing a single apparatus with hundreds or thousands of collaboration members.)
The difference is less in the purity of the approach to disseminating procedures than in the attitude toward the idea of replication. And, as noted above, the practicalities of working with the respective subjects. Physics experiments are susceptible to lots of external confounding factors, but working with inanimate objects makes it a lot easier to repeat the experiment enough times to run those down in a convincing way. Which, in turn, makes it a little less likely for a result that's really just a spurious noise effect to get into the literature, and thus get to the stage where people feel that failed replications are challenging their professional standing and personal integrity.
It's not impossible, though-- there have even been retractions of particles that were claimed to be detected at the five-sigma level. And sometimes there are debates that drag on for years, and can involve some nasty personal sniping along the way.
The really interesting recent(-ish) physics case that ought to be a big part of a discussion of replication in physics and other sciences is the story of "supersolid" helium, where a new and dramatic quantum effect was claimed, then challenged in ways that led to some heated arguments. Eventually, the original discoverers ">re-did their experiments, and the effect vanished, strongly suggesting it was a noise effect all along. That's kind of embarrassing for them, but on the other had, speaks very well to their integrity and professionalism, and is the kind of thing scientists in general ought to strive to emulate. My sense is that it's also more the exception than the rule, even within physics.
- Log in to post comments
psychologists (and biomedical researchers) are thrilled to see results that are significant at the 1% level
In many cases they will settle for a 5% (2σ) significance level. Which inevitably means that there are many spurious results in the literature. As the alternative is not publishing anything at all, they have decided to live with this. I've pointed out before, and you imply this above, that a big reason why physicists use the 5σ standard is, "Because we can." In this regard, it's physics that is exceptional. And not even all physicists can hold to that standard: astrophysicists and geophysicists often have trouble meeting the 5σ threshold, because you don't always get enough experimental runs out of the real world/universe.
First off, obligatory reference to Feynman's essay "Cargo Cult Science" (specifically the rat maze portion) is obligatory.
If a given parameter is critical for obtaining the listed results, then it should be specified in the methods sections. If unpublished personal communication is required for replication, it damages the scientific record. It means the effect you're measuring is more limited than you claim. You also have a problem 10 years later when the original researchers have moved on/passed on/simply forgot, and you can no longer talk to them directly.
I can come up with three reasons why you wouldn't include relevant details in the methods section: 1) You want to deprive your competitors of crucial information they need to compete. 2) You didn't realize it was critical. 3) You realized it was critical, but thought the way you did it was the obvious thing to do. (This is surprisingly common, especially where someone immersed in a sub-field doesn't realize that received wisdom in the sub-field isn't prevalent outside the field. c.f. the frequency comb example.)
1) has no excuse, and should be chastised out of the scientific community. 2) and 3) are harder cases, as the researcher *would* include the details in the experimental method, if they knew they should. I'm a bit of a mixed mind. On one hand, the method section should be sufficient to replicate the experiment without communication with the original authors. On the other hand, it's unlikely that people will be able to think of and control for all parameters, so there's always a chance a critical one will slip through. So I think you shouldn't *have* to contact the original researchers in a replication paper, but if you fail to replicate you *should*, to see if there's an uncontrolled parameter which could explain the discrepancy. -- I don't think you should be required to run the replication by the original researchers first, though, as the linked policy seems to recommend. Nor should failure of the original researchers to respond, or a response that's trivially dismissive, limit publication.
"Failure to replicate" shouldn't be the end conclusion of a replication paper. If you're replicating a paper, you have a duty to your readers and the scientific community at large to at least try to figure out why there was a discrepancy between the two papers. (Just as an original paper which demonstrates an effect should expend some effort to determine/explain the cause of the effect.) Most of time this would involve contacting the original researcher to debug the experiment.
Would papers be improved if they had something more like screen credits than a mere author list? Seeing someone listed as "frequency comb knob twiddler" would be a dead giveaway that there was something to watch for.
On the other hand, it’s unlikely that people will be able to think of and control for all parameters, so there’s always a chance a critical one will slip through.
In theory, that is something referees are supposed to do: tell the lead author that he's leaving something out that won't be obvious to the audience. In practice, the reviewers tend to be experts in the same field and are therefore (at least if they're experimentalists--of course the theorists wouldn't know better either) likely to have the same blind spots as the author.
Would papers be improved if they had something more like screen credits than a mere author list?
I understand that some journals already require something like this: "I. M. First designed and conducted the experiment under the supervision of her advisor, U. R. Second, who obtained funding for this research. H. S. Third assisted with data collection during overnight runs. N. D. Last provided the samples used in the experiment." (In some fields, switch Dr. Second and Dr. Last.)
I think some biomed journals (Blood, I think) want you to just say p<.001 when it's smaller than that. That gets rather goofy when you are testing over 50000 hypotheses and want to say how good your best results were (imagine arrays to measure transcript abundances).
I'd also like to point out that sometimes statistics should not be about hypothesis tests, but rather involve decision theory about how to act, and data giving even just p=.05 might be enough to make me take drug D (or whatever) rather than something else. I mention it cause in some doctoring blogs, people think all data is good for is hypothesis tests - they've never seen the decision theory, which is not a good thing, since they tend to make lots of decisions.
We were guilty of saying to use a DNA ligase, when it actually had to be the one from E. coli rather than just whatever, or else you weren't going to get anything cloned.
I'm a rare bird here - degree in physics, then research in psychophysics, then qualified in psychology, especially hypnotherapy and dream analysis. I'm now back in physics.
I can’t help thinking that if the font in which a question is printed is sufficient to skew the result of a study one way or the other, then maybe we've stumbled on deep and robust truths about the way our brains work.
P.S. I quit psychology because I got scared of the powerful and dramatic effects of dream work and hypnotherapy and realised I was messing with forces I don't understand. Tickling the dragon's tail is not for me.