The concern about biomedical research often leads to a call for testing the reproducibility of published studies by independent labs. While well intentioned, this approach may often be misguided. Here's why:
Reproducing data is an important component of a robust measurement process, but it is only one component. Other components include determining the limit of detection and the response function; incorporating intermediate control measurements; testing the assay for robustness to changes in protocol, technicians and reagents; performing the study with at least one orthogonal method to minimize bias and test assumptions; testing software for accuracy; and using appropriate statistical analyses.
One way of assuring a robust measurement process is by performing interlaboratory comparisons. The strength of interlaboratory comparison studies is multifold: more than two labs are involved, so there is statistical strength in the study; and it can be designed in a way to determine which protocol differences are significant contributors to lack of robustness. Surprisingly to most naïve observers, even highly expert labs analyzing well defined samples can fail to achieve the same result[1].
The example cited above was a study of peptide analysis using mass spectrometry. Experiments involving cells can be much more complex. An international comparative study to evaluate nanoparticle toxicity has been performed using a common assay for cellular metabolic activity[2]. Results from different labs showed large deviations in measured toxicity. The experimental protocol was analyzed with an experimental design to identify the most important sources of experimental variation, and to design a protocol that provides intermediate benchmark measurements that add to the confidence in the data. At the end of the study, each lab had modified their protocols and tested their process with internal benchmark measures, and achieved a high rate of concordance of their data.
It is important to note that while reproducibility is important, reproducing results does not indicate that results are accurate, it does not necessarily assure the robustness of the protocol, it does not test assumptions, and it does not necessarily eliminate bias. When experiments in one laboratory are not reproducible in another laboratory, it only indicates a failure in harmonization of the measurement processes. This failure could occur for many reasons, such as non-identical starting materials, or differences in handling of samples or reagents such as what was found in the nanoparticle toxicity study. It could occur because certain steps in the protocol are not robust, such that small differences in the protocol have large effects on the results.
An alternative to testing each study for reproducibility is to require more complete reporting about the measurement process such as: how were reagents characterized, how was robustness tested, were intermediate measurements within an expected range, what orthogonal methods were used to test that the result was not method-dependent, what were the limits of detection and response[3]?
One of the most enticing aspects of complete reporting of the measurement process is that discrepancies between laboratories might uncover parameters that are surprisingly important or unimportant in the biological process under study. If we are prepared to mine these data, there might be important knowledge embedded in disparate results. For example, the effect of experimental parameters on the outcome of the nanotoxicity study was analyzed. It was observed that nanotoxicity was insensitive to a genetic difference that was detected in the (presumably identical) cell lines that were being used in the different laboratories. That's an interesting result, isn't it?
[1] Addona et al Nat. Biotechnol. 2009 Jul; 27(7): 633–641.)
[2] Elliott et al 2015 in preparation
[3] Plant et al Nature Methods, 2013