Multiple Statistical Tests
When we perform a statistical test and set the level of significance at 0.05, we are acknowledging a 5% chance that if the null hypothesis were in fact true we would nonetheless falsely reject it with our test. Turned around, this loosely means a 95% chance of “getting it right,” subject to the limitations of P value interpretation described in the previous segment of this series. This seems reasonable for a single test, but what about the typical research study in which dozens of statistical tests are run? For two independent tests, the chance of “getting it right” in both cases would be 0.95 x 0.95 = 90%. For 20 tests, this probability would be only 36%, meaning a more than 50% chance of drawing at least one false conclusion. The trouble is that there is no way to know which of the 20 tests might have yielded a wrong conclusion!
To address this issue, researchers may set their initial level of significance at a stricter level—perhaps 0.01. There are also mathematical ways to adjust the level of significance to help with multiple comparisons. The key point is that the more tests you run, the more chances you have to draw a false conclusion. Neither you nor your patients can know when this occurs, though. The same arguments apply to subgroup analyses and data-driven, or post hoc, analyses. Such analyses should be regarded as hypothesis-generating rather than hypothesis-testing, and any findings from these analyses should be evaluated more directly by additional research.
A rarely considered aspect of study interpretation is whether the results would change if only a few data points changed. Studies with rare events and wide confidence intervals are often sensitive to a change in even one data point. For example, a study published in 2000 by Kernan, et al., presented a statistically significant finding of increased risk of hemorrhagic stroke in women using appetite suppressants containing phenylpropanolamine. This result was based on six cases and one control, with an unadjusted odds ratio of 11.9 (95% CI, 1.4-99.4).
Shifting just one patient who had used phenylpropanolamine from the case group to the control group would change the odds ratio to 5.0, with a nonsignificant CI of 0.9-25.8. Such an analysis should make readers question how quickly they wish to apply the study results to their own patients, especially if the benefits of the drug are significant. A result that is sensitive to small changes in the study population is probably not stable enough to warrant application to the entire patient population.
Back to the Common-Sense Test
An excellent way to judge whether a study’s results should be believed is to step back and consider whether they make sense based on current scientific knowledge. If they do not, either the study represents a breakthrough in our understanding of disease or the study’s results are flawed. Remember, if the prevalence of a disease is very low, even a positive diagnostic test with high sensitivity and specificity is likely to be a false positive. Similarly, a small P value may represent a false result if the hypothesis being tested does not meet standard epidemiologic criteria for causality such as biological plausibility. Statistics are primarily a tool to help us make sense of complex study data. They can often suggest when new theories should be evaluated, but they should not determine by themselves which results we apply to patient care.