In the last installment of this series, we introduced the concept of critical appraisal of the statistical methods used in a paper. The statistical analysis in a study is often the final barrier between the study’s results and application of those results to patient care, so making sure that the findings have been properly evaluated is of obvious importance.
We have previously discussed P values and confidence intervals—two of the most common statistical outcomes upon which clinical decisions are based. In this segment, we will discuss several specific issues that can help a reader decide how much faith to place in a study’s results.
Statistical tests generally require that a variety of assumptions be satisfied for the test procedure to be valid. These assumptions vary from test to test, and unfortunately most computer packages do not ask users whether they want to examine these assumptions more closely. This is one of the dangers of “black box” analysis, when researchers with little statistical training run their data through a statistical package without fully understanding how the output is generated.
Many statistical tests are based on the theory of the bell curve, or normal distribution. These tests require a large enough sample size, usually at least 30 subjects per group and sometimes much greater, for this theory to hold. In addition, the data should not be skewed excessively. For example, consider a study comparing two treatments for mild pain for which scores on a continuous 0-10 visual analog scale are expected to be between 0 and 2. Because of the asymmetry of the data, an underlying bell curve isn’t likely to make much sense. Therefore, a two-sample t-test may not be appropriate for this study even with two large samples.
Another commonly violated assumption is that the two groups being compared may need to be independent. The simplest case occurs when the same subjects are measured before and after a procedure. A two-sample statistical test is not appropriate here because the two groups are actually the same, and therefore clearly not independent. In this case, a paired analysis is required. The issue of independence becomes more complicated when we consider tests of multiple variables that may be related to one another, or studies of effects over time. In these instances, additional expertise in selecting the correct analysis approach is usually needed.
The best way to ensure that these assumptions and the many others required for valid statistical testing are met is to plan your analyses with the help of a trained statistician. If this is not an option, it is incumbent upon the researcher to learn about these assumptions and evaluate their study to make sure the appropriate methods are applied.
Negative Study Results
A more straightforward issue concerns interpretation of negative study results. Most clinicians are familiar with statistical power: A small study may yield a negative finding because this is the correct result or because there is not enough power to discern a difference between the groups being tested. Often, the width of the confidence interval provides insight into this problem. If the confidence interval includes a difference that would be clinically meaningful, a negative study should be viewed skeptically. In such cases, a larger study or a meta-analysis may be needed to better address the question. If, on the other hand, the confidence interval suggests that no clinically relevant result is likely, the negative study finding becomes more compelling.
Multiple Statistical Tests
When we perform a statistical test and set the level of significance at 0.05, we are acknowledging a 5% chance that if the null hypothesis were in fact true we would nonetheless falsely reject it with our test. Turned around, this loosely means a 95% chance of “getting it right,” subject to the limitations of P value interpretation described in the previous segment of this series. This seems reasonable for a single test, but what about the typical research study in which dozens of statistical tests are run? For two independent tests, the chance of “getting it right” in both cases would be 0.95 x 0.95 = 90%. For 20 tests, this probability would be only 36%, meaning a more than 50% chance of drawing at least one false conclusion. The trouble is that there is no way to know which of the 20 tests might have yielded a wrong conclusion!
To address this issue, researchers may set their initial level of significance at a stricter level—perhaps 0.01. There are also mathematical ways to adjust the level of significance to help with multiple comparisons. The key point is that the more tests you run, the more chances you have to draw a false conclusion. Neither you nor your patients can know when this occurs, though. The same arguments apply to subgroup analyses and data-driven, or post hoc, analyses. Such analyses should be regarded as hypothesis-generating rather than hypothesis-testing, and any findings from these analyses should be evaluated more directly by additional research.
A rarely considered aspect of study interpretation is whether the results would change if only a few data points changed. Studies with rare events and wide confidence intervals are often sensitive to a change in even one data point. For example, a study published in 2000 by Kernan, et al., presented a statistically significant finding of increased risk of hemorrhagic stroke in women using appetite suppressants containing phenylpropanolamine. This result was based on six cases and one control, with an unadjusted odds ratio of 11.9 (95% CI, 1.4-99.4).
Shifting just one patient who had used phenylpropanolamine from the case group to the control group would change the odds ratio to 5.0, with a nonsignificant CI of 0.9-25.8. Such an analysis should make readers question how quickly they wish to apply the study results to their own patients, especially if the benefits of the drug are significant. A result that is sensitive to small changes in the study population is probably not stable enough to warrant application to the entire patient population.
Back to the Common-Sense Test
An excellent way to judge whether a study’s results should be believed is to step back and consider whether they make sense based on current scientific knowledge. If they do not, either the study represents a breakthrough in our understanding of disease or the study’s results are flawed. Remember, if the prevalence of a disease is very low, even a positive diagnostic test with high sensitivity and specificity is likely to be a false positive. Similarly, a small P value may represent a false result if the hypothesis being tested does not meet standard epidemiologic criteria for causality such as biological plausibility. Statistics are primarily a tool to help us make sense of complex study data. They can often suggest when new theories should be evaluated, but they should not determine by themselves which results we apply to patient care.
This series has been intended as a brief introduction to many different facets of evidence-based medicine. The primary message of evidence-based medicine is that critical assessment of every aspect of research is necessary to ensure that we make the best possible decisions for our patients. Understanding the important concepts in study design and analysis may seem daunting, but this effort is made worthwhile every time we positively affect patient care.
Hospitalists are uniquely situated at the interface of internal medicine and essentially every other area of medicine and because of this have a tremendous opportunity to broadly impact patient care. My hope is that evidence-based medicine-savvy hospitalists will capitalize on this for the benefit of our patients, will play a prominent role in educating future clinicians on the importance of evidence-based medicine, and will use it to lead the next wave of patient care advances. TH
Dr. West practices in the Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minn.
- Greenhalgh T. How to read a paper. Statistics for the non-statistician. I: Different types of data need different statistical tests. BMJ. 1997;315:364-366.
- Greenhalgh T. How to read a paper. Statistics for the non-statistician. II: “Significant” relations and their pitfalls. BMJ. 1997;315:422-425.
- Guyatt G and Rennie D, eds. Users’ guides to the medical literature. Chicago: AMA Press; 2002.
- Kernan WN, Viscoli CM, Brass LM, et al. Phenylpropanolamine and the risk of hemorrhagic stroke. N Engl J Med. 2000;343:1826-1832.