STATISTICS Year : 2016  Volume : 7  Issue : 2  Page : 106107 Common pitfalls in statistical analysis: The perils of multiple testing Priya Ranganathan^{1}, CS Pramesh^{2}, Marc Buyse^{3}, ^{1} Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India ^{2} Department of Surgical Oncology, Division of Thoracic Surgery, Tata Memorial Centre, Mumbai, Maharashtra, India ^{3} International Drug Development Institute, San Francisco, California, USA; Department of Biostatistics, Hasselt University, Hasselt, Belgium Correspondence Address: Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times  either at multiple timepoints or through multiple subgroups or for multiple endpoints. This amplifies the probability of a falsepositive finding. In this article, we look at the consequences of multiple testing and explore various methods to deal with this issue.
Introduction In a previous article, we discussed the alpha error rate (or falsepositive error rate), which is the probability of falsely rejecting the null hypothesis.[1] In any study, when two or more groups are compared, there is always a chance of finding a difference between them just by chance. This is known as a Type 1 error, in contrast to a Type 2 error, which consists of failing to detect a difference that truly exists. Conventionally, the alpha error is set at 5% or less which ensures that when we do find a difference between the groups, we can be at least 95% confident that this is a true difference and not a chance finding. The 5% limit for alpha, known as the significance level of the study, is set for a single comparison between groups. When we compare treatment groups multiple times, the probability of finding a difference just by chance increases depending on the number of times, we perform the comparison. In many clinical trials, a number of interim analyses are planned to occur during the course of the trial, with the final analysis taking place when all patients have been accrued and followed up for a minimum period. If all these interim (and final) analyses were performed at the 5% significance level, the overall probability of a Type 1 error would exceed the prespecified limit of 5%. It can be calculated that if two groups are compared 5 times, the probability of a false positive finding is as high as 23%; if they are compared 20 times, the probability of finding a significant difference just by chance increases to 64%.[2],[3] Fortunately, much statistical research has been devoted to this problem, and “group sequential designs” have been proposed to control the Type 1 error rate when the data of a trial need to be analyzed multiple times. Another, more challenging type, of multiple testing occurs when authors try to salvage a negative study. If the primary endpoint does not show statistical significance, looking at multiple other (less important) comparisons quite often produce a “positive” result, especially if there are many such comparisons. Investigators can try to analyze different endpoints, among different subsets of patients, using different statistical tests, and so on, so the opportunity for multiplicity can be substantial.[2],[4] One case in point is subset analyses when the treatments are compared among subsets of patients defined using prognostic features such as their gender, age, tumor location, stage, histology, and grade If there were only three such binary factors, 8 = 23 subsets could be formed. If we were to compare the treatments among these 8 subsets, we would have one chance in three (33% probability) to observe a statistically significant (P <= 0.05) treatment effect in one of them even if there was no true difference between the treatments. Worse still, if there was an overall statistically significant benefit (P <= 0.05) in favor of one of treatments, we would have a nine in ten chance (90% probability) to observe a benefit in favor of the other treatment in one of the subsets! It is to avoid these serious problems that all intended comparisons should be fully prespecified in the research protocol, with appropriate adjustments for multiple testing. However, for retrospective studies, it is difficult to ascertain with certainty whether the analyses performed were actually thought of when the research idea was conceived or whether the performed analyses were mere data dredging. How Are Adjustments Made for Multiple Testing? Two main techniques have been described for controlling the overall alpha error: The family–wise error rate: This approach attempts to control the overall falsepositive rate for all comparisons. “Family” is defined as a set of tests related to the same hypothesis.[2] Various approaches for correcting the alpha error include the Bonferroni, Tukey, Hochberg and Holm's stepdown methods. The Bonferroni correction consists of simply dividing the overall alpha level by the number of comparisons. For example, if 20 comparisons are being made, then the alpha level for significance for each comparison would be 0.05/20 = 0.0025. However, while this is simple to do (and understand), it has been criticized as being far too conservative, especially when the various tests being performed are highly correlated [3] The false discovery rate: This approach attempts to control the fraction of “false significant results” among the significant results only. The Benjamini and Hochberg procedure has been described for this approach.[5] Is Adjustment or Common Sense Needed for Multiple Testing? Many statisticians feel that alphaadjustment for multiple comparisons reduces the significance value to very stringent levels and increases the chances of a Type 2 error (false negative error; falsely accepting the null hypothesis).[2] It has also been argued that an obsessive reliance on alphaadjustment may be counterproductive.[6] The following simple strategies have been suggested to handle multiple comparisons:[2],[7] Readers should evaluate the quality of the study and the actual effect size instead of focusing only on statistical significance Results from single studies should not be used to make treatment decisions; instead, one should look for scientific plausibility and supporting data from other studies which can validate the results of the original study Authors should try to limit comparisons between groups and identify a single primary endpoint; using a composite endpoint or global assessment tool is also an acceptable alternative to using multiple endpoints. Financial support and sponsorship Nil. Conflicts of interest There are no conflicts of interest. References


