Home  |  About us  |  Editorial board  |  Ahead of print  | Current issue  |  Archives  |  Submit article  |  Instructions |  Search  |   Subscribe  |  Advertise  |  Contacts  |  Login 
  Users Online: 1217Home Print this page Email this page Small font sizeDefault font sizeIncrease font size  

 Table of Contents      
STATISTICS
Year : 2016  |  Volume : 7  |  Issue : 2  |  Page : 106-107

Common pitfalls in statistical analysis: The perils of multiple testing


1 Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India
2 Department of Surgical Oncology, Division of Thoracic Surgery, Tata Memorial Centre, Mumbai, Maharashtra, India
3 International Drug Development Institute, San Francisco, California, USA; Department of Biostatistics, Hasselt University, Hasselt, Belgium

Date of Web Publication31-Mar-2016

Correspondence Address:
Priya Ranganathan
Department of Anaesthesiology, Tata Memorial Centre, Ernest Borges Road, Parel, Mumbai - 400 012, Maharashtra
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/2229-3485.179436

Rights and Permissions
   Abstract 


Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times - either at multiple time-points or through multiple subgroups or for multiple end-points. This amplifies the probability of a false-positive finding. In this article, we look at the consequences of multiple testing and explore various methods to deal with this issue.

Keywords: Biostatistics, data interpretation, multiplicity, statistical significance


How to cite this article:
Ranganathan P, Pramesh C S, Buyse M. Common pitfalls in statistical analysis: The perils of multiple testing. Perspect Clin Res 2016;7:106-7

How to cite this URL:
Ranganathan P, Pramesh C S, Buyse M. Common pitfalls in statistical analysis: The perils of multiple testing. Perspect Clin Res [serial online] 2016 [cited 2019 Sep 19];7:106-7. Available from: http://www.picronline.org/text.asp?2016/7/2/106/179436




   Introduction Top


In a previous article, we discussed the alpha error rate (or false-positive error rate), which is the probability of falsely rejecting the null hypothesis.[1] In any study, when two or more groups are compared, there is always a chance of finding a difference between them just by chance. This is known as a Type 1 error, in contrast to a Type 2 error, which consists of failing to detect a difference that truly exists. Conventionally, the alpha error is set at 5% or less which ensures that when we do find a difference between the groups, we can be at least 95% confident that this is a true difference and not a chance finding.

The 5% limit for alpha, known as the significance level of the study, is set for a single comparison between groups. When we compare treatment groups multiple times, the probability of finding a difference just by chance increases depending on the number of times, we perform the comparison. In many clinical trials, a number of interim analyses are planned to occur during the course of the trial, with the final analysis taking place when all patients have been accrued and followed up for a minimum period. If all these interim (and final) analyses were performed at the 5% significance level, the overall probability of a Type 1 error would exceed the prespecified limit of 5%. It can be calculated that if two groups are compared 5 times, the probability of a false positive finding is as high as 23%; if they are compared 20 times, the probability of finding a significant difference just by chance increases to 64%.[2],[3] Fortunately, much statistical research has been devoted to this problem, and “group sequential designs” have been proposed to control the Type 1 error rate when the data of a trial need to be analyzed multiple times.

Another, more challenging type, of multiple testing occurs when authors try to salvage a negative study. If the primary endpoint does not show statistical significance, looking at multiple other (less important) comparisons quite often produce a “positive” result, especially if there are many such comparisons. Investigators can try to analyze different endpoints, among different subsets of patients, using different statistical tests, and so on, so the opportunity for multiplicity can be substantial.[2],[4] One case in point is subset analyses when the treatments are compared among subsets of patients defined using prognostic features such as their gender, age, tumor location, stage, histology, and grade If there were only three such binary factors, 8 = 23 subsets could be formed. If we were to compare the treatments among these 8 subsets, we would have one chance in three (33% probability) to observe a statistically significant (P <= 0.05) treatment effect in one of them even if there was no true difference between the treatments. Worse still, if there was an overall statistically significant benefit (P <= 0.05) in favor of one of treatments, we would have a nine in ten chance (90% probability) to observe a benefit in favor of the other treatment in one of the subsets!

It is to avoid these serious problems that all intended comparisons should be fully prespecified in the research protocol, with appropriate adjustments for multiple testing. However, for retrospective studies, it is difficult to ascertain with certainty whether the analyses performed were actually thought of when the research idea was conceived or whether the performed analyses were mere data dredging.


   How Are Adjustments Made for Multiple Testing? Top


Two main techniques have been described for controlling the overall alpha error:

  1. The family–wise error rate: This approach attempts to control the overall false-positive rate for all comparisons. “Family” is defined as a set of tests related to the same hypothesis.[2] Various approaches for correcting the alpha error include the Bonferroni, Tukey, Hochberg and Holm's step-down methods. The Bonferroni correction consists of simply dividing the overall alpha level by the number of comparisons. For example, if 20 comparisons are being made, then the alpha level for significance for each comparison would be 0.05/20 = 0.0025. However, while this is simple to do (and understand), it has been criticized as being far too conservative, especially when the various tests being performed are highly correlated [3]
  2. The false discovery rate: This approach attempts to control the fraction of “false significant results” among the significant results only. The Benjamini and Hochberg procedure has been described for this approach.[5]



   Is Adjustment or Common Sense Needed for Multiple Testing? Top


Many statisticians feel that alpha-adjustment for multiple comparisons reduces the significance value to very stringent levels and increases the chances of a Type 2 error (false negative error; falsely accepting the null hypothesis).[2] It has also been argued that an obsessive reliance on alpha-adjustment may be counterproductive.[6]

The following simple strategies have been suggested to handle multiple comparisons:[2],[7]

  • Readers should evaluate the quality of the study and the actual effect size instead of focusing only on statistical significance
  • Results from single studies should not be used to make treatment decisions; instead, one should look for scientific plausibility and supporting data from other studies which can validate the results of the original study
  • Authors should try to limit comparisons between groups and identify a single primary endpoint; using a composite endpoint or global assessment tool is also an acceptable alternative to using multiple endpoints.


Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

 
   References Top

1.
Ranganathan P, Pramesh CS, Buyse M. Common pitfalls in statistical analysis: “P” values, statistical significance and confidence intervals. Perspect Clin Res 2015;6:116-7.  Back to cited text no. 1
[PUBMED]  Medknow Journal  
2.
Drachman D. Adjusting for multiple comparisons. J Clin Res Best Pract 2012;8:1-3. Available from: https://www. firstclinical.com/journal/2012/1207_Multiple.pdf. [Last cited on 2016 Mar 21].  Back to cited text no. 2
    
3.
Goldman M. Why is multiple testing a problem? 2008. Available from: http://www.stat.berkeley.edu/~mgoldman/Section0402.pdf. [Last cited on 2016 Mar 21].  Back to cited text no. 3
    
4.
Cabral HJ. Multiple comparisons procedures. Circulation 2008;117:698-701.  Back to cited text no. 4
    
5.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B 1995;57:289-300.  Back to cited text no. 5
    
6.
Buyse M, Hurvitz SA, Andre F, Jiang Z, Burris HA, Toi M, et al. Statistical controversies in clinical research: Statistical significance-too much of a good thing. Ann Oncol 2016. pii: Mdw047. [Epub ahead of print].  Back to cited text no. 6
    
7.
Feise RJ. Do multiple outcome measures require p value adjustment? BMC Med Res Methodol 2002;2:8.  Back to cited text no. 7
    



This article has been cited by
1 Age-Related Survival Differences in Patients With Inflammatory Bowel Disease–Associated Colorectal Cancer: A Population-Based Cohort Study
Jessica Bogach,Gregory Pond,Cagla Esckicioglu,Hsien Seow
Inflammatory Bowel Diseases. 2019;
[Pubmed] | [DOI]
2 Serotonin Receptor 1A Variation Is Associated with Anxiety and Agonistic Behavior in Chimpanzees
Nicky Staes,Chet C Sherwood,Hani Freeman,Sarah F Brosnan,Steven J Schapiro,William D Hopkins,Brenda J Bradley,Katja Nowick
Molecular Biology and Evolution. 2019;
[Pubmed] | [DOI]
3 Decision Biases and Heuristics Among Emergency Managers: Just Like the Public They Manage For?
Patrick S. Roberts,Kris Wernstedt
The American Review of Public Administration. 2019; 49(3): 292
[Pubmed] | [DOI]
4 What Was Helpful Questionnaire (WHQ): Psychometric Properties of a Novel Tool Designed to Capture Parental Perceived Helpfulness of Interventions in Children Requiring Mental Health Inpatient Care
Ifigeneia Mourelatou,Jorge Gaete,Sandra Fewings,Oona Hickie,Marinos Kyriakopoulos
Frontiers in Psychiatry. 2019; 10
[Pubmed] | [DOI]
5 Group identities benefit well-being by satisfying needs
A. Kyprianides,M.J. Easterbrook,R. Brown
Journal of Experimental Social Psychology. 2019; 84: 103836
[Pubmed] | [DOI]
6 Victimization and poly-victimization in a community sample of Mexican adolescents
Claudia Méndez-López,Noemí Pereda
Child Abuse & Neglect. 2019; 96: 104100
[Pubmed] | [DOI]
7 Smoke-Free Policies and Smoking Cessation in the United States, 2003–2015
Andrea R. Titus,Lucie Kalousova,Rafael Meza,David T. Levy,James F. Thrasher,Michael R. Elliott,Paula M. Lantz,Nancy L. Fleischer
International Journal of Environmental Research and Public Health. 2019; 16(17): 3200
[Pubmed] | [DOI]
8 Perspective: Advancing Understanding of Population Nutrient–Health Relations via Metabolomics and Precision Phenotypes
Stephanie Andraos,Melissa Wake,Richard Saffery,David Burgner,Martin Kussmann,Justin OæSullivan
Advances in Nutrition. 2019;
[Pubmed] | [DOI]
9 Evaluation of the Interrater Reliability of End-of-Life Medical Orders in the Physician Orders for Life-Sustaining Treatment Form
Gustavo Bigaton Lovadini,Fernanda Bono Fukushima,Joao Francisco Lindenberg Schoueri,Roberto dos Reis,Cecilia Guimarães Ferreira Fonseca,Jahaira Jeanainne Casanova Rodriguez,Cauana Silva Coelho,Adriele Ferreira Neves,Aniela Maria Rodrigues,Marina Almeida Marques,Alessandro Ferrari Jacinto,Karen Harrison Dening,Rick Bassett,Alvin H. Moss,Karl E. Steinberg,Edison Iglesias de Oliveira Vidal
JAMA Network Open. 2019; 2(4): e192036
[Pubmed] | [DOI]
10 Use of Meloxicam or Ketoprofen for Piglet Pain Control Following Surgical Castration
Abbie V. Viscardi,Patricia V. Turner
Frontiers in Veterinary Science. 2018; 5
[Pubmed] | [DOI]
11 Risk factors for childhood enteric infection in urban Maputo, Mozambique: A cross-sectional study
Jackie Knee,Trent Sumner,Zaida Adriano,David Berendes,Ellen de Bruijn,Wolf-Peter Schmidt,Rassul Nalá,Oliver Cumming,Joe Brown,Rojelio Mejia
PLOS Neglected Tropical Diseases. 2018; 12(11): e0006956
[Pubmed] | [DOI]
12 MicroRNA and Long Non-coding RNA Regulation in Skeletal Muscle From Growth to Old Age Shows Striking Dysregulation of the Callipyge Locus
Jasmine Mikovic,Kate Sadler,Lauren Butchart,Sarah Voisin,Frederico Gerlinger-Romero,Paul Della Gatta,Miranda D. Grounds,Séverine Lamon
Frontiers in Genetics. 2018; 9
[Pubmed] | [DOI]
13 Self-management interventions for cancer survivors: a systematic review
Lauren Boland,Kathleen Bennett,Deirdre Connolly
Supportive Care in Cancer. 2017;
[Pubmed] | [DOI]



 

Top
  
 
  Search
 
    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

 
  In this article
    Abstract
   Introduction
    How Are Adjustme...
    Is Adjustment or...
    References

 Article Access Statistics
    Viewed1563    
    Printed28    
    Emailed0    
    PDF Downloaded420    
    Comments [Add]    
    Cited by others 13    

Recommend this journal