"data interpretation, statistical" , ">
Home  |  About us  |  Editorial board  |  Ahead of print  | Current issue  |  Archives  |  Submit article  |  Instructions |  Search  |   Subscribe  |  Advertise  |  Contacts  |  Login 
  Users Online: 556Home Print this page Email this page Small font sizeDefault font sizeIncrease font size  

 Table of Contents      
Year : 2016  |  Volume : 7  |  Issue : 4  |  Page : 187-190

Common pitfalls in statistical analysis: The use of correlation techniques

1 Department of Gastroenterology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India
2 Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India

Date of Web Publication12-Oct-2016

Correspondence Address:
Priya Ranganathan
Department of Anaesthesiology, Tata Memorial Centre, Ernest Borges Road, Parel, Mumbai - 400 012, Maharashtra
Login to access the Email id

Source of Support: None, Conflict of Interest: None

DOI: 10.4103/2229-3485.192046

Rights and Permissions

Correlation is a statistical technique which shows whether and how strongly two continuous variables are related. In this article, which is the eighth part in a series on 'Common pitfalls in Statistical Analysis', we look at the interpretation of the correlation coefficient and examine various situations in which the use of technique of correlation may be inappropriate.

Keywords: Biostatistics, correlation, "data interpretation, statistical"

How to cite this article:
Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: The use of correlation techniques. Perspect Clin Res 2016;7:187-90

How to cite this URL:
Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: The use of correlation techniques. Perspect Clin Res [serial online] 2016 [cited 2020 Dec 5];7:187-90. Available from: https://www.picronline.org/text.asp?2016/7/4/187/192046

   Introduction Top

We often have information on two numeric characteristics for each member of a group and are interested in finding the degree of association between these characteristics. For instance, an obstetrician may decide to look up the records of women who delivered in her hospital in the previous year to find out whether there is a relationship between their family incomes and the birth weights of their babies. The relationship here means whether the two variables fluctuate together, i.e., does the birth weight increase (or decrease) as the income increases.
"Correlation" is a statistical tool used to assess the degree of association of two quantitative variables measured in each member of a group. Although it is a very commonly used tool in medical literature, it is also often misunderstood. This piece describes what "correlation" implies and the situations in which it may be used, as also its pitfalls and the situations where it should not be used. To illustrate various concepts, we use scatter plots, a graphical method of showing values of two variables for each individual in a group.

   Measurement of correlation: Correlation coefficient Top

The degree of correlation between any two variables on a continuous scale is mathematically expressed as the correlation coefficient (also known as Pearson's correlation coefficient or "r0"), a number whose values can vary between −1.0 and +1.0. Thus, it has a sign (+ or −) and a magnitude.


Two variables are said to be "positively" correlated [Figure 1]a-c when their values change in tandem, i.e., increasing values of one are associated with increasing values of the other. By contrast, a "negative" correlation [Figure 1]d-f exists when increasing values of one variable are associated with a decrease in the values of the other. Variables with no or little discernible relationship [Figure 1]g are said to have "no correlation."
Figure 1: Scatter plots of relationship between values of two quantitative variables and their corresponding correlation coefficient (r) values. "r " can vary between − 1.0 and + 1.0. If as the values of one variable (say on X-axis) increase, those of the other variable (on Y-axis) increase, "r " is positive (a-c); however, if the latter decrease, "r " is negative (d-f). When the values of two variables have no clear relation, "r " is zero (g). The absolute values of "r " are higher when the individual data points are closer to a line showing the linear trend (a > b > c; d > e > f)

Click here to view


The absolute value of r represents the strength of association. A value of 1.0 implies a perfect linear relationship between the two variables, i.e., all observations lie on a straight line [Figure 1]a and d, whereas 0 indicates the absence of any linear relationship [Figure 1]g. Higher values (closer to 1.0) imply that individual observations lie close to an imaginary line describing the relationship between the two variables [Figure 1]b and e, and lower values imply that the observations are more spread out [Figure 1]c and f.

   Interpretation of value of correlation coefficient Top

Square of correlation coefficient (r2 ), known as coefficient of determination, represents the proportion of variation in one variable that is accounted for by the variation in the other variable. For example, if height and weight of a group of persons have a correlation coefficient of 0.80, one can estimate that 64% (0.80 × 0.80 = 0.64) of variation in their weights is accounted for by the variation in their heights.

It is possible to calculate P value for an observed correlation coefficient to determine whether a significant linear relationship exists between the two variables of interest or not. However, with medium- to large-sized samples, these methods show even small correlation coefficients to be highly significant and hence their use is generally eschewed.

   When should correlation not be used? Top

  • The correlation coefficient looks for a linear relationship. Hence, it can be fallacious in situations where two variables do have a relationship, but it is nonlinear. For instance, hand-grip strength initially increases with age (through childhood and adolescence) and then declines (e.g., [Figure 2]a). In such cases, "r" could be low ( r = 0 for the data in [Figure 2]a), even though there is a clear relationship.
    Figure 2: Situations in which linear correlation should not be used: (a) two variables have a relationship which is nonlinear (analysis of data points in this figure shows r = 0, thus failing to detect the relationship), (b) the data have one or a few outliers (one outlier at right upper end resulted in a false relationship with r = 0.71; exclusion of this point reduces r to near zero), (c) when the data have two subgroups, within each of which there is no correlation, and (d) when variability in values on Y-axis changes with values on X-axis. Each situation is described further in the text

    Click here to view
  • Correlation analysis assumes that all the observations are independent of each other. Thus, it should not be used if the data include more than one observation on any individual. For instance, in the above example, if hand-grip strength had been measured twice in some subjects that would be an additional reason not to use correlation analysis.
  • If one (or a few) individual observation in the sample is an outlier, i.e., located far away from the others, it may introduce a false sense of relationship [Figure 2]b. Please note that the data points in this figure are identical to those in [Figure 1]g, except for the addition of one outlier. On excluding this outlier, the value of r would drop from 0.71 to 0!
  • If the dataset has two subgroups of individuals whose values for one or both variables differ from each other [Figure 2]c, this can lead to a false sense of relationship overall, even when none exists within each subgroup. For instance, let us consider a group of 20 men and 20 women. If one plots their heights (on X-axis) and hemoglobin levels (on Y-axis), most women may end up in the left lower corner (shorter and lower hemoglobin) and most men in the right upper corner (taller and higher hemoglobin), suggesting a false relationship (a positive "r" value) between height and hemoglobin levels.
  • With very small sample size (say 3-6 observations), a relationship may appear to be present even though none exists.
  • Linear correlation analysis applies only to data on a continuous scale. It should not be used when one or both variables have been measured using an ordinal scale, for example, patients' assessment of pain severity on a scale of 0-10, where higher number means worse pain but similar differences (say from 1 to 3 and from 6 to 8) do not necessarily imply similar change in pain. In these cases, a Spearman's rank correlation method should be used.
  • Relationship between a variable and one of its components (e.g., aggregate marks vs. marks in one subject). For instance, it would be fallacious to use correlation to assess the relationship of height of a group of persons with the lengths of their body's lower segments since the lower segment forms a part of the overall height.
  • Heteroscedasticity or a situation in which the one variable has unequal variability across the range of values of a second variable. For instance, if one looks at the relationship of annual health expenditure versus the annual income of a family, the former is likely to vary more for richer persons than for poor persons [Figure 2]d.
Many of the above pitfalls are easily avoided if one first makes a scatter plot for the data and visually inspects it for nonlinear relationships, outliers, or presence of obvious subgroups.

In addition, correlation analysis is also often inappropriately used to measure agreement between two methods of measuring the same thing (e.g., tumor volume measured using ultrasound and computed tomography). This will be discussed in the next article in this series.

   A final caution: Correlation does not mean causation Top

A relationship between two variables is sometimes taken as evidence that one causes the other. This is, however, often not true, and hence the popular statistical adage: "Correlation does not imply causation." You may wish to visit https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation for some interesting insights into how correlation can arise without any causative link.

Examples of such noncausative correlation include (i) countries' annual per capita chocolate consumption and the number of Nobel laureates per 10 million population; [1] (ii) weekly ice-cream consumption and a number of drowning incidents in swimming pools. These are due to the association of both the variables being studied to national income [2] and hot weather, respectively.

   Endpiece Top

Correlation analysis is a very powerful tool to explore relationships in data. However, one must be careful to use it only when it is applicable. Many of these problems can be avoided by a careful thought about the data, plotting the raw data (to look for nonlinear relationships, outliers, and heteroscedasticity of data), and by thinking in terms of coefficient of determination in preference to the correlation coefficient.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.

   References Top

Messerli FH. Chocolate consumption, cognitive function, and Nobel laureates. N Engl J Med 2012;367:1562-4.  Back to cited text no. 1
Maurage P, Heeren A, Pesenti M. Does chocolate consumption really boost Nobel Award chances? The peril of over-interpreting correlations in health studies. J Nutr 2013;143:931-3.  Back to cited text no. 2


  [Figure 1], [Figure 2]

This article has been cited by
1 Comparison between Diffusion-Weighted Sequences with Selective and Non-Selective Fat Suppression in the Evaluation of Crohn’s Disease Activity: Are They Equally Useful?
Ilze Apine,Reinis Pitura,Ivanda Franckevica,Juris Pokrotnieks,Gaida Krumina
Diagnostics. 2020; 10(6): 347
[Pubmed] | [DOI]
2 Age-dependent changes in bone mineral density for males and females aged 10-80 years
Hamzah M. Hamid,Khalid Gh. Majeed,Saeed H. Saeed
IOP Conference Series: Materials Science and Engineering. 2020; 928: 072052
[Pubmed] | [DOI]
3 Alchemical Hydration Free-Energy Calculations Using Molecular Dynamics with Explicit Polarization and Induced Polarity Decoupling: An On–the–Fly Polarization Approach
Braden D. Kelly,William R. Smith
Journal of Chemical Theory and Computation. 2020;
[Pubmed] | [DOI]
4 A Simple Method for Including Polarization Effects in Solvation Free Energy Calculations When Using Fixed-Charge Force Fields: Alchemically Polarized Charges
Braden D. Kelly,William R. Smith
ACS Omega. 2020;
[Pubmed] | [DOI]
5 Review of non-invasive intracranial pressure measurement techniques for ophthalmology applications
David Andrew Price,Andrzej Grzybowski,Jennifer Eikenberry,Ingrida Januleviciene,Alice Chandra Verticchio Vercellin,Sunu Mathew,Brent Siesky,Alon Harris
British Journal of Ophthalmology. 2020; 104(7): 887
[Pubmed] | [DOI]
6 Minimal Clinically Important Differences for Patient-Reported Outcome Measures of Fatigue in Patients With COPD Following Pulmonary Rehabilitation
Patrícia Rebelo,Ana Oliveira,Lília Andrade,Carla Valente,Alda Marques
Chest. 2020;
[Pubmed] | [DOI]
7 Species delimitation, environmental cline and phylogeny for a new Neotropical genus of Cryptinae (Ichneumonidae)
Fernanda A. Supeleto,Bernardo F. Santos,Leandro A. Basilio,Alexandre P. Aguiar,Michael Schubert
PLOS ONE. 2020; 15(10): e0237233
[Pubmed] | [DOI]
8 Cuestionario Honey-Alonso de Estilos de Aprendizaje: Nuevas evidencias psicométricas en población argentina
Agustín Freiberg Hoffmann,Facundo Abal,Mercedes Fernández Liporace
Acta Colombiana de Psicología. 2020; 23(2): 328
[Pubmed] | [DOI]
9 Respiratory Function and Upper Extremity Functional Activity Performance in People With Dementia: A Shout for Attention
Cátia Paixão,Ana Tavares,Alda Marques
Journal of Aging and Physical Activity. 2020; : 1
[Pubmed] | [DOI]
10 Behavioral Health at School: Do Three Competences in Road Safety Education Impact the Protective Road Behaviors of Spanish Children?
Francisco Alonso,Adela Gonzalez-Marin,Cristina Esteban,Sergio A. Useche
International Journal of Environmental Research and Public Health. 2020; 17(3): 935
[Pubmed] | [DOI]
11 Self-reported campus alcohol policy and college alcohol consumption: a multilevel analysis of 4592 Korean students from 82 colleges
Sarah Soyeon Oh,Yeong Jun Ju,Sung-in Jang,Eun-Cheol Park
Substance Abuse Treatment, Prevention, and Policy. 2020; 15(1)
[Pubmed] | [DOI]
12 The role of self-control, hope and information in technology adoption by smallholder farmers – A moderation model
Shira Bukchin,Dorit Kerret
Journal of Rural Studies. 2020;
[Pubmed] | [DOI]
13 LINSPECTOR: Multilingual Probing Tasks for Word Representations
Gözde Gül Sahin,Clara Vania,Ilia Kuznetsov,Iryna Gurevych
Computational Linguistics. 2020; : 1
[Pubmed] | [DOI]
14 Agroview: Cloud-based application to process, analyze and visualize UAV-collected data for precision agriculture applications utilizing artificial intelligence
Yiannis Ampatzidis,Victor Partel,Lucas Costa
Computers and Electronics in Agriculture. 2020; 174: 105457
[Pubmed] | [DOI]
15 An empirical analysis of source code metrics and smart contract resource consumption
Nemitari Ajienka,Peter Vangorp,Andrea Capiluppi
Journal of Software: Evolution and Process. 2020;
[Pubmed] | [DOI]
16 Studying internet addiction profile of university students with latent class analysis
Irshad Hussain,Ozlem Cakir,Burhanettin Ozdemir
Education and Information Technologies. 2020;
[Pubmed] | [DOI]
17 Intoxication Effects on Impulsive Alcohol Choice in Heavy Drinkers: Correlation With Sensation Seeking and Differential Effects by Commodity
Brandon G. Oberlin,Claire R. Carron,Nolan E. Ramer,Martin H. Plawecki,Sean J. O’Connor,David A. Kareken
Alcoholism: Clinical and Experimental Research. 2020;
[Pubmed] | [DOI]
18 A new approach to estimate aerobic fitness using the NHANES dataset
Kim D. Lu,Ronen Bar-Yoseph,Shlomit Radom-Aizik,Dan M. Cooper
Scandinavian Journal of Medicine & Science in Sports. 2019;
[Pubmed] | [DOI]
19 Prediction of Alkanolamine pKa Values by Combined Molecular Dynamics Free Energy Simulations and ab Initio Calculations
Javad Noroozi,William R. Smith
Journal of Chemical & Engineering Data. 2019;
[Pubmed] | [DOI]
20 The effect of test modality on dynamic exercise biomarkers in children, adolescents, and young adults
Ronen Bar-Yoseph,Janos Porszasz,Shlomit Radom-Aizik,Annamarie Stehli,Pearl Law,Dan M. Cooper
Physiological Reports. 2019; 7(14)
[Pubmed] | [DOI]
21 Optical Quality Assessment in Patients with Macular Diseases Using Optical Quality Analysis System
Joon Hee Cho,So Hyun Bae,Ha Kyoung Kim,Young Joo Shin
Journal of Clinical Medicine. 2019; 8(6): 892
[Pubmed] | [DOI]
22 Correlation of surgical trainee performance on laparoscopic versus endoscopic simulation
Jennifer Koichopolos,Jeffrey Hawel,Eran Shlomovitz,Ilay Habaz,Ahmad Elnahas,Nawar A. Alkhamesi,Christopher M. Schlachta
Surgical Endoscopy. 2019;
[Pubmed] | [DOI]
23 Decreased waterborne pathogenic bacteria in an urban aquifer related to intense shallow geothermal exploitation
Alejandro García-Gil,Samanta Gasco-Cavero,Eduardo Garrido,Miguel Mejías,Jannis Epting,Mercedes Navarro-Elipe,Carmen Alejandre,Elena Sevilla-Alcaine
Science of The Total Environment. 2018; 633: 765
[Pubmed] | [DOI]
24 Minimal Important and Detectable Differences of Respiratory Measures in Outpatients with AECOPD†
Ana Oliveira,Ana Machado,Alda Marques
COPD: Journal of Chronic Obstructive Pulmonary Disease. 2018; : 1
[Pubmed] | [DOI]


    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

  In this article
    Interpretation o...
    Measurement of c...
    When should corr...
    A final caution:...
    Article Figures

 Article Access Statistics
    PDF Downloaded878    
    Comments [Add]    
    Cited by others 24    

Recommend this journal