|
|
STATISTICS |
|
Year : 2017 | Volume
: 8
| Issue : 2 | Page : 100-102 |
|
Common pitfalls in statistical analysis: Linear regression analysis
Rakesh Aggarwal1, Priya Ranganathan2
1 Department of Gastroenterology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India 2 Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India
Date of Web Publication | 27-Mar-2017 |
Correspondence Address: Priya Ranganathan Department of Anaesthesiology, Tata Memorial Centre, Ernest Borges Road, Parel, Mumbai - 400 012, Maharashtra India
 Source of Support: None, Conflict of Interest: None  | Check |
DOI: 10.4103/2229-3485.203040
Abstract | | |
In a previous article in this series, we explained correlation analysis which describes the strength of relationship between two continuous variables. In this article, we deal with linear regression analysis which predicts the value of one continuous variable from another. We also discuss the assumptions and pitfalls associated with this analysis.
Keywords: Biostatistics, linear model, regression analysis
How to cite this article: Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: Linear regression analysis. Perspect Clin Res 2017;8:100-2 |
We often have information on two numeric characteristics for each member of a group and believe that these are related to each other – i.e. values of one characteristic vary depending on the values of the other. For instance, in a recent study, researchers had data on body mass index (BMI) and mid-upper arm circumference (MUAC) on 1373 hospitalized patients, and they decided to determine whether there was a relationship between BMI and MUAC.[1] In such a situation, as we discussed in a recent piece on “Correlation” in this series,[2] the researchers would plot the data on a scatter diagram. If the dots fall roughly along a straight line, sloping either upwards or downwards, they would conclude that a relationship exists. As a next step, they may be tempted to ask whether, knowing the value of one variable (MUAC), it is possible to predict the value of the other variable (BMI) in the study group. This can be done using “simple linear regression” analysis, also sometimes referred to as “linear regression.” The variable whose value is known (MUAC here) is referred to as the independent (or predictor or explanatory) variable, and the variable whose value is being predicted (BMI here) is referred to as the dependent (or outcome or response) variable. The independent and dependent variables are, by convention, referred to as “x” and “y” and are plotted on horizontal and vertical axes, respectively.
At times, one is interested in predicting the value of a numerical response variable based on the values of more than one numeric predictors. For instance, one study found that whole-body fat content in men could be predicted using information on thigh circumference, triceps and thigh skinfold thickness, biceps muscle thickness, weight, and height.[3] This is done using “multiple linear regression.” We will not discuss this more complex form of regression.
Although the concepts of “correlation” and “linear regression” are somewhat related and share some assumptions, these also have some important differences, as we discuss later in this piece.
The Regression Line | |  |
Linear regression analysis of observations on two variables (x and y) in a sample can be looked upon as plotting the data and drawing a best fit line through these. This “best fit” line is so chosen that the sum of squares of all the residuals (the vertical distance of each point from the line) is a minimum – the so-called “least squares line” [Figure 1]. This line can be mathematically defined by an equation of the form: | Figure 1: Data from a sample and estimated linear regression line for these data. Each dot corresponds to a data point, i.e., an individual pair of values for x and y, and the vertical dashed lines from each dot represent residuals. The capital letters (Y) are used to indicate predicted values and lowercase letters (x and y) for known values. Intercept is shown as “a” and slope or regression coefficient as “b”
Click here to view |
Y = a + bx
Where “x” is the known value of independent (or predictor or explanatory) variable, “Y” is the predicted (or fitted) value of “y” (dependent, outcome, or response variable) for the given value of “x”, “a” is called as the “intercept” of the estimated line and represents the value of Y when x = 0, and “b” is called as the “slope” of the estimated line and represents the amount by which Y changes on average as “x” increases by one unit. It is also referred to as “coefficient,” “regression coefficient,” or “gradient.” Note that lowercase letters (x and y) are used to denote the actual values and capital letters (Y) for predicted values.
The value of “b” is positive when the value of Y increases with each unit increase in x and is negative if the value of Y decreases with each unit increase in x [Figure 2]. If the value of Y does not change with x, the value of “b” would be expected to be 0. Furthermore, the higher the magnitude of “b,” the steeper is the change in Y with change in x. | Figure 2: Relationships between two quantitative variables and their regression coefficients (“b”). “b” represents predicted change in the value of dependent variable (on Y axis) for each one unit increase in the value of independent variable (on X axis). “b” is positive, zero, or negative, depending on whether, as the independent variable increases, the value of dependent variable is predicted to increase (panels i and ii), remain unchanged (iii), or decrease (iv). A higher absolute value of “b” indicates that the independent variable changes more for each unit increase in the predictor (ii vs i)
Click here to view |
In the example of BMI and MUAC,[1] the linear correlation equation was: BMI = –0.042 + 0.972 × MUAC (in cm). Here, +0.972 is the slope or coefficient and indicates that, on average, BMI is expected to be higher by 0.972 units for each unit (cm) increase in MUAC. The first term in the equation (i.e., –0.042) represents the intercept and would be the expected BMI if a person had MUAC of 0 (a zero or negative value of BMI may appear unusual but more on this later).
Assumptions | |  |
Regression analysis makes several assumptions, which are quite akin to those for correlation analysis, as we discussed in a recent issue of the journal.[1] To recapitulate, first, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data include more than one observation on any individual. Furthermore, the data must not include one or a few extreme values since these may create a false sense of relationship in the data even when none exists. If these assumptions are not met, the results of linear regression analysis may be misleading.
Correlation Versus Regression | |  |
Correlation and regression analyses are similar in that these assess the linear relationship between two quantitative variables. However, these look at different aspects of this relationship. Simple linear regression (i.e., its coefficient or “b”) predicts the nature of the association – it provides a means of predicting the value of dependent variable using the value of predictor variable. It indicates how much and in which direction the dependent variable changes on average for a unit increase in the latter. By contrast, correlation (i.e., correlation coefficient or “r”) provides a measure of the strength of linear association – a measure of how closely the individual data points lie on the regression line. The values of “b” and “r” always carry the same sign – either both are positive or both are negative. However, their magnitudes can vary widely. For the same value of “b,” the magnitude of “r” can vary from 1.0 to close to 0.
Additional Considerations | |  |
Some points must be kept in mind when interpreting the results of regression analysis. The absolute value of regression coefficient (”b”) depends on the units used to measure the two variables. For instance, in a linear regression equation of BMI (independent) versus MUAC (dependent), the value of “b” will be 2.54-fold higher if the MUAC is expressed in inches instead of in centimeters (1 inch = 2.54 cm); alternatively, if the MUAC is expressed in millimeters, the regression coefficient will become one-tenth of the original value (1 mm = 1/10 cm). A change in the unit of “y” will also lead to a change in the value of the regression coefficient. This must be kept in mind when interpreting the absolute value of a regression coefficient.
Similarly, the value of “intercept” also depends on the unit used to measure the dependent variable. Another important point to remember about the “intercept” is that its value may not be biologically or clinically interpretable. For instance, in the MUAC-BMI example above, the intercept was −0.042, a negative value for BMI which is clearly implausible. This happens when, in real-life, the value of independent variable cannot be 0 as was the case for the MUAC-BMI example above (think of MUAC = 0; it simply cannot occur in real-life).
Furthermore, a regression equation should be used for prediction only for those values of the independent variable that lie within in the range of the latter's values in the data originally used to develop the regression equation.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References | |  |
1. | Benítez Brito N, Suárez Llanos JP, Fuentes Ferrer M, Oliva García JG, Delgado Brito I, Pereyra-García Castro F, et al. Relationship between mid-upper arm circumference and body mass index in inpatients. PLoS One 2016;11:e0160480. |
2. | Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: The use of correlation techniques. Perspect Clin Res 2016;7:187-90.  [ PUBMED] [Full text] |
3. | Bielemann RM, Gonzalez MC, Barbosa-Silva TG, Orlandi SP, Xavier MO, Bergmann RB, et al. Estimation of body fat in adults using a portable A-mode ultrasound. Nutrition 2016;32:441-6. |
[Figure 1], [Figure 2]
This article has been cited by | 1 |
Visual literacy shown through a magnifying lens by high school students |
|
| Pritika Reddy, Bibhya Sharma, Kaylash Chaudhary, Osaiasi Lolohea, Robert Tamath | | Interactive Technology and Smart Education. 2022; | | [Pubmed] | [DOI] | | 2 |
Robust and rigorous identification of tissue-specific genes by statistically extending tau score |
|
| Hatice Büsra Lüleci, Alper Yilmaz | | BioData Mining. 2022; 15(1) | | [Pubmed] | [DOI] | | 3 |
A hybrid approach to enhance the lifespan of WSNs in nuclear power plant monitoring system |
|
| Md Ershadul Haque, Tanvir Hossain, Mahidur R. Sarker, Manoranjan Paul, Md Samiul Hoque, Salah Uddin, Abdulla Al Suman, Mohamad Hanif Md Saad, Tanvir Ul Huque | | Scientific Reports. 2022; 12(1) | | [Pubmed] | [DOI] | | 4 |
Importance of respiratory syncytial virus as a predictor of hospital length of stay in bronchiolitis |
|
| Jefferson Antonio Buendia, Diana Guerrero Patino | | F1000Research. 2022; 10: 110 | | [Pubmed] | [DOI] | | 5 |
Logistic Regression of Czech Luxury Fashion Purchasing Habits During the Covid-19 Pandemic – Old for Loyalty and Young for Sustainability? |
|
| Martin Hála, Eva Daniela Cvik, Radka MacGregor Pelikánová | | Folia Oeconomica Stetinensia. 2022; 22(1): 85 | | [Pubmed] | [DOI] | | 6 |
Aggression Detection in Social Media from Textual Data Using Deep Learning Models |
|
| Umair Khan, Salabat Khan, Atif Rizwan, Ghada Atteia, Mona M. Jamjoom, Nagwan Abdel Samee | | Applied Sciences. 2022; 12(10): 5083 | | [Pubmed] | [DOI] | | 7 |
Mortality Analysis of Patients with COVID-19 in Mexico Based on Risk Factors Applying Machine Learning Techniques |
|
| Aldonso Becerra-Sánchez, Armando Rodarte-Rodríguez, Nivia I. Escalante-García, José E. Olvera-González, José I. De la Rosa-Vargas, Gustavo Zepeda-Valles, Emmanuel de J. Velásquez-Martínez | | Diagnostics. 2022; 12(6): 1396 | | [Pubmed] | [DOI] | | 8 |
Clinicopathological characteristics predicting advanced stage and surgical margin invasion of oral squamous cell carcinoma: A single-center study on 10 years of cancer registry data |
|
| Nur Rahadiani, Muhammad Habiburrahman, Diah Handjari, Marini Stephanie, Ening Krisnuhoni | | Oncology Letters. 2022; 24(4) | | [Pubmed] | [DOI] | | 9 |
Risk factors associated with COVID-19 Intensive Care Unit hospitalisation in Guyana: A cross-sectional study |
|
| Steven A. Seepersaud | | Caribbean Medical Journal. 2021; | | [Pubmed] | [DOI] | | 10 |
A New Insight into Understanding Urban Vitality: A Case Study in the Chengdu-Chongqing Area Twin-City Economic Circle, China |
|
| Haize Pan,Chuan Yang,Lirong Quan,Longhui Liao | | Sustainability. 2021; 13(18): 10068 | | [Pubmed] | [DOI] | | 11 |
Advanced Statistics: Multiple Logistic Regression, Cox Proportional Hazards, and Propensity Scores |
|
| Alessia C. Cioci,Anthony L. Cioci,Alejandro M.A. Mantero,Joshua P. Parreco,D. Dante Yeh,Rishi Rattan | | Surgical Infections. 2021; 22(6): 604 | | [Pubmed] | [DOI] | | 12 |
Importance of respiratory syncytial virus as a predictor of hospital length of stay in bronchiolitis |
|
| Jefferson Antonio Buendia,Diana Guerrero Patino | | F1000Research. 2021; 10: 110 | | [Pubmed] | [DOI] | | 13 |
Importance of respiratory syncytial virus as a predictor of hospital length of stay in bronchiolitis |
|
| Jefferson Antonio Buendia,Diana Guerrero Patino | | F1000Research. 2021; 10: 110 | | [Pubmed] | [DOI] | | 14 |
Importance of respiratory syncytial virus as a predictor of hospital length of stay in bronchiolitis |
|
| Jefferson Antonio Buendia, Diana Guerrero Patino | | F1000Research. 2021; 10: 110 | | [Pubmed] | [DOI] | | 15 |
Information literacy: a desideratum of the 21st century |
|
| Pritika Reddy, Bibhya Sharma, Kaylash Chaudhary, 'Osaiasi Lolohea, Robert Tamath | | Online Information Review. 2021; ahead-of-p(ahead-of-p) | | [Pubmed] | [DOI] | | 16 |
Reviewing the use and quality of machine learning in developing clinical prediction models for cardiovascular disease |
|
| Simon Allan, Raphael Olaiya, Rasan Burhan | | Postgraduate Medical Journal. 2021; : postgradme | | [Pubmed] | [DOI] | | 17 |
AEC Classifier: A Tree-Based Classifier with Error Control for Medical Disease Diagnosis and Other Applications |
|
| Wasif Bokhari,Ajay Bansal | | International Journal of Semantic Computing. 2021; 15(02): 241 | | [Pubmed] | [DOI] | | 18 |
A Machine-Learning-Based System for Prediction of Cardiovascular and Chronic Respiratory Diseases |
|
| Wajid Shah, Muhammad Aleem, Muhammad Azhar Iqbal, Muhammad Arshad Islam, Usman Ahmed, Gautam Srivastava, Jerry Chun-Wei Lin, Fazlullah Khan | | Journal of Healthcare Engineering. 2021; 2021: 1 | | [Pubmed] | [DOI] | | 19 |
The two perfect scorers for technology acceptance |
|
| Pritika Reddy,Kaylash Chaudhary,Bibhya Sharma,Ronil Chand | | Education and Information Technologies. 2021; 26(2): 1505 | | [Pubmed] | [DOI] | | 20 |
Application of adaptive neuro-fuzzy inference system and response surface methodology in biodiesel synthesis from jatropha–algae oilwith and its performance and emission analysis on Diesel engine coupled with generator |
|
| Sunil Kumar,Siddharth Jain,Harmesh Kumar | | Energy. 2021; : 120428 | | [Pubmed] | [DOI] | | 21 |
Data-Driven Photoluminescence Tuning in Eu2+-Doped Phosphors |
|
| Shunqi Lai,Ming Zhao,Jianwei Qiao,Maxim S. Molokeev,Zhiguo Xia | | The Journal of Physical Chemistry Letters. 2020; : 5680 | | [Pubmed] | [DOI] | | 22 |
A hybrid ANN-Fuzzy approach for optimization of engine operating parameters of a CI engine fueled with diesel-palm biodiesel-ethanol blend |
|
| Suman Dey,Narath Moni Reang,Arindam Majumder,Madhujit Deb,Pankaj Kumar Das | | Energy. 2020; 202: 117813 | | [Pubmed] | [DOI] | | 23 |
Booking Prediction Models for Peer-to-peer Accommodation Listings using Logistics Regression, Decision Tree, K-Nearest Neighbor, and Random Forest Classifiers |
|
| Mochammad Agus Afrianto,Meditya Wasesa | | Journal of Information Systems Engineering and Business Intelligence. 2020; 6(2): 123 | | [Pubmed] | [DOI] | | 24 |
Effect of Potentially Inappropriate Medication on Treatment Adherence in Elderly with Chronic Illness |
|
| Supriya Pradhan,Abinash Panda | | Biomedical and Pharmacology Journal. 2018; 11(2): 935 | | [Pubmed] | [DOI] | |
|
 |
 |
|