STATISTICS Year : 2017  Volume : 8  Issue : 2  Page : 100102 Common pitfalls in statistical analysis: Linear regression analysis Rakesh Aggarwal^{1}, Priya Ranganathan^{2}, ^{1} Department of Gastroenterology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India ^{2} Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India Correspondence Address: In a previous article in this series, we explained correlation analysis which describes the strength of relationship between two continuous variables. In this article, we deal with linear regression analysis which predicts the value of one continuous variable from another. We also discuss the assumptions and pitfalls associated with this analysis.
The Regression Line Linear regression analysis of observations on two variables (x and y) in a sample can be looked upon as plotting the data and drawing a best fit line through these. This “best fit” line is so chosen that the sum of squares of all the residuals (the vertical distance of each point from the line) is a minimum – the socalled “least squares line” [Figure 1]. This line can be mathematically defined by an equation of the form:{Figure 1} Y = a + bx Where “x” is the known value of independent (or predictor or explanatory) variable, “Y” is the predicted (or fitted) value of “y” (dependent, outcome, or response variable) for the given value of “x”, “a” is called as the “intercept” of the estimated line and represents the value of Y when x = 0, and “b” is called as the “slope” of the estimated line and represents the amount by which Y changes on average as “x” increases by one unit. It is also referred to as “coefficient,” “regression coefficient,” or “gradient.” Note that lowercase letters (x and y) are used to denote the actual values and capital letters (Y) for predicted values. The value of “b” is positive when the value of Y increases with each unit increase in x and is negative if the value of Y decreases with each unit increase in x [Figure 2]. If the value of Y does not change with x, the value of “b” would be expected to be 0. Furthermore, the higher the magnitude of “b,” the steeper is the change in Y with change in x.{Figure 2} In the example of BMI and MUAC,[1] the linear correlation equation was: BMI = –0.042 + 0.972 × MUAC (in cm). Here, +0.972 is the slope or coefficient and indicates that, on average, BMI is expected to be higher by 0.972 units for each unit (cm) increase in MUAC. The first term in the equation (i.e., –0.042) represents the intercept and would be the expected BMI if a person had MUAC of 0 (a zero or negative value of BMI may appear unusual but more on this later). Assumptions Regression analysis makes several assumptions, which are quite akin to those for correlation analysis, as we discussed in a recent issue of the journal.[1] To recapitulate, first, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data include more than one observation on any individual. Furthermore, the data must not include one or a few extreme values since these may create a false sense of relationship in the data even when none exists. If these assumptions are not met, the results of linear regression analysis may be misleading. Correlation Versus Regression Correlation and regression analyses are similar in that these assess the linear relationship between two quantitative variables. However, these look at different aspects of this relationship. Simple linear regression (i.e., its coefficient or “b”) predicts the nature of the association – it provides a means of predicting the value of dependent variable using the value of predictor variable. It indicates how much and in which direction the dependent variable changes on average for a unit increase in the latter. By contrast, correlation (i.e., correlation coefficient or “r”) provides a measure of the strength of linear association – a measure of how closely the individual data points lie on the regression line. The values of “b” and “r” always carry the same sign – either both are positive or both are negative. However, their magnitudes can vary widely. For the same value of “b,” the magnitude of “r” can vary from 1.0 to close to 0. Additional Considerations Some points must be kept in mind when interpreting the results of regression analysis. The absolute value of regression coefficient (”b”) depends on the units used to measure the two variables. For instance, in a linear regression equation of BMI (independent) versus MUAC (dependent), the value of “b” will be 2.54fold higher if the MUAC is expressed in inches instead of in centimeters (1 inch = 2.54 cm); alternatively, if the MUAC is expressed in millimeters, the regression coefficient will become onetenth of the original value (1 mm = 1/10 cm). A change in the unit of “y” will also lead to a change in the value of the regression coefficient. This must be kept in mind when interpreting the absolute value of a regression coefficient. Similarly, the value of “intercept” also depends on the unit used to measure the dependent variable. Another important point to remember about the “intercept” is that its value may not be biologically or clinically interpretable. For instance, in the MUACBMI example above, the intercept was −0.042, a negative value for BMI which is clearly implausible. This happens when, in reallife, the value of independent variable cannot be 0 as was the case for the MUACBMI example above (think of MUAC = 0; it simply cannot occur in reallife). Furthermore, a regression equation should be used for prediction only for those values of the independent variable that lie within in the range of the latter's values in the data originally used to develop the regression equation. Financial support and sponsorship Nil. Conflicts of interest There are no conflicts of interest. References


