7. Regression Analysis
Regression involves the determination of the degree of relationship in the patterns of variation of two or more variables through the calculation of the coefficient of correlation, r. The value of r can vary between 1.0, perfect correlation, and -1.0, perfect negative correlation. When r=0, there is zero correlation, meaning that the variation of one variable cannot be used to explain any of the variation in the other variable. The coefficient of determination, r2, is a measure of how well the variation of one variable explains the variation of the other, and corresponds to the percentage of the variation explained by a best-fit regression line which is calculated for the data.
In simple linear regression, a single dependent variable, Y, is considered to be a function of an independent X variable, and the relationship between the variables is defined by a straight line. (Note: many biological relationships are known to be non-linear and other models apply.) When a best-fit regression line is calculated, its binomial equation (y=mx+b) defines how the variation in the X variable explains the variation in the Y variable. Regression analysis also involves measuring the amount of variation not taken into account by the regression equation, and this variation is known as the residual. A statistical test called the F-test is used to compare the variation explained by the regression line to the residual variation, and the p-value that results from the F-test corresponds to the probability that the slope of the regression line is zero (i.e., the null hypothesis).
As the value of r2 increases, one can place more confidence in the predictive value of the regression line. Particularly when there are many data points used to generate a regression, a regression may be significant but have a very low r2 , indicating that little of the variation in the dependent variable can be explained by variation in the independent variable.
In the example below, we used regression analysis to explore the relationship between the petal width and petal length of the flowers of Iris versicolor. The calculation of a regression is tedious and time-consuming. Statistics software and many spreadsheet packages will do a regression analysis for you. The output for one such analysis is shown below.
The table summarizes the analysis. We have set up the regression to have Petal Width be the independent variable and Petal Length be the dependent variable. We want to try to predict Petal Length, knowing Petal Width. Our coefficient of determination, r2, is 0.61. This value is reasonably large, indicating that knowing the width of a petal should allow us to make an accurate estimate of petal length.
The table confirms our hunch of a significant relationship between the two variables. The F-value in the table has a value of 77.93 and a p-value <0.0001. The p-value gives the probability that the slope is zero, which would indicate that there is no correlation between the two variables. The low p-value indicates that the probability that the two variables are not related is vanishingly small.
We can also see the coefficients for our regression equation. Remember that the formula for a straight line is y = mx + b, where m is the slope and b is the y-intercept.From the table, we see that the y-intercept is 1.7813 and m, the Petal Width coefficient, is 1.8693. Therefore, the equation for our line is:
Petal Length = (Petal Width * 1.8693) + 1.7813
Finally, the statistical software provides a plot of Petal Length versus Petal Width.