You may want to review the page on the Experimental Method before proceeding. When one collects numerical data to attempt to falsify an hypothesis, the results may be difficult to interpret by eye. Here’s an example. We are testing the olfactory abilities of the meadow vole. The experimental design consists of pipes in the shape of a Y. At one of the upper ends of the Y, we place a small quantity of grass seeds. Nothing is at the other end of the Y. We introduce the vole at the base of the Y. The vole cannot see the seeds but should be able to smell them. We record which arm of the maze a vole chooses. We do this choice experiment on 20 voles.
The results are that 11 voles chose the arm of the Y with the food and 9 chose the arm without food. What can we conclude? It is true that more voles chose the arm with the food but are you comfortable claiming an effect when 9 of the 20 voles did not respond positively to the food?
It’s for cases like this that statistics comes to the rescue. Statistics provide a measurement of the likelihood that an observed effect is real. It’s up the investigator to decide the level of confidence you seek for your experiment. We’ll deal with that below.
Now is a good time to distinguish between statistics and probability. In a way, these two fields of mathematics are mirror images of each other. One uses statistics to make inferences about an entire population (e.g., all humans) based on a sample of the population. We use statistics to decide if human women are shorter than human men, on average? We obviously can’t measure the height of every human but we can take a sample of say 1000 humans and use statistics to answer our question.
In probability, one uses knowledge of the entire population to make predictions about a sample of population. A simple example is that you have a fishbowl full of marbles: 1000 blue ones, 500 red ones and 100 yellow ones. You can use probablity to figure the odds of, for example, randomly choosing five blue marbles.
Means and Variation
Let’s consider a couple of datasets on the number of blueberries on a lowbush blueberry. One site has a northern exposure and one has a southern exposure. Here are the raw data:
|Southern exposure||Northern exposure|
To begin our analysis, we first determine the mean, or average, of our samples from each habitat. We sum the number of blueberries in each column and divide by 10. The resulting mean is 130.0 blueberries/bush for both habitat.
But, that is not the end of our analysis. Do you notice any difference in the two columns of data? It appears that the number of berries per bush is more variable for the northern exposure. Even though the means for the two habitats are not different, the variation suggests an interesting biological difference.
But how can we quantify this variation? What if we calculated the deviation from the mean for each measurement and then summed those deviations up? Seems like a good idea at first but when you sum the deviations, they add to zero. In fact, that is the definition of a mean: the value that exactly minimizes the deviations above and below it.
What if we squared the deviation of each measurement from the mean? That procedure eliminates all negative numbers (deviations of measurements less than the mean). Here is the calculation for the bushes from the southern exposure:
variation = (5)2+ (11)2 + (23)2 + (32)2 + (6)2 + (11)2 + (25)2 + (1)2 + (8)2 + (12)2 = 2690
Note that squaring the deviation causes values appreciably above or below the mean to contribute largely to the total deviation. For instance, the value of 162, 32 above the mean, contributes 1024 units to the total of 2690.
We can calculate what the average deviation is by dividing by the sample size. In this case, that number is 2690/10. This number is one of the most important statistics. It is called the variance. Variance is often depicted symbolically as σ2. The variance is often a large number so statisticians often calculate a derivative of the square root of the variance to yield another statistic called the standard deviation.
The standard deviation is simply √(variance/(n-1)) where n is the number of measurements. For our example, the standard deviation (abbreviated as s.d. or σ)