By Arnold M. Saxton, Animal Science
Finding the Balance
Size of animal experiments is a delicate balance. Use too many animals and you are wasting resources and needlessly exposing animals to potential harm. If too few are used, experimental results will not be clear cut, again wasting animal resources unless the experiment can be enlarged by collecting more data.
Size of an experiment involves five quantities:
- V — the variability among observations within a treatment.
- D — the magnitude of treatment differences.
- a — the chance of incorrectly detecting a treatment difference.
- b — the chance of incorrectly detecting no treatment difference, or Power=1-b is the chance of correctly detecting a treatment difference.
- N — the number of observations.
Typically scientists will choose N to give an 80% chance of detecting the difference D (if it truly exists) with no more than a 5% chance of error, assuming V variability. Note this statement uses all five quantities.
Experiment size will be smaller if:
- V is decreased. This is the reason for good experimental technique, reducing measurement error, working with well defined populations of uniform animals, and controlling known sources of variability by statistical design techniques such as blocking and covariates. See papers by MFW Festing ILARJ (2002) 43:244-258 and Vet. Anaesth. Analg. (2003) 30:59-61.
- D is larger. This generally is not under researcher control, but as treatment differences get smaller, experiments must be larger.
- a is made larger. Generally P<.05 is the largest error rate that is scientifically accepted.
- b is made larger. Again, statistical power of 80% is a commonly quoted minimum. Would you pay for an experiment that only had a 50% chance of detecting an important treatment difference? Should we choose scientific knowledge based on flipping coins?
These considerations show that controlling variance is the best option for reducing sample size. Festing also mentions experimental approaches that use fewer animals than the traditional comparison of groups of animals on different treatments.
These five quantities are connected by complex formulas that change depending on the type of data (continuous, binary, etc.) and the type of question (differences in means is just one of many). A very rough approximation can be obtained from
N = 25*V/ (D*D).
Suppose a researcher wants to detect a difference in mouse body weight, and anticipates the control group will weigh 40g, the treatment group will weigh 50g (D=50-40), and the CV will be 20%. CV, the coefficient of variation, is std. deviation (s) divided by the mean, so
s = (20%)*40g = 8g.
V is simply s squared, giving N = 25*8*8/(10*10) = 16 animals per treatment.
It is recommended that at a minimum, researchers give anticipated means and a measure of variability for Question E5 in the IACUC protocol form. This will allow reviewers to more objectively assess the proposed size of experiments, using the above approximation.
Sample Size Calculators
For more accuracy, however, use of a sample size computer program is recommended. Web-based versions are convenient, and use of http://www.stat.uiowa.edu/~rlenth/Power/index.html is now illustrated.
- Go to the above URL, and select the Two-sample t test for comparing two means.
- Sigma1 and sigma2 are in the upper left. These are std. deviations for the two treatments, set equal by default. Set them to 8.
- The a value is on the right (.05 by default), and below that the true difference can be set. Set D to 10.
- Change the sample sizes at lower left, and see how power changes. The approximation gave a sample size of 16, which has a power of 93%. The experiment could be reduced to 12 animals per treatment and have at least 80% power.
- You can also change power values, and see how sample sizes change.
A web search for “power sample size” will provide many other calculators.
In a complex experiment, with many sub-experiments and treatments, does sample size need to be calculated for every combination? The calculations above theoretically only need to be done once for the worst-case scenario, where variability is highest and treatment difference is lowest. But this would produce excessive use of animals in some treatments, so a design that allows unequal samples sizes for different treatments might be considered. Then sample size calculations would have to be repeated for each unequal sample size allowed.
If all sub-experiments are connected through the use of common animals or tissues, then only the “weakest-link” needs to be considered. If stage 3 of the experiment needs tissue from 10 animals, obviously stage 1 and 2 that lead to stage 3 will need 10 animals, even if sample size calculations suggest 4 animals are sufficient in stage 1 and 2. Again, identify the situation that has smallest treatment difference and largest variance, and that will dictate sample size of the experiment.
As a final example, suppose researchers intend to use Fisher’s Exact Test to compare two percentages. They want to be 90% sure of detecting a true difference between percentages of 70% and 80%, at the 5% significance level.
Go to http://calculators.stat.ucla.edu/powercalc/ and choose Fisher’s, and “Sample size for a given power.” Fill in the form with 0.70, 0.80, 2 sided test, 0.05 and 0.90 power, and the sample size required is about 400 animals. Percentage data generally require large experiments.