Edit this page
We estimated the statistical power of the first and last statistical test presented in 697 papers from 10 behavioral journals. First tests had significantly greater statistical power and reported more significant results (smaller p values) than did last tests. This trend was consistent across journals, taxa, and the type of statistical test used. On average, statistical power was 13–16% to detect a small effect and 40–47% to detect a medium effect. This is far lower than the general recommendation of a power of 80%. By this criterion, only 2–3%, 13–21%, and 37–50% of the tests examined had the requisite power to detect a small, medium, or large effect, respectively. Neither p values nor statistical power varied significantly across the 10 journals or 11 taxa. However, mean p values of first and last tests were significantly correlated across journals (r =.67, n = 10, p =.034), with a similar trend for mean power ( r =.63, n = 10, p =.051). There is therefore some evidence that power and p values are repeatable among journals. Mean p values or power of first and last tests were, however, uncorrelated across taxa. Finally, there was a significant correlation between power and reported p value for both first ( r =.13, n = 684, p =.001) and last tests ( r =.16, n = 654, p <.0001). If true effect sizes are unrelated to study sample sizes, the average true effect size must be nonzero for this pattern to emerge. This suggests that failure to observe significant relationships is partly owing to small sample sizes, as power increases with sample size.
Link to resource: https://doi.org/10.1093/beheco/14.3.438
Type of resources: Primary Source, Reading, Paper
Education level(s): College / Upper Division (Undergraduates)
Primary user(s): Student
Subject area(s): Life Science, Math & Statistics