Statistical power explained
Statistical power, or sensitivity, is the likelihood of a significance test detecting an effect when there actually is one.
A true effect is a real, nonzero relationship between variables in a population. An effect is usually indicated by a real difference between groups or a correlation between variables.
High power in a study indicates a large chance of a test detecting a true effect. Low power means that your test only has a small chance of detecting a true effect or that the results are likely to be distorted by random and systematic error.
Power is mainly influenced by sample size, effect size, and significance level. A power analysis can be used to determine the necessary sample size for a study.
Why does power matter in statistics?
Having enough statistical power is necessary to draw accurate conclusions about a population using sample data.
In hypothesis testing, you start with a null hypothesis of no effect and an alternative hypothesis of a true effect (your actual research prediction).
The goal is to collect enough data from a sample to statistically test whether you can reasonably reject the null hypothesis in favor of the alternative hypothesis.
There’s always a risk of making one of two decision errors when interpreting study results:
 Type I error: rejecting the null hypothesis of no effect when it is actually true.
 Type II error: not rejecting the null hypothesis of no effect when it is actually false.
Power is the probability of avoiding a Type II error. The higher the statistical power of a test, the lower the risk of making a Type II error.
Power is usually set at 80%. This means that if there are true effects to be found in 100 different studies with 80% power, only 80 out of 100 statistical tests will actually detect them.
If you don’t ensure sufficient power, your study may not be able to detect a true effect at all. This means that resources like time and money are wasted, and it may even be unethical to collect data from participants (especially in clinical trials).
On the flip side, too much power means your tests are highly sensitive to true effects, including very small ones. This may lead to finding statistically significant results with very little usefulness in the real world.
To balance these pros and cons of low versus high statistical power, you should use a power analysis to set an appropriate level.
What is a power analysis?
A power analysis is a calculation that helps you determine a minimum sample size for your study.
A power analysis is made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.
 Statistical power: the likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
 Sample size: the minimum number of observations needed to observe an effect of a certain size with a given power level.
 Significance level (alpha): the maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
 Expected effect size: a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.
Before starting a study, you can use a power analysis to calculate the minimum sample size for a desired power level and significance level and an expected effect size.
Traditionally, the significance level is set to 5% and the desired power level to 80%. That means you only need to figure out an expected effect size to calculate a sample size from a power analysis.
To calculate sample size or perform a power analysis, use online tools or statistical software like G*Power.
Sample size
Sample size is positively related to power. A small sample (less than 30 units) may only have low power while a large sample has high power.
Increasing the sample size enhances power, but only up to a point. When you have a large enough sample, every observation that’s added to the sample only marginally increases power. This means that collecting more data will increase the time, costs and efforts of your study without yielding much more benefit.
Your research design is also related to power and sample size:
 In a withinsubjects design, each participant is tested in all treatments of a study, so individual differences will not unevenly affect the outcomes of different treatments.
 In a betweensubjects design, each participant only takes part in a single treatment, so with different participants in each treatment, there is a chance that individual differences can affect the results.
A withinsubjects design is more powerful, so fewer participants are needed. More participants are needed in a betweensubjects design to establish relationships between variables.
Significance level
The significance level of a study is the Type I error probability, and it’s usually set at 5%. This means your findings have to have a less than 5% chance of occurring under the null hypothesis to be considered statistically significant.
Significance level is correlated with power: increasing the significance level (e.g., from 5% to 10%) increases power. When you decrease the significance level, your significance test becomes more conservative and less sensitive to detecting true effects.
Researchers have to balance the risks of committing Type I and II errors by considering the amount of risk they’re willing to take in making a false positive versus a false negative conclusion.
Effect size
Effect size is the magnitude of a difference between groups or a relationship between variables. It indicates the practical significance of a finding.
While highpowered studies can help you detect medium and large effects in studies, lowpowered studies may only catch large ones.
There’s always some sampling error involved when using data from samples to make inferences about populations. This means there’s always a discrepancy between the observed effect size and the true effect size. Effect sizes in a study can vary due to random factors, measurement error, or natural variation in the sample.
Lowpowered studies will mostly detect true effects only when they are large in a study. That means that, in a lowpowered study, any observed effect is more likely to be boosted by unrelated factors.
If lowpowered studies are the norm in a particular field, such as neuroscience, the observed effect sizes will consistently exaggerate or overestimate true effects.
Other factors that affect power
Aside from the four major components, other factors need to be taken into account when determining power.
Variability
The variability of the population characteristics affects the power of your test. High population variance reduces power.
In other words, using a population that takes on a large range of values for a variable will lower the sensitivity of your test, while using a population where the variable is relatively narrowly distributed will heighten the sensitivity of the test.
Using a fairly specific population with defined demographic characteristics can lower the spread of the variable of interest and improve power.
Measurement error
Measurement error is the difference between the true value and the observed or recorded value of something. Measurements can only be as precise as the instruments and researchers that measure them, so some error is almost always present.
The higher the measurement error in a study, the lower the statistical power of a test. Measurement error can be random or systematic:
 Random errors are unpredictable and unevenly alter measurements due to chance factors (e.g., mood changes can influence survey responses, or having a bad day may lead to researchers misrecording observations).
 Systematic errors affect data in predictable ways from one measurement to the next (e.g., an incorrectly calibrated device will consistently record inaccurate data, or problematic survey questions may lead to biased responses).
How do you increase power?
Since many research aspects directly or indirectly influence power, there are various ways to improve power. While some of these can usually be implemented, others are costly or involve a tradeoff with other important considerations.
Increase the effect size. To increase the expected effect in an experiment, you could manipulate your independent variable more widely (e.g., spending 1 hour instead of 10 minutes in nature) to increase the effect on the dependent variable (stress level). This may not always be possible because there are limits to how much the outcomes in an experiment may vary.
Increase sample size. Based on sample size calculations, you may have room to increase your sample size while still meaningfully improving power. But there is a point at which increasing your sample size may not yield high enough benefits.
Increase the significance level. While this makes a test more sensitive to detecting true effects, it also increases the risk of making a Type I error.
Reduce measurement error. Increasing the precision and accuracy of your measurement devices and procedures reduces variability, improving reliability and power. Using multiple measures or methods, known as triangulation, can also help reduce systematic bias.
Use a onetailed test instead of a twotailed test. When using a t test or z tests, a onetailed test has higher power. However, a onetailed test should only be used when there’s a strong reason to expect an effect in a specific direction (e.g., one mean score will be higher than the other), because it won’t be able to detect an effect in the other direction. In contrast, a twotailed test is able to detect an effect in either direction.
Frequently asked questions about statistical power
 What is statistical power?

In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A statistically powerful test is more likely to reject a false negative (a Type II error).
If you don’t ensure enough power in your study, you may not be able to detect a statistically significant result even when it has practical significance. Your study might not have the ability to answer your research question.
 What is statistical significance?

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test. Significance is usually denoted by a pvalue, or probability value.
Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis.
When the pvalue falls below the chosen alpha value, then we say the result of the test is statistically significant.
 What is a power analysis?

A power analysis is a calculation that helps you determine a minimum sample size for your study. It’s made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.
 Statistical power: the likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
 Sample size: the minimum number of observations needed to observe an effect of a certain size with a given power level.
 Significance level (alpha): the maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
 Expected effect size: a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.
 How do you increase statistical power?

There are various ways to improve power:
 Increase the potential effect size by manipulating your independent variable more strongly,
 Increase sample size,
 Increase the significance level (alpha),
 Reduce measurement error by increasing the precision and accuracy of your measurement devices and procedures,
 Use a onetailed test instead of a twotailed test for t tests and z tests.