Statistical power and sample size - Principles (2023)

What is statistical power?

Power is traditionally defined as the probability of rejecting the null hypothesis when the null hypothesis is false and the alternative hypothesis is true.

Suppose we have two populations whose parametric means are different.

  1. We take samples from the two populations and obtain sample means and variances.
  2. We perform a statistical test to determine if the means are significantly different
  3. We repeat sampling and testing many times.

Basically, test power is the proportion of those tests thatRightindicate that the means of the two populations are significantly different.

In practice, as we pointed outelsewhere,A better (and more general) definition of power is simply the probability that the test will reach Class A.specifiedTreatment effect as significant.Statistical power and sample size - Principles (5)

Assuming that the statistic being tested has a "known" distribution (for example, normal), the power of the test is as follows:

  • To imaginedAis hedistributionYour test stat (for example, Z) under theAlternative hypothesis, HA.
  • So heEnergyYour test is simply the ratio ofdAthose outside the bottom and/or top'critical values' of your test - these are quantiles ofd0, this is the test statisticdistributionunder theNull hypothesis, H0.

This works perfectly fine regardless of the effect size of your treatment (could be zero), but it assumes that your treatment effect is fixed (cannot vary) and that the only difference is between them.d0ydAis its location (thus the treatment effect, δ, ismetroAmetro0).

    of course yesd0ydAdiffer in other ways, or their distributions are unknown (or cannot be easily calculated), the only way to find performance may be empirically (in other words, through simulation). In this case, you repeatedly sample a defined population, apply your test to each sample, and find what proportion of the results are "significant."

Finally, note that when the distribution of the test statistic is not continuous (smooth) but very discrete (stepped), the use of traditional critical values ​​can reduce achievable performance to the point of uselessness. In this situation, averagePValues ​​perform better "on average" as long as you accept that your test is conservative orLiberal.

Of course, we want the power of our statistical test to be as high as possible. So we need to know what other factors determine the validity of a test:

Performance tends to be higher when:
  1. the effect size is large.
  2. the sample size is large,
  3. the variances of the populations examined are small,
  4. the significance level (α) is high (for example, 5% compared to 1%),
  5. a one-tailed test is used instead of a two-tailed one.
Note that performance can only be reliably estimated if all the assumptions of the statistical test are met.

For any given statistical test, there is a mathematical relationship between power, significance level, various population parameters, and sample size. For some of the more important statistical tests, we provide the formulas for this relationship. But before introducing the first of these (for the z test), we need to think carefully about what we are going to calculate using the relationship.

Statistical power and sample size - Principles (7)

Statistical power estimate

There are two reasons for estimating the power of a test:

  1. To create a power curve to predict how much information needs to be collected to have reasonable confidence (for example, 95%), you will get a meaningful result. This is a sensible and productive practice.

    In practice, the sample size required for a given desired performance is usually calculated directly, rather than constructing a performance curve. Still, examining a power curve can be very useful as it can help make a more rational experimental design decision. SuchFirstPower forecasts are useful, but they can be criticized when they are based on insufficient prior information (from a pilot study that is too small) or when too rough (or inappropriate) a model is used to predict how the statistic under test is likely to perform. to vary. Somewhat perversely, reviewers tend to be much more concerned with the exact mathematical model than with the data to which it is applied, possibly because theoretical mathematical flaws are easier to resolve and their refinement offers interesting career prospects for statisticians. mathematicians.

  2. For additional information on data already collected and tested. Suchpost-hocYield predictions are controversial and generally discouraged for two reasons:

    1. You'llforeverdetermine that significance is not sufficient to demonstrate a non-significant treatment effect. This is because the estimated yield is directly related to the observed one.P-Worth. In other words, it cannot give you more than a precise indicationP-Wert.

    Despite this objection, several standard textbooks (such asZar(1996)ySchubfeld(2005)) recommend calculating power when a difference is determined to be negligible as an aid in "interpreting that difference". If one test isn't enough to tell that difference, they suggest classifying the result as "inconclusive."

Unfortunately,post-hocPotency determinations have no theoretical justification and areNorecommended. Power is a pre-judgment concept. We should not apply a pre-experimental probability of a hypothesized set of outcomes to the observed outcome. This has been compared to trying to convince someone that buying a lottery ticket was stupid (the pre-college point of view) after winning the lottery jackpot (the post-college point of view).

  • Calculating the power to demonstrate the effect of your observed treatment locks you in the significant/insignificant mindset with a rigid significance level of 0.05. Once you have the data, it is better to use the accurate onesP-Value used to assess the weight of evidence and calculate a confidence interval around the estimated effect size as a measure of the reliability of that estimate.

    Accepting these points, there is a way of calculating post-event performance that can be very informative: the empirical performance curve or its equivalent.Pvalue chart, theP-value function - corresponding to each possible confidence interval on the size of the observed effect. Whatever it's called, this function estimates the relationship between the probability of rejecting the null hypothesis and the effect size, given the available data. For simpler models, this relationship can be predicted algebraically. Alternatively, and more tellingly, the relationship can be estimated using a "trial inversion." Because test inversion exploits the underlying connection between tests and confidence intervals, we examine this method inUnit 6

    Statistical power and sample size - Principles (13)

    Estimate the sample size required for a given power level

    Predicting the required sample size for a particular statistical test requires power values, significance level, effect size, and various population parameters. You should also indicate whether the test is unilateral or bilateral. We will look at each of these components.

    The values ​​chosen for the statistical power and the level of significance depend on the study. Conventionally, the power should not be less than 0.8, and preferably around 0.9. The most commonly used value for the significance level (α) is 0.05. However, there may be good reasons to deviate from these conventional values. If it is more important to avoid a type I error (ie, a false positive result), the significance level can be lowered to 0.01. If it is more important to avoid a type II error (ie, a false negative result), the power can be increased to 0.95.

    The relevant population parameters depend on the type of statistical test. When comparing means, you should include the population standard deviation. When you compare proportions, you must provide the proportion of the reference or control group, which in turn allows you to estimate the standard deviation. These parameters can normally be estimated using the literature or, if this is not possible, using a pilot study. Sometimes it is necessary to re-evaluate these parameters during the course of a study, although statisticians generally advise against this because it can be an introduction.biasin the process.

    The effect size (the smallest difference between the means or proportions that you consider to be significant) is probably the most difficult parameter to determine because it is subjective to some degree. When comparing a new malaria treatment to the standard, how much is the improvement worth? In making this decision, one must consider the frequency and severity of side effects, the relative cost of the new treatment, and the relative ease of administration. If the new drug is cheaper than the current one and has fewer side effects, even a small improvement in cure rate (say 5%) is worth it. If it's much more expensive with similar side effects, you might consider that only a larger improvement (say 20%) would be worth it.

    Don't just choose the effect size
    this gives you a useful sample size!

    Effect size options should be consideredforevermake yourself explicit - a point not sufficiently emphasized in the literature! Too often, researchers do what is popularly known as this'samba sample size'- That is, simply changing the size of the effect to obtain an adequate sample size. This is very silly because when you find a smaller effect size, you're determined to say it's not worth it, even if it's worth it!

    Finally, one must decide whether to choose a one-tailed or a two-tailed test. Sometimes a one-tailed test is chosen simply to reduce the required sample size, a practice that statisticians strongly discourage. Today, the convention is that with a two-tailed test one should always estimate the sample size, even if a one-tailed test is later used for evaluation.

    There is one last important point!

    The estimate of the required sample size isOh noa precise science. It is always approximate because you need to estimate (sometimes just estimate) the variances of the populations involved. As a result, the actual performance you achieve may be far below what you expect.

    Therefore, it is a good idea to use a slightly larger sample size than specified in your power analysis.

    Statistical power and sample size - Principles (16)

    Power of the estimator and sample size for the z test

    hypotheses and queues

    We now consider how the statistical power of the z test to compare a randomly selected Q value from a test population to the true mean (μ1) - with a known mean of the reference population (μ0) and known standard error (σd). This standard error is assumed to be the same under the zero and alternative hypotheses, andre = Q - m1.

    1. For a one-tailed high-end test:
      • The null hypothesis (H0) esmetro1= metro0
        Also δ = [m1- metro0] = 0
      • Die Alternativhipótesis (H1) esmetro1> metro0
        Also δ = [m1- metro0] > 0
      In other words, δ is the true difference between the null and alternative population means, and d is the difference we observe, which is an estimate of δ. we only reject H0if we observe a d that lies within theupper tailour population zero. The largest δ is compared to σd, the higher this probability is.
    2. For a low-end one-sided test:
      • H0is the same,re = 0.
      • but H1esmetro1< metro0 Then δ < 0.
      Here we can only reject H0if d is observed in thelower tailour population zero.
    3. For a two-tailed test:
      • H0is the same.
      • low h1 d ≠ 0.
      H0can be discarded if d in is trueany tailour population zero.

    Z notation

    To reduce computational effort, these comparisons are generally made using standardized values. Unfortunately, this usually leads to some additional notation that we need to explain before proceeding.

  • Amunit 3We use Z to refer to a normal probability density, from a standard normal distribution. Confusingly, Z can also be used to denote randomly chosen locations within this distribution. Default values ​​(usuallyquantile) within this distribution are indicated by a lowercase z with a subscript.
  • za(o +za) is the location of the critical value of α, above which 100% of the null population lies.
    1. For a standard one-tailed significance test, thehigher, higherCola,= 0,05y+za= +1,645.
    2. Since this distribution is symmetric, for thelowercola− za= −1,645.
    3. For a two-sided comparison, assuming a probability of ≥/2 on each edge and ≥/2 on each edge. = 0.05, then− zA'2= −1,960y+zA'2= +1.960.
  • Consequently, if we standardize the difference between the means by substituting the population standard error of d(σd), afterzd=d/pdo[metro1metro0]/pd

    power formulas

    For the three tests listed above, the probability of correctly rejecting the null hypothesis with a predefined α is as follows:

    Algebraically speaking -

    a. For a one-tailed test using the upper bound (positive treatment effect):

    power (1-b) = P[Z > ( +za− zd)]

    for example whenzd= +za, then half of all randomly selected outcomes exceed +za, so H0to be rejected, so the power (1-β) is 0.5

    b. For a one-tailed test using the lower bound (negative treatment effect):

    power (1-b) = P[Z < ( −za− zd)] = 1 − P[Z > ( −za− zd)]

    Likewise ifzd= −za, then half of all randomly selected results are undercut− za- Then the power (1-β) is 0.5

    C. For a two-tailed test with both ends:

    power (1-b) = P[Z > ( +zA'2− zd)] + 1 − P[Z > ( −zA'2− zd)]

    since yeszd= −zA'2o zd= +zA'2then just over half of all randomly chosen values ​​will cause H0reject If α/2 is greater or, for example,dis less, the difference in performance is quite greater compared to the 1-sided formula. In all three cases, ifre = 0, after(1 - b) = a/2, which is the proportion of errors of the 1st type in which H0It is true.


    • P is the probability, determined from the cumulative normal distribution, as the fraction of the standard normal distribution that is greater than or less than Z. This can be determined from your computer's statistics probability calculator.Package.If you use tables, some indicate the proportion of the distributionless than Z,while others state the proportion of the distribution that isbut bigger than zAnother variation is that the probability given in the table ranges from zero to Z, so you have to add 0.5 to get the correct one.Wert.
    • Z is the standardized normal deviation,
    • zais the location of the critical value for α, above which 100% of population zero lies, and is obtained from your probability calculator or tables, if that is the caseP(Z <za) = 1-aand α is the level of significance.
    • zd=d/pdo[metro1metro0]/pd
    • metro0is the mean of the reference population (under H0),
    • metro1is the mean of the test population
    • pd= population standard error of d. For a zσ testd= σ/√n, the standard error of the reference population mean, usually calculated as the standard deviation of the reference population observations (σ) divided by the square root of the number of observations in the sample (n) .

    Sample size estimation

    We rearrange the power formula to obtain the number of samples needed to obtain a given power.

    Algebraically speaking:

    For a one-sided test:

    Statistical power and sample size - Principles (25)
    (metro1- metro0)2
    • (zais obtained from your probability calculator or tables, if availableP(Z <za) = 1 - andand α is the level of significance.
    • zbis obtained from your probability calculator or tables, if that is the caseP(Z <zb) = 1 − β y 1 - bIt's the power
    • metro0is the known population mean,
    • metro1is the mean of the test population
    • σ is the known population standard deviation of the observations

    For a two-tailed test we use an approximation and use, for exampleA'2instead of for examplea. This ignores the possibility of a Type III error, but usually does not result in a fatal error in the case of large treatment effects.

    The following values ​​of for examplea, y Zbare the most commonly used in sample size calculations:

    significance level
    one-tailed (eg.a)two tails (eg.A'2)
    performance (eg.b)
    Statistical power and sample size - Principles (26)


    It makes a number of assumptions when estimating power and the required sample size. The first set of assumptions applies to all significance tests, namely:

    1. Samples are drawn at random or individuals are randomly assigned to treatment groups.
    2. The observations are independent of each other.

    The second set of assumptions is specific to the z test:

    1. The response variable approximates a normal distribution.
    2. The true population mean and standard deviation are known and are not estimated from a sample.

    Related topics :

    efficiency test

  • Top Articles
    Latest Posts
    Article information

    Author: Saturnina Altenwerth DVM

    Last Updated: 02/24/2023

    Views: 6095

    Rating: 4.3 / 5 (64 voted)

    Reviews: 95% of readers found this page helpful

    Author information

    Name: Saturnina Altenwerth DVM

    Birthday: 1992-08-21

    Address: Apt. 237 662 Haag Mills, East Verenaport, MO 57071-5493

    Phone: +331850833384

    Job: District Real-Estate Architect

    Hobby: Skateboarding, Taxidermy, Air sports, Painting, Knife making, Letterboxing, Inline skating

    Introduction: My name is Saturnina Altenwerth DVM, I am a witty, perfect, combative, beautiful, determined, fancy, determined person who loves writing and wants to share my knowledge and understanding with you.