*An oft-overlooked detail in the significance debate is the challenge of calculating correct p-values and confidence intervals, the favored statistics of the two sides. Standard methods rely on assumptions about how the data were generated and can be way off when the assumptions don’t hold. Papers on heterogenous effect sizes by Kenny and Judd and McShane and Böckenholt present a compelling scenario where the standard calculations are highly optimistic. Even worse, the errors grow as the sample size increases, negating the usual heuristic that bigger samples are better.*

Standard methods like the t-test imagine that we’re repeating a study an infinite number of times, drawing a different sample each time from a population with a fixed true effect size. A competing, arguably more realistic, model is the heterogeneous effect size model (*het*). This assumes that each time we do the study, we’re sampling from a different population with a different true effect size. Kenny and Judd suggest that the population differences may be due to “variations in experimenters, participant populations, history, location, and many other factors… we can never completely specify or control.”

In the meta-analysis literature, the *het* model is called the “random effects model” and the standard model the “fixed effects model”. While the distinction is well-recognized, the practical implications may not be. The purpose of this blog is to illustrate the practical consequences of the *het* model for p-values and confidence intervals.

I model the *het* scenario as a two stage random process. The first stage selects a population effect size, \(d_{pop}\), from a normal distribution with mean \(d_{het}\) and standard deviation \(sd_{het}\). The second carries out a two group difference-of-mean study with that population effect size: it selects two random samples of size \(n\) from standard normal distributions, one with \(mean=0\) and the other with \(mean=d_{pop}\), and uses standardized difference, aka *Cohen’s d*, as the effect size statistic. The second stage is simply a conventional study with population effect size \(d_{pop}\). \(d_{het}\), the first stage mean, plays the role of true effect size.

Figure 1 shows a histogram of simulated *het* results under the null (\(d_{het}=0\)) with \(sd_{het}=0.2\) for \(n=200\). Overlaid on the histogram is the sampling distribution for the conventional scenario colored by conventional p-value along with the 95% confidence interval. Note that the histogram is wider than the sampling distribution.

Recall that the p-value for an effect \(d\) is the probability of getting a result as or more extreme than \(d\) under the null. Since the histogram is wider than the sampling distribution, it has more data downstream of the point where \(p=0.05\) (where the color switches from blue to red) and so the correct p-value is more than 0.05. In fact the correct p-value is much more: 0.38. The confidence interval also depends on the width of the distribution and is wider than for the conventional case: -0.44 to 0.44 rather than -0.20 to 0.20.

Note that effect size heterogeneity “inflates” both the true p-value and true confidence interval. In this particular example, *p-value inflation* is 7.6 (0.38/0.05), and *confidence interval inflation* is 2.2 (0.44/0.20). In general, these inflation factors will change with \(sd_{het}\) and \(n\). Figures 2 and 3 plot p-value and confidence interval inflation vs. \(n\) for several values of \(sd_{het}\). The p-value results (Figure 2) show inflation when the conventional p-value is barely significant (\(p=0.05\)); the confidence interval results (Figure 3) are for \(d=0\) (same as Figure 1).