In my previous posts, I showed how we can use simulated data to calculate power and sample size analyses, and estimate distributions around the effect size. But we can also use simulated data to estimate error around p-values. Unless you have been living under a rock, you should know p-values have been getting smashed for the past decade. But while many admit there isn’t anything inherently bad with p-values if they are used and interpreted carefully, it is the abuse and misinterpretation that causes many to argue they should be banned.
Today I was reading Halsey’s The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?” and they discuss using p-value intervals to estimate the error around the p-value. Using the standard cut-off threshold of 0.05, if a p-value falls under the threshold but the interval is greater than 0.05, then the p-value is likely not reliable. Let’s simulate some binomial data around 0.5 and 0.4, with just 20 individuals per group.
raw.x = c(rep('control', 10), rep('treat', 10))
tot = 20
set.seed(123)
suc = c(rbinom(10, tot, 0.5), rbinom(10, tot, 0.4))
data1 = data.frame(raw.x, suc, tot = rep(tot, 20))
data1$obs <- factor(formatC(1:nrow(data1), flag="0", width = 3))# unique obs ID
data1
library(ggplot2)
p0 = ggplot()+geom_point(data1, mapping = aes(x = raw.x, y = suc/tot),color = 'steelblue',alpha = 0.5, size = 2, position=position_jitter(width = .01))
p0

I’ve deliberately made it so there is quite a bit of overlap. Let’s fit a basic GLMM and assess the associated p-values.
library(lme4)
md3 <- glmer(cbind(suc,(tot - suc)) ~ raw.x+(1|obs),family = binomial (link = logit),data = data1)
summary(md3)
coef(summary(md3))[,"Pr(>|z|)"]
You can see the p-value of the slope (raw.x) is 0.0355, just under the 0.05 threshold. But what happens when we resample the data and then refit the model a few hundred times.
sims = 200 #number of simulations
lowerCI = 0.025sims upperCI = 0.975sims
medianCI = 0.5*sims
predFun1<-function(.)(coef(summary(.))[,"Pr(>|z|)"]) ####extract p value function
bb1<-bootMer(md3,FUN=predFun1,nsim=sims, parallel = "multicore", .progress = 'txt') #get bootstrapping, 200 sims df<-apply(bb1$t,2,function(X) X[order(X)]) %>% data.frame()#apply function, just orders the sim intercepts
int = df$X.Intercept. %>% data.frame()
raw.x = df$raw.x %>% data.frame()
blo<-int[lowerCI,] #find the bottom 2.5%
bhi<-int[upperCI,] #find the top 2.5%
med<-int[medianCI,]
c(med, blo, bhi)
blo<-raw.x[lowerCI,] #find the bottom 2.5%
bhi<-raw.x[upperCI,] #find the top 2.5%
med<-raw.x[medianCI,]
c(med, blo, bhi)
The p-intervals for ‘raw.x’ range from <0.001 (highly significant) to 0.692 (highly non-significant). This is a problem, and indicates the p-value wasn’t really reliable. We may have lucked it in the initial analysis.
Finally, let’s plot it up adding a red line through the 0.05 threshold.
library(ggplot2)
library(tidybayes)
p1 = ggplot(raw.x, aes(x=.))+geom_density(aes( color ='red' , fill='red'), alpha=0.3)+
stat_pointintervalh(aes(y = 0.00, x = .),.width = c(.66, .95))
p1 = p1 + scale_y_continuous(name ="Density")
p1 = p1 + scale_x_continuous(name ="p-values")
p1 = p1 + coord_cartesian(ylim = c(0.0, 8))
p1 = p1 +geom_vline(xintercept = 0.05, color = "red", lty = 2)+ theme_light()
p1= p1+ theme(legend.position="none")
p1

You can see the p-value distribution has a long tail. The 95% interval is in light grey and the 66% in black, showing a very unreliable test.
So what do you think? Should we be reporting p-value intervals with all our statistics?
Halsey LG (2019) The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum? Biol Lett 15:20190174