Note: Change of various wordpress themes and R updates may mean the code below might not work without some minor tweaks. Find the full code here (last checked 20/02/23).
This post will describe how to fit binomial GLMMs to a 3 fixed factor x 1 continuous factor design, a common design in wet-lab experiments. This design is my preferred design I used during my Postdoc for multifactor experiments because it accurately predicts the effect size (ECx) while investigating two factors. For example, you may want to know how much stressor X impacts coral survival, and how this changes with a second factor Y.
The easiest way to use this code is to reformat your headings in your data sheet to match the general heading I use here, therefore no code needs to be altered.
Note: This code does not go deep into data exploration, model selection, or model validation to keep things simple. I am going to assume a random ‘tank’ effect (obs), which occurs in almost all experiments.
- Import some data and label type. Your data should be organised in this type of long form
data1 <- read.table(file=”https://raw.githubusercontent.com/gerard-ricardo/data/master/mockmulticb”, header= TRUE,dec=”,”, na.strings=c(“”,”.”,”NA”))
head(data1)
options(scipen = 999) # turn off scientific notation#2)Organising and wrangling
str(data1) #check data type is correct
data1$raw.x <- as.numeric(as.character(data1$raw.x))
# data1$counts <- as.numeric(as.character(data1$counts))
data1$suc <- as.integer(as.character(data1$suc))
data1$tot <- as.integer(as.character(data1$tot))
data1$prop <- data1$suc/data1$tot
data1$obs <- factor(formatC(1:nrow(data1), flag=”0″, width = 3))# unique tank ID for later on
nrow(data1)
str(data1)
data1 = data1[complete.cases(data1), ] #make sure import matches NA type
Okay let’s do some very basic data exploration
table(data1$tot)
> table(data1$tot)
10 12 15 16 17 18 19 20 21 22 23 24 25 26 27 29 30 31 33 34 35 39 40
1 1 3 5 5 5 5 3 5 3 6 4 3 3 4 4 3 3 2 1 1 1 1
This isn’t great, it means each tank has a different number of animals, and predictions for tanks with very few animals will be very coarse. GLMs can weight this unbalance a bit so let’s leave in for the moment. We can see this in the data my making the size of each tank relative to the number of animals in it.
library(ggplot2)
p0 = ggplot()+geom_point(data1, mapping = aes(x = raw.x, y = prop),position = position_jitter(width = .02), alpha = 0.50,size = data1$tot*0.2)+theme_light()
p0 = p0 + facet_wrap(~factor)+scale_x_log10(name =”raw.x”)
p0
Okidok, at this stage you may want to launch into data exploration and consider some of the following
- is the control healthy/ or representative of field data health?
- how does the control relate to the first treatment?
- are data evenly spaced across the logx scale?
- is the effect linear or non-linear?
- is there any non-monotonic (bell shape) effects?
- are there any outliers?
This data set has a number of issues, but lets push on ahead anyway.
#4) Fit the model
library(lme4)
library(splines)
md3 <- glmer(cbind(suc,(tot – suc)) ~ scale(raw.x) * factor + (1|obs) ,family = binomial (link = logit),data = data1)
summary(md3)
library(RVAideMemoire) #GLMM overdispersion test
overdisp.glmer(md3) #Overdispersion for GLMM
#is there an interactive effect?
#is it overdispersed?
At this stage, many would stop and interpret. They would talk about the interactive effect and for significant effects say something like ‘for every unit increase of X, there is a decrease in response of blah blah blah. Neither statement is super useful for Regulators that actually need to make decisions. Why? Because p-values in lab assays often just reflect the experimental design. Experiments that use high concentrations and lots of replicates will likely find significant effects, but these may not be biologically meaningful. And by placing the interpretation on the treatment (which is to be managed), rather than response makes the metric much more usuable. So let’s take an effect-size approach by looking at the magnitude of change of the response. These metrics (ECx) can also be used in other models such as SSDs and meta-analyses. But first let’s predict and plot up the models. Here I’m going to switch to glmmTMB because of convergence issues.
library(glmmTMB)
md1 =glmmTMB(cbind(suc,(tot-suc))~raw.x*factor+(1|obs) ,data1,family=’binomial’ )
summary(md1)
vec.x = seq(min(data1$raw.x), max(data1$raw.x), length = 100)
df.x <- expand.grid(raw.x = vec.x,
factor = levels(data1$factor))
#make sure names are the same as model
mm <- model.matrix(~raw.x*factor, df.x) # build model matrix to sub the parameters into the equation
eta <- mm %*% fixef(md1)$cond #sub in parameters to get mean on logit scale
df.x$prediction <- as.vector(exp(eta) / (1 + exp(eta))) #back-transform mean from logit scale
se <- sqrt(diag(mm %*% vcov(md1)$cond %*% t(mm)))
df.x$upper <- exp(eta + 1.96 *se) /(1 + exp(eta + 1.96 *se)) #work out upper 95% CI
df.x$lower <- exp(eta – 1.96 *se) /(1 + exp(eta – 1.96 *se)) #work out lower 95% CI
data1$factor = factor(data1$factor,levels = c(“a”, “b”, “c”)) #Set levels in order for model
library(ggplot2)
p0= ggplot()
p0= p0+ geom_point(data = data1, aes(x = raw.x, y = prop,alpha = 0.1), color = ‘steelblue’, size = data1$tot*0.1, position=position_jitter(width = .01))
p0= p0+ geom_line(data = df.x, aes(x = raw.x, y = prediction), color = ‘grey30′, size=1)
p0= p0+ geom_ribbon(data = df.x, aes(x = raw.x, ymin=lower, ymax=upper,fill=’grey’), alpha=0.2)
p0= p0+ scale_x_log10(limits = c(0.9, 100))
p0 = p0+ labs(x=expression(Deposited~sediment~(mg~”cm”^{-2})),
y=expression(Recruit~survival~(prop.)))
p0= p0+ scale_y_continuous( limits = c(-0.05, 1.01))
p0= p0+ theme_light()
p0= p0+ scale_fill_manual( values = c(“grey”,”khaki2″))
p0= p0+ theme(legend.position=”none”)
p0= p0+ facet_wrap(~factor, nrow = 1)
p0
You can see that for each factor, there are some thresholds occurring. For the first level of the categorical factor (the left plot) the curve starts to bend steeply around 10 units. For the third plot, it occurs < 10 units, although the confidence bands are still wide in this region.
We want to predict what levels of the continuous stressor (on the x-axis) causes a 50% effect i.e how much of the stressor will impact 50% of the population. We can do this by interpolating onto the curve. I have used Vernables* dose.p function to do this. In this example I want to compare every EC50 against ‘factor a”s EC50 (represented in the red line).
Yep the code below looks intense but let me explain what is fundamentally going on here.
- We say what type of effect we want to see (here 50% is ec = 50).
- Work out the top of each model at the control
- Find 50% from that.
- Get everything on the logit scale ready for interpolation
- Interpolate for the mean for each curve. Note the model needs to be releved to get the coefficients for the second curve.
- Use some variance-co-variance to get the standard errors
- Use the old Wald approximation to approximate the 95% confidence intervals
- Wrap everything into a data.frame
- This needs to be done for each curve, so I ‘relevel’ for each factor.
#6) Getting ECx’s
ec = 50 #put in your ecx here
library(dplyr)
group.fac <-df.x %>% group_by(factor)%>%summarise(estimate = max(prediction))%>%as.data.frame()
top1 = group.fac$estimate[1] #the modelled control of factor 1
inhib1 = top1 -((ec/100) * top1) #an x% decrease from the control for factor 1
library(VGAM)
eta1 <- logit(inhib1)
data1$factor <- relevel(data1$factor, ref = “b”) #set reference levels for GLMs
md2 <- glmmTMB(cbind(suc,(tot-suc))~raw.x*factor+(1|obs) ,data1,family=’binomial’ )
data1$factor <- relevel(data1$factor, ref = “c”) #set reference levels for GLMs
md3 <- glmmTMB(cbind(suc,(tot-suc))~raw.x*factor+(1|obs) ,data1,family=’binomial’ )
data1$factor = factor(data1$factor,levels = c(“a”, “b”, “c”)) #Set levels in order for model
betas1 = fixef(md1)$cond[1:2] #intercept and slope for ref 1
betas2 = fixef(md2)$cond[1:2] #intercept and slope for ref 2
betas3 = fixef(md3)$cond[1:2]
ecx1 <- (eta1 – betas1[1])/betas1[2]
ecx2 <- (eta1 – betas2[1])/betas2[2]
ecx3 <- (eta1 – betas3[1])/betas3[2]
pd1 <- -cbind(1, ecx1)/betas1[2]
pd2 <- -cbind(1, ecx2)/betas2[2]
pd3 <- -cbind(1, ecx3)/betas3[2]
ff1 = as.matrix(vcov(md1)$cond[1:2,1:2])
ff2 = as.matrix(vcov(md2)$cond[1:2,1:2])
ff3 = as.matrix(vcov(md3)$cond[1:2,1:2])
ec.se1 <- sqrt(((pd1 %*% ff1 )* pd1) %*% c(1, 1))
ec.se2 <- sqrt(((pd2 %*% ff2 )* pd2) %*% c(1, 1))
ec.se3 <- sqrt(((pd3 %*% ff3 )* pd3) %*% c(1, 1))
upper1 = (ecx1+ec.se1*1.96)
lower1 = (ecx1-ec.se1*1.96)
upper2 = (ecx2+ec.se2*1.96)
lower2 = (ecx2-ec.se2*1.96)
upper3 = (ecx3+ec.se2*1.96)
lower3 = (ecx3-ec.se2*1.96)
ec.df1 = data.frame(ecx1, lower1, upper1)
ec.df2 = data.frame(ecx2, lower2, upper2)
ec.df3 = data.frame(ecx3, lower3, upper3)
ecall = cbind(ec.df1, ec.df2, ec.df3)
ec.df1 #this is your factor 1 ecx values
ec.df2 #this is your factor 1 ecx values
ec.df3 #
p0= p0+ geom_hline(yintercept = inhib1, col = ‘red’, linetype=”dashed”)
p0
Finally lets visualize the EC50’s on the plot
#Visualising the ECX
ecx.all = bind_cols(data.frame(factor = c(‘a’, ‘b’, ‘c’)), data.frame(ecx = c(ecx1, ecx2, ecx3)), data.frame(inhib = c(inhib1, inhib1, inhib1)))
upper.all = bind_cols(data.frame(factor = c(‘a’, ‘b’, ‘c’)), data.frame(upper = c(upper1, upper2, upper3)), data.frame(inhib = c(inhib1, inhib1, inhib1)))
lower.all = bind_cols(data.frame(factor = c(‘a’, ‘b’, ‘c’)), data.frame(lower = c(lower1, lower2, lower3)), data.frame(inhib = c(inhib1, inhib1, inhib1)))
p0 = p0 + geom_point(data = upper.all, aes(x = upper, y = inhib), color = ‘red’, size=2)
p0 = p0 + geom_point(data = ecx.all, aes(x = ecx, y = inhib), color = ‘red’, size=2)
p0 = p0 + geom_point(data = lower.all, aes(x = lower, y = inhib), color = ‘red’, size=2)
p0
You can see the EC50 values decrease between the factors, especially in factor c. The EC10 for the left hand plot is quite a bit lower, so a regulator may set lower ‘trigger values’ for areas where these two combinations of stressors occur. The curve on the left has an EC50 of 46.11, but the curve on right has an EC50 > 27.83, so ‘factor c’ is of much greater concern for regulators. In my view this is much better than simply going off p-values. However we can still compare the EC50s statistically, which is a topic for a later post.
I hope this post was useful to you, and will hopefully get researchers new to modelling thinking through the lens of effect size.
*Venables WN, Ripley BD. Modern applied statistics with S-PLUS. Springer Science & Business Media. 2013.