In this post, we will look at a simple logistic binomial regression model in R. First, let's take a look at the following hypothetical data taken from [1]:

C = Yes | C = No | |||

Disease | Disease | |||

Yes | No | Yes | No | |

Exposure | ||||

Yes | 1200 | 600 | 300 | 400 |

No | 400 | 100 | 600 | 400 |

It is a contingency table. With this data we want to predict the probability of getting a disease based on exposure. We assume that some people are exposed to something (not necessarily the disease itself), and we count the number of people that get the disease. The exposure could be something good like an experimental treatment of "exercises at least twenty minutes a week on a treadmill" and the disease could be "heart disease". Notice that there is a second variable, "C" that we could also incorporate into our model. To get this data into R it is easiest to have it in a comma separate file like this:

1 2 3 4 5 |
s,f,E,C 1200,600,1,1 400,100,0,1 300,400,1,0 600,400,0,0 |

Then we can read it into a data frame as follows:

1 |
dat = read.table("testdata.txt",header=F,sep=",") |

The way to fit a binomial model logistic regression model is as follows:

1 2 |
attach(dat) m1 = glm(cbind(s, f) ~ E, family=binomial) |

Let's go over what this actually means under the hood. It means we are modeling the probability $p$ as

$$

p(E) = \frac{\exp(a + bE)}{1 + \exp(a + bE)}.$$ This model is not the typical linear model. Another way to say this is that it models the logit of the probability linearly:

$$

{\rm logit}(p) := \log\left(\frac{p}{1-p}\right) = a + bE.$$ Of course as usual you can put much more complex expressions on the right-hand side. Now, when we actually typed in R code, the way to do this is to call the `glm`

function. Since we have counts of successes and failures, we use the option to pass a vector of those successes and failures using `cbind`

. We then type in the model formula and specify that we are using the "binomial" family, which means we are considering each treatment as an independent binomial experiment.

Note that I did not include the last column! Not doing that is equivalent to combining the C=Yes and C=No data. That is okay. Sometimes you might want to do that, and we'll talk more about that later. Let's look at the outcome of our model by calling `summary(m1)`

(note I have truncated the results somewhat):

1 2 3 4 5 6 7 8 9 10 11 12 |
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.69315 0.05477 12.655 < 2e-16 *** E -0.28768 0.06831 -4.211 2.54e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 198.56 on 3 degrees of freedom Residual deviance: 180.65 on 2 degrees of freedom AIC: 213.01 |

Looking at the coefficient for the exposure, we see that it is significant and negative. So, the presence of the exposure decreases ${\rm logit}(p)$ and hence decreases $p$, so that's a good thing for the exposure. Now what about that fourth column? Shouldn't we put that in the model? To do so, we run:

1 |
m2 = glm(cbind(s, f)~ E + C, family=binomial) |

We get the following result by running `summary(m2)`

:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.40547 0.05893 6.881 5.94e-12 *** E -0.69315 0.07746 -8.948 < 2e-16 *** C 0.98083 0.07454 13.159 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1.9856e+02 on 3 degrees of freedom Residual deviance: 2.4425e-13 on 1 degrees of freedom AIC: 34.353 |

We see that the C coefficient is also significant, and positive. The residual deviance is much lower, so much more of the variation in the data is now explained under this new model. So is this new model better? It might be, and it might not! If the new C variable is a common consequence of both the exposure and the disease (i.e. a *collider*) then it doesn't make much sense to include it as a predictor of the disease. If the C were just some other treatment, say indicating whether a patient was placed in a "takes a multivitamin" treatment group, then it would make sense to include it in the model.

The problem of including a collider, called *collider bias* is explained very nicely and in much more detail in a paper by Janszky, Ahlbom, and Svensson [1], where they discuss the hypothetical data set we examined here and one other.

[1] Janszky, Imre, Anders Ahlbom, and Anna C. Svensson. "The Janus face of statistical adjustment: confounders versus colliders." European journal of epidemiology 25.6 (2010): 361-363.