# Binomial logistic regression models in R

In this post, we will look at a simple logistic binomial regression model in R. First, let's take a look at the following hypothetical data taken from [1]:

 C = Yes C = No Disease Disease Yes No Yes No Exposure Yes 1200 600 300 400 No 400 100 600 400

It is a contingency table. With this data we want to predict the probability of getting a disease based on exposure. We assume that some people are exposed to something (not necessarily the disease itself), and we count the number of people that get the disease. The exposure could be something good like an experimental treatment of "exercises at least twenty minutes a week on a treadmill" and the disease could be "heart disease". Notice that there is a second variable, "C" that we could also incorporate into our model. To get this data into R it is easiest to have it in a comma separate file like this:

Then we can read it into a data frame as follows:

The way to fit a binomial model logistic regression model is as follows:

Let's go over what this actually means under the hood. It means we are modeling the probability $p$ as
$$p(E) = \frac{\exp(a + bE)}{1 + \exp(a + bE)}.$$ This model is not the typical linear model. Another way to say this is that it models the logit of the probability linearly:
$${\rm logit}(p) := \log\left(\frac{p}{1-p}\right) = a + bE.$$ Of course as usual you can put much more complex expressions on the right-hand side. Now, when we actually typed in R code, the way to do this is to call the glm function. Since we have counts of successes and failures, we use the option to pass a vector of those successes and failures using cbind. We then type in the model formula and specify that we are using the "binomial" family, which means we are considering each treatment as an independent binomial experiment.

Note that I did not include the last column! Not doing that is equivalent to combining the C=Yes and C=No data. That is okay. Sometimes you might want to do that, and we'll talk more about that later. Let's look at the outcome of our model by calling summary(m1) (note I have truncated the results somewhat):

Looking at the coefficient for the exposure, we see that it is significant and negative. So, the presence of the exposure decreases ${\rm logit}(p)$ and hence decreases $p$, so that's a good thing for the exposure. Now what about that fourth column? Shouldn't we put that in the model? To do so, we run:

We get the following result by running summary(m2):

We see that the C coefficient is also significant, and positive. The residual deviance is much lower, so much more of the variation in the data is now explained under this new model. So is this new model better? It might be, and it might not! If the new C variable is a common consequence of both the exposure and the disease (i.e. a collider) then it doesn't make much sense to include it as a predictor of the disease. If the C were just some other treatment, say indicating whether a patient was placed in a "takes a multivitamin" treatment group, then it would make sense to include it in the model.

The problem of including a collider, called collider bias is explained very nicely and in much more detail in a paper by Janszky, Ahlbom, and Svensson [1], where they discuss the hypothetical data set we examined here and one other.

[1] Janszky, Imre, Anders Ahlbom, and Anna C. Svensson. "The Janus face of statistical adjustment: confounders versus colliders." European journal of epidemiology 25.6 (2010): 361-363.