Those pesky p-values and simulated p-values

Last time, we investigated the use of the $\chi^2$-test in the following experiment described in [1]: a plot of land was divided into 112 20m by 20m squares. Half of these squares received an artificially large perch, and the researchers observed which squares were chosen by Red-winged Blackbirds to make their nests. The results were as follows:

    Yes No
Nest Yes 16 7
  No 40 49

It’s important to realize that “yes” for the nest factor means at least one nest. That is also something to consider for later. We also saw in the previous post how to input this table into R and do a $\chi^2$-test. This would be as follows:

> tab = as.table(matrix(c(16,7,40,49),ncol=2,byrow=T))
> summary(tab)
Number of cases in table: 112 
Number of factors: 2 
Test for independence of all factors:
	Chisq = 4.432, df = 1, p-value = 0.03527

Under these results, the authors conclude that the placement of the artificial perches likely makes a decision in how Red-winged Blackbirds select their nests.

Does this test actually make sense in this case?

We know that the $\chi^2$ distribution, being the square of the normal distribution, is actually a continuous random variable. In particular, the $\chi^2$ distribution can assume any value on the interval $[0,\infty)$. But people noticed that if you make this compensation, the $p$-value that you get out of this approximation may actually be smaller than the true $p$-value. Therefore, Yates introduced a continuity correction. You can actually see this corrected test if you used the chisq.test functionality in $R$:

> chisq.test(tab)

	Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 3.5017, df = 1, p-value = 0.06131

Now we have a larger $p$-value, that would not pass the usual $\alpha=0.05$ rejection level. Of course, I am definitely not a fan of the dichotomy between “do not reject the null” and “reject the null”; rather, smaller $p$-values are more like greater evidence of some effect, roughly speaking. However, the best way to calculate the $p$-value is simulated, rather than using the Yates correction.

In a simulated $p$-value, many random contingency tables are generated under the null distribution and the proportion of those tables whose test statistic is equal to or greater than the observed is the simulated $p$-value. You can actually obtain this simulated $p$-value in $R$ as follows:

> chisq.test(tab, simulate.p.value=T, B=50000)

	Pearson's Chi-squared test with simulated p-value (based on 50000

data:  tab
X-squared = 4.4319, df = NA, p-value = 0.05914

The B variable is the number of trials. If you don’t specify it, you get the default of B = 5000. In this case, we see that the Yates correction is actually closer to the true $p$-value than the naive $\chi^2$ approximation. In any case, approximation is cheap so you should actually use it all the time if you are set on the $\chi^2$ test-statistic. It makes much more logical sense, but there is also a sense in which none of the methods tells you much more than any of the others. A big flaw here was the lack of directionality. That’s why it’s often superior to resort to generalized linear models such as binomial logistic regression. Still, we should also balance that note by recognizing that within the context of the particular example, the observation data is really only the first step to understanding the phenomenon and statistics performed at this stage can only tell you so much. This smallish $p$-value should really be an impetus for the next stage of experimentation, if feasible. For example, it might be a nice idea to use observations and behaviour codings to determine the exact benefits from the artificial perches (and in fact, the authors did do that simultaneously in some ways, which is great).

I want to end this post by explaining a bit more about simulated $p$-values. As I said, I like them because they are always superior to approximations, though of course the best from a theoretically elegant point of view is an exact solution. But anyway, here is some code that I think is fairly self-explanatory that would also calculate a simulated $p$-value for our Red-winged Blackbird perch discussion:

import random


def chisq_distance(observed, expected):
    result = 0
    for i in range(len(observed)):
        result += (observed[i] - expected[i])**2/expected[i]

def expected_under_null(observed):
    r_1 = observed[0] + observed[1]
    r_2 = observed[2] + observed[3]
    c_1 = observed[0] + observed[2]
    c_2 = observed[1] + observed[3]
    t = r_1 + r_2
    return([r_1/t*c_1, r_1/t*c_2, r_2/t*c_1, r_2/t*c_2])

def random_table2(observed):
    r_1 = observed[0] + observed[1]
    r_2 = observed[2] + observed[3]
    c_1 = observed[0] + observed[2]
    c_2 = observed[1] + observed[3]
    t = r_1 + r_2
    select_from = [1]*c_1 + [0]*c_2
    selected = random.sample(select_from, r_1)
    a_11 = selected.count(1)
    a_12 = selected.count(0)
    a_21 = c_1 - a_11
    a_22 = c_2 - a_12

def chisq_test(observed, iters = 20000):
    expected = expected_under_null(observed)
    test_stat = chisq_distance(observed, expected)
    print(f"Expected under null: {expected}")
    print(f"Test statistic chi^2 = {test_stat}")
    worse = 0
    for i in range(iters):
        current = random_table2(observed)
        ctest = chisq_distance(current, expected)
        if ctest >= test_stat:
            worse += 1
    p_value = worse/iters
    print(f"p-value = {p_value}")

chisq_test([16, 7, 40, 49])

Here is the output:

Expected under null: [11.5, 11.5, 44.5, 44.5]
Test statistic chi^2 = 4.431851489985345
p-value = 0.0572

Note that the $p$-value fluctuates a bit since it is random. You can always increase the number of iterations to increase the accuracy as well. This code’s logic says: take the column marginals as fixed, and then select from all of them a subset the size of the first row marginal. In the context of the Red-winged Blackbirds: select 23 out of the random 112 squares to have at least one nest, and then count the number of those squares with artificial perches in them. Then repeat this experiment thousands of times to form an experimental distribution for the test statistic from which a $p$-value can be approximated.

Finally, we should note that under these scheme, the authors of the paper did not take into account the actual number of nests per square, only whether a square had a nest. One could imagine that this would also be important: for example, if several of the squares with artificial perches had dozens of nests, that could actually be more evidence for the usefulness of the perch. Also, it was not clear whether within the levels of perch status, the nests were really distributed randomly. If not, that could mean other factors that besides the presence of a perch that would influence the results. This is not related to the subtlety of computing $p$-values, but more along the lines of the validity of the tests implied by the experimental design.

Incidentally, it is somewhat misleading to call our test the $\chi^2$-test using simulated $p$-values since the distribution of the test statistic is not really a $\chi^2$-distribution.


1. Yasukawa, Ken, Lynn K. Whittenberger, and Tracy A. Nielsen. “Anti-predator vigilance in the red-winged blackbird, Agelaius phoeniceus: do males act as sentinels?.” Animal Behaviour 43.6 (1992): 961-969.

Leave a comment

Fields marked with * are required. LaTeX snippets may be entered by surrounding them with single dollar signs. Use double dollar signs for display equations. NOTE: Comments are held for moderation, so please wait a day or two for them to appear.