In statistical hypothesis testing, the computed p-value is the probability of getting a result “as extreme” as the given data under the null hypothesis. Here, “as extreme” means relative to a test-statistic, which has some distribution under the null hypothesis.

In the natural sciences, an experiment is performed. A statistical test is used whose null hypothesis is supposed to be related to the scientific hypothesis. If the p-value is less than 0.05, the null hypothesis is rejected. This in turn can guide our beliefs about the corresponding scientific hypothesis.

For a concrete example, I took an actual coin and flipped it 32 times. I got 19 heads and 13 tails.

A $\chi^2$-test with the null hypothesis being “equal probabilities of heads and tails” gives a p-value of 0.2888. Based on this data and a rejection level of 0.05, we do not reject the null hypothesis. So, it seems like we don’t have much evidence to say that the coin isn’t fair.

Often there is more than one well-known test that can be used, simply because you can compute any sort of test statistic that you want. In such cases, results under the null may be strongly dependent on the particular test statistic used. I’d like to illustrate this with goodness-of-fit testing for normality. There are quite a few ways to test for normality. One method is the Kolmogorov-Smirnov test, and another is the Shapiro-Wilk test. Here is a little R script that looks at these two different methods of testing for normality:

1 2 3 4 5 6 7 8 9 10 |
repetitions = 15000 pvalues1 = c() pvalues2 = c() for (i in 1:repetitions){ dat = rnorm(200) pvalues1[i] = shapiro.test(dat)$p.value pvalues2[i] = ks.test(dat,"pnorm", 0,1,exact=TRUE)$p.value } |

This program generates 200 iid samples from a normal distribution with mean zero and standard deviation one (i.e. the “standard normal”). Then, it runs the two normality tests on it. In this code, this testing is done for 15000 samples. It stores the p-values computed for each test, so you can directly compare how the two tests are working. First, as a sanity check, let’s check out the histogram of p-values:

Thank goodness, both of them are approximately uniformly distributed on $[0,1]$. In the long run, the p-values for continuous populations should be uniformly distributed, which follows from the definition of the p-value. For example:

1 2 3 4 |
> sum(pvalues1<=0.05)/length(pvalues1) [1] 0.04953333 > sum(pvalues2<=0.05)/length(pvalues2) [1] 0.04806667 |

The p-values may not always look uniformly distributed. This might happen if the p-value is approximated.

Back to these tests: they are making a Type I error about 5% of the time, which is what we expected. It’s a little interesting, however, that in this experiment, a dataset was rejected by *both* tests as being normal only 0.22% of the time. In fact, here is a plot of the p-values of the p-values for the K-S test against those of the Shapiro-Wilk test:

Striking, isn’t it? Even though they make Type I errors the expected number of times, they make these errors at *different* times.