In statistical hypothesis testing, the computed p-value is the probability of getting a result "as extreme" as the given data under the null hypothesis. Here, "as extreme" means relative to a test-statistic, which has some distribution under the null hypothesis.

In the natural sciences, an experiment is performed. A statistical test is used whose null hypothesis is supposed to be related to the scientific hypothesis. If the p-value is less than 0.05, the null hypothesis is rejected. This in turn can guide our beliefs about the corresponding scientific hypothesis.

For a concrete example, I took an actual coin and flipped it 32 times. I got 19 heads and 13 tails.

A $\chi^2$-test with the null hypothesis being "equal probabilities of heads and tails" gives a p-value of 0.2888. Based on this data and a rejection level of 0.05, we do not reject the null hypothesis. So, it seems like we don't have much evidence to say that the coin isn't fair.

Often there is more than one well-known test that can be used, simply because you can compute any sort of test statistic that you want. In such cases, results under the null may be strongly dependent on the particular test statistic used. I'd like to illustrate this with goodness-of-fit testing for normality. There are quite a few ways to test for normality. One method is the Kolmogorov-Smirnov test, and another is the Shapiro-Wilk test. Here is a little R script that looks at these two different methods of testing for normality:

More »