Suppose we have observations from a known probability distribution whose parameters are unknown. How should we estimate the parameters from our observations?

Throughout we’ll focus on a concrete example. Suppose we observe a random variable drawn from the uniform distribution on $[0,\theta]$, but we don’t know what $\theta$ is. Our one observation is the number $a$. How can we estimate $\theta$?

One method is the ubiquitous *maximum likelihood* estimator. With this method, we put our observation into the density function, and maximize it with respect to the unknown parameter. The uniform distribution has density $f(x) = 1/\theta$ on the interval $[0,\theta]$ and zero elsewhere. This function is maximized when $\theta = a$. For if $\theta$ were any smaller, then $f(a)$ would be zero.

Also, it’s easy to see that if we draw $n$ samples $a_1,\dots,a_n$ from this distribution, the maximum likelihood estimator for $\theta$, which is the value of $\theta$ that maximizes the joint probability density function, is $\max_i \{a_i\}$.

For our single sample $a$, the maximum likelihood estimator might seem a little off. If I told you $a$ was observed from a random variable distributed uniformly on $[0,\theta]$, would you choose $\theta = a$ as the estimator? Nevertheless, it’s clear that for large samples, the maximum likelihood estimator does a good job. For example, let’s run a random number generator to sample 30 points independently from a uniform distribution on $[0,3]$. Here is how the maximum likelihood estimator gets closer to the true value as the observations come:

The y-value is the estimate using the first x-value points in our observation. That is, it illustrates how our estimate gets closer to the true value over time. After the full 30 points have been observed, the estimate for $\theta$ is $\hat{\theta} = 2.966977$. Not bad.

There is another way to estimate $\theta$, and that is using the method of moments. What’s a moment? The moments of a random variable $X$ are the numbers $E(X^n)$ for $n=1,2,\dots$, provided that they exist. Here, $E(-)$ denotes the expected value operator. The idea is that if you can write $\theta$ (in general: the parameter to be estimated) as a function $G(-)$ of the moments, then the moment estimator of $\theta$ will be the function $G(-)$ applied to the estimate of the moments derived from the data.

It becomes clear once we consider our example. The first moment of the uniform distribution on $[0,\theta]$ is just the average, or $\theta/2$. So, the method of moments tells us to compute the average of the observed data, and then multiply by two to get an estimate for $\theta$!

What does this give us? For our single observed value $a$, the method of moments tells us to use $\hat{\theta}= 2a$ as an estimator for $\theta$. Wow! In general, it tells us to use twice the average. And just like the maximum likelihood method, in the long run it converges to the true parameter. Using the same dataset as above, this is how our moment estimator performs as the observations come in:

After all the data points, our moments estimator is 3.707783.

It’s not hard to see why the maximum likelihood estimator in this case works so well: if you divide the interval $[0,\theta]$ into $n$ equal-length intervals, on average you can expect one observation in the right-most interval after only $n$ observations. Therefore, after 10 observations, on average you can expect one observation to be not more than $\theta/10$ away from $\theta$, and this observation is the maximum likelihood estimator. Not only that, but one the maximum likelihood estimator gets closer to the true value of $\theta$, it won’t ever get farther! That’s not so with the moments estimator.

Like we observed, to get within $\theta/n$ of the true estimator with maximum likelihood, you’ll need on average $n$ observations. What about the method of moments? Here is a simulation estimate of the expected number of steps to get to the indicated precision level with both types of estimates, using a thousand simulations for each precision level:

Precision | Steps for ML | Steps for Moments | sd for ML steps | sd for Moments steps |

$\theta/5$ | 5.18 | 4.56 | 4.60 | 5.20 |

$\theta/10$ | 9.24 | 9.28 | 8.89 | 14.82 |

$\theta/20$ | 19.49 | 25.76 | 20.01 | 69.93 |

The expected number of steps for ML, estimated by simulation, are pretty close to what we intuitively would expect. In other words, to get an estimate at most $\theta/n$ away from $\theta$, you need on average $n$ steps. This also seems to be a good approximation of the truth for the method of moments, and I do not think a theoretical derivation of what’s going on here would be overly difficult.

In the table, I have also included the standard deviation for the number of steps for each method. This is where it seems that the method of moments loses: the larger standard deviations that arise, particularly when $n$ gets bigger, suggest that the moments estimator of twice the average has far less consistent behaviour than the maximum likelihood estimator. It would very interesting to find a formula for this standard deviation, either asymptotic or exact.

Thus, in practice, it looks like maximum likelihood is the winner for this problem!