Likelihood and entropy
19 July 2014

For one work project I have playing with information-theoretic approaches to statistics. This route sounded the most natural, given that the core element on the problem involved minimizing the “distance” between two distributions. We finally settled for an empirical likelihood approach, although I very much like the name “stochastic inverse problem” used in Judge and Mittelhammer (2012) as it perfectly describes the fact that the model takes the individual likelihood as a parameter and not as the input.

Take for instance the ML estimator of the mean of a normal distribution with unknown variance. In that case, the individual contribution of $i$ to the likelihood is

which means that the log-likelihood problem to solve is

with the well known results that link the sample and the population statistics.

We can think however about the inverse problem in which we take the individual probabilities as parameters and impose only a minimal set of requirements on the population distribution. As a matter of fact, we can dispense with assuming a particular functional form for the distribution and simply set some moment conditions that we want our estimates to satisfy. For instance, in the example above, rather than assuming a normal distribution for each observation, we can simply assume that they come from an unknown distribution with one parameter (for simplicity) and set it, using the analogy principle, to its sample value. With that, we would be calculating the vector of probabilities $p$ that satisfy the natural constraints: $\sum_i p_i = 1$ and that $\sum_i p_i(x_i-\mu) = 0$. The problem then becomes

where $\mu$ and $p$ can be recovered as a function of the Lagrange multipliers $\lambda_1$ and $\lambda_2$.

What is truly fascinating about it is that the model is very closely connected the notion of entropy and Kullback-Leibler divergence. Take for instance the KL divergence between two discrete distributions characterized by $p=(p_1, \dots, p_n)$ and $q=(q_1, \dots, q_n)$, which is

We can then think about the program above as one for which $KL(n^{-1}, p)$, up to an additive constant. Therefore, maximizing the expression above is equivalent to minimizing the KL of the empirical distribution (Judge and Mittelhammer 2012, chapter 6) relative to the uniform distribution. The problem is thus one of finding, among all the possible distributions of $p_i$, the one that, being consistent with the data, is the closest to a uniform –which is, as we know, the least informative distribution. We could set ourselves different reference distributions, but empirical likelihood is trying hard to be consistent with the entropy principle:

Subject to data, the probability distribution which best represents the current state of knowledge is the one with largest remaining uncertainty.

which seems a very natural starting point.