The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I’m not going to try to recount that elder history in this blog post.

But one of the central conflicts is that Bayesians expect probability theory to be… what’s the word I’m looking for? “Neat?” “Clean?” “Self-consistent?”

As Jaynes says, the theorems of Bayesian probability are just that, theorems in a coherent proof system. No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent – every theorem compatible with every other theorem. […]

Math! That’s the word I was looking for. Bayesians expect probability theory to be math. That’s why we’re interested in Cox’s Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory. Coherent math is great, but unique math is even better.

And yet… should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy – so shouldn’t you need messy reasoning to handle it? Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox. It’s nice when problems are clean, but they usually aren’t, and you have to live with that.

After all, it’s a well-known fact that you can’t use Bayesian methods on many problems because the Bayesian calculation is computationally intractable. So why not let many flowers bloom? Why not have more than one tool in your toolbox?

That’s the fundamental difference in mindset. Old School statisticians thought in terms of tools, tricks to throw at particular problems. Bayesians – at least this Bayesian, though I don’t think I’m speaking only for myself – we think in terms of laws. […]

No, you can’t always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn’t mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation – and fails to the extent that it departs.

Bayesianism’s coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox’s coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).

You may not be able to compute the optimal answer. But whatever approximation you use, both its failures and successes will be explainable in terms of Bayesian probability theory. You may not know the explanation; that does not mean no explanation exists.

So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

Sometimes you can’t use Bayesian methods literally; often, indeed. But when you can use the exact Bayesian calculation that uses every scrap of available knowledge, you are done. You will never find a statistical method that yields a better answer. You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate. Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.

When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you never know if someone else will come up with an even more clever tool tomorrow. But when you can directly use a calculation that mirrors the Bayesian law, you’re done – like managing to put a Carnot heat engine into your car. It is, as the saying goes, “Bayes-optimal”.

Eliezer Yudkowsky, Beautiful Probability, 14 January 2008


Added to diary 15 January 2018