Thursday, October 29, 2009

Benford’s Law: Human Intuition, Randomness and Fraud

Suppose we look at some set of fairly natural data, say the populations of various countries. Let’s look at the leading digits of their populations. For example, the United States has a population slightly over 300 million. So, for the United States, the leading digit would be 3. What fraction of countries would you expect to have a leading digit of 1? Most people would guess 1/9th since there are 9 possibilities (1,2,3,4,5,6,7,8,9) and no obvious reasons to prefer any digit over any other digit. However, you’d be wrong:The actual percentage is slightly over 25%. In fact, it turns out that we should expect about 30% of countries to have a leading digit of 1. This strangely large number of 1s shows up frequently in natural data and is known as Benford’s law. It turns out that for many natural data sets we expect about 30% to have a leading digit of 1, about 18% to have a leading digit of 2 and so on. Benford’s Law is an important statistical rule that has a variety of generalizations and has practical applications (such as in detecting fraudulent or manipulated data).

Where is this strange pattern coming from? I didn’t have a good understanding of this until it was explained to me by Steven Miller. Consider a finite list of numbers that is reasonably well-behaved. For each member of our list, we can write it in scientific notation. So, for example. if we had 236 on our list, we would write it as 2.36 * 10^2. Note that the lead digit is then determined by what the lead digit is in the part of the scientific notation that isn’t the exponent. (This part is sometimes called the mantissa when one wants to be fancy) Now, for each number on our list, instead of looking at the number x, we can look at log10 x. What does this do to the scientific notation? Well, scientific notation then corresponds to the log of the mantissa + an integer that is the power of the exponent. So, for example, log10 (2.36 * 10^2)= .3729...+ 2.

Let’s examine the non-integer part of log10 x (call this f(x)) What distribution do we expect for f(x)?, It is a fixed value between 0 and 1 with no clear cut offs or biases in any direction so the most obvious thing to do is to make it a uniform distribution. That means that there’s about a 5% chance that f(x) falls below .05, about a 20% chance that f(x) falls below .2, about a 50% chance that f(x) falls below .5 and so on.

So what does this tell us about the leading digit? If the mantissa is below 2, then we have lead digit 1. If the mantissa is between 2 and 3, then we have lead digit 2 and so on. The mantissa is below 2 if f(x) is less than log10 2 = .301. Accordingly, we should expect that the mantissa is below 2 about 30.1% of the time. Thus, a number should have a lead digit of 1 about 30.1% of the time. Similar logic works for how frequently we should expect the lead digit to be 2, or 3 or so on.

What is wrong with the intuition that every lead digit is just as common? When we calculate probabilities, we are used to using the simplest probability distribution we can imagine, something like picking a positive integer from 1 to 10^n for some fixed n. We are used to this approach primarily because it is easy to calculate. Consequently, most probability problems in high school and college assume that we have such a uniform distribution since that assumption makes the math much easier. But actual distributions in real life don’t often look like this. For example, we might have a Bell curve or some other distribution. For almost any distribution that arises in nature, Benford’s law will apply due to the logic we used earlier.

So what does this do for us? Perhaps most importantly, we can use this insight to detect fraud. When humans try to make up data, it often fails to fit Benford’s law. In general, humans are bad at constructing data that passes any minimal test for randomness. Failure to obey a generalized version of Benford’s law was one of the major pieces of evidence for election fraud in the last Iranian election. The recent questions regarding whether Strategic Visions Polling was falsifying poll data arose when Nate Silver noticed that its results diverged substantially from Benford’s law.

For more information on Benford’s Law and related patterns in data, as well as more mathematical discussions of that data, see Terry Tao’s blog post from which I shamelessly stole the hard data about populations of nations.

8 comments:

Nathaniel said...

This is probably the clearest way to explain Benford's law.

One of the nicest alternative ways has to do with scale invariance. If, in a given set of quantities, there is any distribution of first non-zero digits that is not merely an artifact of scale, it must be independent of the units in which the measurement is expressed. And if we take (say) the heights of buildings expressed in feet, in yards, etc., we find that they satisfy Benford's law regardless of the choice of units.

There are also some fascinating applications of Benford's Law in probing the distribution of primes.

Nathaniel said...

Here's another way to think about Benford's Law. Suppose that, in some entry in a table of values, we find an entry that begins with the digit 1. By what fraction of the total value will we have to increase it in order to change that digit? (No peaking at the remaining digits of the number!) It could be anything up to 100%; we might have to double the original value to flip that leading digit to a 2. But if the leading digit is a 9, we will at most have to increase the total value by about 11% in order to change the leading digit to a 1.

I suspect there is some fancy way to make the logarithmic values drop out of this consideration, but I'm too sleepy right now to work it out. Over to you!

Shalmo said...

http://www.youtube.com/watch?v=69xYD8oWxYg&feature=sub

Etienne Vouga said...

I'm still a bit confused... why do we expect f(x) to be uniformly distributed between 0 and 1? Can't I pick any different f(x) : R -> S^1, repeat your argument, and derive a different Law?

Joshua Zelinsky said...

Shalmo, I'm sorry I just watched the first five minutes of that video. What does it have to do with the original post?

Etienne,

Yes, and in general you will expect that well behaved functions will have a uniform distribution. Thus, for example if I took sqrt(x) mod 1 as my function it turns out for natural data this will have a uniform distribution. What we are seeing in this particular case is a general example of that sort of behavior.

The only specific reason we notice this in the case of Benford's law is because of the natural convenience of base 10 and the well behaved nature of the logarithm.

sniffnoy said...

...so I guess then what's going on is that this works just as well with other functions, except that if we use any function other than a logarithm, we won't be able to draw any conclusions about the first digit?

Joshua said...

Yes,

Although one can construct functions that deliberately don't map well to S^1. But you have to work at it and they generally fail in an obvious fashion.

Anonymous said...
This comment has been removed by a blog administrator.