Aggregation Pipeline: Applying Benford's Law to COVID-19 Data
Rate this article
Disclaimer: This article will focus on the aggregation pipeline and the stages I used to produce the result I wanted to get to be able to produce these charts—not so much on the results themselves, which can be interpreted in many different ways. One of the many issues here is the lack of data. The pandemic didn't start at the same time in all the countries, so many countries don't have enough data to make the percentages accurate. But feel free to interpret these results the way you want...
Benford's law [...] is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.
Here is the frequency distribution of the first digits that we can expect for a data set that respects Benford's law:
A little further down in Wikipedia's article, in the "Applications" section, you can also read the following:
Accounting fraud detection
In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who fabricate figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford's law ought to show up any anomalous results.
Simply, if your data set distribution is following Benford's law, then it's theoretically possible to detect fraudulent data if a particular subset of the data doesn't follow the law.
In our situation, based on the observation of the first chart above, it looks like the worldwide daily confirmed cases of COVID-19 are following Benford's law. But is it true for each country?
If I want to answer this question (I don't), I will have to build a relatively complex aggregation pipeline (I do 😄).
I will only focus on a single collection in this blog post:
Here is an example:
As you can see, for each day and country, I have daily counts of the COVID-19 confirmed cases and deaths.
Let's apply Benford's law on these two series of numbers.
Before we start applying stages (transformations) to our documents, let's define the shape of the final documents which will make it easy to plot in MongoDB Charts.
It's easy to do and defines clearly where to start (the document in the previous section) and where we are going:
Setting the final objective makes us focused on the target while doing our successive transformations.
Now that we have a starting and an ending point, let's try to write our pipeline in English first:
- Regroup all the first digits of each count into an array for the confirmed cases and into another one for the deaths for each country.
- Clean the arrays (remove zeros and negative numbers—see note below).
- Calculate the size of these arrays.
- Remove countries with empty arrays (countries without cases or deaths).
- Calculate the percentages of 1s, 2s, ..., 9s in each arrays.
- Add a fake country "BenfordTheory" with the theoretical values of 1s, 2s, etc. we are supposed to find.
- Final projection to get the document in the final shape I want.
Note: The daily fields that I provide in this collection
covid19.countries_summaryare computed from the cumulative counts that Johns Hopkins University (JHU) provides. Simply: Today's count, for each country, is today's cumulative count minus yesterday's cumulative count. In theory, I should have zeros (no deaths or no cases that day), but never negative numbers. But sometimes, JHU applies corrections on the counts without applying them retroactively in the past (as these counts were official counts at some point in time, I guess). So, negative values exist and I chose to ignore them in this pipeline.
Now that we have a plan, let's execute it. Each of the points in the above list is an aggregation pipeline stage, and now we "just" have to translate them.
First, I need to be able to extract the first character of
$confirmed_daily, which is an integer.
Here is the first stage:
Here is the shape of my documents at this point if I apply this transformation:
As mentioned above, my arrays might contains zeros and
-which is the leading character of a negative number. I decided to ignore this for my little mathematical experimentation.
At this point, our documents in the pipeline have the same shape as previously.
Here is the shape of our documents at this point:
As you can see for Japan, our arrays are relatively long, so we could expect our percentages to be somewhat accurate.
It's far from being true for all the countries...
I'm not good enough at math to decide which size is significant enough to be statistically accurate, but good enough to know that my rule of three will need to divide by the size of the array.
As dividing by zero is bad for health, I need to remove empty arrays. A sound statistician would probably also remove the small arrays... but not me 😅.
We are finally at the central stage of our pipeline. I need to apply a rule of three to calculate the percentage of 1s in an array:
- Find how many 1s are in the array.
- Multiply by 100.
- Divide by the size of the array.
- Round the final percentage to one decimal place. (I don't need more precision for my charts.)
Then, I need to repeat this operation for each digit and each array.
To find how many times a digit appears in the array, I can reuse techniques we learned earlier:
I'm creating a new array which contains only the 1s with
$filterand I calculate its size with
As a reminder, here is the final document we want:
The value we just calculated above corresponds to the
22.3that we have in this document.
At this point, we just need to repeat this operation nine times for each digit of the
confirmedarray and nine other times for the
deathsarray and assign the results accordingly in the new
benfordarray of documents.
Here is what it looks like in the end:
At this point in our pipeline, our documents look like this:
Note: At this point, we don't need the arrays anymore. The target document is almost there.
In my final charts, I wanted to be able to also display the Bendord's theoretical values, alongside the actual values from the different countries to be able to spot easily which one is potentially producing fake data (modulo the statistic noise and many other reasons).
Just to give you an idea, it looks like, globally, all the countries are producing legit data but some arrays are small and produce "statistical accidents."
To be able to insert this "perfect" document, I need to introduce in my pipeline a fake and perfect country that has the perfect percentages. I decided to name it "BenfordTheory."
But (because there is always one), as far as I know, there is no stage that can just let me insert a new document like this in my pipeline.
I inserted my fake document into the new collection randomly named
benford. Note that I made this document look like the documents at this current stage in my pipeline. I didn't care to insert the two arrays because I'm about to discard them anyway.
With this new collection in place, all I need to do is
At this point, our documents look almost like the initial target document that we have set at the beginning of this blog post. Two differences though:
- The name of the countries is in the
_idkey, not the
- The two arrays are still here.
Here is our final result:
And please remember that some documents still look like this in the pipeline because I didn't bother to filter them:
My final pipeline is pretty long due to the fact that I'm repeating the same block for each digit and each array for a total of 9*2=18 times.
Did you think that this pipeline I just presented was perfect?
Well well... It's definitely getting the job done, but we can make it better in many ways. I already mentioned in this blog post that we could remove Stage 3, for example, if we wanted to. It might not be as optimal, but it would be shorter.
Also, there is still Stage 5, in which I literally copy and paste the same piece of code 18 times... and Stage 6, where I have to use a workaround to insert a document in my pipeline.
In this blog post, I tried my best to share with you the process of creating a relatively complex aggregation pipeline and a few tricks to transform as efficiently as possible your documents.
We talked about and used in a real pipeline the following aggregation pipeline stages and operators:
Also, let me know if you can find out if some countries are clearly generating fake data.