Aggregation Pipeline: Applying Benford's Law to COVID-19 Data
John Page, Maxime Beugnet16 min read • Published Jan 07, 2022 • Updated Jan 26, 2023
Rate this article
In this blog post, I will show you how I built an aggregation
pipeline to
apply Benford's law on
the COVID-19 data set that we have made available in the following
cluster:
If you want to know more about this cluster and how we transformed the
CSV files from Johns Hopkins University's repository into clean MongoDB documents, check out this blog post.
Finally, based on this pipeline, I was able to produce a dashboard in MongoDB Charts. For example, here is one Chart that applies Benford's law on the worldwide daily cases of COVID-19:
Disclaimer: This article will focus on the aggregation pipeline and
the stages I used to produce the result I wanted to get to be able to
produce these charts—not so much on the results themselves, which can be
interpreted in many different ways. One of the many issues here is the
lack of data. The pandemic didn't start at the same time in all the
countries, so many countries don't have enough data to make the
percentages accurate. But feel free to interpret these results the way
you want...
This blog post assumes that you already know the main principles of the
aggregation pipeline
and you are already familiar with the most common stages.
If you want to follow along, feel free to use the cluster mentioned
above or take a copy using mongodump or mongoexport, but the main takeaway from this blog post is the techniques I used to
produce the output I wanted.
Also, I can't recommend you enough to use the aggregation pipeline
builder in MongoDB Atlas
or Compass to build your pipelines and play with the ones you will see in this blog post.
Benford's law [...] is an observation about the frequency distribution
of leading digits in many real-life sets of numerical data. The law
states that in many naturally occurring collections of numbers, the
leading digit is likely to be small. In sets that obey the law, the
number 1 appears as the leading significant digit about 30% of the time,
while 9 appears as the leading significant digit less than 5% of the
time. If the digits were distributed uniformly, they would each occur
about 11.1% of the time. Benford's law also makes predictions about the
distribution of second digits, third digits, digit combinations, and so
on.
Here is the frequency distribution of the first digits that we can
expect for a data set that respects Benford's law:
A little further down in Wikipedia's article, in the "Applications"
section, you can also read the following:
Accounting fraud detection
In 1972, Hal Varian suggested that the law could be used to detect
possible fraud in lists of socio-economic data submitted in support of
public planning decisions. Based on the plausible assumption that people
who fabricate figures tend to distribute their digits fairly uniformly,
a simple comparison of first-digit frequency distribution from the data
with the expected distribution according to Benford's law ought to show
up any anomalous results.
Simply, if your data set distribution is following Benford's law, then
it's theoretically possible to detect fraudulent data if a particular
subset of the data doesn't follow the law.
In our situation, based on the observation of the first chart above, it
looks like the worldwide daily confirmed cases of COVID-19 are following
Benford's law. But is it true for each country?
If I want to answer this question (I don't), I will have to build a
relatively complex aggregation pipeline (I do 😄).
I will only focus on a single collection in this blog post:
covid19.countries_summary
.As its name suggests, it's a collection that I built (also using an
aggregation
pipeline)
that contains a daily document for each country in the data set.
Here is an example:
As you can see, for each day and country, I have daily counts of the
COVID-19 confirmed cases and deaths.
Let's apply Benford's law on these two series of numbers.
Before we start applying stages (transformations) to our documents,
let's define the shape of the final documents which will make it easy to
plot in MongoDB Charts.
It's easy to do and defines clearly where to start (the document in the
previous section) and where we are going:
Setting the final objective makes us focused on the target while doing
our successive transformations.
Now that we have a starting and an ending point, let's try to write our
pipeline in English first:
- Regroup all the first digits of each count into an array for the confirmed cases and into another one for the deaths for each country.
- Clean the arrays (remove zeros and negative numbers—see note below).
- Calculate the size of these arrays.
- Remove countries with empty arrays (countries without cases or deaths).
- Calculate the percentages of 1s, 2s, ..., 9s in each arrays.
- Add a fake country "BenfordTheory" with the theoretical values of 1s, 2s, etc. we are supposed to find.
- Final projection to get the document in the final shape I want.
Note: The daily fields that I provide in this collection
covid19.countries_summary
are computed from the cumulative counts that
Johns Hopkins University (JHU) provides. Simply: Today's count, for each
country, is today's cumulative count minus yesterday's cumulative count.
In theory, I should have zeros (no deaths or no cases that day), but
never negative numbers. But sometimes, JHU applies corrections on the
counts without applying them retroactively in the past (as these counts
were official counts at some point in time, I guess). So, negative
values exist and I chose to ignore them in this pipeline.Now that we have a plan, let's execute it. Each of the points in the
above list is an aggregation pipeline stage, and now we "just" have to
translate them.
First, I need to be able to extract the first character of
$confirmed_daily
, which is an integer.MongoDB provides a
$substring
operator which we can use if we transform this integer into a string.
This is easy to do with the
$toString
operator.
Here is the first stage:
Here is the shape of my documents at this point if I apply this
transformation:
As mentioned above, my arrays might contains zeros and
-
which is the
leading character of a negative number. I decided to ignore this for my
little mathematical experimentation.If I now translate "clean the arrays" into something more
"computer-friendly," what I actually want to do is "filter the
arrays." We can leverage the
$filter
operator and overwrite our existing arrays with their filtered versions
without zeros and dashes by using the
$addFields
stage.
At this point, our documents in the pipeline have the same shape as
previously.
The final goal here is to calculate the percentages of 1s, 2s, ..., 9s
in these two arrays, respectively. To compute this, I will need the size
of the arrays to apply the rule of
three.
To be completely honest, I could compute this on the fly later, when I
actually need it. But I'll need it multiple times later on, and this
stage is inexpensive and eases my mind so... Let's
KISS.
Here is the shape of our documents at this point:
As you can see for Japan, our arrays are relatively long, so we could
expect our percentages to be somewhat accurate.
It's far from being true for all the countries...
I'm not good enough at math to decide which size is significant enough
to be statistically accurate, but good enough to know that my rule of
three will need to divide by the size of the array.
As dividing by zero is bad for health, I need to remove empty arrays. A
sound statistician would probably also remove the small arrays... but
not me 😅.
We are finally at the central stage of our pipeline. I need to apply a
rule of three to calculate the percentage of 1s in an array:
- Find how many 1s are in the array.
- Multiply by 100.
- Divide by the size of the array.
- Round the final percentage to one decimal place. (I don't need more precision for my charts.)
Then, I need to repeat this operation for each digit and each array.
To find how many times a digit appears in the array, I can reuse
techniques we learned earlier:
I'm creating a new array which contains only the 1s with
$filter
and I
calculate its size with $size
.As a reminder, here is the final document we want:
The value we just calculated above corresponds to the
22.3
that we
have in this document.At this point, we just need to repeat this operation nine times for each
digit of the
confirmed
array and nine other times for the deaths
array and assign the results accordingly in the new benford
array of
documents.Here is what it looks like in the end:
At this point in our pipeline, our documents look like this:
Note: At this point, we don't need the arrays anymore. The target
document is almost there.
In my final charts, I wanted to be able to also display the Bendord's
theoretical values, alongside the actual values from the different
countries to be able to spot easily which one is potentially
producing fake data (modulo the statistic noise and many other reasons).
Just to give you an idea, it looks like, globally, all the countries are
producing legit data but some arrays are small and produce "statistical
accidents."
To be able to insert this "perfect" document, I need to introduce in my
pipeline a fake and perfect country that has the perfect percentages. I
decided to name it "BenfordTheory."
But (because there is always one), as far as I know, there is no stage
that can just let me insert a new document like this in my pipeline.
So close...
Lucky for me, I found a workaround to this problem with the new (since
4.4)
$unionWith
stage. All I have to do is insert my made-up document into a collection
and I can "insert" all the documents from this collection into my
pipeline at this stage.
I inserted my fake document into the new collection randomly named
benford
. Note that I made this document look like the documents at
this current stage in my pipeline. I didn't care to insert the two
arrays because I'm about to discard them anyway.With this new collection in place, all I need to do is
$unionWith
it.At this point, our documents look almost like the initial target
document that we have set at the beginning of this blog post. Two
differences though:
- The name of the countries is in the
_id
key, not thecountry
one. - The two arrays are still here.
Note that I chose which field should be here or not in the final
document by inclusion here.
_id
is an exception and needs to be
explicitly excluded. As the two arrays aren't explicitly included, they
are excluded by default, like any other field that would be there. See
considerations.Here is our final result:
And please remember that some documents still look like this in the
pipeline because I didn't bother to filter them:
My final pipeline is pretty long due to the fact that I'm repeating the
same block for each digit and each array for a total of 9*2=18 times.
If you want to see more visually how this pipeline works step by step,
import it in MongoDB Compass
once you are connected to the cluster (see the URI in the
Introduction). Use the
New Pipeline From Text
option in the
covid19.countries_summary
collection to import it.Did you think that this pipeline I just presented was perfect?
Well well... It's definitely getting the job done, but we can make it
better in many ways. I already mentioned in this blog post that we
could remove Stage 3, for example, if we wanted to. It might not be as
optimal, but it would be shorter.
Also, there is still Stage 5, in which I literally copy and paste the
same piece of code 18 times... and Stage 6, where I have to use a
workaround to insert a document in my pipeline.
Another solution could be to rewrite this pipeline with a
$facet
stage and execute two sub-pipelines in parallel to compute the results
we want for the confirmed array and the deaths array. But this solution
is actually about two times slower.
However, my colleague John Page came up
with this pipeline that is just better than mine
because it's applying more or less the same algorithm, but it's not
repeating itself. The code is a lot cleaner and I just love it, so I
thought I would also share it with you.
John is using very smartly a
$map
stage to iterate over the nine digits which makes the code a lot simpler
to maintain.
In this blog post, I tried my best to share with you the process of
creating a relatively complex aggregation pipeline and a few tricks to
transform as efficiently as possible your documents.
We talked about and used in a real pipeline the following aggregation
pipeline stages and operators:
- $map.
If you are a statistician and you can make sense of these results,
please post a message on the Community
Forum and ping me!
Also, let me know if you can find out if some countries are clearly
generating fake data.
If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.