Store Sensitive Data With Python & MongoDB Client-Side Field Level Encryption
Rate this quickstart
With a combination of legislation around customer data protection (such as ), and increasing legislation around , it's increasingly necessary to be able to store sensitive customer data securely. While MongoDB's default security is based on modern industry standards, such as TLS for the transport-layer and for password exchange, it's still possible for someone to get into your database, either by attacking your server through a different vector, or by somehow obtaining your security credentials.
In these situations, you can add an extra layer of security to the most sensitive fields in your database using client-side field level encryption (). CSFLE encrypts certain fields that you specify, within the driver, on the client, so that it is never transmitted unencrypted, nor seen unencrypted by the MongoDB server. CSFLE makes it nearly impossible to obtain sensitive information from the database server either directly through intercepting data from the client, or from reading data directly from disk, even with DBA or root credentials.
There are two ways to use CSFLE in MongoDB: Explicit, where your code has to manually encrypt data before it is sent to the driver to be inserted or updated using helper methods; and implicit, where you declare in your collection which fields should be encrypted using an extended JSON Schema, and this is done by the Python driver without any code changes. This tutorial will cover implicit CSFLE, which is only available in and . If you're running MongoDB Community Server, you'll need to use , which won't be covered here.
- A recent release of Python 3. The code in this post was written for 3.8, but any release of Python 3.6+ should be fine.
[encryption]in square braces tells pip to install the optional dependencies required to encrypt data within the PyMongo driver.
The second thing you'll need to have installed is mongocryptd, which is an application that is provided as part of . Follow the to install mongocryptd on to the machine you'll be using to run your Python code. In a production environment, it's recommended to run mongocryptd as a service at startup on your VM or container.
Test that you have mongocryptd installed in your path by running
mongocryptd, ensuring that it prints out some output. You can then shut it down again with
First, I'll show you how to write a script to generate a new secret master key which will be used to protect individual field keys. In this tutorial, we will be using a "local" master key which will be stored on the application side either in-line in code or in a local key file. Note that a local key file should only be used in development. For production, it's strongly recommended to either use one of the integrated native cloud key management services or retrieve the master key from a secrets manager such as Hashicorp Vault. This Python script will generate some random bytes to be used as a secret master key. It will then create a new field key in MongoDB, encrypted using the master key. The master key will be written out to a file so it can be loaded by other python scripts, along with a JSON schema document that will tell PyMongo which fields should be encrypted and how.
First, here's a few imports you'll need. Paste these into a file called
The first thing you need to do is to generate 96 bytes of random data. Fortunately, Python ships with a module for exactly this purpose, called
secrets. You can use the
token_bytesmethod for this:
Next, here's some code that creates a MongoClient, configured with a local key management system (KMS).
Add this code to your
Once the client is configured in the code above, it's used to drop any existing "csfle_demo" database, just to ensure that running this or other scripts doesn't result in your database being left in a weird state.
The configuration and the client is then used to create a ClientEncryption object that you'll use once to create a data key in the
__keystorecollection in the
create_data_keywill create a document in the
__keystorecollection that will look a little like this:
Now you have two keys! One is the 96 random bytes you generated with
token_bytes- that's the master key (which remains outside the database). And there's another key in the
__keystorecollection! This is because MongoDB CSFLE uses . The key that is actually used to encrypt field values is stored in the database, but it is stored encrypted with the master key you generated.
To make sure you don't lose the master key, here's some code you should add to your script which will save it to a file called
Finally, you need a JSON schema structure that will tell PyMongo which fields need to be encrypted, and how. The schema needs to reference the key you created in
__keystore, and you have that in the
key_idvariable, so this script is a good place to generate the JSON file. Add the following to the end of your script:
Now you can run this script. First, set the environment variable
MDB_URLto the URL for your Atlas cluster. The script should create two files locally:
key_bytes.bin, containing your master key; and
json_schema.json, containing your JSON schema. In your database, there should be a
__keystorecollection containing your new (encrypted) field key! The easiest way to check this out is to go to , find your cluster, and click on
Create a new file, called
csfle_main.py. This script will connect to your MongoDB cluster using the key and schema created by running
create_key.py. I'll then show you how to insert a document, and retrieve it both with and without CSFLE configuration, to show how it is stored encrypted and transparently decrypted by PyMongo when the correct configuration is provided.
Start with some code to import the necessary modules and load the saved files:
Add the following configuration needed to connect to MongoDB:
The code above is very similar to the configuration created in
create_key.py. Note that this time,
AutoEncryptionOptsis passed a
schema_map, mapping the loaded JSON schema against the
peoplecollection in the
csfle_demodatabase. This will let PyMongo know which fields to encrypt and decrypt, and which algorithms and keys to use.
At this point, it's worth taking a look at the JSON schema that you're loading. It's stored in
json_schema.json, and it should look a bit like this:
If you don't want to store the schema in a file when you generate your field key in MongoDB, you can load the key ID at any time using the values you set for
keyAltNameswhen you created the key. In my case, I set
["example"], so I could look it up using the following line of code:
Because my code in
create_key.pywrites out the schema at the same time as generating the key, it already has access to the key's ID so the code doesn't need to look it up.
Add the following code to connect to MongoDB using the configuration you added above:
The code above connects to MongoDB and clears any existing documents from the
peoplecollection. It then adds a new person document, for Sophia Duleep Singh, with a fictional
Just to prove the data can be read back from MongoDB and decrypted by PyMongo, the last line of code queries back the record that was just added and prints it to the screen. When I ran this code, it printed:
When I ran this, it printed out:
That's a very different result from '123-12-1234'! Unfortunately, when you use the Random encryption algorithm, you lose the ability to filter on the field. You can see this if you add the following code to the end of your script and execute it:
When you execute this block of code, it will print an exception saying, "Cannot query on fields encrypted with the randomized encryption algorithm...".
AEAD_AES_256_CBC_HMAC_SHA_512-Randomis the correct algorithm to use for sensitive data you won't have to filter on, such as medical conditions, security questions, etc. It also provides better protection against frequency analysis recovery, and so should probably be your default choice for encrypting sensitive data, especially data that is high-cardinality, such as a credit card number, phone number, or ... yes ... a social security number. But there's a distinct probability that you might want to search for someone by their Social Security number, given that it's a unique identifier for a person, and you can do this by encrypting it using the "Deterministic" algorithm.
In order to fix this, open up
create_key.pyagain and change the algorithm in the schema definition from
Deterministic, so it looks like this:
create_key.pyto generate a new master key, field key, and schema file. (This operation will also delete your
csfle_main.pyagain. This time, the block of code that failed before should instead print out the details of Sophia Duleep Singh.
The problem with this way of configuring your client is that if some other code is misconfigured, it can either save unencrypted values in the database or save them using the wrong key or algorithm. Here's an example of some code to add a second record, for Dora Thewlis. Unfortunately, this time, the configuration has not provided a
schema_map! What this means is that the SSN for Dora Thewlis will be stored in plaintext.
If you paste the above code into your script and run it, it should print out something like this, demonstrating that one of the documents has an encrypted SSN, and the other's is plaintext:
In order to have a schema defined on the server-side, return to your
create_key.pyscript, and instead of writing out the schema to a JSON file, provide it to the
create_collectionmethod as a JSON Schema validator:
Providing a validator attaches the schema to the created collection, so there's no need to save the file locally, no need to read it into
csfle_main.py, and no need to provide it to MongoClient anymore. It will be stored and enforced by the server. This simplifies both the key generation code and the code to query the database, and it ensures that the SSN field will always be encrypted correctly. Bonus!
The definition of
By completing this quick start, you've learned how to:
- Create a secure random key for encrypting data keys in MongoDB.
- Use local key storage to store a key during development.
- Create a Key in MongoDB (encrypted with your local key) to encrypt data in MongoDB.
- Use a JSON Schema to define which fields should be encrypted.
- Assign the JSON Schema to a collection to validate encrypted fields on the server.
As mentioned earlier, you should not use local key storage to manage your key - it's insecure. You can store the key manually in a KMS of your choice, such as , or if you're using one of the three major cloud providers, their KMS services are already integrated into PyMongo. Read the to find out more.
There is a lot of documentation about Client-Side Field-Level Encryption, in different places. Here are the docs I found useful when writing this post: