Store Sensitive Data With Python & MongoDB Client-Side Field Level Encryption
Rate this quickstart
With a combination of legislation around customer data protection (such as GDPR), and increasing legislation around money laundering, it's increasingly necessary to be able to store sensitive customer data securely. While MongoDB's default security is based on modern industry standards, such as TLS for the transport-layer and SCRAM-SHA-2356 for password exchange, it's still possible for someone to get into your database, either by attacking your server through a different vector, or by somehow obtaining your security credentials.
In these situations, you can add an extra layer of security to the most sensitive fields in your database using client-side field level encryption (CSFLE). CSFLE encrypts certain fields that you specify, within the driver, on the client, so that it is never transmitted unencrypted, nor seen unencrypted by the MongoDB server. CSFLE makes it nearly impossible to obtain sensitive information from the database server either directly through intercepting data from the client, or from reading data directly from disk, even with DBA or root credentials.
There are two ways to use CSFLE in MongoDB: Explicit, where your code has to manually encrypt data before it is sent to the driver to be inserted or updated using helper methods; and implicit, where you declare in your collection which fields should be encrypted using an extended JSON Schema, and this is done by the Python driver without any code changes. This tutorial will cover implicit CSFLE, which is only available in MongoDB Enterprise and MongoDB Atlas. If you're running MongoDB Community Server, you'll need to use explicit CSFLE, which won't be covered here.
- A recent release of Python 3. The code in this post was written for 3.8, but any release of Python 3.6+ should be fine.
There are two things you need to have installed on your app server to enable CSFLE in the PyMongo driver. The first is a Python library called pymongocrypt, which you can install by running the following with your virtualenv enabled:
1 python -m pip install "pymongo[encryption,srv]~=3.11"
The
[encryption]
in square braces tells pip to install the optional dependencies required to encrypt data within the PyMongo driver.The second thing you'll need to have installed is mongocryptd, which is an application that is provided as part of MongoDB Enterprise. Follow the instructions to install mongocryptd on to the machine you'll be using to run your Python code. In a production environment, it's recommended to run mongocryptd as a service at startup on your VM or container.
Test that you have mongocryptd installed in your path by running
mongocryptd
, ensuring that it prints out some output. You can then shut it down again with Ctrl-C
.First, I'll show you how to write a script to generate a new secret master key which will be used to protect individual field keys. In this tutorial, we will be using a "local" master key which will be stored on the application side either in-line in code or in a local key file. Note that a local key file should only be used in development. For production, it's strongly recommended to either use one of the integrated native cloud key management services or retrieve the master key from a secrets manager such as Hashicorp Vault. This Python script will generate some random bytes to be used as a secret master key. It will then create a new field key in MongoDB, encrypted using the master key. The master key will be written out to a file so it can be loaded by other python scripts, along with a JSON schema document that will tell PyMongo which fields should be encrypted and how.
All of the code described in this post is on GitHub. I recommend you check it out if you get stuck, but otherwise, it's worth following the tutorial and writing the code yourself!
First, here's a few imports you'll need. Paste these into a file called
create_key.py
.1 # create_key.py 2 3 import os 4 from pathlib import Path 5 from secrets import token_bytes 6 7 from bson import json_util 8 from bson.binary import STANDARD 9 from bson.codec_options import CodecOptions 10 from pymongo import MongoClient 11 from pymongo.encryption import ClientEncryption 12 from pymongo.encryption_options import AutoEncryptionOpts
The first thing you need to do is to generate 96 bytes of random data. Fortunately, Python ships with a module for exactly this purpose, called
secrets
. You can use the token_bytes
method for this:1 # create_key.py 2 3 # Generate a secure 96-byte secret key: 4 key_bytes = token_bytes(96)
Next, here's some code that creates a MongoClient, configured with a local key management system (KMS).
Note: Storing the master key, unencrypted, on a local filesystem (which is what I do in this demo code) is insecure. In production you should use a secure KMS, such as AWS KMS, Azure Key Vault, or Google's Cloud KMS.
I'll cover this in a later blog post, but if you want to get started now, you should read the documentation
Add this code to your
create_key.py
script:1 # create_key.py 2 3 # Configure a single, local KMS provider, with the saved key: 4 kms_providers = {"local": {"key": key_bytes}} 5 csfle_opts = AutoEncryptionOpts( 6 kms_providers=kms_providers, key_vault_namespace="csfle_demo.__keystore" 7 ) 8 9 # Connect to MongoDB with the key information generated above: 10 with MongoClient(os.environ["MDB_URL"], auto_encryption_opts=csfle_opts) as client: 11 print("Resetting demo database & keystore ...") 12 client.drop_database("csfle_demo") 13 14 # Create a ClientEncryption object to create the data key below: 15 client_encryption = ClientEncryption( 16 kms_providers, 17 "csfle_demo.__keystore", 18 client, 19 CodecOptions(uuid_representation=STANDARD), 20 ) 21 22 print("Creating key in MongoDB ...") 23 key_id = client_encryption.create_data_key("local", key_alt_names=["example"])
Once the client is configured in the code above, it's used to drop any existing "csfle_demo" database, just to ensure that running this or other scripts doesn't result in your database being left in a weird state.
The configuration and the client is then used to create a ClientEncryption object that you'll use once to create a data key in the
__keystore
collection in the csfle_demo
database. create_data_key
will create a document in the __keystore
collection that will look a little like this:1 { 2 '_id': UUID('00c63aa2-059d-4548-9e18-54452195acd0'), 3 'creationDate': datetime.datetime(2020, 11, 24, 11, 25, 0, 974000), 4 'keyAltNames': ['example'], 5 'keyMaterial': b'W\xd2"\xd7\xd4d\x02e/\x8f|\x8f\xa2\xb6\xb1\xc0Q\xa0\x1b\xab ...' 6 'masterKey': {'provider': 'local'}, 7 'status': 0, 8 'updateDate': datetime.datetime(2020, 11, 24, 11, 25, 0, 974000) 9 }
Now you have two keys! One is the 96 random bytes you generated with
token_bytes
- that's the master key (which remains outside the database). And there's another key in the __keystore
collection! This is because MongoDB CSFLE uses envelope encryption. The key that is actually used to encrypt field values is stored in the database, but it is stored encrypted with the master key you generated.To make sure you don't lose the master key, here's some code you should add to your script which will save it to a file called
key_bytes.bin
.1 # create_key.py 2 3 Path("key_bytes.bin").write_bytes(key_bytes)
Finally, you need a JSON schema structure that will tell PyMongo which fields need to be encrypted, and how. The schema needs to reference the key you created in
__keystore
, and you have that in the key_id
variable, so this script is a good place to generate the JSON file. Add the following to the end of your script:1 # create_key.py 2 3 schema = { 4 "bsonType": "object", 5 "properties": { 6 "ssn": { 7 "encrypt": { 8 "bsonType": "string", 9 # Change to "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic" in order to filter by ssn value: 10 "algorithm": "AEAD_AES_256_CBC_HMAC_SHA_512-Random", 11 "keyId": [key_id], # Reference the key 12 } 13 }, 14 }, 15 } 16 17 json_schema = json_util.dumps( 18 schema, json_options=json_util.CANONICAL_JSON_OPTIONS, indent=2 19 ) 20 Path("json_schema.json").write_text(json_schema)
Now you can run this script. First, set the environment variable
MDB_URL
to the URL for your Atlas cluster. The script should create two files locally: key_bytes.bin
, containing your master key; and json_schema.json
, containing your JSON schema. In your database, there should be a __keystore
collection containing your new (encrypted) field key! The easiest way to check this out is to go to cloud.mongodb.com, find your cluster, and click on Collections
.Create a new file, called
csfle_main.py
. This script will connect to your MongoDB cluster using the key and schema created by running create_key.py
. I'll then show you how to insert a document, and retrieve it both with and without CSFLE configuration, to show how it is stored encrypted and transparently decrypted by PyMongo when the correct configuration is provided.Start with some code to import the necessary modules and load the saved files:
1 # csfle_main.py 2 3 import os 4 from pathlib import Path 5 6 from pymongo import MongoClient 7 from pymongo.encryption_options import AutoEncryptionOpts 8 from pymongo.errors import EncryptionError 9 from bson import json_util 10 11 # Load the master key from 'key_bytes.bin': 12 key_bin = Path("key_bytes.bin").read_bytes() 13 14 # Load the 'person' schema from "json_schema.json": 15 collection_schema = json_util.loads(Path("json_schema.json").read_text())
Add the following configuration needed to connect to MongoDB:
1 # csfle_main.py 2 3 # Configure a single, local KMS provider, with the saved key: 4 kms_providers = {"local": {"key": key_bin}} 5 6 # Create a configuration for PyMongo, specifying the local master key, 7 # the collection used for storing key data, and the json schema specifying 8 # field encryption: 9 csfle_opts = AutoEncryptionOpts( 10 kms_providers, 11 "csfle_demo.__keystore", 12 schema_map={"csfle_demo.people": collection_schema}, 13 )
The code above is very similar to the configuration created in
create_key.py
. Note that this time, AutoEncryptionOpts
is passed a schema_map
, mapping the loaded JSON schema against the people
collection in the csfle_demo
database. This will let PyMongo know which fields to encrypt and decrypt, and which algorithms and keys to use.At this point, it's worth taking a look at the JSON schema that you're loading. It's stored in
json_schema.json
, and it should look a bit like this:1 { 2 "bsonType": "object", 3 "properties": { 4 "ssn": { 5 "encrypt": { 6 "bsonType": "string", 7 "algorithm": "AEAD_AES_256_CBC_HMAC_SHA_512-Random", 8 "keyId": [ 9 { 10 "$binary": { 11 "base64": "4/p3dLgeQPyuSaEf+NddHw==", 12 "subType": "04"}}] 13 }}}}
This schema specifies that the
ssn
field, used to store a social security number, is a string which should be stored encrypted using the AEAD_AES_256_CBC_HMAC_SHA_512-Random algorithm.If you don't want to store the schema in a file when you generate your field key in MongoDB, you can load the key ID at any time using the values you set for
keyAltNames
when you created the key. In my case, I set keyAltNames
to ["example"]
, so I could look it up using the following line of code:1 key_id = db.__keystore.find_one({ "keyAltNames": "example" })["_id"]
Because my code in
create_key.py
writes out the schema at the same time as generating the key, it already has access to the key's ID so the code doesn't need to look it up.Add the following code to connect to MongoDB using the configuration you added above:
1 # csfle_main.py 2 3 # Add a new document to the "people" collection, and then read it back out 4 # to demonstrate that the ssn field is automatically decrypted by PyMongo: 5 with MongoClient(os.environ["MDB_URL"], auto_encryption_opts=csfle_opts) as client: 6 client.csfle_demo.people.delete_many({}) 7 client.csfle_demo.people.insert_one({ 8 "full_name": "Sophia Duleep Singh", 9 "ssn": "123-12-1234", 10 }) 11 print("Decrypted find() results: ") 12 print(client.csfle_demo.people.find_one())
The code above connects to MongoDB and clears any existing documents from the
people
collection. It then adds a new person document, for Sophia Duleep Singh, with a fictional ssn
value.Just to prove the data can be read back from MongoDB and decrypted by PyMongo, the last line of code queries back the record that was just added and prints it to the screen. When I ran this code, it printed:
1 {'_id': ObjectId('5fc12f13516b61fa7a99afba'), 'full_name': 'Sophia Duleep Singh', 'ssn': '123-12-1234'}
To prove that the data is encrypted on the server, you can connect to your cluster using Compass or at cloud.mongodb.com, but it's not a lot of code to connect again without encryption configuration, and query the document:
1 # csfle_main.py 2 3 # Connect to MongoDB, but this time without CSFLE configuration. 4 # This will print the document with ssn *still encrypted*: 5 with MongoClient(os.environ["MDB_URL"]) as client: 6 print("Encrypted find() results: ") 7 print(client.csfle_demo.people.find_one())
When I ran this, it printed out:
1 { 2 '_id': ObjectId('5fc12f13516b61fa7a99afba'), 3 'full_name': 'Sophia Duleep Singh', 4 'ssn': Binary(b'\x02\xe3\xfawt\xb8\x1e@\xfc\xaeI\xa1\x1f\xf8\xd7]\x1f\x02\xd8+,\x9el ...', 6) 5 }
That's a very different result from '123-12-1234'! Unfortunately, when you use the Random encryption algorithm, you lose the ability to filter on the field. You can see this if you add the following code to the end of your script and execute it:
1 # csfle_main.py 2 3 # The following demonstrates that if the ssn field is encrypted as 4 # "Random" it cannot be filtered: 5 try: 6 with MongoClient(os.environ["MDB_URL"], auto_encryption_opts=csfle_opts) as client: 7 # This will fail if ssn is specified as "Random". 8 # Change the algorithm to "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic" 9 # in client_schema_create_key.py (and run it again) for this to succeed: 10 print("Find by ssn: ") 11 print(client.csfle_demo.people.find_one({"ssn": "123-12-1234"})) 12 except EncryptionError as e: 13 # This is expected if the field is "Random" but not if it's "Deterministic" 14 print(e)
When you execute this block of code, it will print an exception saying, "Cannot query on fields encrypted with the randomized encryption algorithm...".
AEAD_AES_256_CBC_HMAC_SHA_512-Random
is the correct algorithm to use for sensitive data you won't have to filter on, such as medical conditions, security questions, etc. It also provides better protection against frequency analysis recovery, and so should probably be your default choice for encrypting sensitive data, especially data that is high-cardinality, such as a credit card number, phone number, or ... yes ... a social security number. But there's a distinct probability that you might want to search for someone by their Social Security number, given that it's a unique identifier for a person, and you can do this by encrypting it using the "Deterministic" algorithm.In order to fix this, open up
create_key.py
again and change the algorithm in the schema definition from Random
to Deterministic
, so it looks like this:1 # create_key.py 2 3 "algorithm": "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic",
Re-run
create_key.py
to generate a new master key, field key, and schema file. (This operation will also delete your csfle_demo
database!) Run csfle_main.py
again. This time, the block of code that failed before should instead print out the details of Sophia Duleep Singh.The problem with this way of configuring your client is that if some other code is misconfigured, it can either save unencrypted values in the database or save them using the wrong key or algorithm. Here's an example of some code to add a second record, for Dora Thewlis. Unfortunately, this time, the configuration has not provided a
schema_map
! What this means is that the SSN for Dora Thewlis will be stored in plaintext.1 # Configure encryption options with the same key, but *without* a schema: 2 csfle_opts_no_schema = AutoEncryptionOpts( 3 kms_providers, 4 "csfle_demo.__keystore", 5 ) 6 with MongoClient( 7 os.environ["MDB_URL"], auto_encryption_opts=csfle_opts_no_schema 8 ) as client: 9 print("Inserting Dora Thewlis, without configured schema.") 10 # This will insert a document *without* encrypted ssn, because 11 # no schema is specified in the client or server: 12 client.csfle_demo.people.insert_one({ 13 "full_name": "Dora Thewlis", 14 "ssn": "234-23-2345", 15 }) 16 17 # Connect without CSFLE configuration to show that Sophia Duleep Singh is 18 # encrypted, but Dora Thewlis has her ssn saved as plaintext. 19 with MongoClient(os.environ["MDB_URL"]) as client: 20 print("Encrypted find() results: ") 21 for doc in client.csfle_demo.people.find(): 22 print(" *", doc)
If you paste the above code into your script and run it, it should print out something like this, demonstrating that one of the documents has an encrypted SSN, and the other's is plaintext:
1 * {'_id': ObjectId('5fc12f13516b61fa7a99afba'), 'full_name': 'Sophia Duleep Singh', 'ssn': Binary(b'\x02\xe3\xfawt\xb8\x1e@\xfc\xaeI\xa1\x1f\xf8\xd7]\x1f\x02\xd8+,\x9el\xfe\xee\xa7\xd9\x87+\xb9p\x9a\xe7\xdcjY\x98\x82]7\xf0\xa4G[]\xd2OE\xbe+\xa3\x8b\xf5\x9f\x90u6>\xf3(6\x9c\x1f\x8e\xd8\x02\xe5\xb5h\xc64i>\xbf\x06\xf6\xbb\xdb\xad\xf4\xacp\xf1\x85\xdbp\xeau\x05\xe4Z\xe9\xe9\xd0\xe9\xe1n<', 6)} 2 * {'_id': ObjectId('5fc12f14516b61fa7a99afc0'), 'full_name': 'Dora Thewlis', 'ssn': '234-23-2345'}
Fortunately, MongoDB provides the ability to attach a validator to a collection, to ensure that the data stored is encrypted according to the schema.
In order to have a schema defined on the server-side, return to your
create_key.py
script, and instead of writing out the schema to a JSON file, provide it to the create_collection
method as a JSON Schema validator:1 # create_key.py 2 3 print("Creating 'people' collection in 'csfle_demo' database (with schema) ...") 4 client.csfle_demo.create_collection( 5 "people", 6 codec_options=CodecOptions(uuid_representation=STANDARD), 7 validator={"$jsonSchema": schema}, 8 )
Providing a validator attaches the schema to the created collection, so there's no need to save the file locally, no need to read it into
csfle_main.py
, and no need to provide it to MongoClient anymore. It will be stored and enforced by the server. This simplifies both the key generation code and the code to query the database, and it ensures that the SSN field will always be encrypted correctly. Bonus!The definition of
csfle_opts
becomes:1 # csfle_main.py 2 3 csfle_opts = AutoEncryptionOpts( 4 kms_providers, 5 "csfle_demo.__keystore", 6 )
By completing this quick start, you've learned how to:
- Create a secure random key for encrypting data keys in MongoDB.
- Use local key storage to store a key during development.
- Create a Key in MongoDB (encrypted with your local key) to encrypt data in MongoDB.
- Use a JSON Schema to define which fields should be encrypted.
- Assign the JSON Schema to a collection to validate encrypted fields on the server.
As mentioned earlier, you should not use local key storage to manage your key - it's insecure. You can store the key manually in a KMS of your choice, such as Hashicorp Vault, or if you're using one of the three major cloud providers, their KMS services are already integrated into PyMongo. Read the documentation to find out more.
There is a lot of documentation about Client-Side Field-Level Encryption, in different places. Here are the docs I found useful when writing this post:
If CSFLE doesn't quite fit your security requirements, you should check out our other security docs, which cover encryption at rest and configuring transport encryption, among other things.
As always, if you have any questions, or if you've built something cool, let us know on the MongoDB Community Forums!