Can You Keep a Secret?
Rate this article
The median time to discovery for a secret key leaked to GitHub is 20 seconds. By the time you realise your mistake and rotate your secrets, it could be too late. In this talk, we'll look at some techniques for secret management which won't disrupt your workflow, while keeping your services safe.
This is a complete transcript of the 2020 PyCon Australia conference talk "Can you keep a secret?" Slides are also available to download on Notist.
Hey, everyone. Thank you for joining me today.
Before we get started, I would just like to take a moment to express my heartfelt thanks, gratitude, and admiration to everyone involved with this year's PyCon Australia. They have done such an amazing job, in a really very difficult time.
It would have been so easy for them to have skipped putting on a conference at all this year, and no one would have blamed them if they did, but they didn't, and what they achieved should really be celebrated. So a big thank you to them.
With that said, let's get started!
So, I'm Aaron Bassett.
I am a Senior Developer Advocate at MongoDB.
For anyone who hasn't heard of MongoDB before, it is a general purpose, document-based, distributed database, often referred to as a No-SQL database. We have a fully managed cloud database service called Atlas, an on-premise Enterprise Server, an on device database called Realm, but we're probably most well known for our free and open source Community Server.
In fact, much of what we do at MongoDB is open source, and as a developer advocate, almost the entirety of what I produce is open source and publicly available. Whether it is a tutorial, demo app, conference talk, Twitch stream, and so on. It's all out there to use.
Here's an example of the type of code I write regularly. This is a small snippet to perform a geospatial query.
First, we import our MongoDB Python Driver. Then, we instantiate our database client. And finally, we execute our query. Here, we're trying to find all documents whose location is within a defined radius of a chosen point.
But even in this short example, we have some secrets that we really shouldn't be sharing. The first line highlighted here is the URI. This isn't so much a secret as a configuration variable.
Something that's likely to change between your development, staging, and production environments. So, you probably don't want this hard coded either. The next line, however, is the real secrets. Our database username and password. These are the types of secrets you never want to hard code in your scripts, not even for a moment.
So often I see it where someone has pulled out their secrets into variables, either at the top of their script§ or sometimes they'll hard code them in a settings.py or similar. I've been guilty of this as well.
You have every intention of removing the secrets before you publish your code, but then it's a couple of days later, the kids are trying to get your attention, you need to go make your morning coffee, or there's one of the million other things that happen in our day-to-day lives distracting you, and as you get up, you decide to save your working draft, muscle memory kicks in...
And... well... that's all it takes.
All it takes is that momentary lapse and now your secrets are public, and as soon as those secrets hit GitHub or another public repository, you have to assume they're immediately breached.
Michael Meli, Matthew R. McNiece, and Bradley Reaves from North Carolina State University published a research paper titled "How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories".
This research showed that the median time for discovery for a secret published to GitHub was 20 seconds, and it could be as low as half a second. It appeared to them that the only limiting factor on how fast you could discover secrets on GitHub was how fast GitHub was able to index new code as it was pushed up.
The longest time in their testing from secrets being pushed until they could potentially be compromised was four minutes. There was no correlation between time of day, etc. It most likely would just depend on how many other people were pushing code at the same time. But once the code was indexed, then they were able to locate the secrets using some well-crafted search queries.
But this is probably not news to most developers. Okay, the speed of which secrets can be compromised might be surprising, but most developers will know the perils of publishing their secrets publicly.
Many of us have likely heard or read horror stories of developers accidentally committing their AWS keys and waking up to a huge bill as someone has been spinning up EC2 instances on their account. So why do we, and I'm including myself in that we, why do we keep doing it?
Because it is easy. We know it's not safe. We know it is likely going to bite us in the ass at some point. But it is so very, very easy. And this is the case in most software.
This is the security triangle. It represents the balance between security, functionality, and usability. It's a trade-off. As two points increase, one will always decrease. If we have an app that is very, very secure and has a lot of functionality, it's probably going to feel pretty restrictive to use. If our app is very secure and very usable, it probably doesn't have to do much.
A good example of where a company has traded some security for additional functionality and usability is Amazon's One Click Buy button.
It functions very much as the name implies. When you want to order a product, you can click a single button and Amazon will place your order using your default credit card and shipping address from their records. What you might not be aware of is that Amazon cannot send the CVV with that order. The CVV is the normally three numbers on the back of your card above the signature strip.
Card issuers say that you should send the CVV for each Card Not Present transaction. Card Not Present means that the retailer cannot see that you have the physical card in your possession, so every online transaction is a Card Not Present transaction.
Okay, so the issuers say that you should send the CVV each time, but they also say that you MUST not store it. This is why for almost all retailers, even if they have your credit card stored, you will still need to enter the CVV during checkout, but not Amazon. Amazon simply does not send the CVV. They know that decreases their security, but for them, the trade-off for additional functionality and ease of use is worth it.
A bad example of where a company traded sanity—sorry, I mean security—for usability happened at a, thankfully now-defunct, agency I worked at many, many years ago. They decided that while storing customer's passwords in plaintext lowered their security, being able to TELL THE CUSTOMER THEIR PASSWORD OVER THE TELEPHONE WHEN THEY CALLED was worth it in usability.
It really was the wild wild west of the web in those days...
So a key tenant of everything I'm suggesting here is that it has to be as low friction as possible. If it is too hard, or if it reduces the usability side of our triangle too much, then people will not adopt it.
It also has to be easy to implement. I want these to be techniques which you can start using personally today, and have them rolled out across your team by this time next week.
It can't have any high costs or difficult infrastructure to set up and manage. Because again, we are competing with hard code variables, without a doubt the easiest method of storing secrets.
So how do we know when we're done? How do we measure success for this project? Well, for that, I'm going to borrow from the 12 factor apps methodology.
The 12 factor apps methodology is designed to enable web applications to be built with portability and resilience when deployed to the web. And it covers 12 different factors.
Codebase, dependencies, config, backing services, build, release, run, and so on. We're only interested in number 3: Config.
Here's what 12 factor apps has to say about config;
"A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials"
And this is super important even for those of you who may never publish your code publicly. What would happen if your source code were to leak right now? In 2015, researchers at internetwache found that 9700 websites in Alexa's top one million had their .git folder publicly available in their site root. This included government websites, NGOs, banks, crypto exchanges, large online communities, a few porn sites, oh, and MTV.
Deploying websites via Git pulls isn't as uncommon as you might think, and for those websites, they're just one server misconfiguration away from leaking their source code. So even if your application is closed source, with source that will never be intentionally published publicly, it is still imperative that you do not hard code secrets.
Leaking your source code would be horrible. Leaking all the keys to your kingdom would be devastating.
So if we can't store our secrets in our code, where do we put them? Environment variables are probably the most common place.
Now remember, we're going for ease of use and low barrier to entry. There are better ways for managing secrets in production. And I would highly encourage you to look at products like HashiCorp's Vault. It will give you things like identity-based access, audit logs, automatic key rotation, encryption, and so much more. But for most people, this is going to be overkill for development, so we're going to stick to environment variables.
But what is an environment variable? It is a variable whose value is set outside of your script, typically through functionality built into your operating system and are part of the environment in which a process runs. And we have a few different ways these can be accessed in Python.
Here we have the same code as earlier, but now we've removed our hard coded values and instead we're using environment variables in their place. Environ is a mapping object representing the environment variables. It is worth noting that this mapping is captured the first time the os module is imported, and changes made to the environment after this time will not be reflected in environ. Environ behaves just like a Python dict. We can reference a value by providing the corresponding key. Or we can use get.
The main difference between the two approaches is when using get, if an environment variable does not exist, it will return None, whereas if you are attempting to access it via its key, then it will raise a KeyError exception. Also, get allows you to provide a second argument to be used as a default value if the key does not exist. There is a third way you can access environment variables: getenv.
getenv behaves just like environ.get. In fact, it behaves so much like it I dug through the source to try and figure out what the difference was between the two and the benefits of each. But what I found is that there is no difference. None.
getenv is simply a wrapper around environ.get. I'm sure there is a reason for this beyond saving a few key strokes, but I did not uncover it during my research. If you know the reasoning behind why getenv exists, I would love to hear it.
Joe Drumgoole has put forward a potential reason for why
getenv
might exist: "I think it exists because the C library has an identical function called getenv() and it removed some friction for C programmers (like me, back in the day) who were moving to Python."Now we know how to access environment variables, how do we create them? They have to be available in the environment whenever we run our script, so most of the time, this will mean within our terminal. We could manually create them each time we open a new terminal window, but that seems like way too much work, and very error-prone. So, where else can we put them?
If you are using virtualenv, you can manage your environment variables within your activate script.
It's a little back to front in that you'll find the deactivate function at the top, but this is where you can unset any environment variables and do your housekeeping. Then at the bottom of the script is where you can set your variables. This way, when you activate your virtual environment, your variables will be automatically set and available to your scripts. And when you deactivate your virtual environment, it'll tidy up after you and unset those same variables.
Personally, I am not a fan of this approach.
I never manually alter files within my virtual environment. I do not keep them under source control. I treat them as wholly disposable. At any point, I should be able to delete my entire environment and create a new one without fear of losing anything. So, modifying the activate script is not a viable option for me.
Instead, I use direnv. direnv is an extension for your shell. It augments existing shells with a new feature that can load and unload environment variables depending on the current directory. What that means is when I cd into a directory containing an .envrc file, direnv will automatically set the environment variables contained within for me.
Let's look at a typical direnv workflow. First, we create an .envrc file and add some export statements, and we get an error. For security reasons, direnv will not load an .envrc file until you have allowed it. Otherwise, you might end up executing malicious code simply by cd'ing into a directory. So, let's tell direnv to allow this directory.
Now that we've allowed the .envrc file, direnv has automatically loaded it for us and set the DB_PASSWORD environment variable. Then, if we leave the directory, direnv will unload and clean up after us by unsetting any environment variables it set.
Now, you should NEVER commit your envrc file. I advise adding it to your projects gitignore file and your global gitignore file. There should be no reason why you should ever commit an .envrc file.
You will, however, want to share a list of what environment variables are required with your team. The convention for this is to create a .envrc.example file which only includes the variable names, but no values. You could even automate this grep or similar.
We covered keeping simple secrets out of your source code, but what about if you need to share secret files with coworkers? Let's take an example of when you might need to share a file in your repo, but ensure that even if your repository becomes public, only those authorised to access the file can do so.
With encryption at rest, the encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a filesystem perspective, and data only exists in an unencrypted state in memory and during transmission.
With client-side field level encryption, applications can encrypt fields in documents prior to transmitting data over the wire to the server.
Only applications with access to the correct encryption keys can decrypt and read the protected data. Deleting an encryption key renders all data encrypted using that key as permanently unreadable. So. with Encryption at Rest. each database has its own encryption key and then there is a master key for the server. But with client-side field level encryption. you can encrypt individual fields in documents with customer keys.
I should point out that in production, you really should use a key management service for either of these. Like, really use a KMS. But for development, you can use a local key.
These commands generate a keyfile to be used for encryption at rest, set the permissions, and then enables encryption on my server. Now, if multiple developers needed to access this encrypted server, we would need to share this keyfile with them.
And really, no one is thinking, "Eh... just Slack it to them..." We're going to store the keyfile in our repo, but we'll encrypt it first.
git-secret encrypts files and stores them inside the git repository. Perfect. Exactly what we need. With one little caveat...
Remember these processes all need to be safe and EASY. Well, git-secret is easy... ish.
Git-secret itself is very straightforward to use. But it does rely upon PGP. PGP, or pretty good privacy, is an encryption program that provides cryptographic privacy and authentication via public and private key pairs. And it is notoriously fiddly to set up.
There's also the problem of validating a public key belongs to who you think it does. Then there are key signing parties, then web of trust, and lots of other things that are way out of scope of this talk.
Thankfully, there are pretty comprehensive guides for setting up PGP on every OS you can imagine, so for the sake of this talk, I'm going to assume you already have PGP installed and you have your colleagues' public keys.
So let's dive into git-secret. First we initiate it, much the same as we would a git repository. This will create a hidden folder .gitsecret. Now we need to add some users who should know our secrets. This is done with git secret tell followed by the email address associated with their public key.
When we add a file to git-secret, it creates a new file. It does not change the file in place. So, our unencrypted file is still within our repository! We must ensure that it is not accidentally committed. Git-secret tries to help us with this. If you add a file to git-secret, it'll automatically add it to your .gitignore, if it's not already there.
If we take a look at our gitignore file after adding our keyfile to our list of secrets, we can see that it has been added, along with some files which .gitsecret needs to function but which should not be shared.
At this point if we look at the contents of our directory we can see our unencrypted file, but no encrypted version. First we have to tell git secret to hide all the files we've added. Ls again and now we can see the encrypted version of the file has been created. We can now safely add that encrypted version to our repository and push it up.
When one of our colleagues pulls down our encrypted file, they run reveal and it will use their private key to decrypt it.
Git-secret comes with a few commands to make managing secrets and access easier.
- Whoknows will list all users who are able to decrypt secrets in a repository. Handy if someone leaves your team and you need to figure out which secrets need to be rotated.
- List will tell you which files in a repository are secret.
- And if someone does leave and you need to remove their access, there is the rather morbidly named killperson.
The killperson command will ensure that the person cannot decrypt any new secrets which are created, but it does not re-encrypt any existing secrets, so even though the person has been removed, they will still be able to decrypt any existing secrets.
There is little point in re-encrypting the existing files as they will need to be rotated anyways. Then, once the secret has been rotated, when you run hide on the new secret, the removed user will not be able to access the new version.
Another tool I want to look at is confusingly called git secrets, because the developers behind git tools have apparently even less imagination than I do.
git-secrets scans commits, commit messages, and --no-ff merges to prevent adding secrets into your git repositories
All the tools and processes we've looked at so far have attempted to make it easier to safely manage secrets. This tool, however, attacks the problem in a different way. Now we're going to make it more difficult to hard code secrets in your scripts.
Git-secrets uses regexes to attempt to detect secrets within your commits. It does this by using git hooks. Git secrets install will generate some Git templates with hooks already configured to check each commit. We can then specify these templates as the defaults for any new git repositories.
Git-secrets is from AWS labs, so it comes with providers to detect AWS access keys, but you can also add your own. A provider is simply a list of regexes, one per line. Their recommended method is to store them all in a file and then cat them. But this has some drawbacks.
So some regexes are easy to recognise. This is the regex for an RSA key. Straight forward. But what about this one? I'd love to know if anyone recognises this right away. It's a regex for detecting Google oAuth access tokens. This one? Facebook access tokens.
So as you can see, having a single large file with undocumented regexes could quickly become very difficult to maintain. Instead, I place mine in a directory, neatly organised. Seperate files depending on the type of secret I want to detect. Then in each file, I have comments and whitespace to help me group regexes together and document what secret they're going to detect.
But, git-secrets will not accept these as a provider, so we need to get a little creative with egrep.
We collect all the files in our directory, strip out any lines which start with a hash or which are empty, and then return the result of this transformation to git-secrets. Which is exactly the input we had before, but now much more maintainable than one long undocumented list!
With git-secrets and our custom providers installed, if we try to commit a private key, it will throw an error. Now, git-secrets can produce false positives. The error message gives you some examples of how you can force your commit through. So if you are totally committed to shooting yourself in the foot, you still can. But hopefully, it introduces just enough friction to make hardcoding secrets more of a hassle than just using environment variables.
Finally, we're going to look at a tool for when all else fails. Gitleaks
Audit git repos for secrets. Gitleaks provides a way for you to find unencrypted secrets and other unwanted data types in git source code repositories. Git leaks is for when even with all of your best intentions, a secret has made it into your repo. Because the only thing worse than leaking a secret is not knowing you've leaked a secret.
It works in much the same way as git-secrets, but rather than inspecting individual commits you can inspect a multitude of things.
- A single repo
- All repos by a single user
- All repos under an organisation
- All code in a GitHub PR
- And it'll also inspect Gitlab users and groups, too
I recommend using it in a couple of different ways.
- Have it configured to run as part of your PR process. Any leaks block the merge.
- Run it against your entire organisation every hour/day/week, or at whatever frequency you feel is sufficient. Whenever it detects a leak, you'll get a nice report showing which rule was triggered, by which commit, in which file, and who authored it.
In closing...
- Keep secrets and code separate.
- If you must share secrets, encrypt them first. Yes, PGP can be fiddly, but it's worth it in the long run.
- Automate, automate, automate. If your secret management requires lots of manual work for developers, they will skip it. I know I would. It's so easy to justify to yourself. It's just this once. It's just a little proof of concept. You'll totally remember to remove them before you push. I've made all the same excuses to myself, too. So, keep it easy. Automate where possible.
- And late is better than never. Finding out you've accidentally leaked a secret is a stomach-dropping, heart-racing, breath-catching experience. But leaking a secret and not knowing until after it has been compromised is even worse. So, run your gitleak scans. Run them as often as you can. And have a plan in place for when you do inevitably leak a secret so you can deal with it quickly.
Thank you very much for your attention.
Please do add me on Twitter at aaron bassett. I would love to hear any feedback or questions you might have! If you would like to revisit any of my slides later, they will all be published at Notist shortly after this talk.
I'm not sure how much time we have left for questions, but I will be available in the hallway chat if anyone would like to speak to me there. I know I've been sorely missing seeing everyone at conferences this year, so it will be nice to catch up.
Thanks again to everyone who attended my talk and to the PyCon Australia organisers.