Safe Software Deployments: The 180 Rule

Mark Porter
October 20, 2021 | Updated: February 28, 2022

In my last post, I talked about the anxiety developers feel when they deploy software, and the negative impact that fear has on innovation. Today, I’m offering the first of four methods I’ve used to help teams overcome that fear: The 180 Rule.

Developers need to be able to get software into production, and if it doesn’t work, back it out of production as quickly as possible and return the system to its prior working state. If they have confidence that they can detect problems and fix them, they can feel more confident about deploying.

All deployments have the same overall stages:

Deployment: You roll the software from staging to production, either in pieces -- by directing more and more transactions to it -- or by flipping a switch. This involves getting binaries or configuration files reliably to production and having the system start using them.
Monitoring: How does the system behave under live load? Do we have signals that the software is behaving correctly and performantly? It’s essential that this monitoring focuses more on the existing functionality than just the “Happy Path” of the new functionality. In other words, did we damage the system through the rollout?
Rollback: If there is any hint that the system is not working correctly, the change needs to be quickly rolled back from production. In a sense, a rollback is a kind of deployment, because you’re making another change to the live system: returning it to a prior state.

The “180” in the name of the rule has a double meaning. Of course, we’re referring here to the “180 degree” about-face of a rollback. But it’s also a reference to an achievable goal of any deployment. I believe that any environment should be able to deploy software to production and roll it back if it doesn’t work in three minutes, or 180 seconds. This gives 60 seconds to roll binaries to the fleet and point your customers to them, 60 seconds to see if the transaction loads or your canaries see problems, and then 60 seconds to roll back the binaries or configurations if needed. Of course, in your industry or for your product, you might need this to be shorter. But the bottom line is that a failed software deployment should not live in production for more than three minutes.

Developers follow these three stages all the time, and they often do it manually. I know what you’re thinking: “How can any human being deploy, monitor, and roll back software that fast?” And that is the hidden beauty of the 180 Rule. The only way to meet this requirement is by automating the process. Instead of making the decisions, we must teach the computers how to gather the information and make the decisions themselves. Sadly, this is a fundamental change for many companies. But it’s a necessary change. Because the alternative is hoping things will work while fearing that they will not. And that makes developers loath to deploy software.

Sure, there are a lot of tools out there that help with deployments. But this is not an off-the-shelf, set-it-and-forget-it scenario. You, as the developer, must provide those tools with the right metrics to monitor and the right scripts to both deploy the software and possibly roll it back. The 180 Rule does not specify which tools to use. Instead it forces developers to create rigorous scripts and metrics, and ensure they can reliably detect and fix problems quickly.

There’s a gotcha that many of you are thinking of: The 180 Rule is not applicable if the deployment is not reversible. For example, deploying a refactored relational schema can be a big problem, because a new schema might introduce information loss that prevents a roll-back. Or the deployment might delete some old config files that aren’t used by the new software.

I’ll talk more about how to avoid wicked problems like these in my subsequent posts. But for now, I’m interested to hear what you think of The 180 Rule, and whether you’re using any similar heuristics in your approach to safe deployment.

← Previous

Safe Software Deployments: Overcoming the Fear and Loathing of Pushing to Prod

Over the course of my career, I’ve had the privilege of deploying many different types of software. I’ve shipped CDs. I’ve pushed customer software over the web. I’ve updated database instances and control planes. And I’ve live-updated large, running, mission-critical systems. I call this a privilege because getting software into the hands of end users is what software engineers love most. But deployments are not all fun and games. And while each deployment presents its own unique challenges, there is one thing they all have in common: fear. Those of you responsible for significant software deployments know exactly what I’m talking about. You work, you prepare, you test. But when the day finally comes for your software to set sail, you are left hoping and praying it proves seaworthy on the Ocean of Production. In most companies, production is so different from your development and staging environments, that it’s almost impossible to know whether the code that worked in staging is going to succeed in production. Yet one thing is certain: if your software fails, everybody is going to know about it. Hence the fear. When it comes to understanding the effects of fear on the developer, I think Frank Herbert, author of the epic science-fiction saga Dune, said it best: “Fear is the mind-killer.” Fear undermines experimentation and the entrepreneurial spirit. It discourages risk-taking and leads to bad habits, like avoiding deployment for months. And worst of all, fear slows down the innovation process. (See my post on the Innovation Tax many organizations are paying, and don’t know it.) Pushing to production is undeniably scary. But over the last 30 years, working with my peers, I’ve developed a few methods for creating the conditions for safe, confident deployments. And my next four blogs in this series will unpack each of them in turn: The 180 Rule - Enabling fast, automated, easily reversible deployments Z Deployments - Limiting downtime from failed rollbacks The Goldilocks Gauge - Making the size and frequency of deployments just right Through the Looking Glass - Ensuring alignment between Dev, Stage, and Prod environments These methodologies aren’t perfect and they won’t guarantee you a bug-free deployment. But they’re the best practices I’ve seen. And they help create a culture of confidence within an engineering team, which is the foundation of meaningful innovation. To get started, my next blog will explain the “180 Rule” to help you reduce outage minutes in production. In the meantime, feel free to share your own tips and techniques for safe deployments with @MarkLovesTech .

October 20, 2021

Next →

Building AI With MongoDB: Integrating Vector Search And Cohere to Build Frontier Enterprise Apps

Cohere is the leading enterprise AI platform, building large language models (LLMs) which help businesses unlock the potential of their data. Operating at the frontier of AI, Cohere’s models provide a more intuitive way for users to retrieve, summarize, and generate complex information. Cohere offers both text generation and embedding models to its customers. Enterprises running mission-critical AI workloads select Cohere because its models offer the best performance-cost tradeoff and can be deployed in production at scale. Cohere’s platform is cloud-agnostic. Their models are accessible through their own API as well as popular cloud managed services, and can be deployed on a virtual private cloud (VPC) or even on-prem to meet companies where their data is, offering the highest levels of flexibility and control. Cohere’s leading Embed 3 and Rerank 3 models can be used with MongoDB Atlas Vector Search to convert MongoDB data to vectors and build a state-of-the-art semantic search system. Search results also can be passed to Cohere’s Command R family of models for retrieval augmented generation (RAG) with citations. Check out our AI resource page to learn more about building AI-powered apps with MongoDB. A new approach to vector embeddings It is in the realm of embedding where Cohere has made a host of recent advances. Described as “AI for language understanding,” Embed is Cohere’s leading text representation language model. Cohere offers both English and multilingual embedding models, and gives users the ability to specify the type of data they are computing an embedding for (e.g., search document, search query). The result is embeddings that improve the accuracy of search results for traditional enterprise search or retrieval-augmented generation. One challenge developers faced using Embed was that documents had to be passed one by one to the model endpoint, limiting throughput when dealing with larger data sets. To address that challenge and improve developer experience, Cohere has recently announced its new Embed Jobs endpoint . Now entire data sets can be passed in one operation to the model, and embedded outputs can be more easily ingested back into your storage systems. Additionally, with only a few lines of code, Rerank 3 can be added at the final stage of search systems to improve accuracy. It also works across 100+ languages and offers uniquely high accuracy on complex data such as JSON, code, and tabular structure. This is particularly useful for developers who rely on legacy dense retrieval systems. Demonstrating how developers can exploit this new endpoint, we have published the How to use Cohere embeddings and rerank modules with MongoDB Atlas tutorial . Readers will learn how to store, index, and search the embeddings from Cohere. They will also learn how to use the Cohere Rerank model to provide a powerful semantic boost to the quality of keyword and vector search results. Figure 1: Illustrating the embedding generation and search workflow shown in the tutorial Why MongoDB Atlas and Cohere? MongoDB Atlas provides a proven OLTP database handling high read and write throughput backed by transactional guarantees. Pairing these capabilities with Cohere’s batch embeddings is massively valuable to developers building sophisticated gen AI apps. Developers can be confident that Atlas Vector Search will handle high scale vector ingestion, making embeddings immediately available for accurate and reliable semantic search and RAG. Increasing the speed of experimentation, developers and data scientists can configure separate vector search indexes side by side to compare the performance of different parameters used in the creation of vector embeddings. In addition to batch embeddings, Atlas Triggers can also be used to embed new or updated source content in real time, as illustrated in the Cohere workflow shown in Figure 2. Figure 2: MongoDB Atlas Vector Search supports Cohere’s batch and real time workflows. (Image courtesy of Cohere) Supporting both batch and real-time embeddings from Cohere makes MongoDB Atlas well suited to highly dynamic gen AI-powered apps that need to be grounded in live, operational data. Developers can use MongoDB’s expressive query API to pre-filter query predicates against metadata, making it much faster to access and retrieve the more relevant vector embeddings. The unification and synchronization of source application data, metadata, and vector embeddings in a single platform, accessed by a single API, makes building gen AI apps faster, with lower cost and complexity. Those apps can be layered on top of the secure, resilient, and mature MongoDB Atlas developer data platform that is used today by over 45,000 customers spanning startups to enterprises and governments handling mission-critical workloads. What's next? To start your journey into gen AI and Atlas Vector Search, review our 10-minute Learning Byte . In the video, you’ll learn about use cases, benefits, and how to get started using Atlas Vector Search.

April 25, 2024