Safe Software Deployments: Z Deployments

Mark Porter
October 27, 2021 | Updated: February 28, 2022

If you’ve gotten this far in my Safe Software Deployment series, you know how scary deployment day can be. Sleepless nights. Knots in the stomach. Cold sweats. These are the symptoms of uncertainty. And three decades of experience have taught me that all the positive thinking in the world won’t ensure a bug-free deployment.

That’s why I’ve developed a number of techniques that can consistently help teams minimize fear and achieve safe software deployment. In the last post, we discussed the 180 Rule. The purpose of this post is to explain how you can use “Z Deployments” to mitigate both fear and downtime. In future posts, we’ll look at both the Goldilocks Gauge and Through the Looking Glass.

Z Deployments are more than a catchy name. This is all about failed rollbacks, which in my experience are the biggest source of downtime in any software deployment pipeline. Now, we all try our best to eliminate the need for rollbacks in the first place - but when they do happen, we want them to be successful. However, in most companies, rollbacks are only tested in Prod, not in the prior stages of the pipeline. Even if you use the 180 Rule, which encourages quick and automated rollbacks, you don’t have any more certainty that they will work. This is where Z Deployments come in.

With a Z Deployment, the goal is to make rollbacks just as predictable and reliable as your normal “roll forward” software deployments. I call this technique a Z Deployment, because if you chart out the process, it looks like a Z. But you can also think of Z Deployments as akin to pressing “Command Z” on your keyboard: undo. Fast, simple, no drama. Here’s how it works.

Roll your code forward from development into staging. In staging, do your canary testing.
Then roll back into development. Do your canary testing again. If it doesn’t work, then you just proved that your rollback code was faulty in some way.
Roll your code forward into staging again, and do your full testing.
If it’s successful, roll your code forward into production.

Of course, this only works if your staging environment is clean and your team trusts it. I’ll get into this more in a future post called “Through the Looking Glass.” But the bottom line is that developers need to know that things will work in production; including any needed rollbacks. And the only way to do that is to test rollbacks in staging. Your version of canary tests and full tests might be different - in a perfect world you’d run full tests three full times, but often build systems aren’t set up to do that quickly enough.

Too often, staging is not clean. But generally, when developers deploy to staging, their added functionality tends to work. Everyone else is using staging, and their functionality is working, too. This is the “Happy Path” - where engineers test that their new thing works. That sounds great. But what else happens? Adjacent things get broken.

Often when you roll back, you’re not necessarily returning to your system’s original state, either for your own software change or for the adjacent software components. Your rollback code has to undo all the state changes your deployment to staging (or prod) may have made. Otherwise, the staging environment becomes polluted, and the results in staging won’t match the results in production. Developers lose faith in staging, and deployment again becomes a terrifying ordeal.

I used to work with someone who was absolutely obsessive about staging. He ran testing, and he refused to have a long-term staging environment. Instead, his team blew away staging every month and rebuilt it from scratch. Did I like this? Absolutely. Did it work? Yes. Developers trusted staging, which meant that deployments to prod were less scary.

The next step of safe software deployment is to embrace the Goldilocks Gauge, which helps make deployments routine and even boring – in a good way. It also makes both the 180 Rule and Z Deployments easier to execute, and it’s a necessity for teams working toward continuous development. In the meantime, feel free to share your own techniques for safe deployments at @MarkLovesTech.

← Previous

Real-time Applications Made Simple with MongoDB and Redpanda

MongoDB has a long history of advocating for simplicity and focusing on making developers more agile and productive. MongoDB first disrupted the database market with the document model, storing data records as BSON (binary representation of JSON documents). This approach to working with data enables developers to easily store and query their data as they use it naturally within their applications. As your data changes, you simply add an attribute to your documents and move on to the next ticket. There is no need to waste time altering tables and constraints when the needs of your application change. MongoDB is always on the lookout for more ways to make life easier for developers, such as addressing the challenges of working with streaming data. With streaming data, it may take armies of highly skilled operational personnel to build and maintain a production-ready platform (like Apache Kafka). Developers then have to integrate their applications with these streaming data platforms resulting in complex application architectures. It’s exciting to see technologies like Redpanda seeking to improve developer productivity for working with streaming data. For those unfamiliar with Redpanda, it is a Kafka API compatible streaming platform that works with the entire Kafka ecosystem, such as Kafka-Connect and popular Kafka drivers : librdkafka , kafka-python , and the Apache Kafka Java Client . Redpanda is written in C++ and leverages the RAFT protocol, which makes Apache ZooKeeper irrelevant. Also, its thread-per-core architecture and JVM-free implementation enable performance improvements over other data streaming platforms. On a side note, MongoDB also implements a protocol similar to RAFT for its replica set cluster primary and secondary elections and management. Both MongoDB and Redpanda share a common goal of simplicity and making complex tasks trivial for the developer. So we decided to show you how to pull together a simple streaming application using both technologies. The example application (found in this GitHub repository ) considers the scenario where stock ticker data is written to a Redpanda and consumed by MongoDB. Once you have the example running, a “stock generator” creates a list of 10 fictitious companies and starts writing ticker data to a Redpanda topic. Kafka Connect service listens for data coming into this topic and “sinks” the data to the MongoDB cluster. Once landed in MongoDB, the application issues an aggregation query to determine the moving averages of the stock securities and updates the UI. MongoDB consumes the ticker data and calculates the average stock price trends using the aggregation framework . Once you have downloaded the repository, a docker-compose script includes a Node server, Redpanda deployment, Kafka Connect service, and a MongoDB instance. The Kafka Connect image includes the Dockerfile-MongoConnect file to install the MongoDB Connector for Apache Kafka . The Dockerfile-Nodesvr is included in the nodesvr image and it copies the web app code & installs the necessary files via NPM. There is a run.sh script file that will launch the docker-compose script to launch the containers. To start the demo, simply run this script file via sh run.sh and upon success, you will see a list of the servers and their ports: The following services are running: MongoDB server on port 27017 Redpanda on 8082 (Redpanda proxy on 8083) Kafka Connect on 8083 Node Server on 4000 is hosting the API and homepage Status of kafka connectors: sh status.sh To tear down the environment and stop these services: docker-compose down -v Once started, navigate to localhost:4000 in a browser and click the “Start” button. After a few seconds, you will see the sample stock data from 10 fictitious companies with the moving average price. Get started with MongoDB and Redpanda This example showcases the simplicity of moving data through the Redpanda streaming platform and into MongoDB for processing. Check out these resources to learn more: Introduction to Redpanda MongoDB + Redpanda Example Application GitHub repository Learn more about the MongoDB Connector for Apache Kafka Ask questions on the MongoDB Developer Community forums Sign up for MongoDB Atlas to get your free tier cluster

October 26, 2021

Next →

Next-Generation Mobility Solutions with Agentic AI and MongoDB Atlas

Driven by advancements in vehicle connectivity, autonomous systems, and electrification, the automotive and mobility industry is currently undergoing a significant transformation. Vehicles today are sophisticated machines, computers on wheels, that generate massive amounts of data, driving demand for connected and electric vehicles. Automotive players are embracing artificial intelligence (AI), battery electrical vehicles (BEVs), and software-defined vehicles (SDVs) to maintain their competitive advantage. However, managing fleets of connected vehicles can be a challenge. As cars get more sophisticated and are increasingly integrated with internal and external systems, the volume of data they produce and receive greatly increases. This data needs to be stored, transferred, and consumed by various downstream applications to unlock new business opportunities. This will only grow: the global fleet management market is projected to reach $65.7 billion by 2030, growing at a rate of almost 10.8% annually. A 2024 study conducted by Webfleet showed that 32% of fleet managers believe AI and machine learning will significantly impact fleet operations in the coming years; optimizing route planning and improving driver safety are the two most commonly cited use cases. As fleet management software providers continue to invest in AI, the integration of agentic AI can significantly help with things like route optimization and driver safety enhancement. For example, AI agents can process real-time traffic updates and weather conditions to dynamically adjust routes, ensuring timely deliveries while advising drivers on their car condition. This proactive approach contrasts with traditional reactive methods, improving vehicle utilization and reducing operational and maintenance costs. But what are agents? In short, they are operational applications that attempt to achieve goals by observing the world and acting upon it using the data and tools the application has at its disposal. The term "agentic" denotes having agency, as AI agents can proactively take steps to achieve objectives without constant human oversight. For example, rather than just reporting an anomaly based on telemetry data analysis, an agent for a connected fleet could autonomously cross-check that anomaly against known issues, decide whether it's critical or not, and schedule a maintenance appointment all on its own. Why MongoDB for agentic AI Agentic AI applications are dynamic by nature as they require the ability to create a chain of thought, use external tools, and maintain context across their entire workflow. These applications generate and consume diverse data types, including structured and unstructured data. MongoDB’s flexible document model is uniquely suited to handle both structured and unstructured data as vectors. It allows all of an agent’s context, chain-of-thought, tools metadata, and short-term and long-term memory to be stored in a single database. This means that developers can spend more time on innovation and rapidly iterate on agent designs without being constrained by rigid schemas of a legacy relational database. Figure 1. Major components of an AI agent. Figure 1 shows the major components of an AI agent. The agent will first receive a task from a human or via an automated trigger, and will then use a large language model (LLM) to generate a chain of thought or follow a predetermined workflow. The agent will use various tools and models during its run and store/retrieve data from a memory provider like MongoDB Atlas . Tools: The agent utilizes tools to interact with the environment. This can contain API methods, database queries, vector search, RAG application, anything to support the model Models: can be a large language model (LLM), vision language model (VLM), or a simple supervised machine learning model. Models can be general purpose or specialized, and agents may use more than one. Data: An agent requires different types of data to function. MongoDB’s document model allows you to easily model all of this data in one single database. An agentic AI spans a wide range of functional tools and context. The underlying data structures evolve throughout the agentic workflow and as an agent uses different tools to complete a task. It also builds up memory over time. Let us list down the typical data types you will find in an agentic AI application. Data types: Agent profile: This contains the identity of the agent. It includes instructions, goals and constraints. Short-term memory: This holds temporary, contextual information—recent data inputs or ongoing interactions—that the agent uses in real-time. For example, short-term memory could store sensor data from the last few hours of vehicle activity. In certain agentic AI frameworks like Langgraph, short term memory is implemented through a checkpointer. The checkpointer stores intermediate states of the agent’s actions and/or reasoning. This memory allows the agent to seamlessly pause and resume operations. Long-term memory: This is where the agent stores accumulated knowledge over time. This may include patterns, trends, logs and historical recommendations and decisions. By storing each of these data types into rich, nested documents in MongoDB, AI developers can create a single-view representation of an agent’s state and behavior. This enables fast retrieval and simplifies development. In addition to the document model advantage, building agentic AI solutions for mobility requires a robust data infrastructure. MongoDB Atlas offers several key advantages that make it an ideal foundation for these AI-driven architectures. These include: Scalability and flexibility: Connected Car platforms like fleet management systems need to handle extreme data volumes and variety. MongoDB Atlas is proven to scale horizontally across cloud clusters, letting you ingest millions of telemetry events per minute and store terabytes of telemetry data with ease. For example, the German company ZF uses MongoDB to process 90,000 vehicle messages per minute (over 50 GB of data per day) from hundreds of thousands of connected cars. The flexibility of the document model accelerates development and ensures your data model stays aligned with the real-world entities it represents. Built-in vector search: AI agents require a robust set of tools to work with. One of the most widely used tools is vector search, which allows agents to perform semantic searches on unstructured data like driver logs, error codes descriptions, and repair manuals. MongoDB Atlas Vector Search allows you to store and index high-dimensional vectors alongside your documents and to perform semantic search over unstructured data. In practice, this means your AI embeddings live right next to the relevant vehicle telemetry and operational data in the database, simplifying architectures for use cases like the connected car incident advisor, in which a new issue can be matched against past issues before passing contextual information to the LLM. For more, check out this example of how an automotive OEM leverages vector search for audio based diagnostics with MongoDB Atlas Vector Search. Time series collections and real-time data processing: MongoDB Atlas is designed for real-time applications. It provides time series collections for connected car telemetry data storage, change streams, and triggers that can react to new data instantly. This is crucial for agentic AI feedback loops, where ongoing data ingestion and learning are happening continuously. Best-in-class embedding models with Voyage AI: In early 2025, MongoDB acquired Voyage AI , a leader in embedding and reranking models. Voyage AI embedding models are currently being integrated into MongoDB Atlas, which means developers will no longer need to manage external embedding APIs, standalone vector stores, or complex search pipelines. AI retrieval will be built into the database itself, making semantic search, vector retrieval, and ranking as seamless as traditional queries. This will reduce the time required for developing agentic AI applications. Agentic AI in action: Connected fleet incident advisor Figure 2 shows a list of use cases in the Mobility sector, sorted by various capabilities that an agent might demonstrate. AI agents excel at managing multi-step tasks via context management across tasks, they automate repetitive tasks better than Robotic process automation (RPA), and they demonstrate human-like reasoning by revisiting and revising past decisions. These capabilities enable a wide range of applications both during the manufacturing of a vehicle and while it's on the road, connected and sending telemetry. We will review a use case in detail below, and will see how it can be implemented using MongoDB Atlas, LangGraph, Open AI, and Voyage AI. Figure 2. Major use cases of agentic AI in the mobility and manufacturing sectors. First, the AI agent connects to traditional fleet management software and supports the fleet manager in diagnosing and advising the drivers. This is an example of a multi-step diagnostic workflow that gets triggered when a driver submits a complaint about the vehicle's performance (for example, increased fuel consumption). Figure 3 shows the sequence diagram of the agent. Upon receiving the driver complaint, it creates a chain of thought that follows a multi-step diagnostic workflow where the system ingests vehicle data such as engine codes and sensor readings, generates embeddings using the Voyage AI voyage-3-large embedding model, and performs a vector search using MongoDB Atlas to find similar past incidents. Once relevant cases are identified, those–along with selected telemetry data–are passed to OpenAI gpt-4o LLM to generate a final recommendation for the driver (for example, to pull off immediately or to keep driving and schedule regular maintenance). All data, including telemetry, past issues, session logs, agent profiles, and recommendations are stored in MongoDB Atlas, ensuring traceability and the ability to refine diagnostics over time. Additionally, MongoDB Atlas is used as a checkpointer by LangGraph, which defines the agent's workflow. Figure 3. Sequence diagram for a connected fleet advisor agentic workflow. Figure 4 shows the agent in action, from receiving an issue to generating a recommendation. So by leveraging MongoDB’s flexible data model and powerful Vector Search capabilities, we can agentic AI can transform fleet management through predictive maintenance and proactive decision-making. Figure 4. The connected fleet advisor AI agent in action. To set up the use case shown in this article, please visit our GitHub repository . And to learn more about MongoDB’s role in the automotive industry, please visit our manufacturing and automotive webpage . Want to learn more about why MongoDB is the best choice for supporting modern AI applications? Check out our on-demand webinar, “ Comparing PostgreSQL vs. MongoDB: Which is Better for AI Workloads? ” presented by MongoDB Field CTO, Rick Houlihan.

April 4, 2025