Auto Pausing Inactive Clusters
Rate this article
A couple of years ago I wrote an article on how to pause and/or scale clusters using scheduled triggers. This article represents a twist on that concept, adding a wrinkle that will pause clusters across an entire organization based on inactivity. Specifically, I’m looking at the Database Access History to determine activity.
It is important to note this logging limitation:
If a cluster experiences an activity spike and generates an extremely large quantity of log messages, Atlas may stop collecting and storing new logs for a period of time.
Therefore, this script could get a false positive that a cluster is inactive when indeed quite the opposite is happening. Given, however, that the intent of this script is for managing lower, non-production environments, I don’t see the false positives as a big concern.
The implementation uses a Scheduled Trigger. The trigger calls a series of App Services Functions, which use the Atlas Administration APIs to iterate over the organization’s projects and their associated clusters, testing the cluster inactivity (as explained in the introduction) and finally pausing the cluster if it is indeed inactive.
In order to call the Atlas Administrative APIs, you'll first need an API Key with the Organization Owner role. API Keys are created in the Access Manager, which you'll find in the Organization menu on the left:
or the menu bar at the top:
Click Create API Key. Give the key a description and be sure to set the permissions to Organization Owner:
When you click Next, you'll be presented with your Public and Private keys. Save your private key as Atlas will never show it to you again.
As an extra layer of security, you also have the option to set an IP Access List for these keys. I'm skipping this step, so my key will work from anywhere.
Since this solution works across your entire Atlas organization, I like to host it in its own dedicated Atlas Project.
Atlas App Services provide a powerful application development backend as a service. To begin using it, just click the App Services tab.
You'll see that App Services offers a bunch of templates to get you started. For this use case, just select the first option to Build your own App:
You'll then be presented with options to link a data source, name your application and choose a deployment model. The current iteration of this utility doesn't use a data source, so you can ignore that step (App Services will create a free cluster for you). You can also leave the deployment model at its default (Global), unless you want to limit the application to a specific region.
I've named the application Atlas Cluster Automation:
At this point in our journey, you have two options:
- Simply import the App Services application and adjust any of the functions to fit your needs.
- Build the application from scratch (skip to the next section).
The extract has a dependency on the API Secret Key, thus the import will fail if it is not configured beforehand.
Use the
Values
menu on the left to Create a Secret named AtlasPrivateKeySecret
containing the private key you created earlier (the secret is not in quotes):npm install -g mongodb-realm-cli
To configure your app with realm-cli, you must log in to Atlas using your API keys:
1 ✗ realm-cli login --api-key="<Public API Key>" --private-api-key="<Private API Key>" 2 Successfully logged in
Select the
App Settings
menu and copy your Application ID:Run the following
realm-cli push
command from the directory where you extracted the export:1 realm-cli push --remote="<Your App ID>" 2 3 ... 4 A summary of changes 5 ... 6 7 ? Please confirm the changes shown above Yes 8 Creating draft 9 Pushing changes 10 Deploying draft 11 Deployment complete 12 Successfully pushed app up:
After the import, replace the `AtlasPublicKey' with your API public key value.
The trigger is schedule to fire every 30 minutes. Note, the pauseClusters function that the trigger calls currently only logs cluster activity. This is so you can monitor and verify that the fuction behaves as you desire. When ready, uncomment the line that calls the pauseCluster function:
1 if (!is_active) { 2 console.log(`Pausing ${project.name}:${cluster.name} because it has been inactive for more then ${minutesInactive} minutes`); 3 //await context.functions.execute("pauseCluster", project.id, cluster.name, pause);
In addition, the pauseClusters function can be configured to exclude projects (such as those dedicated to production workloads):
1 /* 2 * These project names are just an example. 3 * The same concept could be used to exclude clusters or even 4 * configure different inactivity intervals by project or cluster. 5 * These configuration options could also be stored and read from 6 * and Atlas database. 7 */ 8 excludeProjects = ['PROD1', 'PROD2'];
Now that you have reviewed the draft, as a final step go ahead and deploy the App Services application.
To understand what's included in the application, here are the steps to build it yourself from scratch.
The functions we need to create will call the Atlas Administration API, so we need to store our API Public and Private Keys, which we will do using Values & Secrets. The sample code I provide references these values as
AtlasPublicKey
and AtlasPrivateKey
, so use those same names unless you want to change the code where they’re referenced.You'll find
Values
under the Build menu:First, create a Value,
AtlasPublicKey
, for your public key (note, the key is in quotes):Create a Secret,
AtlasPrivateKeySecret
, containing your private key (the secret is not in quotes):The Secret cannot be accessed directly, so create a second Value,
AtlasPrivateKey
, that links to the secret:The four functions that need to be created are pretty self-explanatory, so I’m not going to provide a bunch of additional explanations here.
This standalone function can be test run from the App Services console to see the list of all the projects in your organization.
1 /* 2 * Returns an array of the projects in the organization 3 * See https://docs.atlas.mongodb.com/reference/api/project-get-all/ 4 * 5 * Returns an array of objects, e.g. 6 * 7 * { 8 * "clusterCount": { 9 * "$numberInt": "1" 10 * }, 11 * "created": "2021-05-11T18:24:48Z", 12 * "id": "609acbef1b76b53fcd37c8e1", 13 * "links": [ 14 * { 15 * "href": "https://mongodb.prakticum-team.ru/proxy/cloud.mongodb.com/api/atlas/v1.0/groups/609acbef1b76b53fcd37c8e1", 16 * "rel": "self" 17 * } 18 * ], 19 * "name": "mg-training-sample", 20 * "orgId": "5b4e2d803b34b965050f1835" 21 * } 22 * 23 */ 24 exports = async function() { 25 26 // Get stored credentials... 27 const username = await context.values.get("AtlasPublicKey"); 28 const password = await context.values.get("AtlasPrivateKey"); 29 30 const arg = { 31 scheme: 'https', 32 host: 'cloud.mongodb.com', 33 path: 'api/atlas/v1.0/groups', 34 username: username, 35 password: password, 36 headers: {'Content-Type': ['application/json'], 'Accept-Encoding': ['bzip, deflate']}, 37 digestAuth:true, 38 }; 39 40 // The response body is a BSON.Binary object. Parse it and return. 41 response = await context.http.get(arg); 42 43 return EJSON.parse(response.body.text()).results; 44 };
After
getProjects
is called, the trigger iterates over the results, passing the projectId
to this getProjectClusters
function.To test this function, you need to supply a
projectId
. By default, the Console supplies ‘Hello world!’, so I test for that input and provide some default values for easy testing.1 /* 2 * Returns an array of the clusters for the supplied project ID. 3 * See https://docs.atlas.mongodb.com/reference/api/clusters-get-all/ 4 * 5 * Returns an array of objects. See the API documentation for details. 6 * 7 */ 8 exports = async function(project_id) { 9 10 if (project_id == "Hello world!") { // Easy testing from the console 11 project_id = "5e8f8268d896f55ac04969a1" 12 } 13 14 // Get stored credentials... 15 const username = await context.values.get("AtlasPublicKey"); 16 const password = await context.values.get("AtlasPrivateKey"); 17 18 const arg = { 19 scheme: 'https', 20 host: 'cloud.mongodb.com', 21 path: `api/atlas/v1.0/groups/${project_id}/clusters`, 22 username: username, 23 password: password, 24 headers: {'Content-Type': ['application/json'], 'Accept-Encoding': ['bzip, deflate']}, 25 digestAuth:true, 26 }; 27 28 // The response body is a BSON.Binary object. Parse it and return. 29 response = await context.http.get(arg); 30 31 return EJSON.parse(response.body.text()).results; 32 };
This function contains the logic that determines if the cluster can be paused.
Most of the work in this function is manipulating the timestamp in the database access log so it can be compared to the current time and lookback window.
In addition to returning true (active) or false (inactive), the function logs it’s findings, for example:
Checking if cluster 'SA-SHARED-DEMO' has been active in the last 60 minutes
1 Wed Nov 03 2021 19:52:31 GMT+0000 (UTC) - job is being run 2 Wed Nov 03 2021 18:52:31 GMT+0000 (UTC) - cluster inactivity before this time will be reported inactive 3 Wed Nov 03 2021 19:48:45 GMT+0000 (UTC) - last logged database access 4 Cluster is Active: Username 'brian' was active in cluster 'SA-SHARED-DEMO' 4 minutes ago.
Like
getClusterProjects
, there’s a block you can use to provide some test project ID and cluster names for easy testing from the App Services console.1 /* 2 * Used the database access history to determine if the cluster is in active use. 3 * See https://docs.atlas.mongodb.com/reference/api/access-tracking-get-database-history-clustername/ 4 * 5 * Returns true (active) or false (inactive) 6 * 7 */ 8 exports = async function(project_id, clusterName, minutes) { 9 10 if (project_id == 'Hello world!') { // We're testing from the console 11 project_id = "5e8f8268d896f55ac04969a1"; 12 clusterName = "SA-SHARED-DEMO"; 13 minutes = 60; 14 } /*else { 15 console.log (`project_id: ${project_id}, clusterName: ${clusterName}, minutes: ${minutes}`) 16 }*/ 17 18 // Get stored credentials... 19 const username = await context.values.get("AtlasPublicKey"); 20 const password = await context.values.get("AtlasPrivateKey"); 21 22 const arg = { 23 scheme: 'https', 24 host: 'cloud.mongodb.com', 25 path: `api/atlas/v1.0/groups/${project_id}/dbAccessHistory/clusters/${clusterName}`, 26 //query: {'authResult': "true"}, 27 username: username, 28 password: password, 29 headers: {'Content-Type': ['application/json'], 'Accept-Encoding': ['bzip, deflate']}, 30 digestAuth:true, 31 }; 32 33 // The response body is a BSON.Binary object. Parse it and return. 34 response = await context.http.get(arg); 35 36 accessLogs = EJSON.parse(response.body.text()).accessLogs; 37 38 now = Date.now(); 39 const MS_PER_MINUTE = 60000; 40 var durationInMinutes = (minutes < 30, 30, minutes); // The log granularity is 30 minutes. 41 var idleStartTime = now - (durationInMinutes * MS_PER_MINUTE); 42 43 nowString = new Date(now).toString(); 44 idleStartTimeString = new Date(idleStartTime).toString(); 45 console.log(`Checking if cluster '${clusterName}' has been active in the last ${durationInMinutes} minutes`) 46 console.log(` ${nowString} - job is being run`); 47 console.log(` ${idleStartTimeString} - cluster inactivity before this time will be reported inactive`); 48 49 clusterIsActive = false; 50 51 accessLogs.every(log => { 52 if (log.username != 'mms-automation' && log.username != 'mms-monitoring-agent') { 53 54 // Convert string log date to milliseconds 55 logTime = Date.parse(log.timestamp); 56 57 logTimeString = new Date(logTime); 58 console.log(` ${logTimeString} - last logged database access`); 59 60 var elapsedTimeMins = Math.round((now - logTime)/MS_PER_MINUTE, 0); 61 62 if (logTime > idleStartTime ) { 63 console.log(`Cluster is Active: Username '${log.username}' was active in cluster '${clusterName}' ${elapsedTimeMins} minutes ago.`); 64 clusterIsActive = true; 65 return false; 66 } else { 67 // The first log entry is older than our inactive window 68 console.log(`Cluster is Inactive: Username '${log.username}' was active in cluster '${clusterName}' ${elapsedTimeMins} minutes ago.`); 69 clusterIsActive = false; 70 return false; 71 } 72 } 73 return true; 74 75 }); 76 77 return clusterIsActive; 78 79 };
Finally, if the cluster is inactive, we pass the project Id and cluster name to
pauseCluster
. This function can also resume a cluster, although that feature is not utilized for this use case.1 /* 2 * Pauses the named cluster 3 * See https://docs.atlas.mongodb.com/reference/api/clusters-modify-one/ 4 * 5 */ 6 exports = async function(projectID, clusterName, pause) { 7 8 // Get stored credentials... 9 const username = await context.values.get("AtlasPublicKey"); 10 const password = await context.values.get("AtlasPrivateKey"); 11 12 const body = {paused: pause}; 13 14 const arg = { 15 scheme: 'https', 16 host: 'cloud.mongodb.com', 17 path: `api/atlas/v1.0/groups/${projectID}/clusters/${clusterName}`, 18 username: username, 19 password: password, 20 headers: {'Content-Type': ['application/json'], 'Accept-Encoding': ['bzip, deflate']}, 21 digestAuth:true, 22 body: JSON.stringify(body) 23 }; 24 25 // The response body is a BSON.Binary object. Parse it and return. 26 response = await context.http.patch(arg); 27 28 return EJSON.parse(response.body.text()); 29 };
This function will be called by a trigger. As it's not possible to pass a parameter to a scheduled trigger, it uses a hard-coded lookback window of 60 minutes that you can change to meet your needs. You could even store the value in an Atlas database and build a UI to manage its setting :-).
The function will evaluate all projects and clusters in the organization where it’s hosted. Understanding that there are likely projects or clusters that you never want paused, the function also includes an excludeProjects array, where you can specify a list of project names to exclude from evaluation.
Finally, you’ll notice the call to
pauseCluster
is commented out. I suggest you run this function for a couple of days and review the Trigger logs to verify it behaves as you’d expect.1 /* 2 * Iterates over the organizations projects and clusters, 3 * pausing clusters inactive for the configured minutes. 4 */ 5 exports = async function() { 6 7 minutesInactive = 60; 8 9 /* 10 * These project names are just an example. 11 * The same concept could be used to exclude clusters or even 12 * configure different inactivity intervals by project or cluster. 13 * These configuration options could also be stored and read from 14 * and Atlas database. 15 */ 16 excludeProjects = ['PROD1', 'PROD2']; 17 18 const projects = await context.functions.execute("getProjects"); 19 20 projects.forEach(async project => { 21 22 if (excludeProjects.includes(project.name)) { 23 console.log(`Project '${project.name}' has been excluded from pause.`) 24 } else { 25 26 console.log(`Checking project '${project.name}'s clusters for inactivity...`); 27 28 const clusters = await context.functions.execute("getProjectClusters", project.id); 29 30 clusters.forEach(async cluster => { 31 32 if (cluster.providerSettings.providerName != "TENANT") { // It's a dedicated cluster than can be paused 33 34 if (cluster.paused == false) { 35 36 is_active = await context.functions.execute("clusterIsActive", project.id, cluster.name, minutesInactive); 37 38 if (!is_active) { 39 console.log(`Pausing ${project.name}:${cluster.name} because it has been inactive for more then ${minutesInactive} minutes`); 40 //await context.functions.execute("pauseCluster", project.id, cluster.name, true); 41 } else { 42 console.log(`Skipping pause for ${project.name}:${cluster.name} because it has active database users in the last ${minutesInactive} minutes.`); 43 } 44 } 45 } 46 }); 47 } 48 }); 49 50 return true; 51 };
Yes, we’re still using a scheduled trigger, but this time the trigger will run periodically to check for cluster inactivity. Now, your developers working late into the night will no longer have the cluster paused underneath them.
As a final step you need to deploy the App Services application.
The genesis for this article was a customer, when presented my previous article on scheduling cluster pauses, asked if the same could be achieved based on inactivity. It’s my belief that with the Atlas APIs, anything could be achieved. The only question was what constitutes inactivity? Given the heartbeat and replication that naturally occurs, there’s always some “activity” on the cluster. Ultimately, I settled on database access as the guide. Over time, that metric may be combined with some additional metrics or changed to something else altogether, but the bones of the process are here.