Streamlining deployments with flightplan.js

The deployment process is a crucial step of the software development lifecycle. The more frequently you apply changes to production, the more important it becomes for business operations. No wonder that successful projects have developed unique ways of pushing new features to production and mastered keeping the systems up and running. 

When it comes to Node.js world specifically, two major approaches have emerged. One of them is a brisk and seamless “git push heroku” style. The second one, which we’ll describe in this article, targets generally more advanced deployment pipelines—the ones which consist of dozens of application servers and often require zero downtime policy.

"By using the presented method, we achieved a zero-downtime deploy that is regularly executed on 50 – 120 servers during the workday, but can scale up to some bigger numbers."

This approach goes far beyond a handy one-liner and calls for custom code. So, our goal is to show you how to put together an easy-to-maintain deployment script with a (not-so-) little help from flightplan.js.

Let’s talk about heroku-style deployment in the first place. A disclaimer: we’re far from berating heroku and the family. We’ve been using it ourselves and we think they have a well deserved place on the DevOps map. We just want to present an alternative way we found useful for managing deployments in more challenging environments—challenging in terms of infrastructure and the responsibility for a site reliability laying on your shoulders.

So, “What’s your new alternative?” you ask. The new approach isn’t actually that new. It’s plain old shell script. The only novelty is that it’s wrapped with a nice battle-tested and cloud-ready JS interface. Meet flightplan.js. Now, have a sit and listen about our journey to a configurable and reliable deployment script allowing us, besides full zero-downtime deploys, to do partial deploys, quick rollbacks, and unification of all servers’ versions.

The infrastructure


Before we dive into the nitty-gritty, let’s take a moment to describe how servers and other things are connected in our case. Here’s the big picture:

fly1.png

In the rough overview, we have lots of application servers behind the AWS ELB load balancer. The arrow at the top represents requests coming from the external world into our application. After being routed through the ELB they end up in one of our BE servers, which stands for backend servers holding the app. A load balancer is essential in this architecture to more or less equally split the load among all available instances. Servers are supposed to serve requests and stay on the desired average CPU level, say 70%. The appropriate dash of magic provides monitoring plus autoscaling. After each instance is up, the Puppet helps us to fill it with a minimum required software dependencies and launch our app in the expected version.

Both ELB and Consul allow us to hold a list of servers and check their health; on top of that, Consul offers a key-value storage for real-time configuration storing low-level aspects and top domain-specific settings of the application. The Consul is a service discovery platform that helps us to track all instances (IP, health) divided by type of services that a particular instance provides. From the perspective of DevOps doing a deploy, our script uses all mentioned Consul’s features through its API, external commands called on a preinstalled and configured aws-cli, and the Flightplan’s built-in ability to use SSH agents. The aws-cli is a command line tool from AWS that provides many operations for some of their services. We use it to control the ELB and get some info from S3.
Presented architecture allows us to scale to hundreds of servers (as shown in the graph below) and still have a way to imperceptibly update the code at any time.

The script

Let’s close the boring part and go through the summary of what we can do with Flightplan.
 

"use strict";

// Registering cleanup callback before requiring flightplan.
process.on("SIGINT", interruptedCleanup);

const util    = require("util"),
     moment  = require("moment"),
     _       = require("lodash"),
     plan    = require("flightplan"),
     request = require("request-promise");

/*
Usages:

fly [deploy:]dev                           (deploys current branch on DEV)
fly [deploy:]dev --branch=testing          (deploys testing branch, this optional param can be used for all targets)
fly [deploy:]dev --branch=23af9e8          (deploys 23af9e8 commit, can be used for all targets)
fly [deploy:]canary --msg="Summary"        (deploys master branch on Canary, msg is required param for canary and production targets)
fly [deploy:]production10 --msg="Summary"  (deploys master branch on 10% of production servers)
fly [deploy:]production25 --msg="Summary"  (deploys master branch on 25% of production servers)
fly [deploy:]production --msg="Summary"    (deploys master branch on all of production servers)

fly [deploy:]production --msg="Summary" --force="172.11.22.333"
fly [deploy:]production --msg="Summary" --force="172.11.22.333,172.22.33.444"
(force param allows to skip waiting for some instances' healthchecks and does the forced redeploys there)

fly [deploy:]canary --msg="Msg" --silent   (silent mode turns off Slack notifications and events)

fly rollback:canary                        (rollbacks old build on Canary)
fly rollback:production                    (rollbacks old build on all of production servers)

fly unify:production                       (unifies build version for all of production servers, helpful to "rollback" partial deploys like 10%)

*/

The comments next to each command and a brief process overview we’re about to get through shortly is all I can give you right now to get you prepared for the upcoming code snippets. So please go over it quickly if you still haven’t. And if you wonder now whether you get the idea of partial deploys, rollbacks, and unification right, let the following graph clarify it all by showing how version pointers are changing over time. It's showing how key-value entries with app versions are being updated during deploy to DEV, Canary and to production in full or partial range. These values are then taken when restaring the app on the server. The graph is simplifying the view, because in reality values are git hashes.

fly2.png

Emergency exit

Now, after we’ve gone through the introduction, it’s time to answer the real question: What’s inside the interrupted Cleanup and when is it needed? It’s mostly for the case when the developer wants to quit script’s execution because he thinks something is wrong. In the majority of cases we don’t have any servers being processed at that time, but, for instance, when the script is stopped by accident, we have the last chance to do something useful with the data kept in memory.

If you want to have this working you must add it as quickly as possible, even before the "flightplan" is required. (Note: Besides a locally installed Flightplan for require statements to work in the script, we must also have it installed globally (npm i -g flightplan) to be able to use it as a command line tool.

// Registering cleanup callback before requiring flightplan.
process.on("SIGINT", interruptedCleanup);

...

function interruptedCleanup() {
 if (removedFromELB.length) {
   console.log("Register instances back:");
   console.log("aws elb register-instances-with-load-balancer --load-balancer-name ProjectELB --instances " + _.map(removedFromELB, "id").join(" "));
 }
}

The handler is only logging what command must be used to put back instances removed currently from the ELB. When the script is interrupted, it’s not the best idea to try to fork a new process to do it because in many cases it will simply fail. Printing out the command is, for now, the best we can do in such a case.
 

All settings go first

Going further with the code we can read the following configuration:

// --- Configuration ---
let BRANCH = "master";

const DEV_OPTIONS     = {
       versionKeyPostfix: "_dev"
     },
     CANARY_OPTIONS  = {
       event: "Canary",
       versionKeyPostfix: "_canary",
       oldVersionKeyPostfix: "_canary_old",
       lockTarget: "canary",
       lbTakeout: true
     },
     PROD_10_OPTIONS = {
       event: "10% of servers",
       lockTarget: "production",
       lbTakeout: true,
       bringBackPrevVersion: true
     },
     PROD_25_OPTIONS = {
       event: "25% of servers",
       lockTarget: "production",
       lbTakeout: true,
       bringBackPrevVersion: true
     },
     PROD_OPTIONS    = {
       event: "All servers",
       oldVersionKeyPostfix: "_old",
       lockTarget: "production",
       lbTakeout: true,
       waitForAllHealthy: true,
       canUnify: true
     };

const PARALLEL_DEPLOYS_FRACTION = 0.2;
const CONSUL_URL = "http://url.to.consul:8500/v1";
const BACKEND_GIT_URL = "git@github.com:ORG/project.git";

// ---------------------

let newRev, prevRev,
   localMachine,
   forced                 = [],
   maxParallelDeploys     = 1,
   deployedInstancesCount = 0,
   parallelDeploysCount   = 0;

const removedFromELB = [];

Some of the shown variables should make sense for you right away, like versionKeyPostfix, oldVersionKeyPostfix, bringBackPrevVersion, canUnify or removedFromELB, and some might require a little more explanation. Take lbTakeout, for example. It means if the targets should be taken out from and put back into the ELB. The lockTarget is a variable that, if set, makes the script use simple locking to prevent parallel deploys to important servers. This value is also stored inside the lock in Consul, just because we can.

Also, a word about PARALLEL_DEPLOYS_FRACTION. It’s simply a fraction of how many servers can be modified simultaneously. The rest should get easier when seen in action.

Let me also stress that the most important BRANCH variable stores the app version that should be deployed. By default, it points to "master" but can be easily overridden by the value from built-in params converter. If we specify --branch=”value” in the command line this will be instantly available inside the options object which will be visible later.

Next we have two very similar target hooks for DEV and Canary where we specify (using a function) what servers are to be used.

// Target plan for DEV server.
plan.target("dev", done => {
 BRANCH = "HEAD"; // Using current branch as default.

 getServersList()
   .catch(err => done(new Error(util.format("Getting servers list failed - Message: %s, Error: %j", err, err))))
   .then(servers => {
     if (_.isEmpty(servers)) {
       return;
     }

     const address = (_.find(servers, isDev) || {}).Address;

     if (!address) {
       done(new Error("DEV not found"));
       return;
     }

     console.log("DEV IP: %s", address);

     done([toHost(address)]);
   });
}, DEV_OPTIONS);

First, let’s clarify what target means here. As you’ve probably guessed, whatever is provided as a second parameter here (an object, array, or function that calls a callback with object, or array as in our case) will be taken as target machines which Flightplan will control for us. The second important concept is a task. Tasks fire in the code order if only their name is matching the first parameter that can be either a string or an array.

The content of methods like getServersList can be found in the gist attached at the very end of this post. For now, let’s just assume that servers are taken from the Consul’s catalog of registered instances and similarly we read the healthchecks from it later on.

Next, let’s take a look at the 25% and full production targets (let’s leave the 10% case for the imagination or you can check it in the gist).

// Target plan for 25% of production servers.
plan.target("production25", done => {
 getServersList()
   .catch(err => done(new Error(util.format("Getting servers list failed - Message: %s, Error: %j", err, err))))
   .then(servers => {
     if (_.isEmpty(servers)) {
       return;
     }

     const addresses = _(servers)
       .reject(isNonProduction)
       .map("Address")
       .sortBy()
       .value();

     const hosts = _(addresses)
       .take(Math.floor(addresses.length / 4))
       .map(toHost)
       .value();

     if (_.isEmpty(hosts)) {
       done(new Error("No productions servers found"));
       return;
     }

     maxParallelDeploys = hosts.length;

     console.log("Backend servers: %j", _.map(hosts, "host"));
     console.log("Parallel deploys: %d", maxParallelDeploys);

     done(hosts);
   });
}, PROD_25_OPTIONS);

// Target plan for production servers.
plan.target("production", done => {
 getServersList()
   .catch(err => done(new Error(util.format("Getting servers list failed - Message: %s, Error: %j", err, err))))
   .then(servers => {
     if (_.isEmpty(servers)) {
       return;
     }

     const addresses = _(servers)
       .reject(isNonProduction)
       .map("Address")
       .sortBy()
       .value();

     if (_.isEmpty(addresses)) {
       done(new Error("No productions servers found"));
       return;
     }

     const hosts = _.map(addresses, toHost);

     maxParallelDeploys = Math.floor(addresses.length * PARALLEL_DEPLOYS_FRACTION) || 1;

     console.log("Backend servers: %j", addresses);
     console.log("Parallel deploys: %d", maxParallelDeploys);

     done(hosts);
   });
}, PROD_OPTIONS);

The target plan for Canary server is similar.

As you can see, in the cases where we return many production instances, the IP addresses are sorted alphabetically. This means that, for instance, when deploying 10% and then 25%, only 15% more servers are updated with the new code. Notice that maxParallelDeploys is a fraction of all servers for full deploy but is equal to all target servers count for partial deploys. Therefore, do not use this code for fractions higher than 25% without a proper modification of this variable.

The filters shown below are finding desired instances based on Consul’s tags that come originally from the AMI configuration of each instance and are used during registration in Consul.

function isNonProduction(server) {
 return isDev(server) || isCanary(server);
}

function isDev(server) {
 return _.includes(server.ServiceTags, "dev");
}

function isCanary(server) {
 return _.includes(server.ServiceTags, "canary");
}

function toHost(address) {
 return {
   host: address,
   username: "user",
   agent: process.env.SSH_AUTH_SOCK
 };
}

Above you can see how each host must be formatted at the end—besides providing the IP we need to (or can) pass a username, a locally announced ssh agent, or even provide the certs for the SSH connection.

Ok, let’s take a break now if you are panting, because right now we are entering a really steep downhill.

Quick recap

What did we show so far? First part of the script is all about setting up exit hook, initial configuration, and acquiring data about the servers. Flightplan will then execute on them the “remote part” of the following script. We can provide just one object’s description, the array of such objects, or a function supposed to return one of these two. Each such object specifies all details for your local SSH agent. The array (or object) variable provided as the target’s definition would have to be either hard-coded in the script or obtained synchronously yet before the target’s hook is called.

Tasks definitions

After defining all targets, it’s time to attach tasks to local or remote sides and order matters here.

First, the local hook is going to fire if we want to deploy the app or if we don’t specify the task at all ("default"), for example when running fly dev.

plan.local(["default", "deploy"], local => {
 localMachine = local;
 const options = plan.runtime.options;

 if (options.force) {
   forced = options.force.trim().split(",");
 }

 if (options.branch) {
   BRANCH = options.branch;
 }

 if (BRANCH === "HEAD") {
   newRev = local.exec("git rev-parse HEAD").stdout.trim();
 } else {
   const remoteRev = local.exec("git ls-remote " + BACKEND_GIT_URL + " " + BRANCH).stdout;
   newRev = (remoteRev || local.exec("git rev-parse " + BRANCH).stdout).split("\t")[0].trim();
 }

 if (options.event && typeof options.msg !== "string") {
   plan.abort("Please provide deploy summary. E.g. 'fly deploy:production --msg=\"Deploy summary\"'");
   return;
 }

 if (!buildReady()) {
   plan.abort("Build is not ready");
 }
});

Why are there so many rocks on the road? Let’s go step by step.
In the first line, we store the reference to the local machine in a script’s variable. We do it to have an easier access (less code) in helper methods, but primarily to have a possibility to call its functions while Flightplan is executing a remote hook.

The second line is for storing the options in some more handy variable. The options will have both the default target’s options and the dynamic options that you provide in the command line. This is used in the following two conditional statements: If any value is provided, parse it and use it. If nothing is provided as value for a particular key in the command line, the value in script will be equal to true.

Next there is the most interesting part: where the final revision is acquired. If only the BRANCH equals "HEAD" we obtain the git sha of a local branch. This is very handy when we want to test a feature quickly on the DEV server (in this target’s code this value is assigned to be the default; you can check above). If only the BRANCH is still the "master," or any other branch label, the remoteRev variable is going to resolve to the git sha on a remote repository, to be able to do the proper deploy even without local git pull after merge of the PR. If the branch is not found, the value is going to be tried out as a shortened git sha and expanded into its full form.

Finally, the script checks if the current target demands a deploy message and if it’s provided in the command line. Next, the script verifies if desired build is ready by checking if the Docker container with corresponding name is saved in the AWS S3. This is quite a custom solution, I admit, but you can always find yours. As you can see, if you only call plan.abort, Flightplan will cut its execution. Remember that this function throws an error that must be caught while still being in the Flightplan’s context; if not, we’ll get an unhandled exception.

Commands in Flightplan

Let’s also examine how we check whether the build is ready.

function buildReady() {
 return localMachine.exec("aws s3 ls s3://url.to.docker.registry/docker/registry/v2/repositories/product/_manifests/tags/" + newRev, { failsafe: true }).code === 0;
}

We can see a few of Flightplan’s specific aspects. The script calls exec function on a local instance that forks a process for any system level command we want. By default, if the command fails it will throw an error so that the whole script is aborted. Setting the failsafe param to true will suppress this behavior and we can check the resulting code of the command call. Additionally, we have access to stdout and stderr outputs. Another useful param is silent that can hide the output of some noisy commands from Flightplan’s logs.

More hooks

The following two hooks will fire only to set the localMachine and check if you can run this task on the specified target.

plan.local("rollback", local => {
 localMachine = local;
 const options = plan.runtime.options;

 if (!options.oldVersionKeyPostfix) {
   plan.abort("Target is not reversible");
 }
});

plan.local("unify", local => {
 localMachine = local;
 const options = plan.runtime.options;

 if (!options.canUnify) {
   plan.abort("Target can not be unified");
 }
});

// HACK: Connecting to each server before locking, posting to Slack, etc.
plan.remote(["default", "deploy", "rollback", "unify"], remote => remote.log("Connected"));

Above we can see a first case of the remote hook. What differs from the local one is that we have a reference there to the instance that is far away from our laptop. It’s important to grasp that the code executes here in one Node’s thread, as usual, and at the same time commands fire asynchronous (by the nature of the Web) actions and obtained results look like they came from a regular synchronous operation. In this way, long running remote operations are executed simultaneously on all chosen targets.

In the above snippet, it seems that we just log one word, but is that even needed? This one simple function will halt the execution of the script only if there are some issues with connecting to any BE server. This is placed specifically before another local hook, where we do some initial operations designed to be executed on one machine only, like locking or posting some notifications. These ops could as well go to the remote hook, but then we would need to synchronize other remote instances to wait for the privileged one.

Action hooks

If you reached this point, I have some news for you. First of all, there are just four code snippets to go. However, the first two do the main job, so let’s skim through them and the summary of core actions will be just below each code fragment. At the end, there are two last pieces for stretching your legs before all that struggle that is yet ahead of you. Tie up your shoes well and we’ll see each other just after then next hill.

plan.local(["default", "deploy"], local => {
 const options = plan.runtime.options;
 local.log("Setting Consul lock and new app version");

 const err = local.waitFor(done =>
   Promise.resolve()
     .then(() => {
       if (!options.waitForAllHealthy) {
         return;
       }

       return new Promise(checkServersHealth)
         .then(healthy => {
           if (!healthy) {
             return Promise.reject("Servers not healthy");
           }
         });
     })
     .then(() => {
       if (!options.lockTarget) {
         return;
       }

       return getLock()
         .catch(err => Promise.reject(util.format("Getting lock state failed - Message: %s, Error: %j", err, err)))
         .then(result => {
           console.log("Lock state: %j", result);

           if (result !== "free") {
             return Promise.reject("Deploy is locked in Consul");
           }

           return lock(options.lockTarget)
             .catch(err => Promise.reject(util.format("Locking failed - Message: %s, Error: %j", err, err)));
         })
         .then(() => console.log("Locking succeeded"));
     })
     .then(() => {
       if (!options.bringBackPrevVersion && !options.oldVersionKeyPostfix) {
         return;
       }

       return getCurrentVersion()
         .then(currentVersion => {
           console.log("Current version: %s", currentVersion);
           prevRev = currentVersion;
         })
         .catch(err => Promise.reject(util.format("Getting current version failed - Message: %s, Error: %j", err, err)));
     })
     .then(() => {
       // Don't update old version if new revision is same as prev - it means last deploy failed.
       if (!options.oldVersionKeyPostfix || newRev === prevRev) {
         return;
       }

       return setNewVersion(options.oldVersionKeyPostfix, prevRev)
         .then(result => console.log("Old version set - Result: %s", result))
         .catch(err => Promise.reject(util.format("Setting old version failed - Message: %s, Error: %j", err, err)));
     })
     .then(() => {
       // Posting to Slack and creating Event.
       if (options.event && !options.silent) {
         postToSlack(options.event, options.msg);
         createEvent(options.event, options.msg);
       }

       return setNewVersion(options.versionKeyPostfix)
         .catch(err => Promise.reject(util.format("Setting new version failed - Message: %s, Error: %j", err, err)));
     })
     .then(result => console.log("New version set - Result: %s", result))
     .then(done)
     .catch(done)
 );

 if (err) {
   plan.abort(err);
 }
});

Nice! Let’s briefly recap what is happening above. First, all the execution quickly moves to promises chain that is inside the local.waitFor call. It will make this transport object (can be called on remote) to wait until you call the callback provided to your method.

Let’s go through the steps that take place. Server’s health is checked at first (only on prod) to fail quickly if something is already not ok. After that, we check the lock and acquire it if it was free (canary and prod). Next, the partial and full deploys get the current version. Further full deploys like a canary and a production set the old version to be the current one. Notice that we skip the last update in a special case, when the previous version is equal to the new one. This can occur if we break a partial deploy or we repeat the full deploy instead of running a unification task.

Just before the last command, which sets a new, desired version, the script sends a Slack notification and creates an event in one of our monitoring systems. This fires only if the target has an event and the silent mode is not activated.

And that would be it for this part. If everything succeeds, script goes into remote targets, executing the following code “in parallel” to each other, and then goes into the last hook, which is fired locally again.

plan.remote(["default", "deploy", "rollback", "unify"], remote => {
 const options = plan.runtime.options;

 const instanceId = remote.exec("wget -q -O - http://instance-data/latest/meta-data/instance-id").stdout;
 const instanceHost = remote.hostname().stdout.trim();
 const instanceData = { id: instanceId, host: instanceHost };

 if (options.waitForAllHealthy) {
   // Wait until all servers are back in Consul (healthcheck pass).
   waitUntilServersHealthy(remote);
 }

 parallelDeploysCount++;

 remote.log("Deploying to instance: " + instanceId);

 if (options.lbTakeout) {
   removedFromELB.push(instanceData);
   logProgress();
   removeInstanceFromELB(instanceId);

   // Wait until connections are finished.
   remote.exec("sleep 20");
 }

 remote.log("Restarting project_s service on instance: " + instanceId);
 remote.exec("sudo systemctl stop project_s && sudo systemctl start project_s");

 if (options.lbTakeout) {
   addInstanceToELB(instanceId);
   _.pull(removedFromELB, instanceData);
 }

 parallelDeploysCount--;
 deployedInstancesCount++;

 remote.log("Deployed to instance: " + instanceId);
 logProgress();
});

plan.local(["default", "deploy", "rollback", "unify"], local => {
 const options = plan.runtime.options;

 const err = local.waitFor(done =>
   Promise.resolve()
     .then(() => {
       if (!options.bringBackPrevVersion) {
         return;
       }

       console.log("Setting previous version in Consul: %s", prevRev);
       newRev = prevRev;

       return setNewVersion(options.versionKeyPostfix)
         .then(result => console.log("Previous version set - Result: %s", result))
         .catch(err => Promise.reject(util.format("Setting old version failed - Message: %s, Error: %j", err, err)));
     })
     .then(() => {
       if (!options.lockTarget) {
         return;
       }

       return unlock()
         .then(result => console.log("Unlock succeeded - Result: %j", result))
         .catch(err => Promise.reject(util.format("Unlocking failed - Message: %s, Error: %j", err, err)));
     })
     .then(done)
     .catch(done)
 );

 if (err) {
   console.log(err);
 }
});

An interesting thing happens in the previous part. The code here is executed in the node’s thread, which is exchanging its execution time between all remote contexts whenever a command on a target object is executed. When the command is finished, a calling remote context can start to execute further its iteration of our hook repetition. Local variables create a shared memory (fortunately with just one thread changing the state) for all remote contexts that we can use to synchronize their work the way we want.

The following action points are taking place in each instance. First, we take the instanceId using some AWS trick and the instanceHost from Flightplan’s data. Then, if only the target is a production, we wait with each server until all instances are healthy. Going further, we increase the parallelDeploysCount counter which is visible when reporting the progress. Next, when target demands it, we take it out from the ELB and wait 20 seconds until hopefully all requests are finished. Then, the core step takes place, which is a restart of the Docker container. After it is finished we put back the instance into the ELB, and we do it right away. It’s because ELB will not enable this instance unless it’s not healthy, so we should not worry about this part. Some counters are then decreased (current parallel deploys count) or increased (progress) and the remote hook is finished at this point.

After each remote is finished the script goes into last local hook that sets back previous version (partial deploy) and/or removes the lock if needed.

A piece of history

When talking about what must be triggered on the remote to restart the app, what we do now is restart a service that’s supposed to query for its specific version from the Consul, downloading a corresponding Docker image and running it. However, it was not always like that. One of the strong points of that kind of deploy scripts is the liberty that you gain when it comes to transporting layer of your app. In the very beginning we had neither Docker images nor any CI to build them. Our deploy procedure pushed us to run tests locally after merging the PR to check if all was still ok, and the script was simply pulling fresh code and restarting the guardian pm2 process. So you see, if you want to shift from git-push-repo deploy style it does not have to be such a big jump in the first iteration. Then we switched to .tar balls that were built on a CI and fetched into the remote machine by our deploy script, and now we have Docker images. Nobody at this point can predict if we won’t do any more switches in the future, but in all cases we have the elasticity.

Synchronizing the work

As the last, hardest code sample I’m giving you, the waitUntilServersHealthy method matters only for the full production deploy. It’s orchestrating our remotes against maximum allowed parallel deploys limit and Consul’s health checks. If this is not the time to break the loop for particular remote (start the deploy), remote executes a sleep command, releasing our thread for other work. Worth noting is the way we use the local variable w postponing progress logging to not overwhelm the reader.

let w = 0;
function waitUntilServersHealthy(remote) {
 while (true) {
   if (parallelDeploysCount < maxParallelDeploys) {
     const serversHealthy = remote.waitFor(checkServersHealth);

     if (serversHealthy && parallelDeploysCount < maxParallelDeploys) {
       w = 0;
       break;
     }
   }
   if (++w % 20 === 0) {
     logProgress();
   }
   remote.exec("sleep 10");
 }
}

function checkServersHealth(done) {
 getServersHealthStatuses()
   .then(servers => {
     done(_.every(servers, server => {
       const passing = _.every(server.Checks, check => check.Status === "passing") || _.includes(forced, server.Node.Address);

       if (!passing) {
         console.log("Not healthy: " + server.Node.Address);
       }
       return passing;
     }));
   })
   .catch(err => {
     console.log("Waiting for all Backend servers in Consul failed - Message: %s, Error: %j", err, err);
     done(false);
   });
}

Finally, the following is how we log the deploy progress. Besides calling it when servers are added or removed, we also use it in the loop above. A fragment of the final effect can be seen just after the code.

function logProgress() {
 console.log("Progress: %d/%d, Removed from ELB: %j, Parallel deploys: %d/%d",
   deployedInstancesCount, plan.runtime.hosts.length, removedFromELB, parallelDeploysCount, maxParallelDeploys);
}

...
Not healthy: 172.31.63.67
172.31.65.234 $ sleep 10
172.31.54.27  ok
172.31.72.225  ok
Not healthy: 172.31.63.67
172.31.76.226 $ sleep 10
172.31.69.83  ok
172.31.86.123  ok
172.31.72.39 Deploying to instance: i-0c334653869876f82
Progress: 28/129, Removed from ELB: [{"id":"i-0c334653869876f82","host":"ip-172-31-72-39"}], Parallel deploys: 1/25
localhost $ aws elb deregister-instances-from-load-balancer --load-balancer-name ProductELB --instances i-0c334653869876f82
172.31.92.201  ok
172.31.57.73  ok
Not healthy: 172.31.63.67
172.31.72.181 $ sleep 10
172.31.57.244  ok
172.31.92.119  ok
172.31.72.249 Deploying to instance: i-0f686dea161b7caf3
Progress: 28/129, Removed from ELB: [{"id":"i-0c334653869876f82","host":"ip-172-31-72-39"},{"id":"i-0f686dea161b7caf3","host":"ip-172-31-72-249"}], Parallel deploys: 2/25
localhost $ aws elb deregister-instances-from-load-balancer --load-balancer-name ProductELB --instances i-0f686dea161b7caf3
172.31.92.44  ok
172.31.64.205  ok
localhost  ok
...

Summary

By using the presented method, we achieved a zero-downtime deploy that is regularly executed on 50 – 120 servers during the workday, but can scale up to some bigger numbers. It will not work perfectly yet on numbers bigger by order of magnitude, because servers sometimes fail to start and block our script in endless loop rechecking server’s health. But this is most likely some next issue we will solve here when we feel the urge.

In closing the story of possible choices to solve the deploy problem, let me stress out that the presented list is not exhaustive in any way. Let’s place git push deployments on one end of the spectrum, where all hard work is taken from us but the language of choice simply must be supported. On the other end place locally executed deploy scripts such as this one, with all that freedom and responsibility they mix in. When looking at this picture, it’s hard to forget there are plenty of choices in the middle. If we squint at just the containers’ deployments some good options like the Kubernetes, Docker swarm mode, and Apache Mesos arise on the horizon. Any of them might be the best answer for you, but with each option you gain and lose something. Dedicated for that job, products could provide right away most of the basic functions that we had to implement, like handling partial deploys or different deploy policies; yet achieving some of them as well as extending the process with these little custom steps, like Slack notifications, might be tricky at some point. Scripts maintained by us can also perform some project-specific actions on each instance, from cleaning old logs to parallel processing of some big amount of data. The final choice is up to you and as always should depend mostly on your project’s needs.

As a final thought, I can say that besides saving a lot of time for repeatable steps, like creating a monitoring event, the presented way is not free from a few negatives. First, it’s not always that easy in the beginning to write it more or less correctly. An already mentioned example is that we cannot stop the whole script directly from a waitFor indent and it might be tempting to use the abort in that code block. Also, the fact that we can synchronize remotes as freely as we want to can end up in more complex code than we initially envisaged. Another factor could be that locally executed script can be modified too easily. However, in good hands it can provide more good than possible damage, and you can always run it from the CI platform.

Yet, I hope with that kind of help you’ll be able to finish your flightplan.js much quicker now. As promised, a link to the full script is provided, especially for those of you interested in how exactly we use the Consul’s API or aws-cli commands. Please leave a comment if you have any thoughts or want to share your way of doing the deploys.