Feature as a service - new and noteworthy

In February, we published a guest post on how to introduce yourself to e-commerce APIs. With this short note, we’d like to showcase more new & noteworthy API-first platforms. In general, we’d like to present the Feature as a Service products which your developers can use to supercharge your sales, ops, marketing, shipping, you name it.

The list is of course non-exhaustive, so if you’ve found a useful FaaS, please share it in comments, and we’ll sure feature it in the version 2.

Thanks to Scott for inspiring us!


Streamlining deployments with flightplan.js

The deployment process is a crucial step of the software development lifecycle. The more frequently you apply changes to production, the more important it becomes for business operations. No wonder that successful projects have developed unique ways of pushing new features to production and mastered keeping the systems up and running. 

When it comes to Node.js world specifically, two major approaches have emerged. One of them is a brisk and seamless “git push heroku” style. The second one, which we’ll describe in this article, targets generally more advanced deployment pipelines—the ones which consist of dozens of application servers and often require zero downtime policy.

"By using the presented method, we achieved a zero-downtime deploy that is regularly executed on 50 – 120 servers during the workday, but can scale up to some bigger numbers."

This approach goes far beyond a handy one-liner and calls for custom code. So, our goal is to show you how to put together an easy-to-maintain deployment script with a (not-so-) little help from flightplan.js.

Let’s talk about heroku-style deployment in the first place. A disclaimer: we’re far from berating heroku and the family. We’ve been using it ourselves and we think they have a well deserved place on the DevOps map. We just want to present an alternative way we found useful for managing deployments in more challenging environments—challenging in terms of infrastructure and the responsibility for a site reliability laying on your shoulders.

So, “What’s your new alternative?” you ask. The new approach isn’t actually that new. It’s plain old shell script. The only novelty is that it’s wrapped with a nice battle-tested and cloud-ready JS interface. Meet flightplan.js. Now, have a sit and listen about our journey to a configurable and reliable deployment script allowing us, besides full zero-downtime deploys, to do partial deploys, quick rollbacks, and unification of all servers’ versions.

The infrastructure


Before we dive into the nitty-gritty, let’s take a moment to describe how servers and other things are connected in our case. Here’s the big picture:

fly1.png

In the rough overview, we have lots of application servers behind the AWS ELB load balancer. The arrow at the top represents requests coming from the external world into our application. After being routed through the ELB they end up in one of our BE servers, which stands for backend servers holding the app. A load balancer is essential in this architecture to more or less equally split the load among all available instances. Servers are supposed to serve requests and stay on the desired average CPU level, say 70%. The appropriate dash of magic provides monitoring plus autoscaling. After each instance is up, the Puppet helps us to fill it with a minimum required software dependencies and launch our app in the expected version.

Both ELB and Consul allow us to hold a list of servers and check their health; on top of that, Consul offers a key-value storage for real-time configuration storing low-level aspects and top domain-specific settings of the application. The Consul is a service discovery platform that helps us to track all instances (IP, health) divided by type of services that a particular instance provides. From the perspective of DevOps doing a deploy, our script uses all mentioned Consul’s features through its API, external commands called on a preinstalled and configured aws-cli, and the Flightplan’s built-in ability to use SSH agents. The aws-cli is a command line tool from AWS that provides many operations for some of their services. We use it to control the ELB and get some info from S3.
Presented architecture allows us to scale to hundreds of servers (as shown in the graph below) and still have a way to imperceptibly update the code at any time.

The script

Let’s close the boring part and go through the summary of what we can do with Flightplan.
 

"use strict";

// Registering cleanup callback before requiring flightplan.
process.on("SIGINT", interruptedCleanup);

const util    = require("util"),
     moment  = require("moment"),
     _       = require("lodash"),
     plan    = require("flightplan"),
     request = require("request-promise");

/*
Usages:

fly [deploy:]dev                           (deploys current branch on DEV)
fly [deploy:]dev --branch=testing          (deploys testing branch, this optional param can be used for all targets)
fly [deploy:]dev --branch=23af9e8          (deploys 23af9e8 commit, can be used for all targets)
fly [deploy:]canary --msg="Summary"        (deploys master branch on Canary, msg is required param for canary and production targets)
fly [deploy:]production10 --msg="Summary"  (deploys master branch on 10% of production servers)
fly [deploy:]production25 --msg="Summary"  (deploys master branch on 25% of production servers)
fly [deploy:]production --msg="Summary"    (deploys master branch on all of production servers)

fly [deploy:]production --msg="Summary" --force="172.11.22.333"
fly [deploy:]production --msg="Summary" --force="172.11.22.333,172.22.33.444"
(force param allows to skip waiting for some instances' healthchecks and does the forced redeploys there)

fly [deploy:]canary --msg="Msg" --silent   (silent mode turns off Slack notifications and events)

fly rollback:canary                        (rollbacks old build on Canary)
fly rollback:production                    (rollbacks old build on all of production servers)

fly unify:production                       (unifies build version for all of production servers, helpful to "rollback" partial deploys like 10%)

*/

The comments next to each command and a brief process overview we’re about to get through shortly is all I can give you right now to get you prepared for the upcoming code snippets. So please go over it quickly if you still haven’t. And if you wonder now whether you get the idea of partial deploys, rollbacks, and unification right, let the following graph clarify it all by showing how version pointers are changing over time. It's showing how key-value entries with app versions are being updated during deploy to DEV, Canary and to production in full or partial range. These values are then taken when restaring the app on the server. The graph is simplifying the view, because in reality values are git hashes.

fly2.png

Emergency exit

Now, after we’ve gone through the introduction, it’s time to answer the real question: What’s inside the interrupted Cleanup and when is it needed? It’s mostly for the case when the developer wants to quit script’s execution because he thinks something is wrong. In the majority of cases we don’t have any servers being processed at that time, but, for instance, when the script is stopped by accident, we have the last chance to do something useful with the data kept in memory.

If you want to have this working you must add it as quickly as possible, even before the "flightplan" is required. (Note: Besides a locally installed Flightplan for require statements to work in the script, we must also have it installed globally (npm i -g flightplan) to be able to use it as a command line tool.

// Registering cleanup callback before requiring flightplan.
process.on("SIGINT", interruptedCleanup);

...

function interruptedCleanup() {
 if (removedFromELB.length) {
   console.log("Register instances back:");
   console.log("aws elb register-instances-with-load-balancer --load-balancer-name ProjectELB --instances " + _.map(removedFromELB, "id").join(" "));
 }
}

The handler is only logging what command must be used to put back instances removed currently from the ELB. When the script is interrupted, it’s not the best idea to try to fork a new process to do it because in many cases it will simply fail. Printing out the command is, for now, the best we can do in such a case.
 

All settings go first

Going further with the code we can read the following configuration:

// --- Configuration ---
let BRANCH = "master";

const DEV_OPTIONS     = {
       versionKeyPostfix: "_dev"
     },
     CANARY_OPTIONS  = {
       event: "Canary",
       versionKeyPostfix: "_canary",
       oldVersionKeyPostfix: "_canary_old",
       lockTarget: "canary",
       lbTakeout: true
     },
     PROD_10_OPTIONS = {
       event: "10% of servers",
       lockTarget: "production",
       lbTakeout: true,
       bringBackPrevVersion: true
     },
     PROD_25_OPTIONS = {
       event: "25% of servers",
       lockTarget: "production",
       lbTakeout: true,
       bringBackPrevVersion: true
     },
     PROD_OPTIONS    = {
       event: "All servers",
       oldVersionKeyPostfix: "_old",
       lockTarget: "production",
       lbTakeout: true,
       waitForAllHealthy: true,
       canUnify: true
     };

const PARALLEL_DEPLOYS_FRACTION = 0.2;
const CONSUL_URL = "http://url.to.consul:8500/v1";
const BACKEND_GIT_URL = "git@github.com:ORG/project.git";

// ---------------------

let newRev, prevRev,
   localMachine,
   forced                 = [],
   maxParallelDeploys     = 1,
   deployedInstancesCount = 0,
   parallelDeploysCount   = 0;

const removedFromELB = [];

Some of the shown variables should make sense for you right away, like versionKeyPostfix, oldVersionKeyPostfix, bringBackPrevVersion, canUnify or removedFromELB, and some might require a little more explanation. Take lbTakeout, for example. It means if the targets should be taken out from and put back into the ELB. The lockTarget is a variable that, if set, makes the script use simple locking to prevent parallel deploys to important servers. This value is also stored inside the lock in Consul, just because we can.

Also, a word about PARALLEL_DEPLOYS_FRACTION. It’s simply a fraction of how many servers can be modified simultaneously. The rest should get easier when seen in action.

Let me also stress that the most important BRANCH variable stores the app version that should be deployed. By default, it points to "master" but can be easily overridden by the value from built-in params converter. If we specify --branch=”value” in the command line this will be instantly available inside the options object which will be visible later.

Next we have two very similar target hooks for DEV and Canary where we specify (using a function) what servers are to be used.

// Target plan for DEV server.
plan.target("dev", done => {
 BRANCH = "HEAD"; // Using current branch as default.

 getServersList()
   .catch(err => done(new Error(util.format("Getting servers list failed - Message: %s, Error: %j", err, err))))
   .then(servers => {
     if (_.isEmpty(servers)) {
       return;
     }

     const address = (_.find(servers, isDev) || {}).Address;

     if (!address) {
       done(new Error("DEV not found"));
       return;
     }

     console.log("DEV IP: %s", address);

     done([toHost(address)]);
   });
}, DEV_OPTIONS);

First, let’s clarify what target means here. As you’ve probably guessed, whatever is provided as a second parameter here (an object, array, or function that calls a callback with object, or array as in our case) will be taken as target machines which Flightplan will control for us. The second important concept is a task. Tasks fire in the code order if only their name is matching the first parameter that can be either a string or an array.

The content of methods like getServersList can be found in the gist attached at the very end of this post. For now, let’s just assume that servers are taken from the Consul’s catalog of registered instances and similarly we read the healthchecks from it later on.

Next, let’s take a look at the 25% and full production targets (let’s leave the 10% case for the imagination or you can check it in the gist).

// Target plan for 25% of production servers.
plan.target("production25", done => {
 getServersList()
   .catch(err => done(new Error(util.format("Getting servers list failed - Message: %s, Error: %j", err, err))))
   .then(servers => {
     if (_.isEmpty(servers)) {
       return;
     }

     const addresses = _(servers)
       .reject(isNonProduction)
       .map("Address")
       .sortBy()
       .value();

     const hosts = _(addresses)
       .take(Math.floor(addresses.length / 4))
       .map(toHost)
       .value();

     if (_.isEmpty(hosts)) {
       done(new Error("No productions servers found"));
       return;
     }

     maxParallelDeploys = hosts.length;

     console.log("Backend servers: %j", _.map(hosts, "host"));
     console.log("Parallel deploys: %d", maxParallelDeploys);

     done(hosts);
   });
}, PROD_25_OPTIONS);

// Target plan for production servers.
plan.target("production", done => {
 getServersList()
   .catch(err => done(new Error(util.format("Getting servers list failed - Message: %s, Error: %j", err, err))))
   .then(servers => {
     if (_.isEmpty(servers)) {
       return;
     }

     const addresses = _(servers)
       .reject(isNonProduction)
       .map("Address")
       .sortBy()
       .value();

     if (_.isEmpty(addresses)) {
       done(new Error("No productions servers found"));
       return;
     }

     const hosts = _.map(addresses, toHost);

     maxParallelDeploys = Math.floor(addresses.length * PARALLEL_DEPLOYS_FRACTION) || 1;

     console.log("Backend servers: %j", addresses);
     console.log("Parallel deploys: %d", maxParallelDeploys);

     done(hosts);
   });
}, PROD_OPTIONS);

The target plan for Canary server is similar.

As you can see, in the cases where we return many production instances, the IP addresses are sorted alphabetically. This means that, for instance, when deploying 10% and then 25%, only 15% more servers are updated with the new code. Notice that maxParallelDeploys is a fraction of all servers for full deploy but is equal to all target servers count for partial deploys. Therefore, do not use this code for fractions higher than 25% without a proper modification of this variable.

The filters shown below are finding desired instances based on Consul’s tags that come originally from the AMI configuration of each instance and are used during registration in Consul.

function isNonProduction(server) {
 return isDev(server) || isCanary(server);
}

function isDev(server) {
 return _.includes(server.ServiceTags, "dev");
}

function isCanary(server) {
 return _.includes(server.ServiceTags, "canary");
}

function toHost(address) {
 return {
   host: address,
   username: "user",
   agent: process.env.SSH_AUTH_SOCK
 };
}

Above you can see how each host must be formatted at the end—besides providing the IP we need to (or can) pass a username, a locally announced ssh agent, or even provide the certs for the SSH connection.

Ok, let’s take a break now if you are panting, because right now we are entering a really steep downhill.

Quick recap

What did we show so far? First part of the script is all about setting up exit hook, initial configuration, and acquiring data about the servers. Flightplan will then execute on them the “remote part” of the following script. We can provide just one object’s description, the array of such objects, or a function supposed to return one of these two. Each such object specifies all details for your local SSH agent. The array (or object) variable provided as the target’s definition would have to be either hard-coded in the script or obtained synchronously yet before the target’s hook is called.

Tasks definitions

After defining all targets, it’s time to attach tasks to local or remote sides and order matters here.

First, the local hook is going to fire if we want to deploy the app or if we don’t specify the task at all ("default"), for example when running fly dev.

plan.local(["default", "deploy"], local => {
 localMachine = local;
 const options = plan.runtime.options;

 if (options.force) {
   forced = options.force.trim().split(",");
 }

 if (options.branch) {
   BRANCH = options.branch;
 }

 if (BRANCH === "HEAD") {
   newRev = local.exec("git rev-parse HEAD").stdout.trim();
 } else {
   const remoteRev = local.exec("git ls-remote " + BACKEND_GIT_URL + " " + BRANCH).stdout;
   newRev = (remoteRev || local.exec("git rev-parse " + BRANCH).stdout).split("\t")[0].trim();
 }

 if (options.event && typeof options.msg !== "string") {
   plan.abort("Please provide deploy summary. E.g. 'fly deploy:production --msg=\"Deploy summary\"'");
   return;
 }

 if (!buildReady()) {
   plan.abort("Build is not ready");
 }
});

Why are there so many rocks on the road? Let’s go step by step.
In the first line, we store the reference to the local machine in a script’s variable. We do it to have an easier access (less code) in helper methods, but primarily to have a possibility to call its functions while Flightplan is executing a remote hook.

The second line is for storing the options in some more handy variable. The options will have both the default target’s options and the dynamic options that you provide in the command line. This is used in the following two conditional statements: If any value is provided, parse it and use it. If nothing is provided as value for a particular key in the command line, the value in script will be equal to true.

Next there is the most interesting part: where the final revision is acquired. If only the BRANCH equals "HEAD" we obtain the git sha of a local branch. This is very handy when we want to test a feature quickly on the DEV server (in this target’s code this value is assigned to be the default; you can check above). If only the BRANCH is still the "master," or any other branch label, the remoteRev variable is going to resolve to the git sha on a remote repository, to be able to do the proper deploy even without local git pull after merge of the PR. If the branch is not found, the value is going to be tried out as a shortened git sha and expanded into its full form.

Finally, the script checks if the current target demands a deploy message and if it’s provided in the command line. Next, the script verifies if desired build is ready by checking if the Docker container with corresponding name is saved in the AWS S3. This is quite a custom solution, I admit, but you can always find yours. As you can see, if you only call plan.abort, Flightplan will cut its execution. Remember that this function throws an error that must be caught while still being in the Flightplan’s context; if not, we’ll get an unhandled exception.

Commands in Flightplan

Let’s also examine how we check whether the build is ready.

function buildReady() {
 return localMachine.exec("aws s3 ls s3://url.to.docker.registry/docker/registry/v2/repositories/product/_manifests/tags/" + newRev, { failsafe: true }).code === 0;
}

We can see a few of Flightplan’s specific aspects. The script calls exec function on a local instance that forks a process for any system level command we want. By default, if the command fails it will throw an error so that the whole script is aborted. Setting the failsafe param to true will suppress this behavior and we can check the resulting code of the command call. Additionally, we have access to stdout and stderr outputs. Another useful param is silent that can hide the output of some noisy commands from Flightplan’s logs.

More hooks

The following two hooks will fire only to set the localMachine and check if you can run this task on the specified target.

plan.local("rollback", local => {
 localMachine = local;
 const options = plan.runtime.options;

 if (!options.oldVersionKeyPostfix) {
   plan.abort("Target is not reversible");
 }
});

plan.local("unify", local => {
 localMachine = local;
 const options = plan.runtime.options;

 if (!options.canUnify) {
   plan.abort("Target can not be unified");
 }
});

// HACK: Connecting to each server before locking, posting to Slack, etc.
plan.remote(["default", "deploy", "rollback", "unify"], remote => remote.log("Connected"));

Above we can see a first case of the remote hook. What differs from the local one is that we have a reference there to the instance that is far away from our laptop. It’s important to grasp that the code executes here in one Node’s thread, as usual, and at the same time commands fire asynchronous (by the nature of the Web) actions and obtained results look like they came from a regular synchronous operation. In this way, long running remote operations are executed simultaneously on all chosen targets.

In the above snippet, it seems that we just log one word, but is that even needed? This one simple function will halt the execution of the script only if there are some issues with connecting to any BE server. This is placed specifically before another local hook, where we do some initial operations designed to be executed on one machine only, like locking or posting some notifications. These ops could as well go to the remote hook, but then we would need to synchronize other remote instances to wait for the privileged one.

Action hooks

If you reached this point, I have some news for you. First of all, there are just four code snippets to go. However, the first two do the main job, so let’s skim through them and the summary of core actions will be just below each code fragment. At the end, there are two last pieces for stretching your legs before all that struggle that is yet ahead of you. Tie up your shoes well and we’ll see each other just after then next hill.

plan.local(["default", "deploy"], local => {
 const options = plan.runtime.options;
 local.log("Setting Consul lock and new app version");

 const err = local.waitFor(done =>
   Promise.resolve()
     .then(() => {
       if (!options.waitForAllHealthy) {
         return;
       }

       return new Promise(checkServersHealth)
         .then(healthy => {
           if (!healthy) {
             return Promise.reject("Servers not healthy");
           }
         });
     })
     .then(() => {
       if (!options.lockTarget) {
         return;
       }

       return getLock()
         .catch(err => Promise.reject(util.format("Getting lock state failed - Message: %s, Error: %j", err, err)))
         .then(result => {
           console.log("Lock state: %j", result);

           if (result !== "free") {
             return Promise.reject("Deploy is locked in Consul");
           }

           return lock(options.lockTarget)
             .catch(err => Promise.reject(util.format("Locking failed - Message: %s, Error: %j", err, err)));
         })
         .then(() => console.log("Locking succeeded"));
     })
     .then(() => {
       if (!options.bringBackPrevVersion && !options.oldVersionKeyPostfix) {
         return;
       }

       return getCurrentVersion()
         .then(currentVersion => {
           console.log("Current version: %s", currentVersion);
           prevRev = currentVersion;
         })
         .catch(err => Promise.reject(util.format("Getting current version failed - Message: %s, Error: %j", err, err)));
     })
     .then(() => {
       // Don't update old version if new revision is same as prev - it means last deploy failed.
       if (!options.oldVersionKeyPostfix || newRev === prevRev) {
         return;
       }

       return setNewVersion(options.oldVersionKeyPostfix, prevRev)
         .then(result => console.log("Old version set - Result: %s", result))
         .catch(err => Promise.reject(util.format("Setting old version failed - Message: %s, Error: %j", err, err)));
     })
     .then(() => {
       // Posting to Slack and creating Event.
       if (options.event && !options.silent) {
         postToSlack(options.event, options.msg);
         createEvent(options.event, options.msg);
       }

       return setNewVersion(options.versionKeyPostfix)
         .catch(err => Promise.reject(util.format("Setting new version failed - Message: %s, Error: %j", err, err)));
     })
     .then(result => console.log("New version set - Result: %s", result))
     .then(done)
     .catch(done)
 );

 if (err) {
   plan.abort(err);
 }
});

Nice! Let’s briefly recap what is happening above. First, all the execution quickly moves to promises chain that is inside the local.waitFor call. It will make this transport object (can be called on remote) to wait until you call the callback provided to your method.

Let’s go through the steps that take place. Server’s health is checked at first (only on prod) to fail quickly if something is already not ok. After that, we check the lock and acquire it if it was free (canary and prod). Next, the partial and full deploys get the current version. Further full deploys like a canary and a production set the old version to be the current one. Notice that we skip the last update in a special case, when the previous version is equal to the new one. This can occur if we break a partial deploy or we repeat the full deploy instead of running a unification task.

Just before the last command, which sets a new, desired version, the script sends a Slack notification and creates an event in one of our monitoring systems. This fires only if the target has an event and the silent mode is not activated.

And that would be it for this part. If everything succeeds, script goes into remote targets, executing the following code “in parallel” to each other, and then goes into the last hook, which is fired locally again.

plan.remote(["default", "deploy", "rollback", "unify"], remote => {
 const options = plan.runtime.options;

 const instanceId = remote.exec("wget -q -O - http://instance-data/latest/meta-data/instance-id").stdout;
 const instanceHost = remote.hostname().stdout.trim();
 const instanceData = { id: instanceId, host: instanceHost };

 if (options.waitForAllHealthy) {
   // Wait until all servers are back in Consul (healthcheck pass).
   waitUntilServersHealthy(remote);
 }

 parallelDeploysCount++;

 remote.log("Deploying to instance: " + instanceId);

 if (options.lbTakeout) {
   removedFromELB.push(instanceData);
   logProgress();
   removeInstanceFromELB(instanceId);

   // Wait until connections are finished.
   remote.exec("sleep 20");
 }

 remote.log("Restarting project_s service on instance: " + instanceId);
 remote.exec("sudo systemctl stop project_s && sudo systemctl start project_s");

 if (options.lbTakeout) {
   addInstanceToELB(instanceId);
   _.pull(removedFromELB, instanceData);
 }

 parallelDeploysCount--;
 deployedInstancesCount++;

 remote.log("Deployed to instance: " + instanceId);
 logProgress();
});

plan.local(["default", "deploy", "rollback", "unify"], local => {
 const options = plan.runtime.options;

 const err = local.waitFor(done =>
   Promise.resolve()
     .then(() => {
       if (!options.bringBackPrevVersion) {
         return;
       }

       console.log("Setting previous version in Consul: %s", prevRev);
       newRev = prevRev;

       return setNewVersion(options.versionKeyPostfix)
         .then(result => console.log("Previous version set - Result: %s", result))
         .catch(err => Promise.reject(util.format("Setting old version failed - Message: %s, Error: %j", err, err)));
     })
     .then(() => {
       if (!options.lockTarget) {
         return;
       }

       return unlock()
         .then(result => console.log("Unlock succeeded - Result: %j", result))
         .catch(err => Promise.reject(util.format("Unlocking failed - Message: %s, Error: %j", err, err)));
     })
     .then(done)
     .catch(done)
 );

 if (err) {
   console.log(err);
 }
});

An interesting thing happens in the previous part. The code here is executed in the node’s thread, which is exchanging its execution time between all remote contexts whenever a command on a target object is executed. When the command is finished, a calling remote context can start to execute further its iteration of our hook repetition. Local variables create a shared memory (fortunately with just one thread changing the state) for all remote contexts that we can use to synchronize their work the way we want.

The following action points are taking place in each instance. First, we take the instanceId using some AWS trick and the instanceHost from Flightplan’s data. Then, if only the target is a production, we wait with each server until all instances are healthy. Going further, we increase the parallelDeploysCount counter which is visible when reporting the progress. Next, when target demands it, we take it out from the ELB and wait 20 seconds until hopefully all requests are finished. Then, the core step takes place, which is a restart of the Docker container. After it is finished we put back the instance into the ELB, and we do it right away. It’s because ELB will not enable this instance unless it’s not healthy, so we should not worry about this part. Some counters are then decreased (current parallel deploys count) or increased (progress) and the remote hook is finished at this point.

After each remote is finished the script goes into last local hook that sets back previous version (partial deploy) and/or removes the lock if needed.

A piece of history

When talking about what must be triggered on the remote to restart the app, what we do now is restart a service that’s supposed to query for its specific version from the Consul, downloading a corresponding Docker image and running it. However, it was not always like that. One of the strong points of that kind of deploy scripts is the liberty that you gain when it comes to transporting layer of your app. In the very beginning we had neither Docker images nor any CI to build them. Our deploy procedure pushed us to run tests locally after merging the PR to check if all was still ok, and the script was simply pulling fresh code and restarting the guardian pm2 process. So you see, if you want to shift from git-push-repo deploy style it does not have to be such a big jump in the first iteration. Then we switched to .tar balls that were built on a CI and fetched into the remote machine by our deploy script, and now we have Docker images. Nobody at this point can predict if we won’t do any more switches in the future, but in all cases we have the elasticity.

Synchronizing the work

As the last, hardest code sample I’m giving you, the waitUntilServersHealthy method matters only for the full production deploy. It’s orchestrating our remotes against maximum allowed parallel deploys limit and Consul’s health checks. If this is not the time to break the loop for particular remote (start the deploy), remote executes a sleep command, releasing our thread for other work. Worth noting is the way we use the local variable w postponing progress logging to not overwhelm the reader.

let w = 0;
function waitUntilServersHealthy(remote) {
 while (true) {
   if (parallelDeploysCount < maxParallelDeploys) {
     const serversHealthy = remote.waitFor(checkServersHealth);

     if (serversHealthy && parallelDeploysCount < maxParallelDeploys) {
       w = 0;
       break;
     }
   }
   if (++w % 20 === 0) {
     logProgress();
   }
   remote.exec("sleep 10");
 }
}

function checkServersHealth(done) {
 getServersHealthStatuses()
   .then(servers => {
     done(_.every(servers, server => {
       const passing = _.every(server.Checks, check => check.Status === "passing") || _.includes(forced, server.Node.Address);

       if (!passing) {
         console.log("Not healthy: " + server.Node.Address);
       }
       return passing;
     }));
   })
   .catch(err => {
     console.log("Waiting for all Backend servers in Consul failed - Message: %s, Error: %j", err, err);
     done(false);
   });
}

Finally, the following is how we log the deploy progress. Besides calling it when servers are added or removed, we also use it in the loop above. A fragment of the final effect can be seen just after the code.

function logProgress() {
 console.log("Progress: %d/%d, Removed from ELB: %j, Parallel deploys: %d/%d",
   deployedInstancesCount, plan.runtime.hosts.length, removedFromELB, parallelDeploysCount, maxParallelDeploys);
}

...
Not healthy: 172.31.63.67
172.31.65.234 $ sleep 10
172.31.54.27  ok
172.31.72.225  ok
Not healthy: 172.31.63.67
172.31.76.226 $ sleep 10
172.31.69.83  ok
172.31.86.123  ok
172.31.72.39 Deploying to instance: i-0c334653869876f82
Progress: 28/129, Removed from ELB: [{"id":"i-0c334653869876f82","host":"ip-172-31-72-39"}], Parallel deploys: 1/25
localhost $ aws elb deregister-instances-from-load-balancer --load-balancer-name ProductELB --instances i-0c334653869876f82
172.31.92.201  ok
172.31.57.73  ok
Not healthy: 172.31.63.67
172.31.72.181 $ sleep 10
172.31.57.244  ok
172.31.92.119  ok
172.31.72.249 Deploying to instance: i-0f686dea161b7caf3
Progress: 28/129, Removed from ELB: [{"id":"i-0c334653869876f82","host":"ip-172-31-72-39"},{"id":"i-0f686dea161b7caf3","host":"ip-172-31-72-249"}], Parallel deploys: 2/25
localhost $ aws elb deregister-instances-from-load-balancer --load-balancer-name ProductELB --instances i-0f686dea161b7caf3
172.31.92.44  ok
172.31.64.205  ok
localhost  ok
...

Summary

By using the presented method, we achieved a zero-downtime deploy that is regularly executed on 50 – 120 servers during the workday, but can scale up to some bigger numbers. It will not work perfectly yet on numbers bigger by order of magnitude, because servers sometimes fail to start and block our script in endless loop rechecking server’s health. But this is most likely some next issue we will solve here when we feel the urge.

In closing the story of possible choices to solve the deploy problem, let me stress out that the presented list is not exhaustive in any way. Let’s place git push deployments on one end of the spectrum, where all hard work is taken from us but the language of choice simply must be supported. On the other end place locally executed deploy scripts such as this one, with all that freedom and responsibility they mix in. When looking at this picture, it’s hard to forget there are plenty of choices in the middle. If we squint at just the containers’ deployments some good options like the Kubernetes, Docker swarm mode, and Apache Mesos arise on the horizon. Any of them might be the best answer for you, but with each option you gain and lose something. Dedicated for that job, products could provide right away most of the basic functions that we had to implement, like handling partial deploys or different deploy policies; yet achieving some of them as well as extending the process with these little custom steps, like Slack notifications, might be tricky at some point. Scripts maintained by us can also perform some project-specific actions on each instance, from cleaning old logs to parallel processing of some big amount of data. The final choice is up to you and as always should depend mostly on your project’s needs.

As a final thought, I can say that besides saving a lot of time for repeatable steps, like creating a monitoring event, the presented way is not free from a few negatives. First, it’s not always that easy in the beginning to write it more or less correctly. An already mentioned example is that we cannot stop the whole script directly from a waitFor indent and it might be tempting to use the abort in that code block. Also, the fact that we can synchronize remotes as freely as we want to can end up in more complex code than we initially envisaged. Another factor could be that locally executed script can be modified too easily. However, in good hands it can provide more good than possible damage, and you can always run it from the CI platform.

Yet, I hope with that kind of help you’ll be able to finish your flightplan.js much quicker now. As promised, a link to the full script is provided, especially for those of you interested in how exactly we use the Consul’s API or aws-cli commands. Please leave a comment if you have any thoughts or want to share your way of doing the deploys.

Coupon numbers under control - How to secure ROI by coupon limits?

Learn Voucherify’s features which restrict coupon usage.

Limits of coupon usage should be adjusted to the campaign goals and be flexible enough to face the dynamics of unexpected changes. With this article, we will show you how to restrict coupon usage and make redemptions a reliable measure of a campaign success.

once (4).png

With Voucherify you can define coupon usage with many properties:

  • Total number of coupon redemptions
  • Number of redemptions count per customer
  • Number of redemptions count per customer among the whole campaign
  • Auto-extending with auto-update mode

 

The total number of a code redemptions

Whether you run a fixed-code campaign (a public code to attract new customers by sharing through social media and other channels) or a campaign of unique coupons (a bulk of unique, one-off codes) you can set the total number of code redemptions according on your goals: 

  • Coupon can be redeemed once only

 

 

 

 

 

  • Coupon can be redeemable unlimited number of times

 

 

 

 

  • Coupon can be used a predetermined number of times—Voucherify enables you to set any custom value as a redemption number

 

The number of redemptions count per customer

Validation rules allow marketers to limit code redemptions count per single customer. The most popular marketing strategy limits voucher usage to the one per client. This practice is common especially in companies which use public, fixed-code campaigns.

Why only once?

  • The once-per-customer rule is useful in campaigns turned toward acquiring new customers. You can be sure that after a valid order a new client won’t use the code again.
  • Other benefits appear while running A/B tests. You're guaranteed that the number of redemptions is limited to the number of attracted customers.

 

The number of redemptions per customer among the whole campaign

You can choose to restrict the number of redemptions per customer among the whole campaign. This property can be combined with the once-per-customer rule or any other custom redemption limit.

 

Auto-update mode

Coupon strategy is based on constant testing and searching. When a campaign requires dynamic changes in its limits or size, Voucherify provides an auto-update mode.  Depending on a current campaign performance, it’s automatically extended with the needed coupon amount. 

Let’s assume you’ve created a campaign with 1,000 unique codes which became an unexpected success. In auto-update mode, Voucherify will automatically extend the pool of coupon codes accordingly to the number of attracted customers. There is no need to create another campaign with the same pattern. No matter the growth scale, campaign data is gathered in one place, and you can easily analyze the success rate.

 

Under control

When coupons are put in numbers, you need to ensure customers will hold on to the limits. In the Voucherify dashboard, you’ll find a current redemption status of every voucher. If a code has already been redeemed the available number of times, the app ensures it cannot be used again. 
 

 

Summary

Effective limits on coupon usage are reflected in a more accurate ROI. You can eliminate misuses which misshape campaign metrics and harm your budget. Once your customers stick to the rules, you can be sure that redemptions are entirely represented in campaign revenues.



Build geolocation into your coupon strategy

The concept of geolocation allows you to tailor your coupon marketing based on a target location. The major goal is to fit your customers’ appetites to more accurate content combined with good timing. Within the following paragraphs, we are going to show you how merchants across the world use geomarketing in their coupon campaigns to improve customer experience.

 

“74 % of adult smartphone owners ages 18 and older say they use their phone to get directions or other information based on their current location.”

 

Local favorite

The idea of geomarketing came out of the correlation between a particular area and a specific consumer’s behavior. It basically allows your coupon campaigns to reach the target audience at the right place. In other words, by tracking customer location and spending habits, you can more effectively tailor discounts and give customers what they want when they want it. Take a look at these examples:

Another market giant diversifies its coupon offer in international markets. Below, you can see the differences between Reebok campaigns in the USA and the UK:

Popular Reebok coupons in the USA

Popular Reebok coupons in the USA

Exemplary Reebok coupons in the UK

Exemplary Reebok coupons in the UK

 

Plan your event

Geo-located marketing can be used to promote your brand with local events and contests. By creating engaging local campaigns, your target will have more opportunities to interact with your coupons online or offline. Memorable events enable you to grab the attention of a local audience and drive real marketing results. Many giant brands use events entertainment as an alternative to the traditional marketing and advertising. This tactic works not only as a way of promotion but also helps to build a long-term brand image.

 
“Economies of scale aren’t going to cut it anymore. These days, you’ve got to go local or get out of town.”

 

 

 

 

What’s ROI got to do with it?

Without target location, it is hard to estimate if in-store coupons or free shipping won’t hurt your budget instead of increase revenues. Knowledge of a campaign’s reach is vital to evaluate expected ROI. This is why global brands use local-based marketing to schedule their promotion strategy. All the data taken from the area can be used to manage sales and conquer local markets.

You can limit coupon redemption to the particular area. Whether globally or locally, use it to promote a new store or to cater for different tastes amongst your clientele.

You can limit coupon redemption to the particular area. Whether globally or locally, use it to promote a new store or to cater for different tastes amongst your clientele.

 

Explore new fields

Geo-located campaigns allow you to overtake areas where your competition feels comfortable. How? By using more hard-hitting tactics called geo-conquesting. The point is to launch attractive discounts aimed at a competitor’s area and a domestic audience. Whether you plan this kind of campaign or not, you have to be prepared for the moment when you might be in the sights of such an action organized by your competition.

 

Privacy hiccup

Today, people are getting more and more conscious when it comes to the privacy issue. The real challenge is gathering data without feeling intrusive. To keep loyalty without giving up on tracking, business has come up with the idea of transparent marketing. Next to the standard location tracking (based on user permission), you can use alternative, more subtle sources of customers data gathering. Coupon software such as Voucherify can provide client location with redemption details:

 

The workflow is straightforward - when a customer redeems a coupon, he/she leaves order details with location data. It’s an example of efficient tracking without hurting the customer experience.

“Customers say that relevant communication from sales and marketing plays a critical role.”

Keep pace with your audience

Traveling was never as easy as it is now. Accordingly, geo-marketing is only efficient when it's ready to handle market dynamics. Whether online or offline, people can seamlessly change their shopping location and shop around the world. The key is to capture these moves and turn them into dynamic customer grouping. Let’s use Voucherify as an example to show how dynamic geo-marketing works:
The app enables you to split customers into groups (called segments). In the next step, you can send customized Voucherify vouchers to the chosen audience. Segments are built with filters which are based on spending habits, location, previous interactions with your coupons or any other custom property you’d like to include.


The crucial feature of Voucherify segments is the dynamic mode which allows for updating of a customer’s location in real time.

By way of example, let’s assume you’ve built a segment of New Yorkers. If a CRM notices that a customer from NYC changes their location, he/she will automatically drop out of the NYC segment. If the client comes back to the NY area, he/she will be visible in the Voucherify segment with the New Yorkers again. CRM data are updated in the dashboard automatically so marketers can be sure that coupons will find the desired target at any time.

 

How to combine location with consumer-related attributes?

Once the geolocation works successfully, a CRM can gather data about preferences related to the particular area. With a specific buyer profile, marketers can finally tailor a campaign’s shape. To model this in practice, let’s create a sale for a target group which -

  • prefers in-store redemptions,
  • uses a MasterCard payment,
  • has the postal code 212/646/917,

and reward them with a product-specific discount.


While creating a segment, you can include all the details which define a target. Geo-located promotions can be based on a customer’s postal code, city, and country.
With the segment completed, you can start to create a campaign with unique coupons. In our scenario, we want to run a geo-located, product-specific campaign. It comes down to using the campaign creator where you need to add validation rules. They define a customer segment, discount products or any other limits for a valid redemption.

Moreover, you can distribute codes automatically to all the clients who appear in the chosen localization. The workflow looks as follows:

  1. Create a dynamic segment which defines your target group.
  2. Create the coupon campaign with the manager and define redemption rules.
  3. Choose a distribution channel and set automatic delivery.

In practice, customers will receive unique coupons when they join the segment.

Customized messages with codes can be delivered automatically by email, SMS or by your MailChimp account. Once you set the automatic distribution, Voucherify sends a code to each customer who enters the segment.

 

Maintenance

Geo-located promotions perform well only if coupons are secure from external misuse. The point is to keep an eye on customer’s data and verify his/her location once he/she is redeeming a coupon. This can be managed with an efficient information flow across a CRM and voucher software. Seamless integration with Voucherify ensures that customer data is updated in real-time. Additionally, the app looks after coupons, so a redemption won't be valid in the case of a dynamic location change.

In the picture, you can see the Nike offer of free shipping. This promotion is based on a delivery address which is verified during a purchase.

In the picture, you can see the Nike offer of free shipping. This promotion is based on a delivery address which is verified during a purchase.

Summary

The concept of geolocation supported by appropriate tools is like holding a key to open the door to great opportunities. The growing popularity of mobile devices has made geo-location an essential component of any marketing strategy. For the retail trade, it’s a chance to develop sales and catch customers with their wallet in their hand.


Building an online marketplace from scratch - introduction

The recent Gartner’s study on digital commerce trends highlights Marketplace Integration and API-Based Commerce Platforms. The paper brackets these concepts with “strategies intended to help them [digital commerce companies] nimbly provide an outstanding experience without significant upfront investment.” Sounds serious, right? If you run an ecommerce business and you’re curious about how these strategies can be translated into technology and operations, you’ve come to the right place. 

Provide an outstanding customer experience without significant upfront investment

With this series of posts, we’d like to focus on the tech point of view and show you how to map the high-level concepts into working software architecture. We’ll try to do this by way of designing an online marketplace architecture from scratch. 

Let’s start with some definitions

Our working definition - an online marketplace is a virtual place where sellers and buyers meet to exchange goods or services. The exchange usually takes the form of transactions managed by the marketplace operator. 

The operator measures the overall success rate of the marketplace by the number of deals between buyers and sellers. This is because they typically get a commission out of every transaction. So, to generate revenue it is the operator’s role to build a friendly environment which encourages: 

  • Sellers to exhibit
  • Buyers to… buy

(usually in this order). 

There are many niches you can target with an online marketplace. On top of this, there are many types of marketplace. So far, the market has come up with B2B, B2C, service, or C2C type. Yet the core mechanics of any marketplace platform are pretty much the same. Paraphrasing Mirakl, a well-oiled marketplace should provide these features:

  • a promise of ever-growing traffic - to attract new sellers, 
  • a product catalog which stands out - and generates the traffic,
  • a buyer-seller-operator communication and sales tooling - to spark off the confidence in buying 
  • a legal framework and transparent payment system - to finally close the sale, guarantee the quality, make buyers come back and start the cycle again

  As you can imagine, these essentials have a huge impact on the design of the software the platform works on. What’s more, they often change over time. As much as we’d like to, we cannot cover all the cases up front. This is why, before we dig into the bloody details of the platform architecture, let’s talk a little bit about the context and key assumptions.

 

Assumptions

In this story, we assume we’ll be designing an early stage marketplace. Moreover, it’ll be a company which starts in a competitive market. You can call it a startup. 

These 2 assumptions imply that the venture business model won’t be battle-tested. And this, in turn, entails particular requirements for the software underneath. In other words, to get a good time-to-market the platform should, therefore, be:

  • Open to quickly handle new, unexpected business scenarios 
  • Super easy to change on a daily basis
  • Ready to onboard and offboard developers fast 

But it should also give the ability to:

  • Learn from data
  • Reasonably scale when unexpected traffic hits the platform
  • Connect different departments 

Also, the very fact of the article title introduces a serious assumption; we want to build marketplace software from scratch but why and what does “from scratch” mean after all? 

“From scratch” redefined

Usually, when I hear this term, something positive pops into my head. Why? Because I immediately imagine a greenfield project with the freedom to choose the software practices and toolkit. I assume that having this kind of freedom should let me build the right software the right way. 

Build for today, design for tomorrow

But is this not a contradiction in connection with the speed of development we assumed to be our primary concern? Is implementing every part of a marketplace on your own doomed to be slow? We don’t really know. It depends on so many factors we don’t want to jump to any conclusions.

What we’d like to lay out instead is a different approach to building ecommerce software. We’ve seen this successfully deployed in a fast-growing Internet business twice; actually, we’ve been a part of both these projects.

It treats the speed of development requirement as a priority and, at the same time, gives the dev team a sufficient degree of freedom to roll-out their own software architecture practices.

What we suggest is to assemble an early stage marketplace platform by mixing modern yet proven API-first platforms. We’ll focus in particular on: 

  • CRM and other sales support tools
  • Marketing & e-commerce APIs
  • Customer service SaaS

In the next posts, we’re going to outline how to actually create such a setup and how to keep it up and running until the business becomes mature enough to rethink and maybe insource some parts of the system. In particular, we’re going to describe:

  • overall architecture, 
  • integration best practices,
  • implementation tips,
  • and finally, the advantages and drawbacks of the 3rd party API platforms

All of these in light of improving time-to-market for online marketplaces.

Last but not least, while this setup has proven to be effective for us, we’ve also learned its downsides. We’re happy to share them too.

Unfortunately, we didn’t come up with any fancy name for this setup. If you have an idea, let us know via @voucherifyio. However, we randomly scattered some cheesy-yet-hopefully-accurate buzzwords all over the article to catch the attention of people who just scroll.

Agile model driven development: the key to scaling agile software development

 

What is it that you actually want to build?

Without further ado, we’re gonna build a marketplace that matches the hardware designers with the hardware manufacturer. It’s like tindie.com/biz but instead of being a recommendation board, it’s gonna oversee the transactions. Let’s call our platform Manufaktura (which is a Polish word for “craft production”)

The business model goes as follows: imagine you have designed a prototype, your Kickstarter campaign worked well and now you’re looking for a trusted supplier. This is when Manufaktura comes into play. You describe your needs and submit a request and manufacturers can then place their bids. 

Besides, the management has noticed another related niche. They predict the platform could be an agency for recurrent support & maintenance contracts. This isn’t to consider for the first stage, but something to keep in the back of your head.

There’s one more thing to keep in mind. The business starts with tight competition. There are already 2 other companies that want to conquer a similar niche. We have to hurry up because chances are that the market leader will take it all. So, let’s get down to business.

Disclaimer: There are many unknowns when building a marketplace like this one. Figuring out “the right way” requires a helluva lot of work. So, we have no idea if this business could be a sustainable one and we’re far from claiming that any of our business assumptions would hold true. Nonetheless, we’d like to show you how Manufaktura could work and overcome some made-up-but-potential business hurdles. 

Components of the marketplace

Marketplaces are complex beasts. There are plenty of things you have to think of. To make matters worse, the individual parts of the platform are often interweaved and it’s hard to pinpoint clear cut components. To make matters even worse, naming things is hard. So please bear with us when some categories don’t correspond with the conventions you’ve learned.

We’ve discerned the following pillars. This breakdown will be mapped into upcoming posts:  

  • Order management - how to handle orders through the platform in a way which allows multiple people/departments to collaborate
  • Platform infrastructure & monitoring & recovery - deploying monitoring and alerting measures, error handling guidelines, manual and automatic recovery
  • Email & SMS marketing - the use of email and SMS APIs in marketing channels
  • Payments & invoices - different payment options and gateways, recurring payments, billing and invoice operations
  • Customer service - creating email campaigns for customer onboarding, organizing support and case management
  • Customer tracking - how to track marketing channels and users online, A/B testing
  • Reporting - collecting and visualizing data
  • Promotions - how to generate more traffic and re-engage existing customers
  • Shipping - automating shipping
  • Partners recruitment - how to organize screening, onboarding, and ranking of suppliers on the platform
  • Inventory management - controlling the product catalog, keeping product descriptions up-to-date
  • Territory management - how to ensure i18n, geolocation, multiple currencies etc. before launching another country 

And to give you a sneak peek of the infrastructure, we can say we’ll be using lots of SaaS/API platforms and a small backoffice system hosted on Heroku. The overall architecture looks as follows:


The first post

The first post won’t be about preparing a developer environment. No Jenkins, no Docker, no AWS, no DB schema design at this time. There’s no time for those things at the moment. The Excel orders sheet is becoming worse and worse in maintenance and is delaying Manufaktura’s bid process. Let’s replace it with a proper ordering system!