A couple of months ago, we successfully migrated a larger part of our infrastructure from Heroku to AWS. Now, when the dust (or should I say the cloud) has settled, we’d like to share what was the main driver behind our decision and how we approached the transfer without stopping Voucherify API, even for a minute.
To better understand our reasoning here, let’s take a quick look at what Voucherify is and what the architecture looks like.
Voucherify offers programmable building blocks to build coupon, referral, and loyalty campaigns. It’s basically an API-first platform which devs can use to build complex and personalized promotional campaigns, like send a customer an email with a specific coupon code when he or she joins a “premium” segment. It also allows companies to track coupon redemptions to figure out what promotions work best. Lastly, it provides a dashboard for marketers to take the burden of promotions’ maintenance and monitoring off developers’ shoulders.
The platform consists basically of 3 components:
- Core application exposing the API
- Website serving the dashboard
- Supporting microservices for non-API related jobs
When it comes to data storage, we employ Postgres, Mongo, and redis trio.
This is how it looks after the migration:
We serve over 100 customers, who send a couple of million API calls monthly, including both regular requests and some more power-consuming ones like bulk imports/exports or syncs with 3rd party integrations.
Why Heroku in the first place?
Heroku was a perfect solution when we kicked off Voucherify in 2015. It gave us cost-effective hosting and fantastic continuous deployment workflow. Anyone who has used Heroku before knows how simple it is to integrate it with Github and how fast you can deploy. On top of that, Heroku is well-documented and the community is pretty vibrant.
All of this allows you to focus on iterating on your product without having to assign a dedicated person to devops for quite a long time; it was ~16 months in our case. Hosting on Heroku is actually about the speed. You just build, ship, and scale without the bother of infrastructure (deployment scripts, scaling, or security). But the speed manifests itself also in the low latency ensured by Heroku’s data centers located around the world. This is super important for us because the main priority for our API-first platform is developer experience - and I don’t know any dev who’s happy with sluggish responses.
Heroku was fine but our platform has started to grow more dynamically. New enterprise customers wanting to serve dozens of thousands of API calls per day. Our guerilla approach to scale the dynos was becoming more and more costly. We knew that it wouldn’t scale financially in the long-run. As we’re a bootstrapped business we had to react.
This is how our billing looked:
- API services - $750 (only Dynos for handling traffic, no databases)
- Web Dashboard - $50
- Next Dyno for handling extra API traffic, at least $50
- In the case of massive operations (import, exports), we needed even bigger containers with extra memory. So, the cost then went up to $250 per web Dyno.
But the pricing wasn’t the only reason we walked away from Heroku.
Firstly, we started facing strange and hard to debug infrastructure problems. A couple of times we noticed our platform had problems and responded non-deterministically. Off we went to find the problem, only to find out (much, much later) that the problems lay with Heroku. The status page just wasn’t updated right away. Sometimes they reacted quickly, sometimes we had to work around the problem ourselves because the fix took several hours.
Secondly, Heroku (or I guess any other PaaS) is poor at resource utilization. The Heroku plans are strict about the machine’s resources structure. As much as we know that such resource limiting policy is important and justified, one should keep in mind every application is different and therefore it needs an appropriate CPU/memory utilization profile. In effect, we paid for additional unused CPU power when we upgraded the plan for more memory. And it gets more and more visible and costly when you need to scale your app. Let’s take a look at our case. Tom, our infrastructure engineer, says:
Now, for exactly the same monthly price (750$ - only services, without databases) we don’t have troubles with managing bulk operations because we utilize resources properly. Moreover, we can handle 600% more traffic.
Thirdly, lack of private IP address - we received notifications that our application was generating spam traffic which was not the case. Moreover, some of our enterprise clients have security policies controlling the outbound traffic and they asked us for the IP address for their firewall - with Heroku we couldn’t satisfy this requirement.
Lastly, limited addons - some Heroku addons are not compatible with recent versions of software Voucherify is integrated with. For example, the Compose addon can only be used in its 2.6 version.
All in all, Heroku became both more money- and time-consuming.
AWS to the rescue
Why AWS? Well, I guess it’s sort of a disappointing answer, but it’s the most popular cloud provider out there and we have significant experience from previous projects. We didn’t run thorough research. Plus, our database instances were already hosted on the AWS in the same regions.
Migration process - or how to keep all systems 100% operational
Our API handles hundreds of thousands of coupon redemption requests all over the globe. If the API goes down, somewhere in the world a bunch of folks get embarrassed and then frustrated when their latte discount becomes invalid right at the checkout. Or, somebody can’t use a birthday gift card to pay for that new drone they long dreamed of. Such unpleasant cases quickly escalate to our customers, they get disappointed and it’s the last thing we want.
This is why we came up with a step-by-step availability-first migration strategy. Here’s how we did it.
- We migrated a single application with a bash script. But once we got into details (dependencies, security patches etc.) the script started to become hard IT maintenance. We turned to containers. As our API would run on a fleet of servers and requires load-balancing, auto-scaling, and failover (Heroku gives this out of the box), we tapped into kubernetes to orchestrate this. They seem to have the biggest community compared to Mesos and Docker Swarm
- Kubernetes support Google Cloud Platform out of the box, but that’s not the case for AWS which is why we used StackPoint to make k8s configuration easier. In this way, we didn’t have to spend lots of time configuring the cluster on AWS. The best thing about StackPoint is that they offer a flat rate regardless of the number of nodes. The additional value of using k8s is that we can move the cluster to another cloud provider when needed. This is a surprisingly valid point because we’ve recently been asked for an option to host specifically outside AWS. Finally, with this setup, we can always create a cluster in another region within 15 minutes.
- Introducing k8s monitoring platform - Prometheus + Grafana. We queued an article describing this setup, you can subscribe to our blog to get notification.
- Calculating the current resource utilization on Heroku to set a reference point for container resources and to select proper worker nodes on EC2
- Incrementally re-routing the traffic with Route53:
- 10% AWS - 90% Heroku
- 25% - 75%
- 50% - 50%
- 100% AWS
In the end, we got a more predictable platform and future-proofed it against growing traffic for the next couple of months. We still use Heroku for a couple of services like our dashboard, because it’s easier to deploy and it doesn’t need that many resources after all. Heroku Connect is worth noting too. We love it because it mitigates the effort of Salesforce sync.
Lastly, we also migrated postgres instance away from Heroku. Doing this without stopping production required us to apply some interesting tricks. We’ll describe them in the next post. Stay tuned!