How We Moved From MongoDB to Postgres Without Downtime and Cut Our Costs by 30%?

‍Voucherify was born in 2015 as a weekend hackathon project run by our small-scale software house, rspective. Initially, it was backed up by a MongoDB database. Truth be told, this choice was random – it was the most common database we used in our projects. We already had some experience with it so Mongo was a pretty natural component at that stage. However, as Voucherify's scale grew, we’ve added a second database – PostgreSQL – that seemed to be more suitable for the upcoming features. Then for some time we kept part of our data in Mongo, and the other part in Postgres, until the day we decided to move it all to Postgres, hosted on the Amazon RDS with the rest of our infrastructure.

Want to work in a bootstrapped SaaS company?

Join us

When we started out, we already had around five years of data collected, spread around multiple database instances, located on three continents, each dedicated to a different Voucherify cluster. Millions of voucher codes that could be updated anytime. Around a terabyte of constantly changing data. And to make matters worse, a lot of code had to be prepared for the upcoming breaking change. If presented on a timeline, we spent three months rewriting and testing new code and next three months migrating all the data.

Why then go through all this trouble? We had two valid reasons to do that.

First of all, as you can easily imagine, maintaining two different database types creates a cascade effect of doubled codebase, paradigms and concepts you have to keep in mind while adding new features. It was also the source of problems with the initial setup followed by issues popping up randomly (usually, on Friday afternoons). If one of them looks redundant, then all these issues sum up and cause tension and frustration in the engineering team.

Secondly, Compose – a SaaS platform serving MongoDB that we were using, was very expensive in comparison to alternatives. It became a significant percentage of our monthly expenses. Additionally, we were not satisfied with the quality of the support we got. Sometimes response delays could be as long as several days. In some cases, the only offered solution was to restart the database, with no good explanation of why the weird stuff happened in the first place or whether they plan to fix that in the future.

To succeed with the migration and maintain platform stability when the traffic was high, we split it into a couple of tasks – each corresponding to a different entity. Most of them were easy migrations of relatively small chunks of data that was updated rarely. Each of these tasks has its own story, but this article is going to tell the story of the last task. It was about two core entities – vouchers and campaigns – that serve as primary objects in Voucherify API. As you can imagine, these were kept in Mongo for the longest. You could say that the core of our system was built around the database that had to be replaced.

We used AWS’s Database Migration Service to help us with migration. The primary motivation was to reduce the time of preparing the setup for the migration tool, by relying on a SaaS solution tested by hundreds of developers already.

Here is what we did:

Sanity check of data in MongoDB.
Creating child tables with Postgres Inheritance.
Applying transform triggers.
Running Amazon Database Migration Service scripts.
Sanity check of data in child tables.
Moving “deleted” data from child to parent table.
Switching applications logic to use Postgres.
Anomaly detection.
Moving “active’ data from child to parent table.
Stopping and removing DMS tasks.

The Migration Beginning

First, we had to find out which database we should choose. This migration was different from the previous ones, so we were still struggling with the question if Postgres is the best choice. After dropping a few options, we took a closer look at DocumentDB and PostgreSQL from the AWS environment. DocumentDB is a non-relational database released in January 2019 used for storing JSON documents. Its mission is to be compatible with the MongoDB API to some extent and it is fully shifted to the AWS team, which translates to scalability and availability at a reasonable price.

Postgres had the obvious advantage of unifying the underlying technologies. As a result, buying DocumentDB would be an additional expense when compared to reusing an existing database. By design, DocumentDB does not support all MongoDB commands either, even with its greater scalability or availability. However, the supported features should be good enough for the majority of cases.

Therefore we could expect no code changes for the migration only if we didn’t play around with exotic queries too much. Our estimation was that we had around 5-10 places in the code that use some not-supported Mongo features. Doing some additional (usually small) migration of schema, like additional flags to flatten the state, were usually the solution to such problems. In the next step, such flags would let us simplify more sophisticated queries. However, such changes require adding many small adjustments across the whole application in the first place.

After finally gathering all cost predictions, it became clear that if we manage to reuse current database instances, maybe enhancing them slightly, we are still better off in comparison to other options. Therefore, after all, the choice was not that hard and when the decision was made the only question left was how we are going to migrate that huge amount of data.

To answer this question we did not dive into vast research about available migration tools. What we did instead was to check out the possibilities of AWS’s Database Migration Service, and see how far we can go with that idea. We wanted to reduce the time of preparing the setup for the migration tool, by relying on a SaaS solution tested by hundreds of developers already.‍

Database Migration Service‍

DMS is a great tool which can be used to migrate data between different types of databases – including Mongo to Postgres transfer. After turning it on, you get an underlying EC2 instance with ready-to-use software for migrating your data. To be able to keep your data in sync, even during long-running migrations, you can launch it in Multi-AZ mode, which takes just one click. With DMS, you also get a decent migration monitoring process out-of-the-box. Database Migration Service provides several abstractions, out of which three seem to be essential. So endpoints, replication instances, and database migration tasks.

DMS Pros

Starting from the last, a task is a description of a particular copy-paste job executed once and/or as an ongoing replication. It holds filtering and transformation rules together with some additional options.

Replication instances are self-explanatory, but to be clear, these are lists of EC2 instances that we can choose to be used in a particular migration task. After AWS admin creates an instance with desired RAM and CPU resources, the payment counter is triggered. Yet, all in all, the overall cost of using DMS in our case was marginable. But, if you really want to know, then yeah... we spent maybe like 200$ in total for this tool. But compared to the predicted savings it was like a drop in the ocean.

Last but not least, an endpoint is a description of how to connect to a particular database. Besides some obvious parameters, it includes the intent of direction it will be used, meaning if that’s a source or rather a target in the migration process. Once you have a replication instance running you can use it to test the connection for defined endpoints. When you create a DMS replication instance in the same VPC where your source/target endpoint is, then you are one step ahead because you don’t have to make your database public on the internet for the migration time. All in all this made DMS pretty easy to use in our case.

DMS Cons

First of all, its GUI isn’t perfect. There are two modes – graphical and json – where the first does not support all the features of the json mode. So going through the documentation is necessary to understand all of the possible filters and transformations. However, after we managed to set up the migration process using the json mode, it became way better to use DMS that way. The trick was to generate, using simple bash scripts, big JSONs with precise task descriptions that we simply pasted to the DMS website’s text area as a whole.

Another thing that we didn’t like about DMS was that when a new migration job was created the option to drop the destination table in the target database at the beginning was preselected by default. For sure there are cases when such behaviour is desirable, but why is the most dangerous option a default one? We planned to migrate the data project by project in a very controllable way, so we envisioned hundreds of migration tasks to be created. Having this constant risk of deleting all the production data by a simple mistake pushed us towards a safer, but much harder to achieve variant of the migration process. Our solution was to migrate the data in two steps – first to temporary tables that could be wiped by the DMS accidentally, and then to the destination production tables.

Before we were sure that DMS was an acceptable tool for this job, we still had to overcome many technical obstacles. Initially we planned to do two rounds of the migration, one for campaigns and second for vouchers. We imagined that each round would have only two steps for each Project (workspace in Voucherify) – run DMS with ongoing replication and switch a boolean flag in the project's config. As you can guess, it became much more complex. Let’s dive briskly into the story.

Preparations
Sanity Checks
The Big Picture
The Main Meal
Cloning Data
Process Overview
One More Issue
The Bottom Line‍

Preparations

First, we needed to clarify precisely what was the setup for each project before the migration. From the DMS task’s point of view, each project is basically a list of campaigns, and a number of vouchers that are standalone or belong to a campaign. Doable? Sure, but if you multiply that by hundreds of projects it becomes clear that you cannot type all of these settings by hand.

Before such big tasks, it’s always good to clear the code from some leftover logic to simplify the migration a little bit. In our case, this was for instance checking if all Campaign and Voucher fields are still in usage. Dropping old code is the easiest part of the process that can shorten its overall time. Unfortunately, we didn’t have any work to do here.

This allowed us to have pretty straightforward core migration scripts, which was definitely one of the goals. To save time in further steps, we made some assumptions while describing the valid model of each entity type. As you probably know, MongoDB is a document database, which means that it is schemaless. If you use it, like we did, to store data that has a schema, then you not only shift the responsibility for keeping data in good state to the application level, but you still have to expect that some part of your data will be dirty when migration comes. And yes, as we’ll see later on, data in Mongo can be dirty in many ways (or elastic, depending on the point of view). With these assumptions in mind, we were performing appropriate sanity checks before the migration of each project to find the polluted entries. This allowed us to fix corrupted entries quickly in isolation and perform the core migration in peace. Short disclaimer – a part of these assumptions could have been checked much faster in the destination database, therefore in our case it made sense to have two phases of sanity checks – initial one in Mongo and the second in Postgres.‍

Sanity Checks‍

We performed these sanity checks while data was still in temporary tables, to which DMS copied the data. More details will come, but I can mention already that we were creating short-term destination tables for each MongoDB collection, and when data was loaded we merged these temp child tables one-by-one into their parents.

To have a better insight into what type of checks we did, let’s explore the first piece of code that we used to find if entities are free from two simple types of errors:

{{CODE}}
db["vouchers-TENANT-PROJECT"].count({$or: [{ type: { $exists: false } }, { deleted_at: { $exists: true } }]})

+db["campaigns-TENANT-PROJECT"].count({$or: [{ campaign_type: { $exists: false } }, { deleted_at: { $exists: true } }]})+

The first thing we wanted to find was if one of the old migrations of voucher and campaign types is fully done. We decided to have these types with a NON NULL constraint in the final SQL form, so all entries must have been equipped with some value before the migration. It was also possible to check and fix this issue also in the final database, but it was easier to cut the problem in the roots and forget about it once for all.

The second thing we checked was if there exist any entries with an old way of naming the variable keeping the deletion time – deleted_at. Currently we had an alternative camel-case naming deletedAt, and it occurred later that for the sake of a simple migration script it’s better to clean the old data first. So if the total returned from the above query was non-zero, we listed all wrong entries and either fixed or removed them in justified cases.

We performed the second round of sanity checks right after the data landed in the Postgres database. Let’s jump a few steps ahead and quickly describe what was checked in this phase. Firstly, we searched for entries corrupted internally using the following query:

--- wrong vouchers

SELECT * FROM voucherify.vouchers_migration WHERE

...

OR id = 'MISSING'

OR discount IS NULL

OR discount = 'null'::jsonb

OR publish ->> 'count' LIKE '%.%'

OR (jsonb_array_length(publish->'entries') > 0 AND publish::text LIKE '%$date%')

You can see a part of the checks that were performed on the table holding the vouchers. This is the most interesting piece, so the rest of this query is cut here. The same kind of query was performed on the campaign's data right after that. Whenever some interesting findings came up we fixed them in Mongo, so that the effect of this action was preserved if the migration had to be aborted for some reason.

Check Anatomy

Let’s explore the details of the piece of code above. The first line will be hard to grasp at the moment yet. We will get back to it when describing the core migration script. The next two lines compared the ‘discount’ field with SQL-like ‘NULL’ and json-like ‘null’ values. These two types of NULLs were the first issue we faced. Both the old and the new code covered such a possibility nicely in the GET responses, but we wanted to be sure that data was fully complete after it was pushed through the migration script. SQL’s NULL value was expected for the missing data, however json null was a bit surprising. We didn’t have a code that could set a null value at that point, so maybe we had a code like that some time ago or these null values were a result of some old manual actions. Anyway, reading and parsing corrupted data like that worked well, however a jsonb concatenation ‘||’ applied on a null stored inside a PostgreSQL field resulted in a nasty error. So it’s good to be cautious about this little quirk.

The next line regarding the count of the publications shows how we checked if any integer-like fields were not stored in Mongo as a string value in a float-like form. For instance instead of expected 1, we got “1.0000.” Maybe we wouldn’t even notice that if we didn’t have plenty of (CAST .. AS INT) SQL transformation already in the code. That combination caused unexpected errors after migrating a couple of old test projects, so you can call it a second quirk we found. There were very few cases like that so we fixed all of them manually. These two problems – “1.0000” and json-like null values – were addressed by us later on in the migration script, but we left all these checks here to double check if all is fine.

The last check “(jsonb_array_length(publish->'entries') > 0 AND publish::text LIKE '%$date%')” was meant to find vouchers with publication entries stored in an invalid way. At that point, our code was already ignoring the data stored there, except one query that checked the total count of these entries. Therefore, we decided to migrate vouchers with all these publication entries, and fix this issue later on after we had all the data in one database. But in the first place we had to fix a nasty bug that was captured after the first manual tests. In rare cases dates were stored in MongoDB as ISODates, which are transferred by the DMS as json objects with one $date field containing a numeric timestamp value. Even though we didn’t really use this data anymore, our ORM system was still parsing it and obviously failed to read dates in this format. Same as before, there were not many cases like that, so manual fixes were most effective.

Besides checking internal vouchers and campaigns data, we also checked all kinds of relations at this step. That wasn’t possible back when vouchers and campaigns were in one database and the other part in another. That’s because a script for that would run for a very long time and most probably would give us a lot of false positives. So, we verified the vouchers count inside each campaign. This check was possible before the migration using a pure MongoDB script, but it was much easier to write it in the SQL version. Also SQL guaranteed that counting vouchers was not affected by parallel operations, like adding or removing vouchers, so comparing it with campaign’s ‘vouchers_count’ always gave reliable results. We also inspected the total of redemptions and publications stored in vouchers data. That kind of check was doable only after all data was inside one database, and even then it took a significant amount of time.‍

The Big Picture‍

For each project, there were two Mongo collections to be migrated. We decided that each collection will be transferred to a temporary table to limit the risk that one of the DMS jobs will drop a target table full of active campaigns or vouchers. While preparing for the migration, the choice whether to migrate to final or temporary tables was crucial. We went for the option that seemed the safest, if for instance we make a mistake while setting up the DMS task. This was unlikely to happen, but we wanted to use the safest path and see if we can stand the burden that will be put on us.

Additionally, as you will see later on we needed extra fields to finalize the migration, and having them in the final table would add a risk of returning this data through the API, or storing it in the system events data. We would need to prepare the code for such risk, but we still could have missed something. Therefore having child tables extending the model gave us absolute guarantee that this won’t happen. Soon you will see that some of the problems that we faced wouldn’t exist if we chose to migrate the data directly to the production tables. It’s hard to say how the other path would end, yet this article will present at least one side of the coin. From the time perspective, I can safely say that the road we settled on was the right one as it brought us the best performance and the expected results.

A Word or Two About Tables

We decided to create new temporary tables for each Mongo collection, and somehow safely merge them with production tables in the next steps. The PostgreSQL database has a nice feature that helped us here called table inheritance. It allowed us to join two tables together in a hierarchical order to obtain a parent table having multiple independent child tables. Each child table stores data and indexes separately, but when you read from the parent table you get the results aggregated from all engaged tables as if there was only one logical table. The obvious condition to join two tables in such inheritance relation is that the child tables must possess all the columns of a parent table, yet they can have more in the same sense as inheritance works in any programming language. These columns cannot be accessed while reading data through the parent table, but they can be useful in many ways. In our migration process we used these additional columns to store MongoDB raw data and _id of each object.

Let me stress once again that each child table holds a separate data with separate indexes. It’s especially important for unique indexes, because it’s possible to obtain the seemingly impossible result while reading from the parent table. For example, despite the unique index for voucher codes it would be possible to receive two vouchers with the same codes as a query result. That would break basic assumptions that our application was built upon, therefore it was critical to take this scenario into account.

Another less significant advantage of the inheritance was visible in a simple script for creating child tables. That’s because all parent’s columns are added for you if you command Postgres to create such a relationship in one call. Let’s see how this looks in an example:

CREATE TABLE

voucherify."vouchers-test@voucherify.io-proj_f1r5Tpr0J3Ct"

(_id varchar(200), _doc text) INHERITS (voucherify.vouchers);

Here we create a child table called vouchers-test@voucherify.io-proj_f1r5Tpr0J3Ct that inherits from a parent table vouchers, and has two additional columns representing the Mongodb _id and the entire json document (_doc) stored as a text. During the migration we created two child tables for each project for campaigns and vouchers. When the migration was finalized for a particular project and child tables were empty, we dropped them entirely to spare the CPU for the VACUUM process in the database. Let’s take a look at the diagram of the migration process described so far:

The diagram shows how migration looked from perspective. Above, servers running Voucherify have connection pools to both Mongodb and Postgres databases. Depending on the project's flags they read and write to one of them. Behind the scenes, the DMS task copies the data once, and then keeps running the ongoing replication for as long as we want. This way DMS was forwarding all updates made in the Mongodb collections, and our main job was to switch both reads and writes at the best possible moment.

It would also be possible to design it differently where our servers do double-writes in some time window to both an old and a new database, then we switch the reads, and disable double-writes in the end. However that approach would require much more time for code preparation. Also, it would still be possible to have a bug somewhere in the code for double-writes. Making the ongoing replication in the DMS fully work was a challenge, but we are glad it was not that hard at the end.

On the other hand, if we went for double-writes from within our code then in the case of any troubles it would be possible to stop the migration for a particular project to some late point, and switch everything back. In our model DMS does not keep state in sync in both directions, keeping the sync one-sided. We decided to make it a one-way-ticket migration with no going back. It might be possible to set up a parallel DMS task that replicates changes from target to the source, but then we would have new problems to solve, like how to stop the infinite cycle of transferring changes. In the end we didn’t find such an idea useful in our case. To minimize possible drawbacks we split the work by projects, and were applying fixes on the way so that nearly everything was cached by the time of migrating the most important projects.

Let’s catch a breather before getting to the real deal by quickly describing how we enabled the ongoing replication in the DMS. For the replication to work in a reasonable manner, a source database needs to expose some kind of a feed with a log of changes. In the case of MongoDB this feature is called a Replica Set Oplog. You can play around with oplog by yourself using commands like the following one:

db.oplog.rs.find({ ns:

"voucherify.vouchers-test@voucherify.io-proj_f1r5Tpr0J3Ct",

ts: { $gte: Timestamp(1575648464, 1) } })

First, manual tests with a temporary MongoDB instance served from one of our Kubernetes pods showed that DMS can realize the ongoing replication pretty well. Yet when testing the connection urls for the MongoDB from the Compose platform the feature didn’t work. In DMS this was marked as a partially done task, because just copying the data succeeded. Digging into DMS task logs (accessible after enabling CloudWatch feature for a task) confirmed our suspicions that there was a problem with accessing Mongodb’s changes feed. Fortunately, it’s possible to buy an addon that exposes an additional url for a MongoDB database that allows reads from an oplog. Again, the cost of this plugin was small compared to the overall cost of keeping a MongoDB.

The next big question that had to be answered was how to convert our data between two different database types. It was not a problem to join the Mongo’s schemaless and the SQL worlds, because we had the schema well visible in our classes in the code. Also, we decided to map object’s fields in a straight line with SQL columns without any weird modifications. So for instance, text fields converted to text columns and json objects converted to json or jsonb columns. After choosing MongoDB as a source database in the DMS endpoint, you will face a question which of the two metadata modes should be applied. AWS admin can select a table or a document mode at this step.

When working in the first mode, DMS will try to extract a schema from N documents, and will map data to appropriate table columns when linked with SQL database as a target database. The number N can be freely chosen by the Endpoint creator, and doesn’t seem to have an upper limit. This mode seemed to be a decent choice in our case at the first glance. However after some tests it turned out that the schema extracted by DMS didn’t match our intentions in the majority of cases. The main reason was that it flattens all nested jsons to separate columns, and we expected that they would be kept as separate jsons. Moreover, after DMS had the schema created, the task would skip new unmapped fields when facing an entry that was not scanned before. So it was not possible for us to estimate any good number for the N parameter, because we had to assume that our data can be very inconsistent. After a couple of tries it became clear that the document mode with a custom transformation script is the only option.‍

The Main Meal‍

All right! If you made it that far, then you must be hungry for some juicy code. Let’s start with a quick summary of what we know so far before diving into the technical core.

For each project, there were two Mongodb collections to be migrated using a DMS task. Each task copied whole data and was replicating the changes after that. So there was no need to add double-writes logic in our code, and we only had to prepare an alternative logic around the target database with two boolean flags enabling new logic for campaigns and vouchers separately. Yet it wasn’t clear at first when in the migration process they will be switched. Next, to make the migration as safe as possible we decided to transfer the data in two steps. First, each Mongodb collection would be copied to a corresponding temporary table, inheriting from the appropriate parent table. Next, we would move the data to parent tables and switch boolean flags in a vaguely known order. Finally, we managed to connect DMS with both source and target databases, having the MongoDB source running in a document mode. The main problem that we had to solve next, was about writing a mapper that links fields from source json objects to the appropriate SQL columns.

PostgreSQL database, same as many other contemporary SQL databases, has a nice feature called triggers. It gives a possibility to attach a previously registered, custom piece of code to be triggered on particular events for the specified database entities. Transformation functions called when inserting and updating rows in our child tables seemed like the perfect use case. The following lines show how we were plugging them in:

DROP TRIGGER IF EXISTS transform_voucher_trigger ON

voucherify.vouchers_migration;

CREATE TRIGGER transform_voucher_trigger BEFORE UPDATE OR

INSERT ON voucherify.vouchers_migration

FOR EACH ROW EXECUTE PROCEDURE

voucherify.transform_voucher('TENANT', 'PROJECT');

First, the script drops the trigger transform_voucher_trigger if it was applied already, and then creates one that is gonna be fired for the vouchers_migration table BEFORE UPDATE OR INSERT of any row. This means that there is a chance to alter the row data before each insert or update, which allows unpacking data to separate columns before constraints like NOT NULL are checked. We could therefore apply NOT NULL and UNIQUE constraints applied already on the child tables, and validate new data before moving it into the final table.

It’s important to notice that we can apply custom variables as method arguments when creating a trigger. This was crucial, because we needed to have a way to store the tenant and project ids inside each migrated entry. The problem was that this data was kept only inside the Mongodb collection name, not inside each entry. DMS has a plenty of transformations that can be used, but each of them works only within a particular abstraction level. For instance, you can alter a Mongodb collection name while matching it with a Postgres table or remap field/column names. But you cannot manipulate the value of a chosen column, like extracting it from a collection name. Fortunately for us, triggers in Postgres can have variables supplied, so the only problem to solve was how to do it precisely and quickly for each of thousands of projects that had to be migrated.

Now that we learned some triggers details, let’s see the full final core migration script:

CREATE OR REPLACE FUNCTION voucherify.transform_voucher() RETURNS TRIGGER AS $$

DECLARE

M_DATE timestamp with time zone;

PG_DATE timestamp with time zone;

DOC JSONB;

BEGIN

IF TG_OP = 'UPDATE' AND OLD._doc::text != NEW._doc::text THEN

M_DATE := GREATEST(to_timestamp((NEW._doc::jsonb->'updated_at'->>'$date')::bigint / 1000), (DOC->>'updated_at')::timestamp, to_timestamp((NEW._doc::jsonb->'deletedAt'->>'$date')::bigint / 1000), (DOC->>'deletedAt')::timestamp);

PG_DATE := GREATEST(OLD.updated_at, OLD.deleted_at);

IF M_DATE < PG_DATE THEN

RAISE WARNING 'VOUCHER ANOMALY DETECTED - ID: %, Incoming time: %, Stored time: %', NEW._id, M_DATE, PG_DATE;

RETURN NEW;

END IF;

IF NEW._doc IS NOT NULL AND (TG_OP = 'INSERT' OR OLD._doc::text != NEW._doc::text) THEN

DOC := NEW._doc::jsonb;

NEW.tenant_id = COALESCE(DOC->>'tenant_id', TG_ARGV[0]);

NEW.project_id = COALESCE(DOC->>'project_id', TG_ARGV[1]);

NEW.id = COALESCE(DOC->>'id', 'MISSING');

NEW.code = DOC->>'code';

NEW.deleted = COALESCE((DOC->>'deleted')::boolean, FALSE);

<...>

-- discount

NEW.discount = DOC->'discount';

IF jsonb_typeof(NEW.discount) = 'null' THEN

NEW.discount = NULL;

END IF;

-- publish

NEW.publish = COALESCE(DOC->'publish', '{}'::jsonb);

IF jsonb_typeof(NEW.publish) = 'null' THEN

NEW.publish = '{}'::jsonb;

END IF;

NEW.publish = NEW.publish || jsonb_build_object('count', COALESCE(FLOOR((NEW.publish->>'count')::float), 0));

-- validity_day_of_week

IF DOC->>'validity_day_of_week' IS NOT NULL THEN

NEW.validity_day_of_week = voucherify.json_arr2int_arr(DOC->'validity_day_of_week');

ELSE

NEW.validity_day_of_week = NULL;

END IF;

-- dates

NEW.created_at = to_timestamp(voucherify.hex_to_bigint(SUBSTR(DOC->'_id'->>'$oid', 1, 8)));

NEW.deleted_at = COALESCE(to_timestamp((DOC->'deletedAt'->>'$date')::bigint / 1000), (DOC->>'deletedAt')::timestamp);

END IF;

RETURN NEW;

END;

$$ language plpgsql;

The code piece can be a bit overwhelming at first glance, but it will become pretty straightforward once we split it into biteable chunks. The script consists of three parts: declaring variables, anomaly check, and the main transformation that fires when it’s an insert or an update coming from a DMS task. Guessing the meaning of each declared variable should be a piece of a cake, so let’s dive into the transformation logic.

The following check “IF NEW._doc IS NOT NULL AND (TG_OP = 'INSERT' OR OLD._doc::text != NEW._doc::text)” guarantees that any transformation happens only if it’s a row inserted or updated by a DMS task. That’s because NEW._doc could not be sent from our apps, because there was no such logic. We could in fact simplify it to just the first part of this condition, because we narrowed the scope of possible operations to INSERT and UPDATE when creating a trigger, but we preferred to have it double checked and called as rarely as possible.

In the next line, we parse _doc column to a binary JSON format and store it under the DOC variable. Following line “NEW.tenant_id = COALESCE(DOC->>'tenant_id', TG_ARGV[0]);” shows that we had two ways to deliver tenant and project ids. The first option was to update the data in Mongo first, and send it already equipped with these values. Or trigger arguments, described on previous pages. If neither of these was provided then the inserted document would be rejected by NOT NULL constraint, ending as a DMS task failure. Next line “NEW.id = COALESCE(DOC->>'id', 'MISSING');” sets the ID for the row for a proper one, or uses 'MISSING' as a fallback. We preferred to copy corrupted data and fix corrupted leftovers already in Postgres. The corresponding check “id = 'MISSING'” was fired in the --- wrong vouchers sanity check shown before. The line after that sets voucher code. We don’t have any fallback here, because it’s possible to have vouchers without the code internally in our system. The following line “NEW.deleted = COALESCE((DOC->>'deleted')::boolean, FALSE);” is responsible for setting a boolean flag for the deleted column. As you can see it provides a default FALSE value.

The next section of the presented piece of the migration script deals with a discount jsonb variable. Let’s see this fragment:

NEW.discount = DOC->'discount';

IF jsonb_typeof(NEW.discount) = 'null' THEN

NEW.discount = NULL;

END IF;

As you can see, to extract a part of the MongoDB document and store it in a json/jsonb column it is enough to use a single arrow -> notation, which does not parse to a string as the double arrow ->>. However, be cautious here! Missing value will give you SQL-like NULL value, but a json-like null value will get converted with no issues into a json-like null value in your SQL table. Although it is expected from the SQL side, it may cause your code to fail due to invalid transformations of such value. Normally, when writing SQL queries we make it safe only for a NULL case. And if we don’t exclude also the null values then using -> or ->> will cause errors in the production environment. The solution that we applied was to catch such null cases in the migration script and convert them to their SQL equivalents.

The section following the previous one also handles a jsonb publish field, but has some enhancements essential for our system. Let’s take a closer look:

NEW.publish = COALESCE(DOC->'publish', '{}'::jsonb);

IF jsonb_typeof(NEW.publish) = 'null' THEN

NEW.publish = '{}'::jsonb;

END IF;

NEW.publish = NEW.publish || jsonb_build_object('count', COALESCE(FLOOR((NEW.publish->>'count')::float), 0));

Here we additionally have a fallback '{}'::jsonb value. Simply put, it’s possible to have a NULL discount in our platform, but we prefer to always have at least a json with a zeroed count field in the publish column. Next comes the same check for null values as before, with just a different fallback value. The last line is either setting a default zero for the count variable, or parsing an existing value to an integer by rounding it down. We faced an issue, described before already, with some corrupted values like “1.00001” stored as string values under this field. This was working well in our Mongo-oriented code, but failed in the SQL version. We preferred to have clean data as a solution to this problem.

The following section for the validity_day_of_week field was used to convert a json-like array of integers into an SQL-like integers array:

IF DOC->>'validity_day_of_week' IS NOT NULL THEN

NEW.validity_day_of_week = voucherify.json_arr2int_arr(DOC->'validity_day_of_week');

ELSE

NEW.validity_day_of_week = NULL;

END IF;

This kind of transformation is not as straightforward as you may expect. Using a single arrow -> would leave an array as a json/jsonb value, which would not be assignable for this column. The trick was to use a voucherify.json_arr2int_arr helper function that we registered in Postgres together with the migration script. The content of this function will be presented in the appendix. The use of a double arrow in the “IF DOC->>'validity_day_of_week' IS NOT NULL” condition is more interesting here. When using a single arrow, the json-like null value would result in a value different than NULL causing error in the next line. Double arrow will flatten possible outcomes to a single one, making this piece of code safe again.

The last part of the migration script deals with three date variables – created_at, updated_at and deleted_at. Let’s see it again:

NEW.created_at = to_timestamp(voucherify.hex_to_bigint(SUBSTR(DOC->'_id'->>'$oid', 1, 8)));

NEW.updated_at = COALESCE(to_timestamp((DOC->'updated_at'->>'$date')::bigint / 1000), (DOC->>'updated_at')::timestamp);

NEW.deleted_at = COALESCE(to_timestamp((DOC->'deletedAt'->>'$date')::bigint / 1000), (DOC->>'deletedAt')::timestamp);

The first uses a kind of a magic trick. In the Mongo database, the first eight characters of the _id field are in fact the entity creation timestamp. It can be extracted by first converting these signs from the hexadecimal format to SQL’s big integer using the voucherify.hex_to_bigint helper function, and then to a timestamp using the built-in to_timestamp method. The _id field is not a string directly, but rather an object with a single string property $oid, that holds the desired text value.

The next two fields, that store the moments of the last update and of the entity deletion accordingly, have the same transformation logic. It covers three formats of how dates can be stored in Mongo, and converts them all into the SQL timestamp. The second part “(DOC->>'deletedAt')::timestamp” covers two cases – a timestamp as a number and an ISO formatted string, and the first part “to_timestamp((DOC->'updated_at'->>'$date')::bigint / 1000)” deals with a bit more exotic MongoDB ISODate form. The last one is received from the DMS task as an object with a single $date property holding a numeric timestamp. This timestamp also holds milliseconds, so we need to divide it by 1k to make it compatible with the to_timestamp method.

This almost ends the migration script topic if not for the anomaly detection logic. At the time of writing this script we were not sure how the migration procedure will look like in all the details, so we tried to make it more generic. The anomaly check was added to detect some nearly impossible scenarios that could happen at the moment of switching a project to the new logic.

Let’s imagine a situation where we have a specific voucher in MongoDB collection, containing a counter of redemptions equal to 1. This voucher is copied already to the Postgres by a DMS task, and a one-way ongoing replication is transferring changes applied in this entry. Next, imagine there are two very fast redemptions made one after the other, and we make a switch to new logic right at that moment. It could happen that the last change happened in the source database, like increasing the redemptions counter to 2, was then copied as a whole document to the target database. In our bad scenario, the second request to increment the counter is already received by the target db instance, and the counter there is equal to 2. After that comes, our document from the replication does not change the state anymore, and we are left with an invalid voucher’s state in the end. If we are able to make a switch in the moment of a very low project’s activity, then the chances for this scenario are super low, but we still prepared a piece of logic catching these rare cases. Better safe than sorry.

Let’s see the code responsible for the anomaly check:

IF TG_OP = 'UPDATE' AND OLD._doc::text != NEW._doc::text THEN

PG_DATE := GREATEST(OLD.updated_at, OLD.deleted_at);

IF M_DATE < PG_DATE THEN

RAISE WARNING 'VOUCHER ANOMALY DETECTED - ID: %, Incoming time: %, Stored time: %', NEW._id, M_DATE, PG_DATE;

RETURN NEW;

END IF;

Assuming that we will do a switch only once, no going back, then it’s enough to check only one side of the updates source. Updates coming from the target database are of no interest to us, because it’s a series of changes from the new logic after the switch is expected. On the other hand, what’s interesting is when an update coming from the DMS is received after some change was done already in its copy located in the target. Therefore, the check above fires only after the update was coming from the DMS, which is guaranteed in the line “IF TG_OP = 'UPDATE' AND OLD._doc::text != NEW._doc::text”. From the perspective of this chunk of code it is possible that an update is coming also from the new code logic running in our apps, but in that case OLD._doc::text and NEW._doc::text would be equal, because only DMS could alter the _doc field.

The next two lines gather last modification times from each side. First goes the time in MongoDB calculated as a GREATEST of the last update or delete. We treat time of entity deletion as a type of an update that must be compared with the other types as if that was the same operation. This first line is really simple, but looks a bit scary because there are more MongoDB datetime versions taken into account. These alternatives were described already, so there is no need to dig into that again. The second line does the same for the Postgres side, but it’s way simpler.

The final part of this check is comparing these two dates, raising a warning about a detected anomaly when the date coming from Mongo is younger than its Postgres version. To see these warnings we were monitoring the database logs directly in the RDS panel in the AWS website. It’s a little bit annoying that the preview there is split into rolling files, so we must be cautious to switch the view to the newest page after each hour. Fortunately, we didn’t have a single case cached by this mechanism, so we cannot say how to deal best with possible wrong cases. But it’s fair to say that each case would require a manual check and a fast decision what to do with such a corner case.‍

Cloning Data‍

There are two remaining technical pieces to be described yet. The first of them is the DMS task mentioned already several times. Let’s see how the task’s description looked liked in its JSON form:

‍
{

"rules": [

{

"rule-type": "transformation",

"rule-id": "1",

"rule-name": "1",

"rule-target": "table",

"object-locator": {

"schema-name": "voucherify",

"table-name": "campaigns-TENANT-PROJECT"

"rule-action": "rename",

"value": "campaigns_migration",

"old-value": null

{

"rule-type": "transformation",

"rule-id": "2",

"rule-name": "2",

"rule-target": "table",

"object-locator": {

"schema-name": "voucherify",

"table-name": "vouchers-TENANT-PROJECT"

"rule-action": "rename",

"value": "vouchers_migration",

"old-value": null

{

"rule-type": "selection",

"rule-id": "3",

"rule-name": "3",

"object-locator": {

"schema-name": "voucherify",

"table-name": "campaigns-TENANT-PROJECT"

"rule-action": "include",

"filters": []

{

"rule-type": "selection",

"rule-id": "4",

"rule-name": "4",

"object-locator": {

"schema-name": "voucherify",

"table-name": "vouchers-TENANT-PROJECT"

"rule-action": "include",

"filters": []

}

]

}

There are four items in the list of rules applied to each job responsible for a single project. This number comes from two double-splits that we have:

Campaign and voucher entities.
Selection and transformation rules.

The last point is about something new. These two rules are the only possible types of rules in the DMS tasks. Selections are used to narrow the source data to the expected scope. They can either include or exclude a subset of the data, however we used only the first mode. Transformations have many more possible actions, like rename that we used, and can be applied to one of the three levels called “rule-target” which are: schema, table and column. At the time of writing this article there are also two more options reserved for Oracle database, but they are not important since we don’t use this database.

Having this knowledge, it should not be difficult to read the example above. To be on the same page, let’s finish the job together. In our case, rules that we applied were doing the following:

1. Selecting two MongoDB collections:

campaigns-TENANT-PROJECT
vouchers-TENANT-PROJECT

2. Mapping collection names to following PostgreSQL tables (in the same order):

campaigns-TENANT-PROJECT mapped to campaigns_migration
vouchers-TENANT-PROJECT mapped to vouchers_migration

The above SQL tables were temporary child tables that we mentioned already. Each DMS task rule must have a unique rule-id. It can be an arbitrary value, but we preferred to use increasing integers to preserve compatibility with the default DMS behavior. This way we were able to switch between json and visual DMS modes easily without any differences in the outcome json. The rule-name field does not have to be unique, but we have it equal to rule-id for simplicity.

The code snippet above, together with few options, is enough to begin the migration process. After selecting the mode for copying the data and then replicating the changes, we had all we needed to have the data in the destination temporary tables. Keep in mind that changes replication does not copy the changes, like increment by 1, but rather copies whole objects again.

One more note, in the above example you can see that we used campaigns_migration and vouchers_migration SQL table names. However, in the example shown before we used voucherify."vouchers-test@voucherify.io-proj_f1r5Tpr0J3Ct" as a table name. The thing is that it’s not a coincidence that we changed the approach at some point, because it occurred that there is a limit for table names in the PostgreSQL that we faced for some long email addresses. If we used a fixed length ids here then there would be no problem with that.

The last small problem to solve was moving the data from these child tables to their parents. To do that we came up with a simple SQL query fired for each of the child temporary tables of each project. Let’s see this beast:

WITH toberemoved AS (

SELECT tenant_id,

project_id,

FROM voucherify.vouchers_migration

LIMIT 100000),

removed AS (

DELETE FROM voucherify.vouchers_migration x

USING toberemoved t

WHERE x.tenant_id = t.tenant_id

AND x.project_id = t.project_id

AND x.id = t.id

RETURNING

x.tenant_id,

x.project_id,

x.id,

x.code,

x.deleted,

<...>

x.discount,

x.publish,

x.validity_day_of_week,

x.created_at,

x.updated_at,

x.deleted_at)

INSERT

INTO voucherify.vouchers (tenant_id,

project_id,

id,

code,

deleted,

<...>

discount,

publish,

validity_day_of_week,

created_at,

updated_at,

deleted_at)

SELECT * FROM removed

ON CONFLICT ON CONSTRAINT pk_vouchers

DO UPDATE SET deleted = excluded.deleted,

<...>

discount = excluded.discount

publish = excluded.publish

validity_day_of_week = excluded.validity_day_of_week

updated_at = excluded.updated_at

deleted_at = excluded.deleted_at;

It consists of four parts:

Finding 100k entities to be moved.
Deleting them from the temporary table, at the same time fetching this data.
Inserting the just-deleted data into the final table.
If a voucher with a given id already exists then update it with new data.

It is a single query so there is an implicit transaction happening here. If inserting or updating fails for any reason, then the whole chunk of (up to) 100k items stays in the temporary table still, because this hidden SQL transaction is rolled back. Then we can investigate the issue and try to get this batch to work again after applying a fix.

Keep in mind that both copying the data by DMS and moving it the way we did, eats the IOPS Burst Balance. Therefore it’s essential to monitor this metric during the migration not to push the production out of the ring accidentally.

Process overview

After gathering all the essential knowledge, let’s put all the pieces into a list of steps to be performed for each project during the migration process:

Sanity check of vouchers and campaigns in MongoDB.
Create child tables & indexes for child tables.
Apply transform triggers.
Create and run a DMS task.
Sanity check data in child tables in Postgres.
Move deleted vouchers from child to parent table.
Switch project’s logic to use Postgres.
Check anomaly detector.
Move active vouchers from child to parent table.
Move campaigns from child to parent table.
Stop and remove DMS tasks.

Most of these steps were described in detail already, let’s quickly talk about the leftovers. The third point is about creating the same indexes for the child tables as indexes that the parent tables have. We had to do that even though we moved majority of the project’s data to final tables before making the switch. The reason behind this are database queries from other projects. For each query PostgreSQL has to fire a subquery to every child table to fetch zero results from it. However when there are no indexes applied for these child tables such a subquery executes very long due to sequence scans. Therefore it was critical to have all indexes in place, so that our production didn’t blow up right after filling up child tables with data.

When copy pasting the Mongo indexes to the SQL format we had to add an additional param for created_at in the end to some of them. It happened that we sorted by _id field in our code, which works as ordering by creation time in Mongo. We did not have a created_at field, but only sorted by “_id: 1” and “_id: -1”, so it was essential to add one more index and extend a few others so that entities listing works fast after the database switch.

To fully understand the order of steps that we chose, we must first stress that shifting the project to a new logic could happen before or after the data is moved into final tables. What are the differences between these two approaches? Well, they have different cons for bad scenarios. The dilemma of how to order these two operations would not exist if we decided to transfer the data using DMS directly to the final tables. We chose to keep new data in separate physical tables therefore we had one additional problem to be solved at this point.

So if the switch is done before the move, we have a risk of creating a duplicated entity after switching to the new logic. It is true for all unique indexes, so in our case it could have been a duplicated campaign id, campaign name, voucher id or voucher code. Why could the duplicated entry happen even though we have unique indexes for both parent and child tables? That’s because data with their unique indexes is stored separately in child and parent tables, so such misleading cases are possible.

For example, vouchers created in the final vouchers table would not collide by id or code with any voucher kept in the child table. As far as the campaign names and voucher codes are concerned, we would be on the safe side, because we check the voucher's code existence before creating one everywhere in our codebase, same as the campaign name. However things get more complicated for campaign and voucher ids. Although it’s not very likely to create a duplicate there, especially for campaigns, it would still be possible because we don’t have any additional checks for such cases in our code and adding them would not be an easy job. In regards to campaigns, we had quite long ids for campaigns at the time of the migration, and moreover they are rather always created manually on the website, so it was very unlikely to happen even once within the migration time window. Vouchers, on the other hand, are generated after an API call in the vast majority of cases, or it could happen that a batch of vouchers is added to some campaign during the migration. Even though the likelihood of picking a duplicate id is very low, we had a possibility to disable voucher generation for a while which significantly reduced such probability.

Additionally, as you can already see in the list of steps we moved deleted vouchers in the first place so that chances of generating duplicate were reduced even more. All in all, even if such an error occurs it would most probably not cause any damage unless such voucher is used. Moreover, even if it seems a bit scary – it is not that hard to fix such an issue – a couple of manual actions needed on our side. Real problem would be a code duplication if it was published right after being created, because then two customers would have the same code, and it would not work for either of them. Fortunately, we had a separate logic checking this case, so we were on the safe side.

The alternative version, where move is done before the switch, has three different risks, where the first two are closely connected. There are two bad scenarios that could happen in theory. Imagine that we moved all data from the child table, made a switch, and suddenly one last change was transferred by the DMS that occurred in the source database. We can see that there is one more element to be moved to the parent table at that moment, so we fire the query moving the data once again counting on its part updating an existing entry. The first bad thing is that it already takes some additional seconds, which could in theory enable for voucher code overuses because counting redemptions would be inaccurate for a while. The second thing is that for the same short time window we will have a duplicated entry while fetching from the parent table. It would produce an error in our code and would probably be seen by the customer. This scenario is about a very tiny time window, but the second case is more interesting.

Let’s imagine that just before we repeat firing the move command an entity is already modified in the target database. Then we would override this new state by the old one coming yet from the source. Now, keep in mind that it could be an important update on one of the sides, and without careful comparison we would not know what should be the proper entity version. Taking into account that such migration is stressful for the executor, we didn’t want to add more manual steps just in its end.

Let’s summarize the cons of these two approaches:

First switch, then move:

1. Risk of duplications in id:

Nearly zero probability of duplicate campaign id.
Very long id.
Created very rarely (in comparison to vouchers).
Low probability of duplicate voucher id.
We could disable creating batches of vouchers for a while.
Moving deleted vouchers first would reduce the chances.
Problems would show up when using vouchers, so there was plenty of time to apply a fix.

2. Risk of duplications in voucher’s code:

Protected in the code.

3. Risk of duplications in campaign’s name:

Protected in the code.

First move, then switch:

Risk of short-time data desync (couple of seconds after the switch).
Risk of short-time duplications.
Risk of long-time data desync (accidental override of an update done in target database):

1. Frequent update of campaign is highly probable:

Most often updated field: vouchers_count.
Risk of overriding important manual updates from the website.

2. Sequential update of voucher is not probable:

Most frequently updated field: redemptions_count.
A mistake here could allow for over expenses using a particular voucher.
Risk of overriding important manual update from the website.

Taking into consideration all these possible bad scenarios, and the fact that campaigns and vouchers have opposite characteristics in terms of update frequency, we picked the first option for active vouchers and the second option for the rest of the data. Therefore in the plan above we first moved deleted vouchers that cannot be modified anymore, do the switch, and then move active vouchers as fast as possible to decrease the chance of generating a duplicate voucher. After that we moved all the campaigns data which didn’t need to be done in a rush.

The way of migration described above was applied for active projects that were in use. We also migrated many more trial projects, or with cancelled subscription if someone renews it. For projects with no activity we prepared a simplified version of the process that could additionally be applied for big chunks of projects at once. In order to have such ability we needed to apply the following changes.

First of all, we needed to have an appropriate number of child tables created beforehand. This step was performed once for all projects to be migrated, and then we removed all temporary tables at once. In order to have that many temporary tables, we were adding postfixes to table names like _0, _1, and so on. Additionally, we applied all indexes for these tables just once.

Secondly, we generated the fat DMS jsons, as well as the SQL for applying the triggers, using some simple scripts that were prepared. They were taking a TSV file as input with all the essential data of the next batch of projects to migrate.

Last but not least, there was no need to overcomplicate the switching campaign and voucher procedure anymore. We could freely assume that there will be no updates at all, so it was just fine to move all the data to parent tables after projects are switched. The simplified process looked like that:

Create N child tables.
Create indexes for child table.
Apply transform triggers.
Sanity check of vouchers and campaigns in MongoDB.
Create and run a DMS task.
Sanity check data in child tables in Postgres.
Switch project’s logic to use Postgres.
Move all data from child to parent tables.
Stop and remove a DMS task.
Drop N child tables.

One More Issue

After migrating almost all projects we faced one last interesting problem. We have a feature in our application of grouping vouchers by categories. When showing the list of all available categories we fetched it from the vouchers data using Mongo’s distinct function. There was no need to have a separate list for categories for each project, because this query was always fast enough. We then rewrote this logic to SQL using SELECT DISTINCT which worked fine in general, however had a significant performance downgrade on the production when users started to use it. It seems that Mongo is much faster in queries like that due to some caching of the results, while Postgres was performing an extensive search each time. The downgrade was so noticeable that we had to quickly add a cache in Redis over this feature to gain the speed back. Later on we added missing code that updates these entries, making it a long-term solution.

The Bottom Line

After getting through the whole migration process we gained a significant monthly cost reduction (in some clusters up to 50%), much cleaner codebase in many places, less problems due to change of the provider, and a lot of bug fixes made along the way. All in all, this was not an easy task, however it was worth it. The biggest question we faced was if DMS should copy the data directly to the target tables or, as we chose, to the temporary ones. Both answers have their cons and risks, so there is no perfect solution for everybody. It seems that migrating to the final tables should be easier, but harder to control. If we were to migrate data of lower importance, we would most probably choose the alternative path, even to just see how this path ends. Also, if we were to move one big data collection at once without splitting it in any way, then there would be no issue with taking the second way.

But if you have a very similar situation to ours, then unfortunately we cannot give you the right answer. You must bear in mind all the pros and cons, and make your own decision. But, we’ll be more than happy to read about it in some article one day. Fingers crossed!

APPENDIX

json_arr2int_arr helper function:

CREATE OR REPLACE FUNCTION voucherify.json_arr2int_arr(js jsonb) RETURNS integer[] LANGUAGE sql IMMUTABLE AS

'SELECT ARRAY(SELECT jsonb_array_elements_text(js)::numeric::integer)';

hex_to_bigint helper function:

CREATE OR REPLACE FUNCTION voucherify.hex_to_bigint(hexval text) RETURNS bigint AS $$

DECLARE

result bigint;

BEGIN

EXECUTE 'SELECT x' || quote_literal(hexval) || '::bigint' INTO result;

RETURN result;

END;

$$ LANGUAGE plpgsql IMMUTABLE STRICT;

‍

Adrian Wilczek

Software enthusiast. Fan of simple but efficient code. Full-stack programmer with a special love for Node.js and Java. During weekends conquering mountains with a bike, jogging, or swimming.

Get inspiration to grow

How We Moved From MongoDB to Postgres Without Downtime and Cut Our Costs by 30%?

The Migration Beginning

Database Migration Service‍

DMS Pros

DMS Cons

Preparations

Sanity Checks‍

Check Anatomy

The Big Picture‍

A Word or Two About Tables

The Main Meal‍

Cloning Data‍

Process overview

One More Issue

The Bottom Line

APPENDIX

Katowice

New York

Product

Scenarios

Resources

For Developers

Company

Get inspiration to grow

Get inspiration to grow

How We Moved From MongoDB to Postgres Without Downtime and Cut Our Costs by 30%?

The Migration Beginning

Database Migration Service‍

DMS Pros

DMS Cons

Preparations

Sanity Checks‍

Check Anatomy

The Big Picture‍

A Word or Two About Tables

The Main Meal‍

Cloning Data‍

Process overview

One More Issue

The Bottom Line

APPENDIX

Related Voucherify Tech articles

Building A/B Tests for Product Managers

Delivering IT for an Immensely Successful Startup

Fewer Unit Tests, More Logs

Join our newsletter