Voucherify Under the Hood: Tomasz Sikora and His Path to a Stable Platform
How did you start at Voucherify?
Although today I am responsible for SRE, my beginnings at Voucherify go back to 2014, the very first days of the company. At the start, I worked on various projects for European startups. During the first years, we have learned a lot about creating products from scratch.
Two projects that I remember are Kisi and ShareIQ. The first, Kisi, was about keyless locks opened with a mobile app. ShareIQ, in turn, was a product that detected unauthorized use of photos. Their platform crawled the web and scanned graphic files. Then, an algorithm generated a hash and identified copies (even if the photo was edited, e.g., it had different colors or was a mirror image).
This platform also collected information about who first and where posted a given photo (e.g., on social media). ShareIQ created a distribution map that showed where the photo was further shared, by whom and on which portals.
Global brands that wanted to reach influencers who promoted their products were some of the ShareIQ's clients. Another use was to spot scammers using original photos to sell counterfeit products.
The product had very experienced founders and it is no wonder that in just 2 years it was sold to the giant of the media monitoring market, Cision.
That’s some fast growth – what do you remember the most about this project?
On a more personal level, one of the best experiences with ShareIQ was recruitment. They organized a 2-day hackathon in Berlin, which they invited me to. It was actually the only hackathon after which I came back with more energy than before leaving.
The whole experience was rounded off by a beautiful sunset on the flight home.
Then it was time to work on your own product. What was your role in Voucherify early days?
The first version of Voucherify was based on a client’s idea (more information in this article). We started building the skeleton on an internal hackathon. Later, we involved more programmers in the development. I joined the project about a year after the hackathon. I immediately started to stick my nose everywhere I could. I was building backend and software architecture, frontend and developing deployment processes – in short, everything that was needed to get new customers and ensure that downtime did not cause them to cancel the subscription.
I quickly felt like a part of the team. When I joined the company, the team was only several people, most of whom I already knew from previous projects. A close-knit team translated into efficiency and speedy delivery from the very beginning – and it has remained so until today.
What are you currently doing at Voucherify?
Now I mainly deal with site reliability, so ensuring that the platform is scalable and stable. My goal is to prevent the team from waking up at night to bring the platform back to life in case of some downtime. I also have to ensure platform scalability, so address potential performance issues in advance.
In many cases, this requires rewriting and correcting current solutions, addressing technical debt or building scalable solutions in place of what we once coded in a Spartan way.
We often work starting with MVPs and when the platform has greater usage or a given solution is more popular, the given functionality is no longer sufficient and needs to be improved. One of the recent improvements with a capital "I" was the move from MongoDB to Postgres, for example.
Most developers complain about technical debt and fixing the mess left by someone else. Aren’t you bored of such work?
Complaining about the work of others and bugs in the code are the bread and butter of developers. I also sometimes complain about the work of my colleagues (and also my own!) However, you have to realize that no one will ever write a perfect code that will be wonderful, beautiful, win a Nobel Prize, and will never need to be improved.
On the other hand, pragmatism and building quick solutions makes sense. It is not worth it to build an extensive solution from the beginning, if you don’t know the scale, and the only thing you know is that the client needs a given function already. After all, the most valuable code is the one that makes you money.
The same principle applies to infrastructure, which should naturally grow with the platform. At the beginning, our platform used only Heroku, later as we grew, we moved to AWS.
Most of our technical debt and problems aren't caused by "bad code," although bugs can happen to anyone. Most bugs (95% of cases) are caught with code reviews. The remaining 5% may go to production.
Our technical debt mainly results from how we solved a given problem earlier, and we did it pragmatically, i.e. we covered only a small part of the functionality that could be tested in the production environments and made sure that it was useful for the client.
Previously, such a solution was sufficient, but now, due to the growing scale and more demanding customers, it is no longer the case.
What are your plans as the head of the SRE department?
I would like to continue to influence the platform development, trying to satisfy future clients. I would also like to develop Voucherify focusing on the scale that we will have in a few months or years. To have a strategic approach to platform development, not only to solve current SRE problems and fight fires. It just so happens that with over 300 clients and the growing number of requests per minute, the platform requires me to plan long-term and set proper fire protection in place.
It is worth mentioning that our platform currently operates in 6 different regions (we currently have 14 clusters). We are constantly improving, but at the same time we try to simplify the process of implementing new changes to the platform and infrastructure.
That is why I have been learning something new all the time for 5 years – SRE is such a bottomless well (in a positive sense), a place where creativity can be satisfied.
What was your greatest achievement in Voucherify?
Recent achievements include reducing the platform load and infrastructure cost. We optimized one key functionality, customer segments, that used to reload once or several times a day.
Reloading segments, i.e. users entering and leaving a group with some specific characteristics is a very slow process. Naturally, the more data, the longer it took to reload. For one of the larger customers who used Voucherify on a very big scale, it took over a day. Currently, it takes from a few seconds to several minutes.
Here is the outgoing traffic coming from one of the machines running the applications:
What issues of scale are you facing today?
We have some technical debt, a holdover from the first years of rapid growth, and many of them still need to be resolved. For example, our own queue mechanism, which worked fantastically and for 5 years allowed us to easily manage messages in the queue, quickly deliver new functionalities and iterate with existing ones, started to slow us down at the current scale.
The main indicators that we are trying to improve are the number of requests to our API and their response time. We focus mainly on the 95th percentile (~ 600ms), less often on the average response times. The RPM we face is 1-1.5k on average in a single cluster. There are, however, 5k RPM jumps, which we also support.
With our clients’ batch processes, the level of IOPS on our bases jumps, sometimes exceeding 4,000 IOPS. For this reason, we optimize queries and change the approach to standardization and data transfer. We have already implemented ElasticSearch and data caches on other layers. It is very easy to solve this problem by boosting resources, but we try not to take shortcuts and reduce the overall infrastructure costs.
What do you like the most about your job?
Involvement in the project is enormous, and so is my trust in colleagues. We have an ideal team, very well-coordinated, we support and complement each other. The working atmosphere is very good. I like and appreciate the people I work with. It's just that.
Is there anything that annoys you?
The continuous and very dynamic growth of the customer base and their expectations requires iterating around the architecture from the very beginning. We have had about four larger infrastructure refactors so far. The implementation of subsequent changes requires many steps. To make matters more complicated during this process we always try to ensure full availability of the platform – we don’t accept API shutdown for several hours. In practice, it still feels like chasing after the requirements. Now, keep in mind that we have 14 clusters. We try to avoid new processes, we’d rather simplify the existing ones.
We have always been a step ahead, but for me personally it is not enough. I would like to feel more free and have some reserve. Fortunately, our SRE team continues to grow stronger.
How does the SRE team work?
We have a flexible approach. We write a lot on Slack. The team often asks questions and shares knowledge there. Most often, when there are incidents or larger problems, we have a call together to discuss, analyze and solve the problem or set further action points. The length of the calls depends on the problem. Everyone is rather satisfied with the meetings, looking at their feedback. Our meetings are very productive, we can quickly arrange something at the call. As a rule, developers, team leaders and CTOs take part in such calls. Darek talked about it very well in his interview.
I am also glad that the management always asks us for our opinion on matters that concern us or our work. We know that they trust us and that we take an active part in making decisions.
After several years of developing the platform, is there anything else that is stressing you out?
Most often, stress is associated with incidents where we have “all hands on deck” and act ad hoc. It gets worse when the platform crashes at night. We don’t have regular shifts, but we have people who prefer to work in the evening hours, so there will always be someone who will be able to put the platform back on its feet.
Stress and workload depend on the period, sometimes nothing happens, sometimes incidents occur every week. Fortunately, the days of taking my laptop on vacation are over, because the SRE team today consists of people I fully trust.
But you're still looking for support – what are the most important criteria when recruiting?
Two values that I would like to see in anyone I work with are the ability to admit to mistakes (openness to criticism) and the willingness to learn.
I am happy to work with juniors and I am open to share my knowledge and experience. Technology is secondary, the most important thing is that the programmer wants and likes to do what she or he does, and if he or she makes a mistake, they should be open to feedback.
Last question: what is your advice to customers about using Voucherify?
I recommend that you control the number of requests sent to our API and avoid peaks (moments when suddenly a lot of requests appear). A burst of random queries will cause the API calls limit to quickly exhaust, blocking the entire integration. As a solution, I recommend queuing and spreading processes over time, especially processes which you may not pay much attention to, e.g., importing customers or exporting redemptions. This is a good solution for both customers and our platform.