This is the 4th part of the series “Building an online marketplace from scratch”. A collection of articles on how to design and deliver modern business applications quickly. In this post, we’ll elaborate on platform monitoring and error handling in the early days of your system.
The first three posts focused on how to quickly build the core of the system and how to iterate efficiently, taking into account often changing business requirements. As a wannabe-CTO, after reading this lecture, you should be able to ramp up and automate the basic order management process. The management and marketing/customer service/operations teams are already thankful for taking the pain out of mundane, manual jobs and giving them some time to be creative.
But, as you can imagine, fast software delivery involves cutting corners, or should I say, “corner cases”. You put yourself and your software team at risk of being attacked by hordes of unexpected errors coming from both external and internal users; think: unhandled form data, duplicate orders, records removed by mistake, a super important request from a big fish client, and the list continues. All of a sudden, the manual work you’ve taken away from the other departments ends up on your plate. Now, how to fix it?
Some software conservatives would say: “cover your crappy code with a thick safety net of unit tests and you'll get rid of unimplemented corner cases". Easier said than done? Well, our standpoint is that it makes little sense to add unit tests at this stage. First, the code changes so frequently that the tests would become obsolete next week or sometimes within a day. Second, test-driven software development requires discipline and process. And they take time. And there’s no time in an early stage business like Manufaktura.
So, let’s see what measures you can leverage to reduce error handling work in your team and, what’s more important, improve the quality of your service after all. We will describe the low hanging fruits first and in the last section, point you in the right direction when it comes to heavyweight monitoring tools.
Unified logging structure is invaluable for error handling. For a business app like Manufaktura I dare say it saves you (as a developer) more time than any collection of tests no matter if they are unit or end-to-end. But it’s also beneficial for the security and devops team later on when your business matures.
- Every dev uses the same, unified log structure
- Every log has a unique context identifier; you can use timestamp, business components, user ID, order ID, web session ID etc.)
- Every dev pays special attention to detailed error logging
- It starts you measuring execution time of the most important / popular features; it might be useful when the scale hits
Detailed error logging example:
logger.error("[EmailService] Email not sent - Order: %s Account: %s Message: %s Error: %j Stack: %j", order.Id, account.Id, error, error, error && error.stack);
Now, let’s see why these rules are so important. Logging makes little sense if there isn't an easy way to analyze it. One way forward is to dump your logs into files on your server, login via ssh and browse with your favorite bash spells. But this can be time-consuming and the analysis itself might not be intuitive. Also, you have to manage the storage problem yourself.
Another approach is to set up a profound log aggregator based on top of any of the popular open source tools; ELK and graylog are of particular relevance here. But again, the setup and hosting take time, especially if you don’t have a Linux wizard onboard. Additionally, alerting is not out of the box - you have to configure and maintain another package.
And it is the alerting which is the key takeaway I want you to leave with. Being able to define and be notified about critical error messages is the backbone of a sustainable monitoring system for an early stage system.
Imagine the time saved if you can select which errors are actually important to you and if you can get the error context in a well-formatted email or a dedicated Slack channel.
So, how can we get email/Slack/SMS alerts fast? The answer is: we’ll employ a SaaS platform (you might have already noticed this pattern in our series).
The log aggregator market has matured over the last couple of years. You can choose one of many tools. However, we found LogEntries to be the best for most of our projects. We’ve been using it for over 3 years now, and it’s been totally worth the price. Why LogEntries? Because apart from the alerting module, it also offers several other time-saving features like:
- SQL-Like query language for searching
- Aggregated live tail search
- Custom tags of logs
- Works with multiple PaaS (heroku addon) and IaaS
- Has the ability to aggregate logs from different applications/services
Let’s get back to the alerts though. Creating email notifications is super simple with LogEntries. You just define a tag using built-in filters or regex and then define which tag should send an email and who gets it. It’s worth noting that you can also adjust the frequency.
In an online marketplace business like Manufaktura, the CTO will be notified about expected and unexpected errors:
- Expected - sometimes you know a particular case isn’t implemented yet (because of priorities*) but it happens so rarely that a simple, manual db update is sufficient. You just need to be informed early enough.
- Unexpected - all other stuff like when the server responds with 500 or with a timeout
*As we’ve mentioned in previous posts, prioritizing which cases should be implemented first or which features should be shipped at all is a skill in itself. The only way to learn it is the hard way - through experience. The good news is that with the error alerts you just got a handy tool to reduce the impact of the wrong choice.
Plug’n’Play monitoring tools
Imagine you can get a complex application performance dashboard, including metrics like:
- Total number of requests
- Transaction execution time
- Database query execution time
- Latency around the world
by writing only a couple lines of code. This is possible with the plethora of app performance SaaS tools. One of the most popular and most mature is New Relic. It supports many programming languages and has grown hundreds of integrations including database-, browser-, infrastructure-, mobile-specific plugins. But it’s pricey at the same time. That’s why it’s good to take a look at alternatives.
Anyway, at the early stage you don’t need most of the New Relic features. You can go around with just the APM module. And if you host your platform on heroku, New Relic has an interesting $49 offer for you:
What’s also nice about New Relic is the alerting module. Similar to LogEntries alerts, you can subscribe for expected unexpected situations. Like spikes in the traffic which your application cannot handle yet. This gives you a way to react before the shit really hits the fan, e.g. you can scale-up your infrastructure for the increased traffic period or try to queue jobs and process them later.
The basic LogEntries and NewRelic notifications run on email. For both tools, you can also add a Slack channel through webhooks. Unfortunately these aren’t much use when you sleep. And your platform might have one or two super-critical business processes you don’t want to be down even for a minute.
Pager Duty handles this. It gives you SMS/push notification alerts starting at $9 a month. The price goes up if you want to add phone call alerts or if you want to apply on-call scheduling rules for your team. For example, you can make Pager Duty call Tom first, if he doesn’t pick up (acknowledge to Pager Duty that he reacted), it calls Dick, and finally Harry.
Moreover, when you go to their integrations page, you’ll find both LogEntries and New Relic among more than 200 other connectors. The integrations and the triage support give you a simple way of connecting a particular error to a responsible developer.
These 3 SaaS monitoring tools are good value for money investments in an early stage online business. The configuration and hosting doesn’t require a dedicated administrator and they all have offers for small teams (LogEntries $39, New Relic $49, PagerDuty $9). What you get is priceless - reduced manual work for your dev team and the quality of service increases. Having such a thick safety net, our Manufaktura is ready to develop more power features. In the next article, we’ll tackle one of these - email and SMS communication.
Now, when the number of features grow, the infrastructure swells and so do the bills from our SaaS monitoring providers. That's when you might want to reconsider your monitoring toolset and tap into self-hosted open source products. These are some of the market leaders:
- Prometheus - a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
- Grafana - allows you to query, visualize, alert on, and understand your metrics no matter where they are stored (integrates nicely with Prometheus).
- Consul - a tool for discovering and configuring services in your infrastructure and managing health checks.
- ELK - centralized logging system based on ElasticSearch, LogStash, and Kibana
- Graylog - another log management, requires ElasticSearch and MongoDB