At TrackJS we pride ourselves on our pragmatic approach to software development. We’re cautious of making changes - every change must be weighed not only by its reward, but also its risk. We prefer to avoid big sweeping changes if possible. Smaller, incremental changes are typically our goal. But, this is a story about what happens when a big sweeping change is the only option.
Email Is Important
We send about 400,000 emails a month. The vast majority of those are alert notifications or daily summary reports. Our customers are relying on us to tell them when their web apps are breaking. If those emails don’t go through, we’ve failed at one of our missions.
We prefer not to take dependencies on third parties if we can avoid it. But sending email effectively is hard. Dealing with sender reputation, unsubscribes, the vagaries of SMTP and the inconsistency between ESPs were not rabbit holes we wanted to climb down. Fortunately, there are several third parties who provide an API to send emails on your behalf.
Dependencies: Great Until They’re Not
For the last five years we used Mandrill for transactional emails. All alerts, summaries, password resets, etc. were sent through their API. Things mostly went well, but cracks started to appear. In 2016 it was announced that Mandrill would become a paid add-on of Mailchimp. This suggested Mandrill was not profitable or self sufficient, and perhaps investment in Mandrill would begin to decline. Not a good sign.
Fast forward a year and we started experiencing random API failures. These were best described as hiccups. A few emails a day would fail with cryptic error responses from Mandrill. They did not acknowledge anything was wrong, which left us to figure out a path forward.
As I said, we don’t like making big changes unless we have to. It was not a tremendous number of failures. Maybe 10 on a bad day. We could have switched email providers then (and perhaps we should have, in retrospect) but instead we decided to add more robust retry logic. If an email failed, we’d ensure it would get attempted a few more times before giving up completely. These changes bought us another year.
Making Failures Visible
It’s worth taking a brief diversion to explain how we knew our email provider was having issues. We send around 15,000 emails a day. At first, only a handful of emails were failing. How did we know? Some folks will trawl their logs daily and look for anomalies. Log severity filtering helps to surface the bigger offenders. But you have to remember to do this every day, or automate it. And even then, you have to pay attention and actually take action. It’s easy to overlook small numbers of errors in large log reports. You might not realize you have a problem until it’s out of hand.
To avoid log trawling, we decided important events should be brought to our attention immediately. The way we accomplish that is to pipe those events directly in to our main chat room. We’re in this room all day, and if it starts to get noisy with errors, we’ll fix the problem quickly. As it relates to our email discussion - if an email fails to send after the the specified number of retries, that error is posted to our chat room. You can imagine, even a dozen errors a day in chat is noticeable, and irritating. We do the same for all unhandled exceptions in TrackJS. We know before anyone else when something is having a problem.
You Had One Job
At the beginning of September we started getting dozens, sometimes a hundred failed emails a day (think of our poor chat room). The Mandrill API was constantly sputtering and sending errors. After several weeks of issues and support tickets, Mandrill removed their status page from the internet completely. At the time of this writing it points to their Twitter account, which, incidentally, was filled with their customers who were having the same issues we were. With limited acknowledgement of the problem, and even less action to fix it, we decided we had to switch email providers.
The three we looked at were SendGrid, Mailgun and Postmark. We chose Mailgun primarily because they focus on one thing - sending email via API. The other services all have additional analytics and marketing features. We didn’t need any of that, we just wanted a reliable way to send email. We are hoping this focus translates in to overall reliability.
Changing the Code
Five years ago when we started sending emails, we opted to use the Mandrill SDK directly. We could have built a wrapper or abstraction layer to make emails “provider agnostic”, but at the time we didn’t know what parts of their SDK we were going to use. In retrospect this was a good choice. It let us move fast at the time, using the SDK directly, and we got 5 solid years out of it.
However, with the move to Mailgun, it felt like the right time for a provider-independent abstraction in the code. If Mailgun doesn’t work out, we want an easy way to try the next provider, or, in a pinch, use Mandrill again. In 2013 it didn’t make sense to spend a lot of time building abstractions - we weren’t even sure what trajectory our business would take. But in 2018, with thousands of users relying on our emails, it was time to refactor a bit.
Mailgun, So Far
Two weeks ago we pushed the changes that would switch all emails from Mandrill to Mailgun. The rollout went well. Mailgun does a nice job with onboarding, and ensuring your DNS records are set up correctly. One thing they did not mention, but we found useful, was to ensure your email sending domain has a valid
A record that points to whatever IP is sending the emails in Mailgun. Several anti-spam products were flagging the lack of
A record as a problem.
While Mailgun has thus far been more reliable than Mandrill, they did have some API problems a few days ago. We’re hoping this is inconvenient timing, and not signs of a more systemic issue. Because of our abstraction work, we can always switch back to Mandrill in a pinch. We’ve also talked about multi-provider fallback. Time will tell if we need to go this far.
We spend a lot of time making sure our emails get through, so it’s one less thing for our customers to worry about. If you want emails when your web application is having trouble, sign up for TrackJS today!