An Apology for Our Errors
In the last few weeks, you may have seen an error trying to get to our UI. We’re really sorry about that, and we wanted to share what’s been happening and what we’re doing to fix it.
Around December 22nd, we began experiencing periodic connection failures between our web-servers and our Elasticsearch cluster. The failures manifest as an error in the UI, but never impacted all customers. We would see a flurry of these failures a few times per day, but mostly everything was working great.
This all seemed to point to a load issue, so we setup some load tests with loader.io. We started pushing requests against a fairly expensive query in our system and turned it up. We got as high as 100 requests per second and still no issues. The response time slowed, of course, but we couldn’t recreate any timeout.
When the issue started again, our load tests exhibited very different behavior–they would begin creating connection errors at roughly the same rate as the other online customers.
We started investigating our hosting platforms next. We heavily use Microsoft Azure and we had not seen anything like this before. We applied extra instrumentation in our app and waited for the next occurrence. Watching it happen confused us further–we could connect with Elasticsearch just fine from other hosts and the resources on the web servers were not taxed.
After comparing notes with other companies with similar architecture, we put all our services on a Virtual Network. Now, our web servers reference Elasticsearch hosts directly by internal IP and we have to handle load-balancing across the cluster in the client connection.
The platform has been stable at this point for almost a week, and we are feeling confident this is the issue. We are actively working with Microsoft to identify the culprit, as this experience has been very concerning for us. This incident has greatly increased our monitoring and response tools and we will be better equipped to handle this should it occur again.
Lastly, we have not been awesome communicating all this to you. We’ve been heads down diagnosing and troubleshooting the issues. But other than a handful of direct emails, you probably haven’t seen anything from us about it until now. This was mistake, and we’re looking for better ways to communicate this sort of thing. We welcome your ideas.