Your monitoring, instrumentation, and observability systems should be separated from your primary system infrastructure for resiliency and visibility. You already know monitoring is important, but you should also consider where you host your monitoring, to understand real user experience and recover from outages quickly.
When we were first building TrackJS, we built it using Microsoft Azure. It was a good platform, and it allowed us to go fast in the early days. It also taught us a lot about monitoring complex systems.
Like many cloud providers, Azure has built-in tooling for monitoring the systems you build with Azure. They work pretty well, but they have blind spots that you could drive a truck through. Azure’s monitoring is built on the same infrastructure as the rest of Azure, inside the Azure network. When the infrastructure fails, the internal monitoring fails too. When external traffic is blocked, the internal monitoring doesn’t know.
We ran into these problems periodically during our time in Azure (2013-2016), and it prompted us to pull our monitoring outside of Azure. It was a game-changer. We knew how real customers saw our performance and availability. We knew when Azure’s external routing was offline, even before Microsoft did.
I’m not picking on Azure, this problem exists on all infrastructure platforms. AWS had an infamous outage that was not reported on its own status site, despite the entirety of AWS being offline. GCP goes offline with the same problems, even recently.
All the major cloud providers, and many other systems, suffer from these blind spots.
To make matters worse, hosting monitoring on the same infrastructure as your system creates dangerous recovery scenarios. Imagine, your system is offline, customers are angry, and you’re trying to figure out what happened. But your monitoring is offline too. You don’t have context to see what happened. Now you need to spend precious cycles restoring monitoring data just to understand the real problem.
Monitoring systems should be available and query-able even when your primary system is offline. You need to get important alerts sent as the system starts to fail, and you need to check your instrumentation to understand the failure.
Despite these problems, many systems suffer from these blind spots and failure scenarios because cost and time savings to reuse a shared infrastructure for monitoring is very attractive. Why wouldn’t you just host monitoring on the same web hosting, servers, or networks you already have?
Because you can’t:
- Monitor your datacenter availability & performance from inside the datacenter.
- Monitor your network availability & performance from inside the network.
- Monitor your server availability & performance from inside the server.
- Monitor your website availability & performance from inside the website.
You have to get outside of your system to see it like the users see it. It doesn’t matter what your monitors see if the end-user only sees a 500 error.
This is a big reason why third-party monitoring services, like TrackJS for your client-side, make tons of sense. It’s outside your core system, likely hosted on separate infrastructure, with very different failure modes. So it’s online when you need it to be.
And because you shouldn’t build monitoring on your shared infrastructure, building your own monitoring correctly can be very expensive and time consuming. Especially compared to well-developed online services, like TrackJS.
TrackJS is not hosted on Azure, AWS, or GCP. Instead, we host on our own dedicated hardware. So if you have problems because the cloud is down, TrackJS will still be online, ready to help.