How to take a 7 figure monitoring budget down to 6

METRICFIRE

Jun 11, 2020 ∙ 10 min read

MetricFire Blogger

Table of Contents

Monitoring is a must-have
How and why do large enterprises have a $1 million monitoring budget
The cost of a downtime
The cost of policies and data privacy
The costs of being big
Ways to reduce monitoring costs
How can MetricFire help you?
Where to start?

Great systems are not just built. They are monitored.

MetricFire runs Graphite and Grafana as a fully managed service for growing engineering teams, taking care of storage, scaling, and version updates so your team doesn't have to. Plans start at $19/month, billed per metric namespace rather than per host, and include engineer-staffed support. Integrations work natively with Heroku, AWS, Azure, and GCP, and data is stored with 3× redundancy in SOC2- and ISO:27001-certified data centres.

Monitoring is a must-have

Infrastructure and systems are so challenging today that they become complicated to manage. It is crucial to have continuous health and performance metrics to ensure the stability of your services and infrastructure. This will help you in solving problems with your software development, releases, deployments, and infrastructure health.

Having a good time-to-market and a high development velocity is not only related to your product but also to the way it behaves once put in real production conditions. This is where you need a complete monitoring system. The problem is that monitoring becomes exponentially expensive as you grow. The list of companies that reach a seven-figure monitoring budget is long. Yet, reducing this budget is possible and accessible.

First, this article will look at how large companies end up with massive monitoring budgets. We will look at why spending that money is important for them, and then also analyze how those costs can be cut down.

You will see how open source monitoring can be great for saving money, but to keep costs as low as possible you'll want to use an open-source hosting service such as MetricFire. To jump ahead - sign up for the MetricFire free trial here.

How and why do large enterprises have a $1 million monitoring budget

Being a large company causes inefficiencies, expensive security policies, and costly downtime. All of these factors contribute to higher budget requirements for monitoring.

The cost of a downtime

If you analyze the market, you'll learn that many major corporations handle all their services on their own. They usually combine legacy and modern technologies to continue serving their customers.

Downtime provoked by a software bug, network congestion, a load balancer overflow, or any of the common problems costs, on average, $5,600 per minute.

The same report linked above, also says that 33% of organizations stated that 60 minutes of downtime costs their firms $1-5 million. 81% of participants said that 60 minutes of downtime costs more than $300,000, and 98% of these businesses declare that a single hour of downtime may cost their companies more than $100,000.

Downtime is usually the result of many things, but often it's due to the lack of good monitoring. Being proactive is not possible if you don't have enough data and this is where monitoring is vital. Companies invest in monitoring to avoid losing money, clients, and reputation.

The cost of policies and data privacy

Big organizations typically have multiple customers worldwide, so they don't want to send any statistics or information to a monitoring platform that isn't controlled by them. Nobody wants to take a data risk while you represent several customers. This pushes costs up because not only do you have to monitor data, you have to monitor in a secure way, and you have to do it on your own (without a cloud-based provider). Alternatively, solutions like DreamFactory can help organizations maintain governance and control over sensitive data through self-hosted, secure API access with role-based permissions and identity passthrough.

As well, some of the monitoring data may include information about users protected by local policies, such as the GDPR. If you are adhering to the GDPR, storing some types of data in locations outside of the EU is prohibited. This causes costs to increase because you may be required to change your data storage service to an EU-based one, which will likely be more expensive.

Requirements such as these are one of the factors driving costs up at big companies.

The costs of being big

When companies reach a certain level, their internal processes and communication become less fluid and more complicated. The lack of fluidity comes with inefficiency costs as the information becomes spread across teams and less centralized. This has an impact on the cost of development, production, and monitoring.

From a technical point of view, when there are tens or hundreds of teams working on the same product, you'll certainly find many different technologies. Each team owns its stack, and this is also applicable to monitoring technologies. When you find yourself using several monitoring technologies, the total costs will certainly increase.

To achieve production stability, some companies reach the $1 million monitoring budget. This is required to cover all costs, from scaling the monitoring system, maintaining it, securing it, and arriving at the premium subscription and training costs.

Ways to reduce monitoring costs

Scripting and in-house solutions

You can ask your existing employees to write some scripts and schedule them to run every minute. This can save you money because you will not spend extra money on tools or third-party solutions. However, time is money, and in-house solutions need maintenance and customization with the evolution of your infrastructure and applications.

You may think just keeping custom scripts simple will be easy, but let's look at it closer. You will probably start with simple custom scripts and deploy them to each of your servers. You will probably script an email alert when anything goes over the specified threshold.

You solved the problem, but you will need to deploy the solution to all of your servers. However, what if your server goes down? When this happens your monitoring system will follow, and you'll never be able to know why such a downtime occurred. "Bad things come in three" - Your monitoring system uptime should not be dependent on the monitored server.

This is when you will probably think about creating an independent central server that performs the same monitoring, just remotely. You solved a problem again, but the challenge here is the dynamic nature of modern infrastructure.

Think about distributed systems, auto scalability, and all the scenarios where the inventory of your servers at a given time is not the same after just a few minutes.

You'll certainly think about adapting your monitoring solution and implementing auto-discovery or building an agent-based monitoring solution. You solved this problem likewise, but it will not be the last one.

Many of these kinds of solutions exist, but they're challenging to implement. In-house solutions seem to be cheaper, but they have a hundred and one pitfalls.

Using free versions of monitoring tools

The second way to reduce the monitoring expenses is to use the freemium versions of monitoring tools. Nonetheless, free versions will not serve the entire purpose of monitoring because most of the production-ready required features will be missing.

Using open source monitoring tools

The most successful strategy to reduce the monitoring cost is using open source.

It would certainly save you costs in the long term, but initially, you have to make a decent investment. Open-source software needs to be hosted and maintained on a regular basis. You will need to hire skilled engineers to use and keep the tool running, and as your infrastructure grows, you'll need to grow that team.

One option here is to use a hosted open-source solution such as MetricFire. You can get all of the benefits of open source, without the initial investment or the in-house engineering team.

Only monitoring what matters

This is an unconventional solution, and the results may cost you 10x the monitoring budget. Problems are unpredictable, and you never know how to solve them without knowing the real problem.

To give you an illustration, let's follow the same reasoning: Availability of service and its response time are two important things, and they are what ultimately matters. You decide to solely watch these two metrics.

You may ask your team to develop regular automated health checks that will measure the uptime and the response time. No extra expense is needed, sure, but remember that incidents are different from problems.

By monitoring "what matters" in this case, you stumbled into a trap: You know that your system is not performing as it should, but you don't know how to solve such an issue.

You need to have a broader picture of your system performance and metrics to understand why a downtime or a long response time occurred. Monitoring tools like Grafana and Graphite were built to give you an insightful view of your systems and solve this dilemma.

How can MetricFire help you?

MetricFire is an all-in-one solution for all your monitoring needs. It offers a ready-to-use pluggable and scalable monitoring system.

MetricFire acts like your metrics store, so you don't have to set up, store, backup, or manage your monitoring data. This is important for you if you want to focus more on your core business. Monitoring is your tool, not your goal, and your time is limited. MetricFire hosts your monitoring so that you can spend more time on creating your product and less time monitoring it.

By using MetricFire, you will get the advantage of both open source and managed/hosted systems. We developed a full-scale monitoring platform by using the strength of open-source Grafana and Graphite. Then we made the whole stack accessible, extensible, and highly available.

Integrating your existing infrastructure with our monitoring system is as easy as ABC, and you will be able to choose between your server locations. MetricFire has servers located in Europe, as well as the US. You don't have to set them up, manage them, or spend a penny on them.

With all of these advantages, our pricing is amongst the most competitive in the market. We offer options for all sizes of companies. If you compare our prices to other alternatives like Datadog or AWS CloudWatch, you will notice a significant difference.

Whether you are coming from CloudWatch, Datadog, SolarWinds, or NewRelic, MetricFire is a more affordable monitoring solution. Unlike many of its alternatives that fail in one or more key areas of customer concern, we offer a complete monitoring solution and integrate with the big providers, such as AWS and Azure, as well as many other data sources.

Our engineering team is also dedicated to enhancing our existing features, and integrations, and introducing new improvements to satisfy your future needs.

Where to start?

Collecting, storing, and visualizing your metrics is an important component of modern monitoring. Being able to know what's going on within your systems, and track back the root causes to fix them is crucial.

Open-source Graphite, and Grafana is the best choice to set up a complete monitoring system, but it will cost you money and time to maintain it. This is why MetricFire offers the benefits of these open source tools and the comfort of managed platforms - and all of this at competitive prices.

To try out MetricFire, sign up for our free trial here, and build your own dashboards today. If you're looking to talk to our team directly, book a demo and talk with us about how MetricFire can help with your monitoring needs.

Total Servers to monitor ~150 metrics per host (configurable for fewer metrics if needed) Cloud Services to monitor (in AWS, Azure, GCP)

~25 metrics per instance / service (typical baseline monitoring) Application / Custom metric event footprint Custom metrics are defined and emitted from your app code Heroku Applications ~75 metrics (varies by app-size / add-ons)