
By: Ron Miller | Published: January 9, 2024
Choosing a Log Management Vendor: The Complete Guide
Sooner or later—mostly sooner—you’ll want visibility into what happens to your application in production. There are several ways to get that visibility. You might want to see metrics like CPU utilization and memory usage. Or you might want distributed traces that show a map of requests as they travel between services. Or you might want to use logs—your own or from services you’re using (e.g., databases, message brokers, etc.). More likely than not, you’ll end up using all of those things. But in this post, let’s talk about logs and choosing the right tech for your log management.
There are several major decisions to make when choosing a log analysis tool—forks in the road, so to speak. And after those, you’ll still end up with many solutions to evaluate, but that's just the state of the market. So let’s dive deep, starting with the major decisions:
Decision #1: Self-Host or Use a Managed Service?
There are several open-source logging platforms that come in both flavors: self-hosted or managed. The most prominent ones are Elastic Stack and Loki. You can roll out an Elastic or Loki instance on an EC2 relatively easily, but you’ll have to deal with maintaining and scaling. Operational costs could be higher than anything you might save by avoiding subscription costs—at least at first. But logs tend to grow very fast, and if you have an experienced team, you can absolutely save a lot of money in the long run.
On the other hand, a managed solution is by far the easiest way to get started. No maintenance, automatic scaling, and a cherry on top. But it comes with a cost. Most vendors charge a decent rate per ingested GB of data, retention, or for each instrumented host. Some of the bigger solutions, like DataDog are notoriously expensive.
A possibly more important consideration than cost is the additional features various vendors can offer. DataDog, for example—albeit expensive—is known to provide a wholesome solution where their APM and logging are nicely combined in a single app. Many other solutions offer AI-driven anomaly detection, which you won’t get with your run-of-the-mill self-hosted instance. Other tools, like Obics, offer a Copilot-like experience where you can use natural language to prompt an AI model trained on your own logs.
Decision #2: Proprietary Instrumentation or OpenTelemetry?
There are many ways of sending data to your logging tool. Many vendors have proprietary agents that instrument your code and send logs seamlessly. Other vendors might have proprietary packages instead of agents. The problem with this is the vendor lock-in conundrum. After spending years instrumenting your code for a specific vendor, you’re going to be in trouble when you want to switch to another. And the option of switching is always important—whether because of high costs, deteriorating service, or a different solution that might suddenly become important for whatever reason.
Luckily, there’s an alternative: OpenTelemetry. In case you don’t know it, it’s an open-source standard for sending logs (and traces and metrics), as well as a suite of SDKs that instrument your code. Best of all, it’s vendor-neutral. OpenTelemetry has gained so much popularity that all major third-party vendors now support it.
That said, there are cases where you’ll still want to use proprietary agents or libraries. That might be when those tools provide something OpenTelemetry doesn’t, like profiling capabilities or better instrumentation.
Consideration #3: Cost
As I mentioned, log volumes tend to grow very big. Even if some price point seems small right now because you’re ingesting, say, 10GB of logs per day, that number could become 10TB of logs sooner rather than later—at least if you’re successful and have to scale. By the way, if you weren’t able to scale, it doesn’t really matter anyway. My advice? Plan for success.
If you choose to self-host, note that an Elastic Stack solution is a storage hog. Its indexed data tends to consume 2x to 5x more space than the original data. If you add replicas (as you should), your 100GB of logs will become 1,000GB in Elastic. Another option is to use Loki, which is much more frugal in its storage needs. It doesn’t have full-text indexing and instead relies on labels, though that comes at the expense of full-text search.
You’ll have an even tougher decision to make if you choose to use a third-party vendor. Some of them are very expensive and prices are often hidden or confusing enough that you won't realize the real cost until you see the bill. As it happens, the biggest vendors, that meet all the compliance standards and SLA requirements, are also the most expensive. This situation is unfortunate, and customers often end up compromising by using aggressive sampling strategies or being super-careful when adding each new log. But with OpenTelemetry and open-source solutions, the open market is gradually transforming logging into more of a commodity.
Consideration #4: Search, Query, and Aggregations
Mainstream logging solutions vary somewhat among major vendors:
- Field-based search (for structured logs) is supported by everyone.
- Full-text search is supported by most vendors (DD, Elastic, Splunk, etc.) except for Loki.
- Pattern matching/regex search is supported by most vendors (DD, Elastic, Splunk) except for Loki.
- Small joins (lookups) are supported by Splunk and Elastic. DD, Loki, and CloudWatch don’t support them at all. Azure Monitor and Obics are the only ones that fully support joins.
- Aggregations and group-by are supported by all vendors to some degree. Loki and CloudWatch have more basic capabilities than the others. Elastic is better but can be slow for large datasets. Splunk is even better. Azure Monitor and Obics are the only ones that support real-time aggregations for any dataset (within reason).
Consideration #5: APM, Metrics, Security
You can think of log management as part of the bigger “Observability” umbrella. Under that umbrella, there’s also APM (Application Performance Monitoring), metrics, synthetic testing, infrastructure monitoring, and many other things. You might use different tools for all of these or the same tool. Whether you get a big advantage from using the same tool depends a lot. While DataDog is known to tie together these products nicely into a wholesome experience, other solutions, like Splunk, will get you something closer to a bunch of separate tools. My only real advice is to avoid vendor lock-in. If you use OpenTelemetry SDKs, you’ll have the freedom to switch tools when necessary.
Consideration #6: Other Features
Besides the basic features of log ingestion, centralized storage, search, and query, there’s a myriad of other features vendors provide. Here are some:
- AI anomaly detection
- Alerting and notifications
- Dashboards and visualizations
- OLAP analytics for multidimensional data queries
- Role-Based Access Control (RBAC)
- Error monitoring
- Issue tracking
- Integrations with Grafana, Kafka, Jira, etc.
Note that many of these features can come from other third-party solutions. For example, if you like a certain log management tool but it has crappy dashboards, you don’t have to give it up if it integrates with Grafana, which provides amazing dashboards.
Consideration #7: Compliance and Auditing
If you’re a big enterprise—or even if you’re a small startup looking to meet regulatory requirements yourself (e.g., GDPR, HIPAA)—you’ll need every part of your supply chain to meet those requirements as well. This is especially true for a log management solution that retains your data. So whatever vendor you choose, it should be easy to ensure they meet the compliance and standards you need.