Analytics and OLAP Queries in Obics
Obics is an observability and log management tool that enables real-time analytics. Think about the Elastic stack (ELK), Splunk, Sumo Logic, and Graylog. Those tools at their core allow you to ingest logs, search, filter, and visualize them. You can quickly filter by time, sessionID, TraceID, and other fields. You can also create dashboards and alerts. Obics is all that, but with a big advantage: it allows you to run OLAP queries on your logs.
OLAP queries enable running SQL-like analytics on your logs, metrics, and traces. You can join logs, aggregate, group by, and filter.
You might create a query that counts logs with status code 504 for each day of the week and create a time chart.
If you see a spike on a certain day, you can aggregate the number of logs by the exception details and see which exception caused the spike.
Next, you might suspect the error is relevant to a certain tenant, data center, or platform; no problem - you can group by those fields and see if there's a correlation.
Now you suspect that the error is related to a certain recent commit. There's probably a log showing the commit ID, so you can join the error with that log and see if they are related.
All of those things and more you can do with Obics, without waiting for hours for each query and without exporting a sample to another database.
With Obics, you get your answers in real-time, which means queries run in milliseconds rather than seconds.
Here's an example with a similar scenario. Let's say you come to work in the morning and want to see the system's health.
You can query to see if there were any errors in the last 24 hours.
SELECT count(*), toStartOfHour(Time) as hour
FROM Logs
WHERE Level = 'Error' AND Time > now() - INTERVAL 1 DAY
GROUP BY hour
ORDER BY hour
VISUALIZE Timechart
After 57 milliseconds...
Oh boy, there was some kind of spike at 8 PM. First thing's first, let's find out which service is causing these errors.
Since we have a microservice architecture, each log contains a Service
field, and it should be easy to find out.
SELECT count(*), Service
FROM Logs
WHERE Level = 'Error' AND Time > now() - INTERVAL 1 DAY
GROUP BY Service
VISUALIZE Piechart
After 44 milliseconds...
Seems like it's the Checkout service again. Figures, it just went through a big update.
So the picture is getting a bit clearer, but we still don't know what the error was. Since all error logs in the Checkout service have a fixed JSON structure, we can group the errors by the reason field.
Let's see the top 5 errors in the Checkout service in the last 24 hours.
Error logs in this app are formatted as JSON and have the reason field, so it should be pretty easy to find out with a query.
SELECT count(*) as times,
JSONExtractString(Message, 'reason') as reason
FROM Logs
WHERE Level = 'Error' AND Service = 'Checkout'
AND Time > now() - INTERVAL 1 DAY
GROUP BY reason
ORDER BY times DESC
LIMIT 5
After 126 milliseconds...
times | reason |
---|---|
267084 | Item is out of stock |
267078 | No items to left checkout |
969 | Invalid access token |
482 | Request timed out |
138 | Session timed out |
'Item is out of stock' and 'No items left to checkout' seem to cause the issue. Since they have the same count, it's probably the same core issue causing both errors.
Could it be that some popular item is out of stock? That's weird; if an item is out of stock, the user shouldn't be able to add it to the cart.
But we can start by looking at the cart items at the time of the error. Since we have a log that prints the cart items, we can join the error log with the cart log on the TraceId field.
The log that prints cart items just before checkout. For example: "Toy car,Basketball,Robot"
.
The log's code is written like this:
obics.info('cart6', cartItems.join(','));
Note that the string cart6
is the Log ID, which will appear in the LogId
field in the logs. While not necessary, it's a good practice to uniquely name your logs, and allows for easier analytics.
Now you can use arrayJoin
to split the array into rows and then join the logs. The query would look like this:
with products as
(
SELECT arrayJoin(splitByString(',',Message)) as product,
CorrelationId
from Logs
WHERE LogId = 'cart6'
)
SELECT count(*) as times, R.product as product
FROM Logs as L
INNER JOIN products as R
ON L.CorrelationId = R.CorrelationId
GROUP BY product
ORDER BY times DESC
LIMIT 3
I realize that this is a bit of a complex example, but it shows the power of OLAP queries on logs.
In this case, I will get the top 3 products that were in the cart when the error occurred. The result would look like this:
After 311 milliseconds...
times | product |
---|---|
267084 | Toy car |
74245 | Deodorant |
65810 | Slippers |
It seems that the Toy car is the most popular item in the cart.
It's possible that we have a bug in the system that allows users to add the Toy car to the cart even when it's out of stock.
Maybe this product's inventory was depleted but the new amount wasn't updated for some services. I'll leave this investigation alone, assuming that our hypothetical engineer figured out the problem.
Intrigued? Try out the sandbox