Analytics and OLAP Queries in Obics

Obics is an observability and log management tool that enables real-time analytics. Think about the Elastic stack (ELK), Splunk, Sumo Logic, and Graylog. Those tools at their core allow you to ingest logs, search, filter, and visualize them. You can quickly filter by time, sessionID, TraceID, and other fields. You can also create dashboards and alerts. Obics is all that, but with a big advantage: it allows you to run OLAP queries on your logs.

OLAP queries enable running SQL-like analytics on your logs, metrics, and traces. You can join logs, aggregate, group by, and filter.
You might create a query that counts logs with status code 504 for each day of the week and create a time chart.
If you see a spike on a certain day, you can aggregate the number of logs by the exception details and see which exception caused the spike.
Next, you might suspect the error is relevant to a certain tenant, data center, or platform; no problem - you can group by those fields and see if there's a correlation.
Now you suspect that the error is related to a certain recent commit. There's probably a log showing the commit ID, so you can join the error with that log and see if they are related.

All of those things and more you can do with Obics, without waiting for hours for each query and without exporting a sample to another database.
With Obics, you get your answers in real-time, which means queries run in milliseconds rather than seconds.

Here's an example with a similar scenario. Let's say you come to work in the morning and want to see the system's health.
You can query to see if there were any errors in the last 24 hours.

SELECT count(*), toStartOfHour(Time) as hour 
FROM Logs 
WHERE Level = 'Error' AND Time  > now() - INTERVAL 1 DAY
GROUP BY hour
ORDER BY hour
VISUALIZE Timechart

After 57 milliseconds...

Timechart

Oh boy, there was some kind of spike at 8 PM. First thing's first, let's find out which service is causing these errors.
Since we have a microservice architecture, each log contains a Service field, and it should be easy to find out.

SELECT count(*), Service 
FROM Logs 
WHERE Level = 'Error' AND Time  > now() - INTERVAL 1 DAY
GROUP BY Service
VISUALIZE Piechart

After 44 milliseconds...

Piechart

Seems like it's the Checkout service again. Figures, it just went through a big update.
So the picture is getting a bit clearer, but we still don't know what the error was. Since all error logs in the Checkout service have a fixed JSON structure, we can group the errors by the reason field.
Let's see the top 5 errors in the Checkout service in the last 24 hours.
Error logs in this app are formatted as JSON and have the reason field, so it should be pretty easy to find out with a query.

SELECT count(*) as times, 
  JSONExtractString(Message, 'reason') as reason 
FROM Logs
WHERE Level = 'Error' AND Service = 'Checkout'
  AND Time > now() - INTERVAL 1 DAY
GROUP BY reason
ORDER BY times DESC
LIMIT 5

After 126 milliseconds...

timesreason
267084Item is out of stock
267078No items to left checkout
969Invalid access token
482Request timed out
138Session timed out

'Item is out of stock' and 'No items left to checkout' seem to cause the issue. Since they have the same count, it's probably the same core issue causing both errors.
Could it be that some popular item is out of stock? That's weird; if an item is out of stock, the user shouldn't be able to add it to the cart.
But we can start by looking at the cart items at the time of the error. Since we have a log that prints the cart items, we can join the error log with the cart log on the TraceId field.

The log that prints cart items just before checkout. For example: "Toy car,Basketball,Robot".
The log's code is written like this:

obics.info('cart6', cartItems.join(','));

Note that the string cart6 is the Log ID, which will appear in the LogId field in the logs. While not necessary, it's a good practice to uniquely name your logs, and allows for easier analytics.

Now you can use arrayJoin to split the array into rows and then join the logs. The query would look like this:

with products as
(
    SELECT arrayJoin(splitByString(',',Message)) as product,
      CorrelationId 
    from Logs
    WHERE LogId = 'cart6'
)
SELECT count(*) as times, R.product as product
FROM Logs as L
INNER JOIN products as R
ON L.CorrelationId = R.CorrelationId
GROUP BY product
ORDER BY times DESC
LIMIT 3

I realize that this is a bit of a complex example, but it shows the power of OLAP queries on logs.
In this case, I will get the top 3 products that were in the cart when the error occurred. The result would look like this:

After 311 milliseconds...

timesproduct
267084Toy car
74245Deodorant
65810Slippers

It seems that the Toy car is the most popular item in the cart.
It's possible that we have a bug in the system that allows users to add the Toy car to the cart even when it's out of stock.
Maybe this product's inventory was depleted but the new amount wasn't updated for some services. I'll leave this investigation alone, assuming that our hypothetical engineer figured out the problem.

Intrigued? Try out the sandbox

Analytics and OLAP Queries in Obics | Obics