Is logging really the 3rd pillar of observability?

Heiko W. Rupp

Published in

ITNEXT

6 min readJul 25, 2019

When people talk about observability, they usually list the following 3 pillars

Telemetry / Metrics
(Distributed) Tracing
Logging

Classically with the single server setup one could easily log into the server and run grep on logfiles. Now with cloud services, containers or functions that start and shutdown quickly the situation has changed. Of course logs are captured, often by logstash or kibana and shipped to Elasticsearch for post-processing and searching for content. This has created a monster which is pretty hard to set up and keep running (and it uses huge amounts of resources like storage, network bandwidth and I/O and also cpu power for indexing).

There is clearly a need for a more cloud-native way of doing things:

This tweet on the left by Alice Goldfuss is often used to motivate more lean solutions than the ELK/EKF stack, for example Loki.
But still even if that solution is more lean by not indexing the kitchen sink, it still has issues e.g. stack traces may end up in many individual lines or being interleaved by output of a different thread.
Also those log entries are still either written to a file on disk by the appenders from where they need to be grabbed or being written to standard out from where fluentd/promtail etc grab them.

The classical logs also have another issue: data in a log entry is usually unstructured (see the tweet from Chartiy on the left). This means that it is hard to find the meaning of individual parts of an entry or to correlate them with other observability signals on the system that may help you further in understanding the behaviour of your system. Or if you log into multiple lines, it is hard to have a machine understand which log lines belong together; it is even worse when several log sources in a multi-threaded service are interleaved in the output, effectively separating entries from a single source

Note: The Docker json logging driver treats each line as a separate message. […] — From logging in Kubernetes

When you follow that thread, you’ll find more arguments why the classical logging is not up the cloud native challenge.

The other inherent problem is that to be able to make sense of the logs of one service in a distributed system, you also need to have a look at the other systems around it and correlate the log entries with metrics, data from distributed tracing or other signals like e.g. thread or heap dumps .
For request based interactions, the request-id could be used and delivered in traces (as part of the trace-id?) and for logs the request-id or trace-id would also be an entry in the log entry.
Unfortunately not a lot of logging libraries can do that out of the box, so that one ends up by grepping through data by timestamp (and again, finding the request-id / trace-id in an unstructured log entry is hard).

Are logs still useful?

I am not saying logs are useless, but the classical way of logging is less useful in the new world than what it was in the past. Instead of shipping and storing tons of logs by default, it makes probably more sense to report the number of errors of a service and only store errors or fatals in order for developers to look at them and apply a fix (but see also below). Remember though, that you need a “pointer to the original error entry”, as de-aggregating is not possible. And still this is sometimes too much, as there are situations where e.g. a third party library is constantly logging a message that can’t be removed. When storing those messages, the system should even de-duplicate the entries and only increase a counter of errors thrown. Good old syslog has a mechanism where it can detect repeated entries and replaces them with a respective message (which fails if 2 different messages are coming in alternation):

Jul 22 12:22:20 snert awdd[241]: Diagnostics Report
Jul 22 12:22:50 --- last message repeated 3 times ---

Now when you have 10 instances of a service deployed (e.g scaled to 10 pods), you are most likely to see all those messages coming from all 10 pods, so some more de-duplication on the log-receiver side is needed.

Delivery

Another orthogonal aspect to the above is the delivery of those “log messages” or better events. The old way of writing to stdout and then having a process scrape that is not really good. And it does not really matter if e.g. DockerD is wrapping that in Json.

The handler on container/pod level grabs one line at a time of stdout/err, appends information about time, container and pod and writes it to a file from where processes like fluentd or promtail read the data for forwarding into ElasticSearch or Loki

It would probably be much better to send those log events (a stack trace is a single event) via some messaging bus to the central “processing unit” in order to deal with:

Unfortunately this is nothing one can easily do in a shell script or many other frameworks and languages unlike the stdout based method.

Tracing to the rescue?

Distributed tracing allows to add additional information to a span, which could be log information or stack traces in the error case. Tracing each and every request (for log shipping purposes) would again overwhelm the network and trace processing system. One could thus use a modified version of dynamic sampling to only keep the normal 1% or 1per mille of traces, except in the case where a trace contains an error flag where then the entire trace is kept. The tracing server would then allow to not only find traces by id or timing, but also by querying for those log entries.

Unfortunately tracing is not a panacea either, as basic system level events that are not coming triggered by a user request will not end up in a trace (span) and are thus not reported: just take your JVM telling you that it spends too much time in garbage collection.

But what about debugging?

What I have written above talks a lot about reducing the overhead of logging and processing of logs. Now when you are in production and issues occur, you want to sometimes have the full firehose of logs from a service at debug or even trace level. This article shows the idea of diagnostic logging in production. The tricky part is though, especially in the world of immutable container images, to enable diagnostic logging on the fly without shipping new images or instantiating the existing ones with new parameters.

Call to action

First: give me feedback :-)

In all seriousness, I think it became clear that the classic way of logging is no longer up to the cloud-native world. In order to get better again one has to start augmenting log entries with metadata and do structured logging. This does not imply that a system like Loki or EFK has to index on all of them (high cardinality problem). Having the metadata allows to find and group by that metadata. Also entry data that belongs together should be supplied in one blob. If you have logged into different files in the past, make sure you add a filename label that denotes the file.

Conclusion

Structured logging is a pillar of observability and even an important one. Not not in the old way where you dump random strings to stdout or a file, but a more elaborate, structured way, that allows for more easy navigation through logs.