Set up your logs for better insight
Make logs first-class citizens in your codebase
April 22nd, 2020
Ever felt like your backend system was a black box (managed by other people) and wondered how your system was doing? Logging can be the thing that opens up that box and gives you insights into your system. This is the main reason why I love logs; they make me “feel” my code, moving it from static text to a live entity.
Logging is one of the primary ways we gain insights into running software systems, but is by no means the only option. Logging should be seen as part of a bigger ecosystem, alongside metrics, active monitoring, the ability to log directly into systems, and applications specific management/observability tools. That said, logging is one of the most versatile methods you have when you need answers to questions which you didn’t know beforehand.
Below is a set of broad topics around logging and how to get a good logging setup. The context is logging for backend systems. A lot can be directly transferred to Apps (mobile and web) and desktop applications, but some aspects will be different.
Read your logs
If you only remember one thing from this blog post it should be to read your logs. All the rest of this doesn't really matter if you don’t use your logs. If you start doing just that, a lot of what I write will be obvious.
View your log as first-class citizens in your codebase. Revisit them, improve, add missing pieces of information, and remove the ones that don’t provide real value or are redundant. When writing a new module, try following it in your logs when it goes to a live system and improve the logs in that module. I often find that half the logs are irrelevant, and the rest can be improved in different ways.
Some use a very elaborate set of log levels. For example, trace, debug, info, warning, error, and fatal. If that works in your context, do just that. I have always only used info, warning, and error.
Everything is info by default.
Error: If something bad happens that does affect the system as a whole, this will be an error, but most often error logs are for unexpected situations, like exceptions. Often the catch part of a try-catch statement will log an error.
Warning: If something bad happens, but doesn’t really affect the system as a whole — like a request to an external system fails (once) — I often make this a warning. It could also be something that is annoying to a user.
Info: The rest will be info logs.
There is a clear direction to the evolution of log levels. They tend to go from error towards warning and info. Many error situations turn out to be just everyday events that you need to handle gracefully, and thus the error log for an unexpected situation becomes an info log that happens every day.
Should you always log all log levels? It depends. If you can afford it, then yes. Having extra debug levels can be a good idea, just remember that you need to iterate over your logs in order to make them good. This is also true for your conditional log levels, which you often would have turned off.
Structured or unstructured
There is a spectrum from structured to unstructured logs, each with pros and cons.
Structured logs are easier to ship to and parse in your log management system and have a better fit for some logging frameworks. They are difficult to use directly without a log management system, which occurs in many situations such as running unit tests, running systems locally, or supporting customer systems.
Unstructured logs will almost always have some structure, but as text and not something formal like JSON. Therefore they tend to be easier to read by humans in their original format, but getting machines to extract the structure takes effort.
I normally use a human-readable format that is also fairly easy to parse, like:
20-04-02T15:37:49.927+0200 [ScalaTest-run-running-ParseXmlTest] INFO c.h.e.f.Data$ - writeSnapshot took_ms=5 targetFile=’/tmp/humiotest4292/global-data-snapshot.json’ size=17
The above can be parsed in all log management systems and is still readable as-is. I value the last quite high, because of the situations where it’s not handy to sign into the log management system.
Some notes on log formats:
Avoid multi-line logs if possible. Java stack traces are notorious for being multi-line and they can be a pain to get right.
You can only choose the log format for new systems. Refactoring existing systems from unstructured logs to structured will almost never be done, so expect to support many log formats.
The traditional way to ship logs to a log management system is to have your logging framework log to a file and then have a log agent ship those to your log management system. This works well for a lot of situations and one really big benefit to file base logging is that when things start breaking down, you can always ssh into your server and tail the logs as a last resort. In this situation having human-readable logs is a bonus. Also, with file-based logging you are in a good situation to not lose any logs if the log management system is unreachable. The log files on disk serve as a fine buffer if you’re having network issues. You need to use a log shipper that can keep track of which logs have been reliably delivered to your logging solution and retry sending those that failed to be delivered.
Note that you should always have log rotation! Getting a disk-full failure by filling the disk with old logs is not something you want.
Another method is to ship directly from your logging framework. This might be easier to set up in some environments and on platforms where services can move around autonomously due to failure or scaling up.
Make sure you know what delivery guarantees your logging framework provides. How many times will it retry sending logs if it cannot deliver them?
Always enable batch sending. You do not want to do one request per log line.
Make sure that your retry-and-batching strategy is finite. You do not want your log framework to give you out-of-memory failures.
Note, if you have multiline logs, the right place to handle those is either in your logging framework or in your log agent. The log management system will not be able to handle this reliably.
What to log
What to log is really dependent on your application, but in general you should add context to your logs and log key lifecycle events for your workflows.
All code belongs to some workflow, where workflow is very broadly defined. Examples are web requests, transactions, services, or batch jobs. You should always log key events and state changes in your workflows, such as initialization, death, errors, key decisions made or key state changes, or interactions with other workflows. Also, add timings and between steps in your workflows. Having an idea as to where time is spent is really valuable.
For adding context to your logs the following is a list of ideas:
Line number - where the log line is
Workflow IDs like request ID or transaction ID
If the code has an ID of some sort it is probably a good idea to log it
If you have a list of something, log its length
Logs vs. metrics
What is a metric compared to a log? Taking a step back, a log is really documenting an event happening. A metric is an aggregation over a set of events, so if you squint your eyes, metrics are aggregations over logs. You might think this is putting too much into logs, but when you work with a log management system you group over logs or chart aspects of logs all the time, which is exactly creating metrics from logs on-the-fly.
Some aspects of logs and metrics:
Logs tend to be high volume while metrics tend to have low volume
Logs tend to have a bursty nature, while metrics have a fixed rate
Logs tend can have high dimensionality while metrics have low dimensionality
Logs tend to have high cardinality, while metrics have low cardinality
Logs tend to be unstructured, while metrics are structured
Logs are often unspecified, while metrics often come with documentation
Some types of information naturally fit as metrics and some as logs, but you should use both. With infinite resources you could capture all metrics at the underlying level of events, but that is often not feasible in terms of money and compute power, so use metrics for situations where you have a lot of underlying events. When your systems grow you might also find that some logs get too numerous. When that happens, you can turn the logs into a set of metrics instead.
Another consideration is that given that metrics are better organized by nature, they are the best fit for key business or operational insights. You might still have the underlying logs/events, but metrics can be easier to share and communicate across organizational boundaries.
The key to high-value logs is to start using them. Logs are part of the interface to your code, not one that users often sees, but one that you yourself will use. As with most things in software, they should be treated as first-class citizens alongside your unit test, your metrics, and your documentation.