[object Object]

Using Humio to Improve Humio

How we used Humio to make Humio upload speeds 50% faster
Morten Grouleff
September 02, 2019

Humio is a great tool for ad-hoc searching and charting from free-form log files. As a developer of the tool, it’s my obvious choice when testing out the real effects of a performance improvement to the existing code base. But really, the feedback loop starts with noticing that something is spending more time than expected.

Humio has a basic built-in profiler of what is going on, by writing thread dumps information on all threads that have anything from the Humio source code on the current stack periodically. Due to the way the Java Virtual Machine (JVM) collects these stack dumps from running threads, it’s not a very accurate profiler, but it does allow you to learn a lot about what the system is actually doing in production, in a non-intrusive manner. Using Humio to inspect the resulting logs is easy, but digging deep into the output requires knowing how Humio works internally. We have put in work to make the thread dumps readable to systems administrators to help them diagnose what is going on.The threads have meaningful names, and methods where threads are parked waiting for work have names such as idleWaitingForDigestWork.

For a specific example of how this helps find the bottlenecks in production,I wanted to improve throughput of the S3 archiving feature, so I triggered a reset of the upload state for a repository, and looked at the thread dumps after it had been uploading for a couple of hours.

Table showing count for thread detection

That table shows a count of how many times a thread was detected as being at the humio class and function listed next to it for threads that also had either the S3ArchingJob or the SegmentSource (class) names as part of their stack. This represents the top humio class and function on the stack of those threads, not the actual top frame which is often a JDK function that does not tell us much about what the system is working on.

The top three entries represent threads parked, waiting for some other thread to complete a (large) task. The fourth is gzip compression of outbound data which is not my focus for this article, but could be a function worth looking into in another article. The fifth (serializeAsNDJSON) represents formatting the events into the selected output format of NDJSON. The three lines following that caught my interest, as knowing their task I would not expect them to be as common as this. Making this call requires insight into what is expected and what is not, which is a skill that can be developed over time by monitoring the system.

I then looked at how those functions were being applied in this context and in other contexts, how they were implemented, and decided I could improve on all of them. Pushing this change to the same cluster at 12:37 and looking at the stats a few hours later confirmed that the change did indeed help. There was a reduced number of samples from three and an increase in the number from serializeAsNDJSON.

Chart showing occurrences of two functions in thread dumps using stacked display of the timechart

That chart shows the relative occurrences of those two functions in the thread dumps by turning on “stacked” display of the timechart. This tells me my change had the desired effect in this case as it reversed the 40/60 distribution to favor “serialize” over “elements”. But did it also improve the throughput of the job of uploading events to S3? Let’s look at the metric for that:

Chart showing improvement of job thoroughput to S3

As this chart shows, yes it did. The previous version uploaded data at a rate of 2.2 TB/day whereas the new version managed 3.3 TB/day. Did this come at the expense of more CPU usage on the system, perhaps?

Chart showing improvement in upload speeds of roughly 50%

Apart from some spikes that I can tell from charts not included here that are due to other events, this confirms that this change did not significantly change the CPU usage. The net result is an improvement in upload speed of roughly 50%.

Applying New Compression Settings

For another example of how we inspect our own Humio using Humio, we switched from the default compression type to the new one that we expect to make the default soon on August 27 around midday. This test cluster has been using the new setting for a few days now, but how does it compare to the old settings? Well, if the input events did not change too much (and they did not) the result is pretty good, as compression ratio went from 7–8 to about 14.

Chart displaying search criteria

As this search is a bit more involved let’s walk though the search criteria.

The first two lines are pure filters that select events (log lines) that we know have the desired information for the graph from the system.

The next two calculate how much extra space is spent on the bloom filter files next to the compressed space, and take into account that those files are not present if they add 50% extra space to the compressed size. This is a common use of the case statement to execute computations conditionally.

Then we do a timechart that calculates the sum of the original_size and the compressed_size including the extra bytes potentially spent on the bloom filters.

The last line then modifies the output of the chart to show the ratio of the two series calculated in the timechart by calculating the ratio and drop the two previous series.

Try this at home!

You can do this too, using the logs for the system you care about. Ship your logs into Humio and get new insights into how that system works.