Mesos and DC/OS Logs in Humio
Having focused our efforts increasingly on additional integrations, we’ve released the first beta version of our Mesos framework. For this first iteration, there’s one very clear goal: Forward all task logs to Humio
October 25th, 2017
This integration comes in addition to our plug-in for Kubernetes and supports both plain Mesos and DC/OS. It’s therefore an important step in realising our goal of providing integrations for the majority of orchestrators.
For the purpose of demonstrating the framework, I’ve installed the Shock Shop Demo into a DC/OS cluster.
To ease the process of setting up the Humio agent, we’ve released it into the DC/OS Universe, which offers a very easy point -and click wizard. All you need to do is to create an account on humio.com with a dataspace and an ingest token.
A feature we’re especially proud of is the framework’s ability to expand and shrink together with the cluster. Meaning that if you add another node, the Humio Agent is installed on the node ready to start streaming within seconds.
When installation is complete, the agent will start streaming all logs immediately so you, straight away, can search for something like
groupby(mesos_service_id) to see all applications.
Currently, all task logs are annotated with the following fields:
Tasks running DC/OS clusters are also annotated with:
Those are all fairly static fields that can’t be changed. Therefore, we’ve added the ability to configure tasks through Mesos Task Labels.
First of all, a task’s logs can be ignored by setting the
HUMIO_IGNORE label to
true. Secondly, you can change the log type with the
HUMIO_TYPE label by setting it to the name of the Humio parser you want to use.
Finally, Humio offers more advanced configuration of multiline fields with the
HUMIO_MULTILINE_ labels. Please see documentation for more details.
When maintaining a Mesos cluster, it’s very interesting to know what’s going on with your tasks in your cluster, i.e. tasks are failing. The Mesos agent is writing a lot of interesting things in the Mesos agent log file. Searching for
@source="/var/log/mesos/mesos-agent.log" will reveal the whole log on all agents in the cluster. So still a bit of a needle in a haystack. To find updates on task status, you can search for “Received status update” and pick out the task status, eventually task name too, and finally plot it into a time chart:
@source="/var/log/mesos/mesos-agent.log" "Received status update"
| regex("update TASK_(?<task_state>\S+?)\s")
| regex("for task\s(?<task_name>\S+?)\s")
It’s definitely very interesting to dig into what happened the other day when roughly 150 task failed in a very short time. Just hit “Event list” to reveal the underlying events and find the interesting ones and eventually narrow down the time span by selecting a shorter span in the Time Line plot.
Having uptime services that ping you if your system is down is very helpfull, but you should also check them. Personally, I have a simple nginx deploy exposed to the public via marathon-lb.
The nginx is deployed as /health, which is conveniently picked up by the Humio agent as the mesos_service_id field. UptimeRobot presents it’s client as something “UptimeRobot”, so combining the two and piping it into a TimeChart should hopefully reveal something like what you’re seeing below:
mesos_service_id=/health UptimeRobot | timechart(mesos_slave_id)
Although there are a few gaps here and there, it certainly looks like there’s a very even distribution of request across all nodes in the cluster.
Having run through our newest integration, it’s time to try it out yourself for free. For more information on how to get started, here's a thorough guide. Humio is available both as an on-premise installation and as a managed cloud service.
Thanks to Peter Mechlenborg.