Improve application monitoring with modern log management
Improve application monitoring with modern log management
Microservices have changed everything. To be more specific, the challenges of managing microservices has changed everything. Whether talking about massive application complexity, the dynamic aspect of services, or the melding of infrastructure and software platforms into wholly contained tech stacks, the days of using a single tool to monitor application performance are over.
At the same time, the IT stack is becoming more intertwined with the entire business operation. Non-IT individuals now use application information to answer critical business questions specific to their teams, like seeing how the network or databases are supporting a widely-used app, knowing the status of the app to answer customer support questions, or knowing what features are being used to better support marketing efforts.
This “Brave New World” of applications means that to understand the entire value proposition of applications requires the best of both worlds — Application Performance Monitoring (APM) tools and log management solutions working together to answer questions for every stakeholder.
Observability of the entire infrastructure
Dashboards and alerts
Root cause investigation and analysis
Search across all data
Data retention and long-term storage
Observe app components
App dashboards and alerts
Dependency & flow mapping
Better root cause investigation from all data
Consolidated dashboards with app data and related events
Longer app data retention
With microservices and cloud applications comes new challenges
Understanding the impact of microservice architectures and the technology stack on IT Operations teams requires a little understanding of the application technology timeline — and why microservices represents both an expected and unexpected shift in the stack.
As Java rose to prominence in the enterprise application world, the lack of production visibility became a major issue, especially as banking, insurance, and telco providers began rolling out massively-scaled (at the time) applications for their customers.
Thus entered the first generation of APM tools, built on the ability to inject bytecode into the running applications to monitor component-level performance.
Over time, a new breed of enterprise applications emerged — Service Oriented Architecture (SOA) applications. This paved the path for a second generation of APM solutions that could deal with the distributed nature of SOA applications, adding “maps” to the vocabulary of IT operations. But even as an entire new generation of tools emerged, the primary ideal of technology was inserting bytecode for monitoring into applications.
The growth of microservice applications over the last four years has created an unprecedented explosion in IT organizations when it comes to how applications are built, deployed, and operated.
Large percentages of microservices that make up applications aren’t running code to instrument.
Each developer (and development team) has the capacity to make their own platform and infrastructure decisions on everything from language choice to selecting the specific database and messaging services to include.
Dynamism — constant change in application paths — continues to grow, making it more difficult to capture and understand the dependencies within an application environment.
A third generation of APM tools have surfaced to help monitor these new, highly-distributed, and dynamic applications. But the three changes discussed above also make it more difficult for APM users to deliver the back-half of APM value— solving problems when they occur.
This guide outlines how Log Management solutions and APM tools are perfect complements for each other. We share the following 5 steps as a way to get started, with more detailed information later in the page.
5 Steps to use log management to strengthen APM
Connect APM and log management.
Modern versions of these tools have the ability to directly connect operationally — in some cases as easily as simply pushing a button from one to jump to the other.
Challenges in managing microservice apps
Why is it so difficult to manage microservice applications? While the nature of microservice technology creates difficulties in managing applications, there’s also an organizational shift that must be dealt with to meet the ultimate needs of the business with their applications — the need to move fast.
About the only thing that’s constant in today’s application operations is that there is continuous change. High-performing organizations no longer measure their application operations by the number of releases per year — instead, they focus on the number of updates per day. This continuous delivery model delivers higher quality and higher performance overall, but it creates all kinds of problems for monitoring, especially since the biggest challenge of application management is the need to understand context.
Of course, regardless of how dynamic an environment is, who gets involved in application operations, or how often software is updated, application complexity is the biggest roadblock to all stakeholders getting the most out of their apps.
How can individuals (or even teams) possibly see the operations, relationships, and dependencies of all the entities running in an application environment?
How can they begin to understand and interpret how millions of pieces of information indicate whether an application is running well, and how users are being served?
The days of an app server admin and a chief architect being the only two people involved with application monitoring are long gone. The need for speed encompasses the entire organization – developers, architects, operations, QA – even application business owners. After all, if you’re creating the ability to quickly adjust to market forces in your application, then those team members tasked with monitoring the market must be stakeholders in overall application success (which is more than just performance).
APM solutions rely on their monitoring agents to get the data into the system for analysis and reporting. But there are plenty of pieces of data – both configuration and operational – that exist in logs but can’t be brought into the system via a monitoring agent. Logging management solutions can access more data from specific platforms than APM monitoring agents can get, including network issues, database connections or availability, or information about what’s happening in a container that the app relies on.
It can be difficult to research data from multiple sources via timestamps. It’s not that APM tools can’t sync on time; rather, this is difficult if additional data isn’t laid out in the same time frame as deep application monitoring — the information that provides the critical details for solving issues, especially those associated with configuration or platform dependencies.
Siloed monitoring solutions
The same categories of data from APM (time sequences, configuration information, updates, performance issues, resource usage, etc.) are also the hallmarks of other monitoring tools (network performance monitoring, server monitoring, deep user monitoring, etc.). Often, one affects the other, and the cause of an issue under investigation may not be collected by the APM. The best-situated platform for looking at all monitoring data in aggregate is the log management solution. By looking at data from the APM or log management alone, it may take longer to discover the cause of a performance issue. By bringing other operational details to bear, log management can take things to the next level.
But wait! There’s more!
Using an APM in isolation leaves out other operational tools that produce useful logs – from PagerDuty to Jenkins. All of them have timestamps, warnings, alerts, and other messages. They help to understand just how the application is running, and how it is operating in other parts of the infrastructure. The only common denominators between this completely-unconnected set of tools are logging and timestamps.
Mean Time to Repair
The measure of APM tools
In addition to the typical measurements Dev and Ops put on their applications, there’s an APM - specific metric that is used to determine how effective your APM solution is — Mean Time to Repair (MTTR).
Mean Time to Repair is the average time that’s required to find and repair a failed component or device.
Mean Time To Repair is usually broken down into two separate components:
Mean Time to Find: the average time it takes to isolate where a problem occurred within the complex architecture or infrastructure of the application.
Mean Time to Fix: the average amount of time it takes to debug the problem, once isolated, and take action that corrects the issue.
While the overall MTTR is used to judge the efficacy of any APM solution, the tools themselves tend to focus primarily on the Mean Time to Find side of the equation:
Mean Time to Find focuses on the collective strengths of APM: tracing problematic requests, following those requests to specific components, and isolating the piece of the stack and service map causing the slow-down, outage, or hang.
The IT skills needed to find problems are distinctly different from fixing them — and they map better to APM users, which tend to be broad ops personnel.
Expert personnel tend to only be available to solve problems after having proved that the problem exists within the scope of their responsibilities.
APM tools talk about MTTR, but they focus most of their features and energy on finding problems, not fixing them. So, what’s needed to actually fix application problems?
The typical process is to isolate the exact location and type of the problem component, then get the developer or service owner involved to do the deep dive required to understand the exact cause of the problem, and then do the work to solve it.
This is where the wider lens of log management can help drastically reduce MTTR. In fact, the process for fixing usually includes a collection of several tools, including log management.
Log management and APM: better together
Log Management solutions and APM tools are perfect complements for each other — operating on adjacent technology layers. APM solutions optimize analysis of specialized data to answer a discrete set of questions about applications, while log management tools use less specialized but more comprehensive data, and a user interface designed for a broader set of questions.
A bonus of this complementary pair is the fact that log management tools can ingest data from APM solutions, making APM data available for broader analysis capabilities.
Modern log management adds value to APM
Choose the right log management solution
The nature of applications and application troubleshooting means that not every log management solution can be brought to bear on microservice applications. There are at least three absolute requirements for a log management solution to be truly helpful to a complex APM tool, especially when dealing with microservices.
Unlimited data ingestion
With microservices there is exponentially more data than monolithic or SOA applications. On top of the individual stack data, there’s also data available from the applications, and each request can have a unique path through the infrastructure. Trying to guess what pieces of data to include for analysis isn’t just a difficult proposition, it’s practically impossible. This is why microservice application monitoring solutions put so much emphasis on automation and mapping — because there’s too much for an individual (or team) to take in and understand.
In a world where time is the arbiter of success, the need to index data on the way in and query indexes for analysis simply gets in the way of advanced data analysis. And as situations change, the next query will have to build on the last. Just one troubleshooting session could incorporate dozens of queries. If streaming data can be collected without being restricted to defining the schema up front, there is much more freedom to explore relationships later. And when a search is easily generated and results come in instantly, it encourages the user to ask more questions and explore further.
Real-time data and streaming.
Yes, this is technically two different items, but they’re related enough to think of them together. As organizations move from a few software releases a year to dozens every day, the need for immediate feedback is greater than ever. The only way to effectively assist the ops team to keep their service levels up is to provide data in near real time. The best way to do that is to stream data from the source and process it without delaying for indexing.
For the best results, look for a modern log management solution optimized for speed and efficiency. Look for these hallmarks to find the best high-throughput, low-cost log management system.
Invest in modern log management tools that are fast, flexible, efficient, and easy to use.
Capacity to ingest and store all data required
Fast search with near-zero latency from ingest to being searchable
Streaming data ingest
Index-free technology for real-time ingest, free-text searches, and optimized storage
Architecture for speed, efficiency, and flexibility
Affordable license fees that scale predictably as data requirements grow
Easy-to-use free-text search
Data enrichment to augment raw data, including joins from multiple data sources
Dashboards updated in real time
Flexible visualization capabilities
Data compression for efficient storage and data transfer
Long-term retention and storage using inexpensive cloud storage
Resilient design that doesn’t require extensive ongoing maintenance
Self-hosted or SaaS
5 steps to troubleshoot apps with log management and APM together
One critical aspect of microservices and container-based applications is the percentage of service platforms that aren’t running custom code. Instead, they provide a critical function to the overall operation — including database, security, storage, and messaging, to name a few.
Troubleshooting and problem-solving in this environment requires the ability to watch the interactions between systems, as opposed to simply stepping through custom code. These are the steps for isolating the root cause of microservice application problems in these distributed environments.
A unique aspect of the relationship between application performance monitoring and log management is that one can ingest the data from the other. That’s just one connection, though. Modern versions of these tools have the ability to directly connect operationally — in some cases as easily as simply pushing a button from one to jump to the other.
If you have a set of tools that can integrate, the first thing you should do is to perform the steps needed to actually get that connection active. If you do not use tools with this direct integration, you should still set up the APM export/log management import to get that application data into the log management analysis.
In the APM, send notifications for distribution.
To have the APM send events to your configured log router, configure an Alert Channel WebHook to send events to your Fluentd/Logstash HTTP endpoint. Next configure Alerting to send the selected alerts through the previously configured Alert Channel. For testing, set Events to “Alert on Event Type(s)”, select “All Types” and set Scope to “All Available Entities”. As long as there is activity in the application environment that is monitored, those events will be propagated to the configured logging aggregator.
Push APM events to the log management system using a data shipper like Fluentd or Logstash.
Logstash and Fluentd can be installed natively or run inside a container (or inside a Kubernetes (GKE) cluster). If a log router shipper is used in the environment, it may be easiest to directly ship in APM events. Many log management solutions can also use the HTTP Events Collector (HEC).
For the Fluentd container, take the base image and add a few extra bits, using a Dockerfile like this:
FROM fluent/fluentd:v1.4.2-debian-2.0 USER root RUN apt-get update && \ apt-get install -y build-essential ruby-dev RUN fluent-gem install fluent-plugin-elasticsearch && \ fluent-gem install fluent-plugin-elasticsearch-timestamp-check
Once the new Docker image is built and pushed to a repository, it's easy to spin it up inside the Kubernetes cluster. Use deployment files like Kubernetes deployment descriptors for Fluentd deployment files. Edit them and substitute the appropriate values. Examine the configuration map for the fluentd.conf file to see the input configuration for the webhook endpoint, and the output configuration to push the events into log management with the Elasticsearch bulk API. That's all there is to it.
For the Logstash container, you may be able to use the one from Docker Hub without modification. Get the Logstash.yaml deployment files, and edit them with the appropriate values. The logstash-config file is very similar to the Fluentd file, just with a different syntax.
Events on the APM dashboard
APM events on the log management dashboard
There are those that want to go right to the context of the issue at hand – if there is trouble in the database, then focus in on the database, then start the debugging or resolution process.
In reality, most situations that require an APM or log management solution to solve them aren’t quite as simple as “the problem’s in the database.”
The first thing that needs to be done is to set the time frame around which all your analysis will occur. The idea for the timeframe is to set it small enough to not overload with too many events, but not so small that you miss the event that caused the problem.
Start with an initial time frame of 30 minutes prior to the notification of the problem – this allows for double the reporting period for most legacy APM solutions.
By integrating the APM tool with your log management solution, the time frame selection is automatic, either drawing from the time frame of the APM tool at the time of solution jump, or directly determined by the tool.
The APM tool should have been able to isolate the problem down to a specific application server, service, database, or back-end system that is the cause of the problem (whether a hung transaction, scalability issue, or application slowdown). The result could be a specific machine or server, a cluster of services, or a cluster of systems. With the time frame and the offending system, you have what you need to get to the next step in the resolution process: figuring out what the actual cause of the problem is.
APM solutions integrated with log management will be able to provide detailed context to automatically dive into a potentially problematic component.Case Study
An APM solution saw that end-users were getting slow response times. The tool identified a database server as the source of the problem. When the database performance was examined, everything showed green-light service levels. A savvy operator pointed out that there was another dependency behind the database – the actual storage. That was when it was discovered that the “network pipe” designated for delivering the data for requests was too small for the amount of data being requested.
This is a good example of why hidden systems are critical to solving issues. It also shows how three dashboards (App Code, Database Server, Storage Array) may show green lights, but users experience “red-light service.”
The first thing to do is to analyze the service, pod, cluster, server, and machine to see what other downstream dependencies there are. But beware — just because other dependencies exist doesn’t mean they’re the problem, so don’t stop until the problem is isolated.
There are going to be two primary “true root causes” of application issues, especially in microservice and containerized environments:
Direct result of change
Resource depletion by “outside” source
If you’re not specifically tagging change events as changes, then you’ll need to create a list of messages that act as proxies for a change event and collect those via filters into your workbench. Don’t forget to tag configuration updates and changes as change events.
Start with configuration change events
Configuration changes shouldn’t be popping up left and right, so if you see any config changes, immediately be suspicious. With config changes, you’ll want to do three things:
Check with the SysAdmin to make sure changes were intended
Compare to the golden master of config images
If in a cluster, see if the config change is across the entire cluster or a subset
Typical pattern recognition and analysis takes over from here. If the problems only occur on parts of the cluster that have been changed, then there’s a good chance you’ve found the culprit. NOW, digging into the log, you can find the source of the change – another user, a runbook, a program – and pass the ticket to the appropriate admin for further correction.
Look for resource depletion next
One aspect of containers and microservices is that services and application components are relying on shared resources following a set of rules for resource requests and consumption. But in today’s massively diverse environments, it’s entirely possible that other consumers of resources do not follow those same rules.
A prime example of this would be an application service running on orchestrated containers, while other requesters to the host or server are making requests outside of orchestration. Resources can be depleted, meaning that none are available for the application component – but as far as the orchestration infrastructure (i.e., Kubernetes) is concerned, there’s still plenty of resources left.
In this case, the scenario will be that the application could have been saved by a simple request for another instance of the container, but since Kubernetes believes there are still resources left for the already running container, it won’t make the request.
But even outside of orchestrated containers, rogue applications can deplete resources needed for the application service or component to operate properly
Finally, it’s important to note that “resources” being depleted might not be “on the host” that the service is running on – while it certainly could be memory, disk, threads or other O/S and container resources, it’s just as easy (if not easier) for a rogue application to abuse a back-end resource like a database server or a storage array access pipe. The result is the same – slow, hung, or error-prone requests on the primary app.
Now, unless you’re the luckiest person in the world, chances are you won’t see a change, followed immediately by the added latency, although it’s not IMpossible that it could happen. A more likely scenario is that a change in the system started consuming resources at an elevated rate – until resources got locked, disappeared, and then had an escalating impact on related requests.
Look for software updates
After examining the first two possibilities, it’s time to look at changes to the service or microservice themselves. Again, you might find a change that immediately results in a slowdown, but the more likely scenario is that a change occurred and performance, availability, or resources deteriorated or were depleted over time.
There are two types of update events that the log management tool would get from other systems:
Continuous integration and continuous delivery (CI/CD) tooling like Jenkins (or other versioning systems) should automatically insert their change events into your log management.
APM tools built to handle microservices should include some service update events.
Since you have your timeframe – and your system focus – and you’ve ruled out resource depletion from a rogue app or process, you’re looking for the changes that introduced a new bug or performance issue into the transaction path.
For performance impact, you’re looking for a ramp or spike upward. For poor resource management or rogue usage, you’re looking for ramps and spikes downward in the systems, with the start of those ramps occurring at the time of the update.
When you find the updated code that created the problem (chaos), roll it back to the previous version, watch the monitor to make sure the problem is gone (a step that gets forgotten), then assign it to dev so they can profile their code and fix the bug that was introduced.
You are now a Level 5 application problem solver!
One thing is certain – the only tool that can cross every chasm within operations is the Log Analysis solution. Of course, only if it can operate at the speed and scope of the other solutions. Only Humio has unlimited ingest and real-time streaming that teams need to operate their log management at the same pace as all their other monitoring and management solutions, especially those in continuous delivery environments.
Think AEM: Application “Effectiveness” Monitoring.
It’s not about monitoring application performance – it’s about managing application effectiveness. A broader set of stakeholders actually need something different. They’re more interested in how applications are delivering their promised (or desired) value – to customers, to partners, to sales teams, and to the business. That’s “application effectiveness.” That’s the value proposition that you get from putting the right application architecture, APM solution, and Analytics solution together for your entire team.
Correlate across multiple event streams.
Once you have the APM events streaming into log management, you can correlate across multiple event stream sources. For example, if you use Jenkins as your CI/CD delivery pipeline to automate builds and deployments into Kubernetes, it is possible to correlate deployment events from Jenkins with service quality events from the APM to verify that new deployments do not have a negative performance impact. In some cases, the APM event includes a contextual deep link back to the APM dashboard, enabling you to start root cause analysis immediately.
Supercharge APM trouble tickets.
If you’re using trouble tickets to assign APM use to log management, make sure that the IP address or server name of the APM-identified system is in the ticket, so that anyone can pick up the work at any time in the future.
Think like a hub.
Airlines run a spoke and hub flight operations system – it allows them to be more efficient by running all passengers through a small collection of hubs. In a complex application system, every service (machine, server, network device, etc.) is its own hub, whether architected to an official hub or not. There are a set of requests coming into the service for work to be done – upstream requests. There are requests made by the service for work it needs by other systems – downstream requests.
Isolating the true root cause of a problem requires the ability to analyze an application component from both perspectives. Where are requests coming from – and what resources are they consuming? What requests are being made by the component to other services or components – and what is the status and latency of those requests?
Complete the circle.
Don’t stop when the root cause is found, complete the circle. Once you’ve isolated the issue, flip the focus of the search. Go back up the stack to see what else that issue might be affecting. Is the same issue causing problems with other apps? Is there something they need to do to mitigate any problems?
Start a journal.
In case you can’t solve the problem on the first occurrence, log the time and situation for each time it occurs. This brings you one step closer to finding and fixing the problem. The journal will help you as you search for the pattern around the issues.
Getting started using an APM with Humio
The Humio log management platform is lightning fast, flexible, and built to scale – all at an affordable price. Integrating data sources between Humio and Instana is useful because DevOps, IT Ops, and Security professionals need many types of data and information to optimize their applications and speed up software development. Correlating APM performance data with log data helps teams build better software faster.
The following steps show how to configure Instana APM to work with Humio.
We invite you to see how Humio’s modern architecture redefines what is possible with log management. Request a free live demo to find out how Humio can help your organization improve the quality of app development and reduce infrastructure monitoring costs with modern log management.
Set up a Humio free 30-day trial. See for yourself how Humio can enhance the value of application performance monitoring.
Unlimited logging with better performance and lower costs
A CUSTOMER STORY
Customer Case Study
Join our Slack channel: meethumio.slack.com
You’ll also find lots of useful information on the Humio blog, and informative talks and demos on the Humio YouTube channel. To hear from Humio developers, customers, and partners, listen to our podcast series: The Hoot.
Humio's log management platform offers the lowest total cost of ownership, industry-leading unlimited plans, minimal maintenance and training costs, and remarkably low compute and storage requirements. Humio is the only log management solution that enables customers to log everything to answer anything in real time — at scale, self-hosted or in the cloud. Humio's modern, index-free architecture makes exploring and investigating all data blazing fast, even at scale. Founded in 2016, Humio is headquartered in London and backed by Accel and Dell Technologies Capital.