In this age of instant gratification, the digital experience can make or break a business. Downtime or subpar performance can cause customer attrition and decreased productivity, leading to lower revenues. Slow truly is the new broke, and so IT professionals must become even more attuned to the overall health of data centers to solve issues quickly and proactively identify problems that impact end-users’ experiences and businesses’ bottom lines.
In a fitting example of parallelism, the data center echoes how a business handles adjusting and adapting to external market forces. As trends like cloud, virtualization, hybrid IT, converged infrastructure and more continue to transform traditional IT management and troubleshooting, it’s no longer enough to say, “I know how and where my gear is located and connected, and I have platforms from which I can pull metrics when needed.”
Instead, you need to know how your infrastructure responds to the external stimuli of real users. You need tools that help understand what the multi-variant inputs are that create performance effects on discrete elements in your environment. Can you say that on the second Tuesday of every month when a backup runs, for example, a misconfigured, concurrent map-reduce job for Big Data is colliding with back-up traffic on the storage network? This kind of knowledge is required to untangle performance metrics in today’s environments, and so you must have a fundamental understanding of how these disparate pieces of infrastructure and technology work together to best deliver services—and quality of services—that satisfy end-users. With the rate of change in today’s data center, that’s a tall order.
Why? Because even beyond the ultimate end-goal of monitoring for and deciphering key performance metrics, modern troubleshooting presents even the most seasoned of IT professionals with several hurdles, starting with what kind of environment you’re managing. If you’re tasked with overseeing a traditional data center, you’ve more than likely selected the technology and infrastructure systems (or you’re probably only a phone call away from that person, even if they’ve since moved on). It’s much easier to generate data for these systems, because your IT department has agreed on a common set of standards for gathering those metrics, and monitoring tools are typically very mature.
On the other hand, if you’re at an organization working either partially or entirely in the cloud, IT departments typically take a back seat—business leaders often choose the service providers, so administrators are playing catch up and getting to know someone else’s technology. Cloud-service providers also deliver a relatively new service and are much more focused on rapidly developing features and functionality than on developing monitoring capabilities. DevOps helps to bridge this gap by enabling higher-level integration of tools, but, unfortunately, it’s quite rare to find an integrated tool that has the breadth to monitor everything, from traditional enterprise technology all the way to something as abstracted as containers.
Another major hurdle is simply culling through the seemingly endless amount of data points that even one monitoring tool can generate; let alone the multiple tools many organizations deploy simultaneously. For IT professionals, more data is not necessarily always better. Certainly, the more metrics you have, the more visibility you have, but it’s also a much larger data set to manage. The single largest problem that IT troubleshooters face is identifying that single point of truth amidst all the noise. Instead, identifying and leveraging the right data, and getting only what you need when you need it, is key.
To help streamline troubleshooting, there are a handful of performance metrics you should always monitor, including:
– Percent of capacity used (requires an understanding of what your base capacity is and metrics that report how much you’re using)
– Quality of service at the endpoint
– Network performance across the Internet
– Component-performance metric in a composite application (i.e., can you discretely monitor the relevant performance of each of those components?)
However, when it comes to troubleshooting and understanding performance metrics in today’s data center, the ability to easily and quickly collaborate with your peers across siloes is arguably the fastest resolution. Investigating a potential issue to the extent of your ability and/or responsibility, and then distributing your findings widely via a team-communication platform, ensures that no one else responds to the alert until you’ve had a look, and the same goes for the other teams involved.
Look for tools that allow you to selectively send metrics and problem requests to the next person or team in the chain and, ideally, include details about the metrics and the troubleshooting steps you’ve already attempted. Just to be clear, I’m not talking about a ticket system, that logs notes and changes owners, I mean a system that lets you build a set of metrics that tell a specific story, and then share that story. Using the shared metrics and reporting capabilities of this tool, the second responder can immediately see what you’ve already addressed and what are the most likely root causes of the problem. Now, this second administrator can investigate additional metrics that relate to their specific domain, such as virtualization. They might discover that it’s a noisy neighbor problem and move some virtual machines around to create additional capacity.
Unfortunately, most organizations deploy disparate monitoring tools amongst each operating silo, and cross-platform sharing and collaboration is difficult, let alone correlation of metrics (another must-have). However, as the cloud continually drives data-center convergence, look for comprehensive monitoring and management tools that allow your IT department to easily find? that single point of truth to more effectively manage and troubleshoot problems.
Beyond filing away, the handful of metrics most likely to indicate performance problems and cultivating a more collaborative environment, consider the following best practices as additional guidance for navigating the tangled web of modern data-center troubleshooting.
– Don’t panic. If you are broadly measuring your infrastructure—meaning you have invested time and resources into deploying a monitoring tool or system that ensures you have a significant collection of metrics and data at your disposal—you will be able to solve the problem.
– Experiment. Play with your data. Pull it into a view where you can juxtapose metrics you haven’t considered side by side before. Take wild guesses. You’ll start to more quickly make associations that uncover root cause.
– Befriend the machines. Rely on automated-context discovery wherever possible. Physical topology is one thing, but many monitoring platforms can now go a step further and determine a logical topology and interconnections between applications. Ask: Which application sits on which server, accessing which database, stored on which disk of which LUN? Then, when you start mixing and matching data and exploring juxtaposed metrics, you don’t have to do it off the top of your head. Use the automated-context discovery that your monitoring solution provides as a starting point.
With the rate of data-center changes showing no sign of slowing down, streamline and simplify your performance monitoring and troubleshooting processes by investing in next-generation, comprehensive tools that enable cross-team collaboration and leveraging of the above best practices. The bottom line is, a fundamental understanding of how your infrastructure elements—on premise and in the cloud—work together to deliver services is critical to solve issues quickly, and to proactively identify problems that impact end-users’ experiences before they happen.
Leon Adato is a SolarWinds Head Geek and long-time IT systems management and monitoring expert. Before he was a SolarWinds Head Geek, Adato was a SolarWinds user for over a decade. His expertise in IT began in 1989 and has led him through roles as a classroom instructor, courseware designer, desktop support tech, server support engineer, and software distribution expert.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.