Since the early 1990s, to overcome severe performance and availability problems resulting from change and configuration, IT management tools have been implemented for IT operations. Yet, these tools were not designed to deal with the complexity and dynamics of the modern data center. While these tools provide IT operations with lots of raw data, now overwhelmed by the volume, velocity and variety of change and configuration data, they lack insights or actionable information, leaving the change and configuration problems a chronic pain for IT operations.
It’s time to recognize the chronic change and configuration challenges for what they truly are: “big data” problems, and how they can only be solved with an IT operations analytics approach. This approach stands to end the 15 years of chronic change and configuration challenges by applying powerful analytics to the overwhelming change and configuration management data, turning this data into clear, actionable insights. Enterprises that have already implemented IT operations analytics solutions report on achieving significant cuts in their mean time to repair (MTTR), reduced number of incident and downtime, and smooth error-free releases.
IT operations analytics are a response to the three primary drivers of chronic change and configuration challenges:
Complexity. Over the years, each layer of technology in the data center has become dramatically more complex to control and manage. The average server carries environments with tens and hundreds of thousands of configuration parameters. For example: Windows OS contains between 1,500 and 2,500 configuration parameters, IBM WebSphere Application Server has 16,000, and Oracle WebLogic more than 60,000. And if any of these parameters are misconfigured or omitted, such a change can dramatically impact IT operations. And a growing interdependence between applications makes it increasingly difficult to manage and control all of business services.
This scenario has been made clear in numerous, well publicized outages. For instance, in April 2011, Amazon Web Services suffered a devastating event that knocked offline some of their big-name customers like Reddit, Foursquare, HootSuite, Quora and others, for as much as four days. Amazon released a detailed postmortem about the outage and identified the culprit: A network configuration error made during a network upgrade.
In a recent report, industry analyst firm Forrester declared, “If you can’t manage today’s complexity, you stand no chance managing tomorrow’s. With each passing day, the problem of complexity gets worse. More complex systems present more elements to manage and more data, so growing complexity exacerbates an already difficult problem. Time is now the enemy because complexity is growing exponentially and inexorably.”
Dynamics. For IT Operations, change is a fact of life, taking place at every level of the application and infrastructure stack and impacting nearly every part of the business.
To meet these challenges, enterprises adopted agile development processes to meet business demands for accelerated application release schedules, employing such practices as continuous integration and continuous build, pushing hundreds of changes into production on a daily basis. For example, eBay has described having about 35,000 changes per year. Estimates are that between 50 and 75 percent of data centers run outdated system configurations, according to a 2011 IBM survey.
Silos. Most organizations do not have a single authority that owns end-to-end environments for application management. Typically, applications run on different physical and virtual systems that communicate across networks, which in turn may include internal and external segments with limited visibility. While there are tools for BTM (business transaction management), APM (application performance management), SM (service management), and service desk, they are each focused on handling their particular scope of metrics and data in their own process silo, lacking broad and deep visibility into the overall IT environment.
Existing tools are not designed to deal with this big data problem. With the complexity of IT systems, the dynamics of IT operations and multiple teams working in silos IT operations needs not only to automate, but also collect data down to the finest details, ultimately analyzing all changes and consolidating information to unify the various operations silos. None of the traditional tools actually have done this, never approaching this situation as a ‘big data’ problem.
A New Approach: IT Operations Analytics
Now IT operations analytics are emerging, as Gartner recently reported. These new IT analytics tools take different perspectives on the abundant data and complexity confronting operations teams.
For IT operations, managing the configuration of multiple environments still feels like a nuisance. Between applications, environments, and individual instances, mistakes and unauthorized changes happen, demanding that IT ops spend time managing configuration values.
IT operation analytics tools use mathematical algorithms and other innovations to extract meaningful information from the sea of raw change and configuration data. Some of the most common analytics technologies available are statistical pattern-based analysis, event correlation analysis, heuristics-based analytics, and log analysis.
The value of IT Operations Analytics comes through when applied to many of the common use cases in IT operations, such as:
• Incident management. MTTR is woefully high in most organizations. Analytics can dramatically reduce the time to respond to incidents and even feed efforts to eliminate incidents from occurring in the first place. For instance, when an incident occurs today, IT operations starts a race against time to sort through the sea of dispersed data in an attempt to figure out “what changed” from the last time the system was working fine, and what caused the incident. IT Operations Analytics transforms this process by automatically analyzing all changes that occurred since the system was working fine, applying pattern and statistics based algorithms to identify the incident root-cause.
• Problem management. Very similar analytics technologies help those involved in problem management to arrive at root cause, or a probable cause, identification.
• Change management. IT operations analytics technologies will prove invaluable in performing a sanity check to determine the probability of success before any change is executed.
• Configuration management. IT operations analytics can detect discrepancies from desired configuration (drift) and reduce risk to environment stability.
With so much at stake, IT operations analytics can end these chronic change and configuration challenges. The enterprises that have implemented IT operations analytics solutions report on significant cuts in their response times, a reduction in number of incidents and downtime, and are enjoying smooth releases.
Sasha Gilenson is CEO for Evolven Software, a provider of IT operations analytics. Prior to Evolven, Sasha spent 13 years at Mercury Interactive. He studied at the London Business School and has more than 15 years of experience in IT operations.