The world’s generating mountains of machine data: Web servers, database systems, networking gear and telecommunications systems relentlessly log every activity. And there’s gold in them thar hills. For Web-based companies, mining that gold is as much a necessity as an opportunity.
Edmunds.com has been a purveyor of car pricing information since 1966. In 2006 the company transitioned to Web-only content delivery. The company’s core asset, its pricing data, was readily accessible to anyone—including illicit content scrapers.
Like many information providers, Edmunds.com had to balance the need to protect its data with the need to minimize the legal risks that come with blocking IP addresses. The company’s information security and legal departments were pulling in different directions.
At the same time, a reorganization of the company’s Edmunds.com and Insideline.com websites more than doubled the number of servers, and the company’s home-grown log analyzers couldn’t scale with the infrastructure. More than three years ago, Edmunds.com started using Splunk, a tool that aggregates data from server and application logs, analyzes the data and serves it up in readily comprehensible visualizations.
Splunk tracks 50,000 events per minute on Edmunds.com. These events generate 75 to 80 gigabytes per day of machine data. Being able to collect, manage and visualize this type of data on this scale has made Splunk a popular choice for companies looking to secure their networks and minimize errors and downtime.
Before Edmunds.com began using Splunk, the company’s legal department hesitated to allow the IT staff carte blanche in blocking abusive site visitors, said John Martin, senior director of production engineering. Splunk made it easy to show the lawyers that network operations was targeting IP addresses that were generating inordinately large numbers of requests in a given amount of time. “We were able to show just how on the mark we were,” he said.
This accuracy gave Edmunds.com the confidence to give the network operations team the authority to use a blacklist, said Martin. The company was able to reduce the number of poaching crawlers by 80 percent. “The process even got automated,” he said.
Splunk uncovered all sorts of useful information lurking in the log data, said Martin. “There was a lot we could do with that info outside of IT,” he said.
Dashboards for Decisions Makers
The software generates dashboards that give line-of-business managers and C-level executives real-time views of the business’ performance. “The type of information that they were after was actually in our Web server logs,” said Martin. “It only made sense to present those folks with a visualized version of the data, and Splunk dashboards quickly filled that need,” he said.
Splunk was also instrumental in a major Web redesign. Edmunds.com’s IT staff created a dashboard for the manager of the redesign that made it possible to test the redesign with a portion of the site’s visitors. “We wanted to see in real time as we were making changes how users were reacting,” said Martin.
In April, Splunk made waves as the first big data initial public stock offering. Big data is a label for data analytics performed on very large data sets. Much of the excitement about big data is centered on the potential for big gains in science, finance, healthcare and business intelligence.
Splunk started out as a log analyzer for IT, and its core market is still IT. However, the company is eyeing a broad range of applications. “Splunk’s evolution is into use cases that combine log data with other kinds of data, such as information about individual users whose actions are populating the logs,” said Curt Monash, an analyst and principle of Monash Research.
In November, the company released their beta version of a connector that moves data between Splunk and Hadoop. The ultimate goal is to allow users to create Splunk dashboards for Hadoop Distributed File System (HDFS) data. Edmunds.com is planning to use that capability, said Martin. “Our ability to visualize the data within HDFS will become far easier,” he said.
Splunk’s technology is based on machine learning techniques that find patterns and spot anomalies. Splunk includes a search language that makes it possible to correlate data across structured and unstructured sources.
In January, Splunk released Splunk 4.3. The update includes capabilities that make software more appealing for non-IT uses like business intelligence. The company replaced Flash with HTML 5 for its visualizations, which makes Splunk usable on a wider range of mobile devices, notably the iPad. The new version also includes a backfill feature that automatically grabs recent historical data to provide context for real-time data. It also adds live sparklines—those small in-line-with-text line charts—to search results tables and dashboards.
A simplified free version of Splunk supports up to 500 megabytes of data. Splunk Enterprise pricing starts at $6,000 for a 500 megabyte-per-day perpetual license including first year support, or $2,000 for a one-year subscription including support.
One of the unadvertised benefits of Splunk is controlling the cost of the tool itself. Splunk licenses are based on volume of data indexed. Edmunds.com used Splunk to find and reduce the noise in its log files, which reduced the volume of data Splunk processed from 125 gigabytes per day to 75 to 80 gigabytes, said Martin.
Eric Smalley is a freelance writer in Boston. He is a regular contributor to Wired.com. Follow him on Twitter at @ericsmalley.