Whether it’s to store a massive amount of data, or to quickly run applications like fraud protection, recommendation engines, sorting address book data or to analyze website traffic, companies have found many different uses for the open-source platform Hadoop.
The software and its open-source sister projects including Pig and Hive, have made a major splash in IT shops looking to tackle huge data storage and processing challenges while managing costs. There is no single way to implement or run Hadoop.
Built by engineers at Yahoo, based on the MapReduce technology created by Google, Hadoop went on the Apache open source license in 2007, and version 1.0.0 was released in December 2011. (Version 1.03 was released in May.)
Companies like MapR, Cloudera and Hortonworks have enterprise versions, offering training and support for Hadoop along with their own innovations to the platform. Database companies, analytical platform providers and business intelligence companies, from industry giants like IBM and Oracle, to the leading cloud vendors like Amazon Web Services, have created corporate partnerships with Hadoop distributors, connecting their own products to Hadoop to meet rising market demand. In May, IDC said its researchers projected the worldwide Hadoop-MapReduce ecosystem software market to rise to $812.8 million in 2016 from $77 million in 2011, a compound annual growth rate of 60.2 percent.
Hadoop as a Crater Lake for Data
Hadoop uses MapReduce and the Hadoop Distributed File System to allow companies to store unstructured or semi-structured data on clusters of commodity hardware, and then query that data quicker than if stored on a relational database. Because it uses commodity hardware and is free to download, one of Hadoop’s major attractions is that it’s cheap. Each computing node costs the same as the last one. By combining low cost and a cluster-based architecture, the platform scales very well, enabling companies to store huge amounts of data (think petabytes) from Web logs, social media streams or sensor data they previously ignored, threw away or stored on less-responsive and costly tape drives.
These aspects of Hadoop represent a major shift in strategy for storing data, since space and data schema aren’t a concern like they are in a relational database management system, says Jack Norris, the vice president of marketing for MapR.
“You want to have data lakes, where you dump all the data there and then you’re going to do analysis and figure out what you’re going to ask later,” Norris says. “That’s the reverse of what data warehouses are, where you’ve got to figure out what you’re going to ask, because you need to figure out the schema and then the whole ETL process to get the data in the right stage, so you can ask those questions. With Hadoop I’m going to put it all there, and a mix of data is better than isolating data.”
To gain access to the data, companies make queries to the “data lake” through MapReduce, surveying the entire data store and retuning the answers.
John Kreisa, vice president of marketing at Hortonworks, said a very common way companies are using Hadoop is like an oil refinery, where the data is like crude oil.
“You take in raw crude oil, and then you need to process that through to diesel fuel, sometimes you’re making jet fuel, sometimes you’re making plastics,” Kreisa said. “It is the refinement process that takes that original raw product and moves it to some level processing so it is appropriate for another system. That is exactly what Hadoop is being used for in many of these cases: refine that data so the data analysts can come and take a look at it.”
Because Hadoop is written in Java, which experts say is a cumbersome language in which to write queries, the open-source community has created several other projects to make it easier and more viable, like:
- Pig, a high level scripting language to make programming applications for Hadoop easier.
- Hive, a scripting language similar to SQL to create an easier-to-use interface for Hadoop.
- HBase, a NoSQL key-value store database that rides on top of the Hadoop infrastructure.
- Uzi, a job scheduling program for iterative queries and processes.
- And many others from the robust community of programmers working with the current version of Hadoop and developing new features to be included in future versions.
When companies need more hands-on support, they call companies like Hortonworks, MapR, or Cloudera.
Companies implementing Hadoop on their own “will realize they don’t really want to keep up with the projects or understand what the patches are, they would rather pay somebody and they’ll come to us,” says Kreisa of Hortonworks. “We bundle together and pick the most stable versions, test them and integrate them to make sure they work at scale, and then put them out as a tested and supportable version.”
Jeff Badell, the CTO of MicroStrategy, says most Hadoop users today use it to capture machine or web log information and store it, and are worrying about how to analyze it later. BI vendors like MicroStrategy come in when a company wants to blend that massive stream of unstructured data in with other historical, transactional, and business data, looking for insights.
“We’re the analytics platform,” Badell said. “Most of these customers have had a relationship with us to do analysis out of their data warehouse, and then what they’ve asked us to do is figure out how they can leverage the same analytics and delivery mechanisms that MicroStrategy provides, whether through a web interface or a mobile device interface, to take the raw Hadoop data and turn it into information.”
Use Cases: Fraud Detection, Machine Sensors, Click Streams
Norris, of MapR, said a major implementation of Hadoop for financial companies is fraud detection. By analyzing historical data and finding patterns in accounts than led up to fraud, the company can detect query their data looking for events matching the patterns.
Another case, Norris said, was machine sensors in heavy industry. By collecting sensor data, tracking and querying the events leading up to heavy equipment failures, a company can better predict when a machine needs to go in for maintenance.
“The focus there is avoiding downtime, and in the case of large industrial equipment, having a machine go down has huge business impact,” Norris said. “Being able to avoid that by certain percentages has a huge financial impact.”
Steve Stone, the current CIO of MicroStrategy who recently was CIO of Lowes, said before he left the hardware retailer they were using Hadoop to monitor Lowes.com. By viewing click stream and web log data, the company could detect broken elements of the website and the effectiveness of promotions on a minute to minute basis.
“When we saw something that was an abnormality that popped out, where we would see sales per hour or sales per minute begin to drop, often times we would go to the Hadoop cluster where detailed web logs were kept and we would run queries against that just to pull [the abnormality] out,” Stone said.
As cloud infrastructure becomes more popular, Hadoop distributors have made alliances there, too. MapR has signed partnerships with Amazon Cloud Services and Google to provide Hadoop in the cloud. Hortonworks has partnered with VMware. By taking advantage of the elasticity of those virtual environments, companies with little or no infrastructure can run major Hadoop queries on data stored in the cloud on an ad hoc basis.
“We’re at a major tipping point within the cloud, with the maturity of the offerings there, the infrastructure that is now being opened up, and from a cost and performance standpoint you don’t really need huge capital expenditures to have a very sizable cluster at your fingertips,” Norris said.
One of Hadoop’s great features is that it’s built using commodity hardware. Full Contact, a company that synchronizes address book data across different applications, runs Hadoop in an Amazon Web Services environment. Read more about the strengths and challenges of this company’s experience so far.
Still Work to Do
Kreisa, of Hortonworks, says Hadoop still has a ways to go before it’s a fully viable enterprise system. The system had work to do on security if it wants to attract companies that have strict compliance needs.
“The early adopters are much more flexible and much more willing to take chances, and they’re much more tolerant for downtime, and they’ll hire the skills that are specialized,” Kreisa said. “If you look at enterprise on the other side of the chasm, the so called early majority, those folks need it to be easy to install, easy to run, easy to monitor and manage and easy to tie into the existing infrastructure.”
Kreisa said the BI, database and analytics vendors have work to do, too, building products that are fully integrated with Hadoop and not just “connected.”
“They’re built to work on structured data, they’re not necessarily built to work in this new paradigm,” he said.
There are analytics applications emerging specifically from the Hadoop world, Kreisa says, noting examples such as Datameer and Tresata, which caters to the financial services industry.
It takes a while for open-source software to reach maturity, and projects like MySQL and Linux took several years before they were fully accepted by the enterprise, he said.