For Hadoop to finally make it out of the sandbox and into a real production environment, organizations need the right tools to manage and control Hadoop clusters. That’s a considerable challenge given the vast amount and variety of data Hadoop is capable of storing and processing.
Data platform provider Hortonworks and mobile ad network InMobi hope to change all that with Falcon. A jointly developed data lifecycle management framework for Hadoop, Falcon simplifies data management by allowing users to easily configure and manage data migration, disaster recovery and data retention workflows. InMobi built Falcon for its own usage nearly two years ago. Since then, the technology has been submitted to the Apache Software Foundation for the open source community to use.
That vendors like Hortonworks are starting to look beyond perfecting Hadoop as a platform to managing its clusters is good news to enterprises. According to Shaun Connolly, Hortonworks’ vice president of corporate strategy, there are a string of common questions that often stump companies when attempting to manage Hadoop clusters: How do you get data into the Hadoop platform from other systems? How do you transform this data into the format you need? How do you share this data easily with downstream systems like data warehouses? And: how do you distribute that data geographically?
Falcon heralds the “second generation of Hadoop” as enterprises begin to expect greater scalability, security and manageability from the relatively nascent technology. “In the past year and a half, more mainstream enterprises have been adopting Hadoop so the market demand has been building very quickly and aggressively over the past 18 months,” says Connolly. “So we’re seeing more and more of these [data management] solutions come onto the market.”
Cloudera Manager, for example, promises greater transparency into and control over Hadoop clusters so that users can easily deploy and configure clusters from a centralized console. Zettaset Orchestrator, on the other hand, acts as a management layer that sits on top of any Hadoop distribution to simplify Hadoop deployments without the need for manual configuration and third-party consulting services.
While data management tools vary, the very nature of Hadoop demands a more mature toolset. The problem, says Mark Madsen, president of Third Nature, a research firm specializing in business intelligence, is that “Hadoop is a data processing platform. It’s just there to crunch data and store the information that you’ve fed into it but it doesn’t have the features of a relational database.” Relational databases, for example, have a straightforward and formal structure for managing data in simple tables.
Another roadblock to managing Hadoop is its ability to store a wide breadth of data from multiple sources. In the case of a traditional database, a set of unique product identifiers can be used to distinguish and control each row of data stored in a table – a process that’s not so simple with Hadoop.
Madsen offers the example of a traditional database used to store a grocery chain’s pricing information. “It’s easy to take a price list and a set of product identifiers and manage and control them so that you don’t have random people changing prices in your system,” he says. “But when you try to extend that to the kinds of information being put in a Hadoop cluster, like Web logs and machine sensor data, the data quality tools we have now can’t really scale to that level effectively.”
And then there’s simply the law of physics when it comes to quickly migrating vast amounts of disparate data in the event of an outage. “Hadoop is a very large distributed file system with many types of files so it’s a general purpose utility,” warns Connolly. “The fact that it can store extremely large amounts of data is where the challenge comes in. You have to think about how you’re actually going to be conducting disaster recovery in that scenario. You can’t just pick up 40 terabytes and move it elsewhere.”
Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at firstname.lastname@example.org or via Twitter @Cwaxer.
Home page image of Saker Falcon by € Van 3000 via Wikipedia.