Facebook, Amazon and Twitter may rely on Hadoop for processing data but despite their legions of genius engineers, not one of these tech titans can guarantee Hadoop’s fail-safe performance.
That’s because Hadoop’s Distributed File System (HDFS) metadata service, NameNode, contains an inherent weakness so that when one NameNode in a cluster goes down, the entire cluster fails, blocking access to mission-critical applications. It’s a single point of failure (SPOF) dilemma that’s prompting some companies to hesitate before jumping on the Hadoop bandwagon.
“Hadoop is really ready for prime time in that it’s a very effective, low-cost way to store lots and lots of data in any kind of form,” says Jeffrey Kelly, an analyst with The Wikibon Project, an online community of tech experts. “But in order for Hadoop to be a full-fledged enterprise-grade platform, you can’t have a cluster going down, especially if you’re supporting mission-critical applications.” What’s worse, Kelly says it can take anywhere from “minutes to hours” to get a cluster up and running again.
But now one vendor believes that its data replication technology can render Hadoop fail-safe enough to support mission-critical tasks. WANdisco is a U.K.-based software provider known for its increasing role in the big data market. But the question is whether WANdisco’s Non-Stop NameNode tool can convince companies that Hadoop is practical for enterprises.
“WANdisco has the potential to really put a stake in the ground,” says Kelly. “The company has shown with its technology in the non-big data world that they’re very close to 100-percent uptime. If they can make that transition and actually make it happen on Hadoop, they could be on to something.”
The way WANdisco works is that it replicates data across wide area networks, scattering data geographically, so if that if a server goes offline, failover and recovery occur immediately and without any service interruptions. The tool is based on technology developed by Hadoop developer AltoStor which WANdisco acquired for $1.5 million last November.
“If a company has a data center in London and another in New York, WANdisco connects them,” says Kelly. “It’s not necessarily just distributing data across the cluster. It’s about using WANdisco’s replication software to make sure that each node has the most up-to-date information.”
By delivering 24/7 backup and data storage, WANdisco not only hopes to instill greater confidence in Hadoop but ease administrative tasks such as safely taking down a node for maintenance while ensuring business continuity.
Other Approaches to Hadoop System Reliability
WANdisco isn’t the only Hadoop player to take advantage of data replication. Vendors including Attunity and Informatica both incorporate the technology to speed up the distribution of information. But only WANdisco’s approach makes sure that data in Hadoop environment is always available by replicating data across data centers.
“The Attunity and Informatica solutions are more about data integration – getting data from other data sources into Hadoop – rather than addressing the SPOF issue,” says Kelly.
And then there are the many other approaches vendors are taking to tackle the single point of failure issue. California software company MapR, for example, distributes metadata across nodes rather than storing it all on the NameNode. The thinking is that by dispersing data throughout the cluster, if one node goes down, metadata will continue to be stored on any number of other modes.
The approach promises “higher performance and higher stability than just your vanilla Hadoop so it’s definitely one way to go about it,” says Kelly.
VMWare’s Project Serengeti, on the other hand, relies on virtualization technology to deploy and manage Hadoop clusters. By creating a virtualized environment in which data can be moved around various servers in response to fluctuating needs, VMWare hopes to help companies gain greater efficiencies from their hardware and make better use of their physical machines.
But not everyone is convinced virtualization is the right solution to address Hadoop’s inherent single point of failure weakness because of the costs and complexities involved, Kelly says.
And then there’s Cloudera, a Palo Alto, Calif.-based software provider whose chief architect, Doug Cutting, invented Hadoop. Cloudera’s Impala tool adds a data warehousing layer that sits on top of Hadoop and allows for real-time SQL queries. Rather than have to move segments of data from a traditional database to an analytics platform, companies “can just use Impala on top of their data to perform the same kind of analytics,” says Kelly.
Similarly, Hadapt, based in Cambridge, Mass., has created an analytics platform that integrates SQL with Hadoop. By doing so, Kelly says companies can “use Hadoop as their main analytics platform and build applications on top of that rather than have to move data out of Hadoop.”
But for all the buzz surrounding WANdisco and its many competitors, Kelly says that “the bottom line is traditional enterprises and risk-averse IT departments are not going to deploy mission-critical applications on top of Hadoop until they are confident Hadoop is highly available and reliable.” In the end, it’s up to vendors like WANdisco to prove that open source is worth the risk.
Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at firstname.lastname@example.org or via Twitter @Cwaxer.