IT research firm Gartner projects that by 2015, 65 percent of all packaged business analytics systems will incorporate Hadoop. However, despite its increasing ubiquity, questions remain around whether Hadoop is really enterprise-ready.
“A lot of businesses, particularly financial institutions, are nowhere near ready to deploy Hadoop,” says Jim Vogt, CEO of Zettaset, a Mountain View, Calif.-based provider of Hadoop management platform technology. “A lot of their concerns revolve around security.”
And for good reason: Most traditional big data environments depend upon perimeter security tools for protection. However, because Hadoop “shares data and conducts batch processing of data across nodes,” Vogt says, “that creates a different set of security challenges that you can’t really address with perimeter security.”
Chief among these challenges is that Hadoop is available in a variety of different distributions from vendors including Hortonworks, MapR and Cloudera. As vendors come and go, and their applications evolve, Hadoop distributions are likely to change over time, warns Vogt, making it much more difficult for companies to stick with a security plan.
The inherent nature of Hadoop – a distributed architecture – adds another layer of complexity, especially when it comes to encryption. Because the data is fragmented, encrypting it as it’s shared across nodes becomes increasingly difficult as opposed to having it sit all in one place, on a single cluster.
But that’s not all. With data moving rapidly between nodes, many companies struggle to establish role-based access controls and policies governing who can and cannot access sensitive data. Without these controls in place, employees can tap into sensitive data, from payroll information to medical records, while hackers can steal confidential information.
Worse yet, failure to track who is accessing what data can prevent companies from creating an audit trail – a critical component for meeting enterprise requirements, such as regulatory compliance mandates.
“In a structured data environment, these access controls are already in place,” says Vogt. “There are now these expectations that if I’m going to use Hadoop in an operational environment, I need control over who’s accessing what and to be able to create audit logs for compliance, and to encrypt any data that’s at risk.”
Challenge of Regulating Access to Many Datasets
According to Mary Shacklett, an industry analyst and president of Transworld Data, a research and IT consulting firm, the challenge “is that big data processing engines like Hadoop must aggregate data from many different data sources. The result is often a new set of data repositories which might in turn require a new set of access permissions to be defined.” Having to decide all over again who can and cannot access an enterprise’s datasets is “a knotty problem that can become political as well as regulatory,” warns Shacklett.
To meet these enterprise security expectations, Zettaset has developed a distribution-agnostic management platform for Apache Hadoop-based big data deployments. Dubbed Orchestrator, the tool is essentially a cluster management system that sits on top of Hadoop. By acting as a management layer surrounding any Hadoop distribution, Vogt says Orchestrator can help protect heterogeneous clusters for more all-encompassing, robust security, regardless of distribution or vendor.
Other vendors, such as Hadapt, HStreaming Enterprise and Splunk also contain components that are distribution-agnostic but these vendors’ offerings are more likely used for data analysis than security purposes.
“In the next five years, Hortonworks and Cloudera aren’t necessarily going to be the leading distribution,” warns Vogt. With bigger players such as “Intel, HP, Dell, Teradata, IBM and Oracle” making in-roads into big data, Vogt says Orchestrator covers all the bases by being able to “work on any distribution” rather than serve as its own Hadoop distribution.
That’s likely to please IT managers currently struggling to secure Hadoop. “When you can provide a solution in the open-source community, for a truly agnostic software that doesn’t care which distributions of software it is working with, it can give IT instant headache relief,” says Shacklett. That’s because, unlike traditional Hadoop management systems, distribution-agnostic tools haven’t been “calibrated for optimal performance for specific platforms and environments.”
But whether a distribution-agnostic approach is enough to please security-conscious enterprises remains to be seen. After all, says Shacklett, “Hadoop was designed in an open source, collaborative setting that is typically not sensitive to the level of security that enterprises need. If you are in high-security industry verticals like healthcare and finance, you’re going to need something more to meet the security expectations of your regulators.”
Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at firstname.lastname@example.org or via Twitter @Cwaxer.