Apache Hadoop may be able to work through huge datasets, but it can also present its fair share obstacles when it comes to deployment.
“Hadoop was designed to solve two problems: One, to remove the question of where to store data, and two, where to perform computing. Companies don’t need to ask these questions anymore; they just put it on Hadoop,” said Chris Wensel, CTO and founder of big data application provider Concurrent. “However, the problem is Hadoop is extremely hard to use. Many data scientists aren’t even in production yet with their models.”
The roadblock developers face, said Wensel, is that while it’s relatively easy to build a Hadoop cluster, converting it into a data analysis system requires a mix of sophisticated data mining tools such as R, MicroStrategy and SAS and languages like Pig, Hive and HBase from a whole host of vendors including Hortonworks, MapR Technologies and Cloudera.
Unfortunately, according to Wensel, many of these tools are meant for “data cleansing and scoring, not for preparing machine-learning applications for production.” As a result, he said, “A lot of code must be written to turn cool models into production-oriented applications that users can trust.”
Now Wensel believes he’s discovered a way to ease and expedite Hadoop deployments with Pattern, an open source scoring engine. A scoring engine sorts through and prioritizes a model’s huge volumes of data to make predictions about future behavior – a time-consuming and laborious yet necessary process.
Pattern improves this process by allowing users to create a model using a programming language such as R. However, because R doesn’t always run efficiently on Hadoop, users simply use PPML, a standards-based language for exporting, to carry the model from R over to Hadoop. Next, Pattern converts the PPML model to a Cascading application where developers can begin building enterprise-grade big data applications. Cascading is an application framework for Java developers working with Hadoop, also developed by Concurrent.
By quickly deploying machine-learning applications on Hadoop, Pattern can significantly reduce development time from weeks or months into “hours, if not minutes,” said Wensel. What’s more, because Pattern runs on Concurrent’s Java application framework Cascading, companies can leverage existing investment in big data technology such as Java, SQL and predictive modeling.
That’s good news to Pattern production use cases like WellPoint, a managed health care provider that’s test-driving Pattern to provide analytics used by federal agencies. Social website Airbnb is also using Pattern to better connect travellers to home-dwellers with room to spare. And then there’s early Pattern evaluation user AgilOne, a company that makes cloud-based predictive analytics and business analytics tools.
Antony Arokiasamy, senior software architect at AgilOne, said in a statement that using Pattern allow for the deployment of a variety of machine-learning algorithms for its predictive marketing tools. “As a self-service SaaS offering, Pattern allows us to evaluate multiple models and push the clients’ best models into our high performance scoring system. The PMML interface allows our advanced clients to deploy custom models,” Arokiasamy said.
Pattern’s release follows closely on the heels of the unveiling of Concurrent’s Lingual, an open source ANSI-standard SQL engine that runs on top of Cascading. Lingual lets data scientists and developers with basic SQL skills build applications on Hadoop without any training in MapReduce. Instead, users are able to run MapReduce applications from standard SQL queries.
These programming tools are designed to make creating predictive models easier. But heavy coding and a hodgepodge of sophisticated tools aren’t the only hindrances to Hadoop deployments. Erik Jarlstrom, vice president of technology solutions at Dataguise, a provider of data privacy tools, has long warned of the data security dangers that keep Hadoop applications in pilot. Hadoop deployments, with their wide variety of data classifications including log files, structured data and mixed data types, can be difficult to secure. Because of this, Jarlstrom has issued a wake-up call to those that aren’t taking advantage of the tools needed to secure the “serious holes” plaguing Hadoop deployments.
Whether the hesitation springs from data security concerns or a painstakingly long development cycle, there is one thing developers can agree on: “You can have the best model in the world after spending months creating it, but if you can’t put it in production, it’s useless,” said Wensel.
Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at firstname.lastname@example.org or via Twitter @Cwaxer.