Hadoop Sandboxes Offer Experimental Spaces for Analytics Modelers

by   |   July 30, 2013 10:05 am   |   0 Comments

Alpine Data Labs promises to help enterprises take advantage of predictive analytics. However, before a company can benefit from one of Alpine Data Labs’ platforms, the San Mateo, Calif.-based software provider must first test the performance of “billions of rows of data and hundreds of columns” to check how effectively its technology can build models out of clients’ data, says Steven Hillion, Alpine Data Labs’ head of products.

Related Stories

How Pig, Hive and Zookeeper build apps on Hadoop and MapReduce.
Read the story »

Hadoop Project Falcon: data lifecycle management for app developers.
Read the story »

Concurrent’s Lingual designed to let SQL developers run big data applications on Hadoop.
Read the story »

An introduction to big data application development and MapReduce.
Read the story »

Hadoop sandboxes provide low-risk entry for new programmers.
Read the story »

Running performance tests on billions of rows and terabytes of data on Apache Hadoop requires a large-scale cluster with thousands of nodes—massive testing ground that’s often financially out of reach for small businesses. So Alpine Data Labs turned to Pivotal Analytics Workbench for help.

Pivotal AWB is a sandbox technology that works by providing free access to a Hadoop cluster in 90-day segments, enabling companies like Alpine Data Labs to take advantage of vast storage and analytics capabilities. Pivotal’s website boasts that its Hadoop cluster “has the equivalent memory of 49,000 iPhone 5’s stringed together and enough storage for 10 years’ worth of Tweets.”

It’s precisely this “hefty horsepower” that Hillion says has allowed Alpine Data Labs to test and tweak its algorithms in a safe yet robust environment. “For a small company like ours, it’s difficult to get your hands on a very large Hadoop cluster to test and see if our models can scale for very large datasets,” he says. “Pivotal AWB has been great in helping us test out our software before it gets into production at our customer sites on their live data sets.”

Although Alpine Data Labs chose Pivotal AWB for its cluster size and ease of implementation, these days, there’s no shortage of sandbox tools to choose from including Hortonworks’ Sandbox and Continuuity’s AppFabric. All of these tools allow developers to test out Hadoop’s capabilities without impacting the performance of active servers, mission-critical systems or live datasets. Still more options are available through educational programs, including IBM’s Big Data University, a set of online courses.

The ability to play with data outside a production environment offers two major benefits, according to Mary E. Shacklett, president of Transworld Data, a Seattle-based technology and market research firm. By working “outside of the chief corporate data repository,” Shacklett says a sandbox acts as “an additional ‘layer’ of separation from your production data, which is always an IT best practice when it comes to keeping the production environment clean.” Essentially, she adds, “The sandbox gives data scientists peace of mind that they are not going to disturb production in any way with their experimentation.”

But there are benefits to end users as well. For a software provider like Alpine Data Labs, commercial success hinges on its ability to take a product to market that’s already been tested for shortcomings.  “It’s much more impressive if you can ship software that just works without any problems at all,” says Hillion. “You don’t really want to be testing it out, dealing with bugs and refining deployment live in front of the customer.”

Shacklett agrees. “The greatest benefit is that trials of analytics processes can be conducted, debugged and proved in a test environment so that the process doesn’t get placed into production until it is thoroughly vetted.”

Better yet, while deploying data in a production environment can take weeks—perhaps months—to complete, getting up and running in a sandbox can be like flipping a switch. It took Alpine Data Labs a matter of minutes to deploy its application onto the Pivotal AWB cluster before it could begin building analytics workflows and experimenting with decision trees and algorithms on very large datasets.

That said, Shacklett says it’s important for companies to avoid becoming so enthralled with sandbox technology that they forget the end goal: to create a production-ready application or platform.

In the case of Alpine Data Labs, Hillion says, “We specify what the goals are and work as long as it takes to get to those levels. Typically, we’ll work on testing an algorithm for about two to three weeks.” Chief priorities for Alpine Data Labs include making sure its models are bug-free and that the data is prepped and ready to be used to build models by clients.

Still, Shacklett notes that timelines may vary according to project type. “There is no hard and fast rule here,” she says. For this reason, Shacklett recommends that companies set a time deadline for experimentation based on the amount of data they plan on testing. Of course, she adds, that’s “subject to the type of research being conducted. Analyzing sales behaviors of customers and trying to discover if there is life on other worlds are processes of very different magnitudes.”

Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at cwaxer@sympatico.ca or via Twitter: @Cwaxer.

Home page photo by elleinad via Flickr. Used under Creative Commons license.

Tags: , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>