These days, almost anything can generate data and transmit it via an IP address. This can be a great opportunity for businesses to know more than they ever had before about their processes and transactions. Consider these examples:
- A utility company can monitor all its customers’ energy usage because each customer’s meter is hooked up to the Internet. The utility uses that data in aggregate and in real time to manage energy demands. It can also layer other data sources such as real-time weather data on top of the usage data to more precisely predict demand.
- Each truck, the pallets in each truck or warehouse, and every vending machine that is part of a beverage company’s supply chain contain sensors that tell exactly how much product is at a given location at any given time. The company knows in real time when to replenish stock, and because some of the sensors can monitor temperature, it knows when demand might peak or wane.
Gathering and analyzing this machine-generated data is done automatically in real time. The process of creating analytical algorithms that recommend an action is referred to as machine learning.
A key challenge for organizations using machine learning is performance. Machine learning applications generate enormous amounts of data in a steady stream that typically need to be analyzed in real time.
Bottlenecks and latencies can occur in a number of areas and for a number of reasons, slowing down the process of analyzing data. We’ve identified four likely areas where performance might suffer and offer advice for minimizing latencies.
1. Traffic Tie-ups from Data Movement
The machine-based algorithms created to detect patterns in mass data sets, which then can be used to drive business strategies and decisions, likely resides on multiple machines, often in different locations.
While it may seem intuitive to move your data from where it’s stored to your algorithms in order to be crunched, Michael Groner, co-founder and chief architect of Appistry, a St. Louis-based vendor of computational storage products, says that’s the exact wrong thing to do.
“Data movement is number one, number two and number three on the list” of factors that cause latencies in real-time analytics, according to Groner.
“When at all possible, move the work to where the data is instead of moving the data to where the work is,” he says. “We’ve seen traditional systems that sit 30 percent to 50 percent idle on the computational grid simply because they’re waiting for data to move around.
“You can simply move a very small request to the machine that happens to be holding the data at that moment,” Groner says. “We’ve seen 10 to 100 times performance speed-ups in doing that.”
2. Algorithms Cause Speed Bumps
You can’t crunch real-time data without algorithms, which lay out specific computational procedures for calculating and processing data.
But your results and analytical conclusions are only as good as the algorithms you use, according to Martin Hack, president, CEO and co-founder of Skytree, a Silicon Valley-based big data analytics company.
“The quality of your algorithms absolutely matters,” says Hack. “Machine learning allows us to learn from data. If the algorithms aren’t smart enough or fast enough, you’re never going to get” the analytic results you want.
While Hack says there’s “usually more than one thing that makes a particular algorithm good or bad,” he cites speed as deceptively critical variable.
“The speed of an algorithm for doing machine learning affects the final quality of the analytics result in more than just the obvious way,” Hack says. “The obvious way is that if you can perform training on a larger dataset because you can do it in a tractable amount of time, then your model will be more accurate. … Which is ultimately what everybody wants.”
Beyond that, he says, “If, during the interactive process of modeling, which involves trying out different types of models, with different parameters, you can perform each one 10 times faster, then you are much more likely to have explored a richer space of models and options in the time available for modeling effort.”
Older algorithms can be specific sources of analytics bottlenecks, Hack says.
“Many times an algorithm can’t keep up anymore with today’s big data requirements: faster, more scalable, more accuracy,” he says. “For example, there are algorithms that are up to 30 years old, but we changed the algorithm so fundamentally that now it is several orders of magnitude faster than the original but at the same time—and this is important—preserving statistical accuracy (many times even with accuracy improvements) and observing the underlying mathematical principle.”
So before you start running data through the analytics process, make sure your algorithms are current and robust enough to handle the load.
3. Computing Clusters Collect Garbage
Networks typically are messy environments, given the inherently awkward marriage of new technologies, legacy systems and the variably talented humans running them. This creates impediments for analytics programs.
“There’s a lot of hardcoded daisy-chaining and file mangling happening on top of Hadoop clusters, with little scripts and bits of glue code here and there, and very little management, coordination and orchestration,” says Steven Noels, co-founder and product vice president of NGDATA, a Belgium-based company that has created a big data management platform called Lily that runs on top of the open-source Hadoop software framework and Hadoop’s HBase non-relational distributed database.
“After FIFO and LIFO, we’re now into GIGO – garbage in, garbage out,” Noels says. “A lot of data is thrown to Hadoop clusters, and with heavy MapReduce engineering we’re trying to hunt for gold information nuggets.”
While there may be no magic bullets for cleaning up messy data clusters, Noels stresses the importance of planning out your approach to analytics based on what you’re actually working with.
“We think schema management and data pre-processing can make this better,” he says. “Also, using an interactive store such as Apache HBase forces people to think about schema, data access patterns, and so on at the start of their project.”
4. Storage Drives in the Slow Lane
Appistry’s Michael Groner talks about bringing the algorithms to the data for analytics, but one particular place where data resides that causes performance slowdowns is the hard drive, according to Martin Hack of Skytree.
“There are latency issues when data goes to hard drive,” Hack says. “Hard drives are the pedestrians of the network.”
One way around the hard-drive bottleneck is in-memory computing, which is becoming more prevalent as the price of processors and memory has dropped significantly. Among the large software vendors, both SAP, with its HANA in-memory computing system, and Oracle unveiled its own in-memory system, called Exalytics, in October 2011.
Chris Nerney, a freelance writer and musician, lives with his family in upstate New York. Reach him at email@example.com.
Correction: The original version of this story suggested that Skytree’s analytics are used by the Large Hadron Collider, NASA and the Sloan Digital Sky Survey. While the company’s CTO has worked with these organizations in the past, Skytree does not claim them as customers.