ScaleOut Software’s middleware stores data in memory to be rapidly accessed and updated over multiple servers in a server farm or a compute grid.
The map/reduce framework allows you to perform algorithms on large datasets that are dispersed across multiple systems. It’s often assumed that map/reduce is used with Hadoop, which has its own map/reduce engine. At the core of Hadoop, is the Hadoop Distributed File System (HDFS). HDFS stores large files that reside on different systems as a single cluster. For most large-scale applications, the combination of map/reduce and Hadoop provide excellent performance.
Unfortunately, “large scale” is a relative term. As data volumes continue to grow, businesses want to do more complex analyses on the data and do it faster. For some industries—financial services, for example—the performance demands are exceeding the limits of Hadoop-based map/reduce.
One of the main factors that limits performance scaling of Hadoop and map/reduce is the fact that Hadoop relies on a file system that generates a lot of file input/output (I/O). I/O adds latency that delays completion of the map/reduce computation. An alternative is to store the needed distributed data within memory. Placing map/reduce in-memory with the data it needs eliminates file I/O latency.
ScaleOut Software has offered an in-memory map/reduce product since 2008, and the company now has about 350 customers across a number of industry verticals. We spoke with ScaleOut’s CEO Bill Bain and COO Dave Brinker to learn more about how this technology works and how it compares to Hadoop’s MapReduce.
DI. First, let’s get a little background. Why in-memory map/reduce?
Bill Bain: I came out of Microsoft. Prior to that, I was at Intel Supercomputing Systems, and then prior to that Bell Laboratories. So I’ve worked in parallel computing my entire career. This concept of map/reduce that we’re seeing very popular today has really been a technology that’s been in development for 25, 30 years. Some of the early pioneering work in this field was done by companies like Intel and IBM and others like Thinking Machines back in the ’80s. We’re seeing just the latest incarnation of that today.
At Microsoft, I saw a problem in scaling storage for applications. Applications were starting to scale out across multiple servers in the commercial space for ecommerce, for websites, and to some extent in financial services. They really didn’t have a good solution for storing data in a way that it could be rapidly accessed and updated and coordinated over multiple servers in a server farm or a computer grid.
So we started our company to address that issue of memory-based storage. We called it distributed caching back then. We’ve more recently used the term “in-memory data grid” (IMDG). Other companies like Tangosol [now owned by Oracle] pioneered this technology, as well, working in the Java space. We focused mostly on .Net when we introduced our product in January 2005. We have since then offered a Linux Java version of our product and totally integrated our technologies across both platforms.
We added a map/reduce capability when we recognized that the financial services market might have a need for this. We’ve been developing our map/reduce implementation for about three or four years now. It draws on a lot of the technology that I worked on in previous companies like Intel.
Dave Brinker: Two segments have emerged for our business. One is what we call enterprise apps, which is characterized by either customer-facing or internal Web-based applications that need to scale. They could be reservation systems, online banking, shopping sites, and many others. What’s driving demand for the technology is the need to scale due to increased load. These apps are always running across multiple servers in a server farm.
The second segment is high-performance computing (HPC), or grid computing. The storage use of an IMDG naturally fits in an HPC grid computing environment because there’s so much data in a parallel computation that needs to be stored temporarily. Examples include interim calculations in aMonte Carlosimulation or data that’s being created from a portfolio analysis run across a grid of computers.
An IMDG in these environments allows you to store application data that’s being rapidly updated and then accessed in memory that’s accessible across the entire grid. That really enhances your performance and the scalability of your computing environment.
In addition to storage, the growing need to do very, very fast analysis on so-called big data is driving the demand for IMDG-enabled analytics. .When we think of big data, we don’t necessarily think of petabytes of data. We’re thinking of data sets that fit in memory but need to be very rapidly analyzed. It might be streaming data or it could be doing a lot of what‑ifs against a more static data set. What is needed is very, very fast in-memory map/reduce.
When we talk about map/reduce, we’re not talking about Hadoop’s specific MapReduce. We’re talking about map/reduce (little “m” and little “r”) in a more generic sense, that is, the notion of shipping functions to data where the data stays in place and is operated on in parallel and then the results are combined through a reduce function.
DI: What are some of the common use cases for in-memory map/reduce?
Bain: We are a middleware company, and our products have a broad range of applications. Although the bulk of our current customers are in either ecommerce or financial services, we see many other verticals which could benefit from this technology, such as predictive analytics for shopping and social media applications. The reason for financial services is certainly that they need very fast analysis of large, fast-changing data sets.
Some of our customers are focused on data analysis. For example, back-testing is something that’s quite interesting to our financial services customers, as is pricing of either fixed income or equities. Others tend to be more focused on scaling application performance by removing bottlenecks for data access as opposed to analyzing data.
DI: Explain how your technology works.
Bain: Our software acts as out-of-process, memory-based storage that scales across multiple servers and allows applications which scale across multiple servers to have scalable storage in-memory and scalable throughput. What that means is that as you offer more workload (accesses, reads, and updates) to it, the throughput of the in‑memory storage scales proportionally without increasing response times.
For a typical database server, at some point the throughput saturates and you can’t get any more throughput out of it because you can’t scale it, and so the latency for access grows higher and higher. A scalable IMDG inherently has lower latency than disk-based storage, avoids bottlenecks to scaling, and can further drive down the latency using client‑side caching.
So the next step is, what do you do with the data? One of the things you can do is to query it. We provide a fully parallel query mechanism based on Java properties of objects and on C# properties of objects, which allows you to build a query which runs across all the servers in the grid simultaneously and returns a list of results for matching data. This is very useful for applications that want to query based on value, not just on the names of objects.
The problem with parallel query is that it leads to a bottleneck in analysis. Our customers discover this as they move from just storing data and retrieving it to querying it and then analyzing it. For example, if you have a set of portfolios you want to analyze, the naive way to do it is to query a large set of portfolios that are of interest (or stock positions and stock histories) and then sequentially retrieve them for analysis. What you find is that you end up moving the bottleneck from the query phase to the analysis phase.
And so, typically, when you analyze data that you’ve retrieved from the grid, you’re going to first read it in, do some analysis, and then merge the results. Map/reduce just unrolls this and runs it in parallel across all the servers. Instead of querying and retrieving the objects themselves, you simply say, “Do that query, and by the way, here is a mapping method and here is a reduction method. I want you to run all that in parallel on the grid servers themselves.” And so, by doing that, we can very, very quickly analyze data and just return the rolled‑up results.
This is just a special form of data parallel computation. When I say that Google didn’t invent map/reduce, the notion of mapping analysis methods to data that’s spread across a set of servers was pioneered in the ’80s by the companies I mentioned. We just didn’t call it that. We called it data decomposition.
So applications like weather simulations, air‑traffic‑control analysis and management programs, and many problems like will naturally be partitionable across a set of servers. I saw a classic case of that was when was advanced aircraft were being designed in the early ‘90s. Analyzing the radar cross‑section of these aircraft was a highly data parallel computation. It can be spread across a set of servers and the results combined to give the results of how well the shape of this aircraft was designed to minimize its radar signature.
That data‑parallel work has traditionally been orchestrated using message‑passing APIs. The classic one is called Message Passing Interface (MPI), which is a standard that’s used by the national labs likeOak Ridgeand others.
MPI sends computations out to be run in parallel, and then it combines the results [from parallel processing] to deliver the combined results using a technique called global aggregation. That was a relatively complex and rudimentary approach compared to map/reduce. Over time, people have moved to more object‑oriented that allow you to run on higher level platforms like IMDGs, which can incorporate global addressability, scalability through load balancing and high availability to avoid loss of data.
Map/reduce is really the latest incarnation of a technique that’s been evolving for a long time. Google adapted it to do search and called their specific implementation map/reduce. We use that term because we know that people will understand the style of computation we’re referring to, but speaking in a more general sense, an equally valid term would be data parallel analysis.
DI: Are there differences between Google MapReduce and what you call map/reduce?
Bain: The main difference is the fact that we do automatic reduction down to a single set of results. So we have an execution tree which performs map operations in parallel and then combines the results. If you’re looking at a set of stock‑history objects and analyze them in parallel, you would then produce a set of results on each server and merge these results pair‑wise to get a final result. What we provide with our solution integrated into our IMDG is automatic pair‑wise merging based on the user’s merge method, and then we do logarithmic‑time inter‑node merging to give one final result.
We differ from the Google or Hadoop MapReduce in the style of merge we do. What Google and Hadoop say is, “Look, we’re going to let the mappers generate a set of parallel results, and then we look to the user to go figure out how to merge the results.” He can do that through creating a set of what are called reducers, which also run in parallel and create another set of results. So it’s really just another mapping phase after the initial map results have been reshuffled and fed to a parallel set of reducers.
While reshuffling for parallel reduction is a powerful mechanism, many applications just need pair-wise reduction down to one final set of results, which corresponds to the traditional technique of global aggregation I spoke of earlier. They also want this to be done as efficiently as possible, and that is what we designed ScaleOut Software’s IMDG to do.
One quality of memory-based map/reduce is that it’s very low‑latency. Instead of submitting a job to a batch execution engine, it’s invoked using just a method call as part of an application. It also allows results to be staged in memory.
With very low-latency map/reduce, you can create a pipeline of results. As the data in the grid is changing, you can run a new map/reduce several times a second and just get a stream of results, so you can build a near real‑time, continuously executing map/reduce engine with this technology.
We’re not going after big Google search problems, such as petabyte string searches. We’re going after the closer-to-real‑time analysis of fast-changing, memory‑based data sets.
There’s file I/O when you load up the data to be executed in Hadoop, there’s file I/O between generating the map results and doing the merge, and there’s file I/O in storing the final results in the Hadoop file system. We avoid all of that, and the net result is that we can get a much more scalable performance. We simply ship the methods to the data that’s already staged in memory. The most important aspect is that we only execute on the data that’s in the server that’s already holding that data in memory.
DI: Hadoop, of course, is not a real‑time system. You just made the comment that what you’re doing is in near real‑time. Could you have an actual real‑time application with your technology?
Bain: “Real‑time” means different things to different people. We say near real‑time because developers focused on hard real‑time problems, like building flight control systems, would not consider what we do “real time.” Having looked at real‑time applications in the past, I know that when people typically say real‑time, they mean you guarantee that a computation is completed in a fixed period of time. You specify what it is, but you guarantee it. We don’t make such guarantees. That’s why I would say that what we do is not real-time, but rather near real‑time.
What we do is avoid the batch overhead of Hadoop and analyze fast-changing, memory-based data. This incremental, continuous map/reduce that’s very low‑latency is where we think this technology will find its best application.
DI: You’ve been around since 2003, so you’ve seen this technology more or less from the beginning in terms of the big‑data era. What are the next big breakthroughs that are going to enable you to take your products to the next step?
Bain: We’re going to address the complexity barrier that Hadoop, in particular, presents. People are nibbling around the edges of that now by creating more industrial‑strength solutions around Hadoop to make it more reliable and plug it into various data sources. But that’s not addressing the core complexity.
We actually took this little program that does back‑testing of stock trading algorithms and had our best Java guy run it in Hadoop. To my surprise, it took him two weeks to do it. It took about a week just to get Hadoop tuned. There’s so much you have to know. There’s so much complexity, so it took him a long time. I think companies will start addressing the complexity issue, and a lot of people will do it in the context of pure Hadoop.
A second area, for IMDGs is the ability to handle larger and larger data sets. One of the ways we’re going to be doing that is by working with solid‑state disks, so we will be able to integrate with low‑latency connections to solid‑state disks and use other high‑speed networking techniques for communicating between servers.
There’s an emerging choice between the “I want to take the language‑oriented, in‑memory approach to doing my map/reduce,” or “I want to take the approach of using the strapped‑on map/reduce technology that comes with my persistent NoSQL data store.” So, for example, you can put data in Cassandra and you use Hadoop, which is integrated into Cassandra through a set of techniques they have. And pretty much every NoSQL vendor has some map/reduce story.
My personal opinion, and one of the bets we’re making, is that memory‑based map/reduce will emerge as a much simpler and faster approach for performing data parallel computation. The roots of data parallel computation are in memory‑based data, and this approach sidesteps many of the problems you have of trying to get good performance doing map/reduce on a persistent data store.
DI: Are you talking about reducing complexity in the sense of having a skilled person just take less time to do it, or do you mean making that technology available to less‑skilled people?
Bain: It’s making the technology available to less‑skilled people: application developers, people who have domain knowledge but lack map/reduce knowledge. They’re skilled, but they’re just skilled in a different area. A lot of this reminds me of, when I first got into computer science in the mid ’70s, there were these experts in IBM Job Control Language. If you didn’t know that stuff, forget it, you couldn’t get near an IBM mainframe.
One of the benefits that UNIX achieved at that time was to allow people who had application skills but lacked mainframe computer skills to be able to very quickly use computers effectively and avoid this big complexity wall. The problem with Hadoop is there are many variables you have to tune to get any kind of performance out of it. Plus, you have to know a great deal about the internal plumbing of Hadoop to get your data flowing the way you want and get the reducers working effectively. It’s much easier to go find somebody who already knows Hadoop, can install it, can tune it, and then can learn about your application.
Long term you want to flip that model so that you take the application developer and say, “I just want some answers,” the way you can get them with MATLAB, for example. You just take an application developer or domain expert and let that person use the tool quickly and efficiently. Our view is that people want to move toward avoiding the need for skilled map/reduce experts.