Syncsort, a software company founded in 1968 to help mainframe systems administrators sort data for transactions processing, has found new life at the center of ETL processes that enterprises with legacy systems require to adopt Hadoop and big data analytics systems.
The Woodcliff Lake, N.J.-based company recently unveiled a data integration service for Hadoop on Amazon’s Elastic MapReduce cloud computing service. It has formed partnerships with Hadoop distributors like Cloudera, Hortonworks and MapR, and Tableau Software, known for its business intelligence and data visualization tools. And, prompted by its venture capital investors—firms like Bessemer Venture Partners, Goldman Sachs and Insight Venture Partners–Syncsort is headed on a shopping spree to build out its own ecosystem of tools and services that connect mainframe systems and corporate transaction data to commodity-style data storage and analysis capabilities.
Syncsort CEO Lonne Jaffe is not shy about pointing out the interesting dynamic at work here. While Internet boldfaced names like Google, Yahoo, Facebook, LinkedIn and Twitter depend on commodity hardware and open source software projects like Hadoop and MapReduce to churn through massive quantities of weblog and other data, many of the world’s enterprises born long ago have been amassing their own data outside of the Web, on mainframes, in data warehouses and other systems. Those are Syncsort’s kind of people, and now the company is looking to be a data services hub between the old guard systems and new technologies.
Jaffe, named CEO in July, comes from IBM, where he spent 13 years, including a year working on acquisitions, and CA, where he led corporate strategy.
“For the first couple of decades in its existence, [Syncsort] was focused on high performance sorting software on the mainframe,” Jaffe says, adding that the company “built these very impressive algorithms that were able to optimize down to the memory I/O, and CPU throughput level, to do sorting which was a major computer science challenge for the early part of the 70s and 80s. And then the company had moved to create some next-generation open systems software.”
What follows is a partial edited transcript of an interview with Jaffe conducted December 5.
Data Informed: In a sense your history turns into a kind of advantage, having roots in all these customers using legacy systems over decades.
Lonne Jaffe: That’s exactly right. The early adopters of some of this powerful big data technology, like Apache Hadoop, the consumer Internet companies of Silicon Valley, like Facebook and LinkedIn and Twitter, they didn’t have some of these grown-up company challenges when they use one of these new technologies because they were building essentially everything from scratch and they could hire armies of Stanford computer science Ph.D. folks to cobble together elaborate data systems running on some of these new technologies. But the more traditional enterprises, the large banks and the hospitals, and the retailers, and the government entities, that are using some of this technology, they have a whole class of problem that the earlier adopters didn’t have.
They have huge numbers of legacy systems that they need to get access to. Including the mainframe, which still stores upwards of 70 to 80 percent of corporate data in some parts of the world. And a lot of really challenging sustainably high value things you need to do when you’re accessing mainframe data from low end commodity servers connected to the Internet. And so we can handle all the connectivity to the legacy systems.
A lot of people are using Hadoop these days as essentially a pre-processor, preparing data for various uses including traditional business intelligence but also all sorts of interesting next-generation things like machine learning and multivariate predictive analytics. Or loading into new versions of new data repositories that do similar things as legacy systems, so the columnar databases or the NoSQL repositories.
Let’s give you an archetypal example of one of the major scenarios we are seeing. You have people collecting data from a variety of sources and then loading it into a data warehouse like Teradata, for example. And then pre-processing the data in one of those data warehouse environments. And then running a business intelligence tool against it, like Business Objects or Cognos. And upwards of 40 to 50 percent of the processing that they are doing in some of these data warehouses, which are often very expensive, is essentially what is called ELT, which is extract, load and transform. It’s somewhat inefficient processing to be doing in these data warehouses which are really expensive. They can run $100,000 to $200,000 a terabyte depending on the customer. And what they’ll do is, they’ll put a Hadoop cluster between the source systems and these downstream systems. They don’t change any of their business processes at all. So they still load those systems, they still run the same business intelligence tool against it.
But by putting Hadoop in there as a pre-processor, and collecting all the data, which can cost $400 to $1,000 a terabyte. And then they load the same downstream system, so they don’t change their processes at all. There’s no change for the business users at the end. But as a side effect of this, their Hadoop cluster essentially becomes a long-term active archive of all of their data. So it’s flowing through the Hadoop cluster. And they’re building skills, which makes it easier to retain talent and attract talent.
And that can be really powerful for them because eventually since they are pre-processing all the data on Hadoop and they can store 10 years of data, instead of just a couple of years of data there, they can run new systems directly against Hadoop because there world of things you can do directly against it as a technology, it’s just getting better, faster than almost anything else in the technology industry.
What have you done with Syncsort’s technology to make it relevant to the newer technologies?
Jaffe: The transformations in the ETL, the transformation stage is where a lot of the work is. The best way to think of this is the simple use case that I described that is using it as a data preparation engine. A lot of these systems, like business intelligence, and some of the operational systems even, require data to be synthesized, and summarized, some of it to be cleansed, and prepared in various other ways. And there’s a tremendous volume of work involved in that in terms of the number of machine that have been typically needed to get involved.
We have made it extremely easy to do that. Think of a graphical user interface that even a not particularly technical use can pick up and learn fairly quickly. And also you need to do that in a way that’s efficient, or else it can suck up a tremendous amount of computing resources. In addition, you may want to do all sorts of analytics that let data scientists start exploring the data.
We’re seeing hospitals gather up data from all of their patients and sometimes neighboring hospitals aggregated all together, do predictive analytics to determine which health care treatments result in the best outcomes for specific types of patients and then deliver that insight to the clinician at the point of care in a way that doesn’t disrupt the care process. We’re seeing cities try to analyze their traffic pattern data to predict traffic multiple hours ahead of time, anticipate bottlenecks and then change the tolls on bridges and the traffic light patterns to prevent the traffic from ever happening in the first place. We’re seeing retailers gather up large quantities of data and then do analytics on what is the optimum price of bread so that you can sell more milk because you understand the foot traffic patterns in your store and you want to dynamically change the pricing in order to maximize both the profit and revenue. We’re seeing government entities use this type of analytics for regulation, financial services companies predict what is going to be the best investment opportunity or reduce fraud. So really interesting stuff.
But the really large enterprises, they’re funding the build-out of these systems, largely by that initial project I was describing. That saves a lot of money by offloading expensive, inefficient processing from systems that aren’t very well suited to it, into Hadoop. And then building that environment as essentially a long-term active archive that keeps all the raw data as a side effect.
When you talk about acquisitions, what kinds of challenges are you looking to address? What do you want to be able to do better for your customers?
Take Circle [Computer Group an acquisition announced September 30] as an example, they have a powerful middleware engine that actually sits on the mainframe. It sits between what is called IMS, which is one of the most widely used mainframe data repositories in the world. And the IMS applications, which are sometimes these very large, elaborate applications that have been built up over decades. What it does is it essentially tricks the applications into thinking they’re still accessing the IMS database, but actually you’ve moved the data out of IMS and you’ve put it into a more strategic location, usually DB2z, where then it can be accessed off platform by Apache Hadoop.
And so that allows them to save money on really hard to find IMS skills, which are getting increasingly scarce in the market, and it also puts the data in a more strategic location, so now that data, which is often the mission critical transaction data for their systems of record, can now be accessed by their next generation analytical systems.
And there’s this phenomenon, where, if you were LinkedIn, this is not a concern. You didn’t have a mainframe. You didn’t have any legacy transactional systems. Similar for Twitter and Facebook and Google and all these other players. But the grownup companies that do have these systems, you know, what good is their next-generation data analytics environment if they can’t access their transactional data?
So those are the kinds of companies we’re really interested in. Companies that unlock previously hard-to-get to data for use in some of these next-generation systems.
You’re describing a demographic issue that some skills are hard to find, that the skills were important as IT systems grew up over decades, now they are finding new life.
Jaffe: That’s right. There are certain systems that are better suited for certain types of data processing workloads. So the mainframe is incredibly powerful at transactions. And will likely be one of the best transactional platforms for the foreseeable future, but there are other kinds of workloads that aren’t maybe quite so well suited to it and can be somewhat expensive to run on the mainframe.
And so it actually increases the value of the mainframe as a platform to have it be used more efficiently. This is true of certain types of data warehouse environments also. So the customers are able to save a tremendous amount of money by moving the inefficient processing from these systems into something like Apache Hadoop when it’s well suited for the data processing workload.
If you can make it turnkey and essentially a button that moves those expensive workloads from the legacy systems to Hadoop, then it also frees up your resources to do things like create all sorts of next-generation applications, or let data scientists play around in the environment to learn. And then on top of that, to your point, it saves you from imminent problems around not having enough skills, in some of these legacy environments to continue to upgrade and update the systems.
And that dynamic is one that is not going to go away in the future. It’s probably going to get even more important.
Are the core technologies that Syncsort started out with still elemental to the techniques and tasks you’re talking about now?
Jaffe: It’s really amazing how that has proven to be the case. There have been a lot of platform shifts over the years, but a lot of the challenges that people had in the early days of the mainframe, around optimizing algorithms for multiple levels of caches, and memory and CPU and throughput, and I/O, are re-emerging as problems in some of these environments where you are dealing with staggering quantities of data.
Data that is so large that it stretches the capacity limits of disk storage and compute. And the parallelization of computing that has happened with some of these new technologies like Hadoop has shifted the bottlenecks around. So they tend to be more I/O oriented, and less CPU oriented for example.
And the bottlenecks are constantly shifting. But the decades of tuning these algorithms to not just data workloads themselves but to the speed and nature of the underlying environment, like the disk speed, the spindle speed, and the I/O bandwidth, and being to dynamically alter the algorithms based on the nature of the data workload and the underlying system, those are reemerging as increasingly important in this new environment.
And so not only has the technology itself proven to be really valuable, but the people who have built that technology and who then can therefore can build some of these new tuned systems to the big data environment, have been similarly valuable.
And I’ve been particularly amazed by the extent to which some of the folks within the company who are not the kinds of people you would normally think of as part of the open source community, because they are later [in their] career, have been tremendously embraced by the broader open source community. Because they have such great talent in some of these areas where the open source community just never really had to struggle with some of these problems before. And so that’s been really encouraging and exciting to see.
Michael Goldberg is the editor of Data Informed. Email him at Michael.Goldberg@wispubs.com.