It’s been almost two years since the IT leaders at Sears Holdings unveiled MetaScale, a business devoted to helping companies adopt big data analytics. And to listen to Ankur Gupta discuss the company’s experience so far is to hear that enterprise IT shops face their own stages of an emotional journey.
First, there is the realization stage that datasets are growing so large that traditional relational database systems won’t be enough to manage them. Second is the rosy assumption stage: since new technologies like Hadoop are open source, they must be simple to use. Frustration comes third, when an IT organization sees that it’s not so easy. Finally there’s acceptance, when IT people see the benefit of finding support from those who journeyed before them.
“Hadoop, or big data, is challenging the thinking, challenging the experience that we all are used to, which is relational databases,” says Gupta, general manager for sales, marketing and operations at MetaScale. “That is what our database [administrators] are used to, and that is what our systems are built around.”
Making the adjustment to new systems is what MetaScale was established to help customers do and Gupta, of course, wants his firm to be the one that IT shops call. In an interview, he reveals the lessons—and empathy—that he has developed for IT shops in his four years of working with Hadoop and other technologies in a 1,000-node cluster that works in conjunction with an enterprise data warehouse, mainframes and other longstanding IT assets. For example, MetaScale has found that COBOL programmers are well positioned to program for Pig, a platform for creating MapReduce programs with Hadoop.
What follows is an edited transcript of an interview with Gupta conducted December 18, 2013 at Data Informed’s offices in Dedham, Mass.
Data Informed: What are the kinds of lessons you learned or mistakes that you had to endure to be able to tackle the challenges that you are talking about now?
Ankur Gupta: Many of those problems were technical problems. Some of them are simple technology questions. How do you ingest data from Teradata or from a mainframe into Hadoop? How do you put the data back?
We are a company with a lot of legacy systems on our site. How do you use Hadoop in combination with those systems? Do you replace those systems? Do you use Hadoop as a complementary system to those? How do you build these various connectors between these somewhat disparate systems?
And then when you do that, and you run various jobs you will run into weird errors that there is no answer to.
So you need to figure out [those challenges], what it means, and go deeper into the whole technology landscape.
That’s one piece of it. The other pieces are [that] there are many commercial offerings out there, so which one suits your specific needs better?
Do we need commercial servers? That could be fairly expensive, it could add to the cost. Or, how do you think through several years down the line on the data side? When we started, we were thinking about 10 or 100 terabytes, now we are running data into petabytes.
Not just for us, for everybody the data size is growing every day, every year. Because of the digital life we all live in. The social media that we are all influenced by. So you need to plan for all that.
We needed to think through how do you grow your distribution, grow your Hadoop footprint. As I said, now we have over a thousand node clusters that we are managing on our side and it continues to grow that footprint for us.
Initially when we started, Hadoop for us was simple, just a cheap storage device. From there, we built this whole new initiative where we moved a lot of our batch processing from mainframes and other EDWs (enterprise data warehouses) to Hadoop.
So because of its very strong parallelization power, we could actually dramatically reduce our MIPS usage, which is millions of instructions per second, that’s how you pay for a mainframe. And that actually hlped us save several millions of dollars. We actually took a whole mainframe off our grid, so you can imagine the savings from that.
At Sears, you started in retail. How does that experience translate into other types of industries or other types of use cases?
Gupta: When you think about it, we’re talking about the back end infrastructure. And that is more or less consistent across industries. So while we were born out of the retail business, our solutions and offerings are independent of that vertical. Today we have clients in the financial domain, health care, technology, retail of course. And we continue to talk to and continue to get inquiries from companies in media, in gaming, wherever you have more and more data.
So the technology, the infrastructure and how data is ingested and used is somewhat neutral, I wouldn’t say the same across industries, but the footprint, the genesis, the big picture is more or less the same.
We have no problem utilizing the learning and expertise from the retail world and importing it to other industries. Where it gets very specific to an industry, for example, if you are talking to a health care client, and they are looking for whether you have a particular certification or not. So in the retail world and in health care you may have a lot of customer, client data or doctors’ data coming in. And the security on both may be somewhat different. You configure your appliance differently for that, but the underlying infrastructure is the same.
That is where you see differences. But when it comes to actual analytics, or the top-most presentation layer as you call it, that’s where the industry starts to segregate in my opinion a little bit more.
In your work, have you moved beyond the back-end infrastructure? Your work now goes beyond that?
Gupta: Yes it does. But to be very candid, where a lot of organizations are today, they are still in the initial phase. They are still talking about: What are our use cases? What is our big data strategy? There are not a lot of organizations, and those that are, they don’t need MetaScale essentially, and that are in the very advanced stage of using Hadoop or talking about using machine learning and predictive analytics.
Our strength area, the core to us is helping companies build out their big data strategy. Help them get started. Set up infrastructure for them. And then from there, either to introduce them to a client, introduce them to another vendor partner, who can actually provide a specific solution for them. So for example if there is somebody looking for a NoSQL database or a particular analytics platform we have partnerships with many that are out there.
We could introduce those to our client. But where MetaScale is today, our core strength is in getting companies started, getting companies going, setting up their infrastructure, taking care of their infrastructure, and also provide knowledge and education. Which is training them on Hadoop, training them on big data technologies, training them on both the technology side and business side.
So I would say that we are the initial, the strategic partner in the overall journey that helps companies identify who could be the right partner moving forward. Because MetaScale doesn’t have analytics products.
What are those conversations like? Who is it that you are talking to, to help them begin their journey?
Gupta: It’s interesting. Depending on a company, stakeholders vary a lot. Generally, we have seen [interest] come from a CMO or CIO, who may have heard about big data somewhere, or may want better analytics, and they generally speaking they actually tell their IT folks to go and figure it out.
So most of the time, more than I want to, you actually have IT people as your first point of contact. What happens there is, when folks from IT see this, they think it’s doable in house. Most of the time, more often than not, they see it’s an open source technology, and they can pretty much put a quick couple of node clusters together and play around with it. In theory yes, it makes sense, but in practicality it’s very very complex, it takes time and experience to really utilize or make an efficient utilization of the whole Hadoop infrastructure.
So, to your point, generally we start with IT. A lot of time we also get to talk to [the business side], whether it is folks on the analytics side, that are providing analytics to various businesses, or even people on say the marketing side, finance side, that are looking to get specific data points, a specific report. However, they don’t care what the underlying infrastructure is, to be honest.
They don’t’ care if you give them a report that’s built on a Teradata machine or a mainframe machine or Hadoop, so you end up working with IT counterparts—those who are actually responsible for making the decision, the infrastructure-level decision, and system-level decisions.
Are there improvements in Hadoop you would like to see? Developments you would like to see happen?
Gupta: Yes, a couple of things that have been ongoing concerns. Data append is an issue in Hadoop. And some of our clients actually talked about it; they have problems with it. And there’s an open ticket for that. It’s not been fixed. And that’s stopping some of our clients from going to that next level of Hadoop version. It’s an ongoing issue in all Hadoop versions, even Hadoop 2.0. I would like that to be fixed.
I would like security [addressed] which continues to be a concern for many. We have not seen [that concern] because we have managed security through access control, through added security layers and what not. But in the native version there are concerns about that. So I would [like to] see much stronger security protocols and security layers in the native version.
The third [issue] would be, one of the concepts that we have continued to talk about is using Hadoop as an enterprise data hub. And I would like to see that become a little bit more mainstream. I just saw that Cloudera announced its latest Hadoop distribution with Hadoop as a data hub, essentially.
Amazon earlier announced their version that allows you to get data from various sources very quickly within their EC2 [Elastic Compute Cloud] part of their AWS [Amazon Web Services].
I think there’s a lot more innovation happening. I’d love to see a lot of this innovation becoming mainstream, available to the public so they can play around even more and make it even more useful.
Does MetaScale have the resources to contribute to the community?
Gupta: Today we are a user. I do expect us to develop more and more talent that will play a key role in actually contributing to the code, the fixes and what not.
Are there things you want to do better, or hope to enhance? Things to get better on the back end?
Gupta: We continue to build out new products. A ready to go Hadoop appliance is one. Our education courses are another one that we are bringing out to the market. We continue to expand our offerings.
Today MetaScale is an enabler; today it’s a starter [partner] for many of the companies. But I think soon enough, and as more and more companies get on this journey, they won’t be in that initial phase.
So my vision is to build an organization where we have practitioners that include data scientists, that include advanced Hadoop architects and users who are thinking about the next version of Hadoop already.
Not only are they contributing to the Apache Hadoop project, they are actually helping us build those specific solutions, by industry, that are easy to deploy.
My vision is to be [a company that offers] those industry-specific products. Do we have a health care analytics platform, a Hadoop for health care, a Hadoop for whoever. And there’s analytics already built on top of it. I think that’s where I see us going.
When you think about the activity here, what kind of value do you see? You talked about costs savings so that is a big one. Are there other types of value?
Gupta: There is value in both bottom line and top line, revenue growth. I will give you an example. On our side, there was a pricing job that took over 20 hours to run on mainframe systems.
For that job we pretty much lose a day. So how would you make a pricing decision? How would you change your pricing based on customer feedback, or based on market conditions, based on inventory condition, and what not, if your job itself is taking so long. So when we moved it to Hadoop, we could run it in minutes, not hours now.
And that allowed us to what we call is for more dynamic pricing, to change pricing based on how market conditions are changing which is actually real time. So if you have snow here in Boston and you have a shortage of snow blowers, and I’m just giving you an example, what is the right price?
What is the right quantity we need to have? We should have much higher inventory at that time here, at the right price for our customer. So by moving that job on Hadoop we not only saved our costs by utilizing less MIPS on the mainframe, but we’ll also be able to make a better business decision, quicker business decision, respond to the market quickly.
And this is just one small use case. We have numerous use cases where we are able to get data.
Another one I will give you is when we have Wi-Fi in our stores. If you go to our Sears stores you can get to Wi-Fi, but now we want to see what people do on these Wi-Fi [networks]. What do they do with it? We are providing this as a service but is it useful to us or not? But then, that means you are collecting and looking at all the data points like every device that is coming in there, phone, iPads, laptops, whatever, and then be able to look at the data across thousands of stores and make sense out of it.
I think without big data technologies and Hadoop, it would not have been, I wouldn’t say impossible, but I would say it wouldn’t have been cost effective for us to do so. So now we can analyze behavior of what website customers are going to. What are they looking at, what are they expecting?