John Wallace, CEO of San Francisco-based marketing analytics company DataSong, says he is “married to big data, but started dating her before she was popular.”
She’s very popular now.
Wallace founded data analytics company Business Researchers in 2003 and guided its evolution into DataSong, which helps retail clients including Williams-Sonoma, Neiman Marcus, and DirecTV gain insights into their customers and optimize marketing spend. In June 2011, the company became an early adopter of Hadoop and began using MapR’s Hadoop distribution in February 2012. The company recently selected MapR’s Hadoop distribution for its central data hub.
Wallace sat down with Data Informed to talk about big data, what led him to Hadoop, and the effect it has had on his business.
Data Informed: How do you define big data?
John Wallace: For me, big data has two parts. The first part is observational data. This is data that was not collected with the intention of analyzing it. We just got it. It’s low-value log data from various sources. Part two is financial data, high-value data. Banks have had most of the high-value data. Today, a lot of low-value data is available. This data was not originally collected for analysis purposes, but that broad data can be used to help make marketing decisions. But it is easy to reach false conclusions and make potentially bad marketing decisions based on observational data. That’s where analytics expertise comes into play.
What was the business problem you were trying to deal with when you started working with Hadoop?
Wallace: We knew we had a product in our analytics software. The business problem was twofold: the need for scalability, and the need to scale cost-effectively. First, can it grow? And second, at what cost? Five years ago, to do what we are doing now, we would have needed a deal with NSA to pay for it.
We focus on the problem of understanding the effects of marketing investments. The old saw is that marketers know that 50 percent of what they spend is wasted. The problem is, they don’t know which 50 percent. Our goal is to give our customers answers at the consumer level through analysis of marketing data. A big marketer might have relationships with 20 or 30 million households, with 3 or 4 million active at any one time. Our analytics are able to build timelines that attribute purchases at the individual customer level.
To do that kind of analysis on the old-model technology, traditional relational database and SAS analytics, you needed commercial-grade UNIX servers and storage and software licenses. So that really limited what you could tackle. For Williams-Sonoma, initially we did a prototype, using traditional technology. Williams Sonoma liked what we did, but they said, “We have six more brands we want to analyze.” How could we do it for six more brands?
The answer for us was to build our application on the Hadoop platform. So three years ago we made a bet on Hadoop. We struggled through the learning curve, and now we have a highly repeatable, highly scalable platform. Today we feel that we can walk into any prospective customer, the biggest retailers in the country, and be confident we can handle their data analysis needs, because we are not constrained by the need for a multi-million dollar platform.
The timing of Hadoop could not have been better. We were ahead of our time with our application, but were limited by the available platform technology. But with Hadoop, there was a viable alternative technology that made economic sense. There are a lot of open-source projects out there. Some make it and some don’t. With Hadoop, we got in at the sweet spot.
How difficult was the implementation of Hadoop?
Wallace: We did it in two steps. Getting on Hadoop to start with was pretty difficult. Hadoop is its own particular environment. How difficult it is depends on what you are doing. We wanted to build a complex application from scratch on Hadoop. That is a hard thing to do. There are lots of things that would be a lot easier. For example, many companies need the power of Hadoop to replace current tasks that are routine. Cleaning and aggregating data in Hadoop versus expensive ETL and database tools is a relatively straightforward task that has proven attractive in many cases. What we did was a lot more difficult. We had prototyped in SAS and were learning how to rewrite code from SAS into Java. Some of the work we are doing on the data is quite intense, so we had to go down the road of implementing many things in Hadoop from first principles.
We put a lot of work, a big investment in energy and time, into getting our application onto Hadoop. Our costs were almost all in people. It took several man-years to build an application that can shape unstructured log file data into model-ready data in a repeatable fashion.
What were you doing before moving to Hadoop?
Wallace: With previous-generation technology, you were constrained by technology to deal with smaller data sets. To scale up, from, say, one product line to six product lines, you’d have to go out and buy six more UNIX servers, and more storage capacity, and more licenses.
Before, our mindset was always, “avoid processing data twice.” Now, because we can handle so much more data cost effectively, we can be much more aggressive. We can say, “OK we can see what happens over a year’s worth of data, now let’s run the analysis on two years’ data.”
What led you to go to the MapR distribution of Hadoop?
Wallace: Going to Hadoop was hard, but moving from Hadoop to MapR was painless. We moved to MapR two years ago. In the Hadoop world, that makes us old timers. We started with another Hadoop distribution. We moved to MapR because of the investment they had made in Hadoop. There are two parts to Hadoop: the file system, and the compute tier. MapR had developed their own file system to work with the Hadoop compute tier, so their file system’s compression and data storage work more efficiently. MapR is our data store for everything. It is our data processing and analytical engine, the main file system, the utility, the hub of everything.
How do your customers work with you?
Wallace: For our customers, the process is a little like bringing your income tax data to your accountant in a shoebox. Our customers give us unstructured data from a lot of sources. The data comes in a shoebox, so to speak, and we get it ready for analysis.
The goal is always to optimize the customer’s marketing spend strategy. First, we on-board the customer’s data. Typically it includes advertising server impression logs, clickstream data for websites, direct mail logs, and transaction data from stores, websites, call centers, etc. We pick up a whole year of media data and five years of financial data, actual sales transaction data. And new data comes in daily. We prep the data and then we run our analytics. The result is that we can see what works and what doesn’t work. The results we report become a kind of general ledger for the marketing department: a precise record of what went out and what came back in.
The capacity of the Hadoop platform means we can run a full year of data rather than a smaller sample of, say, a month. The ability to run a full year’s worth of data means we can get the right answers to our customers faster. If you are limited to a smaller sample, you don’t get all the factors into the analysis, for example, the impact of weather, holidays, etc. Therefore, you don’t get the analysis right, or not as right.
What information do your customers get back?
Wallace: They get a report on the incremental effect of their marketing spend, an incremental impact report each week. Some of our customers do daily reports. Before, customers could look at a weekly or daily report, but the report would be wrong. Why wrong? There are always going to be multiple touch points for the same transaction – Facebook, website, email, etc. Unless you can make a more accurate attribution, all touch points will get credit for the sale. With attribution analytics, the customer can see what percentage of incremental sales is attributable to social media, what percentage to email, what percentage to something else. As a result, they can have a lot more confidence in their knowledge of where to put their marketing spend. The ability to quickly analyze and determine the impact of a marketing action also allows them to try a new channel or a new approach and practically immediately see whether or not it is working.
We all see the impact of new developments in consumer-facing technology. Hadoop has had a comparable impact on enterprise technology. But the fact is our customers don’t care about Hadoop. What they care about is that the analytics, enabled by Hadoop, allow them to re-deploy marketing budgets to cut waste and increase yield. They can see the segments of the marketing investment that are not effective, and those that are effective. That ability, the ability to attribute sales to specific marketing investments, is the “Holy Grail” of marketing analytics.
Beyond what it does for your customers, what benefits does Hadoop bring to DataSong?
Wallace: Performance and cost. Sometimes people think of Hadoop as providing infinite storage on commodity hardware. But that’s only part of the story. It is also parallel computing. The beauty of Hadoop’s parallelism is speed. If a job used to take 20 hours to run on one server, with Hadoop you can run it on 20 processors in one hour. The gain is almost linear.
In terms of cost, first there’s hardware. That’s a huge factor. Instead of big UNIX boxes and big storage hardware, we run on commodity Dell boxes running Linux.We have about 120 processors on about 15 nodes, with storage approaching a petabyte. By today’s standards I’d say that’s a small cluster. Each node has its own storage. We store all our projects on Hadoop: raw data, process data, and outcome data. There are also the cost savings inherent with open-source software. With Hadoop, our costs are almost strictly hardware costs. With lower software licensing costs, we can invest in even more hardware and run even faster.
How has Hadoop helped you win business?
Wallace: We grew our company after major brand-name retailers had invested tens of millions of dollars in servers and licenses and teams of people to do marketing analytics. But once we built our system on Hadoop, we had systems that outperform those traditional technology systems. Now we have a technology advantage. So now prospective customers might want to say, “Why don’t we just send you the data?”
Before, we could only sell analytical expertise to the big companies that had made big investments in analytical infrastructure. Now we can sell a broader range of services to even the biggest retailers, because now we have the technology advantage. The scale of what we are doing today simply does not fit in the systems big retailers invested in a few years ago.
And we can also compete for smaller companies’ business, because we can offer our services cost effectively, running on our commodity infrastructure. And we have lowered the entry cost, because we can do initial projects economically. That is attractive to customers.
What has been the impact of Hadoop on your company’s growth?
Wallace: Hadoop is the key to our ability to grow our business. We would have been ahead of our time, but Hadoop came along and let us invest a lot of time and money in developing our applications. We could make that investment with confidence that with Hadoop and MapR, we had the scalability and cost effectiveness we needed to grow our business. That was especially important because this is a bootstrap business. We have not taken any venture capital. We are growing our software and services business by 40 percent a year. We are at about 60 people, but we do not need to add 40 percent more people every year to support our growth. At this point it is a “rinse and repeat” business. Our strategy now is to onboard as many retailers as we can. We are ready to grow and confident we can handle the growth, that we can handle as many new customers as we can land. We are in a position to do that. We are in a position to look at the biggest retailers in the country, companies like Neiman Marcus and Williams-Sonoma, and handle their business. If we landed Walmart tomorrow, we would just have to add some servers. This thing is cruising. We are confident with a capital C.
Peter McGowan is a freelance writer living in Framingham, MA. He has written on a wide range of technology topics, including data storage and telecommunications. He can be contacted at firstname.lastname@example.org.