When considering an approach to a big data analytics project, decision-makers often tend to focus on which of the three major techniques best fits their use case: an ad-hoc approach, batch analytics, or real-time analytics. Each of these approaches has an application in particular situations, but developing an effective project demands that each of these approaches be used in conjunction with one another in the context of the goals of the project.
Even a cursory understanding of the technologies would seem to offer easy answers: batch processing if you have big chunks of data, real time if the data is streaming, and ad-hoc to tackle cases in between. The reality is much more nuanced, however.
Arranging technologies to address a particular use case is a little bit like arranging instruments in a symphony. Each has particular qualities and plays a different role in the end product, but that end result should be a seamless meshing of the individual elements to achieve the desired effect such that the unique qualities of each one fade into the background. The same is true of big data use cases.
In my experience working with businesses developing big data projects, I have found that one of the most common pitfalls that causes big data projects to fail is an initial focus on technologies, a focus that is not rooted in a firm understanding of the practical realities of the project: the project goals, the projected time to business value, the nature of the available data, and other realities like scope and budget. When beginning with a technology in mind, the realities of the project tend to fall by the wayside, and the result is a big data project that withers on the vine, unable to bear the fruit that its initiators envisioned. An hour and a half oboe solo is not likely to satisfy a symphony audience, and likewise a big data project that focuses only on one analytic technology is not likely to satisfy most use cases.
Instead, when beginning with a thorough understanding of the use case, it often becomes clear that a multifaceted approach makes the most sense.
For example, when my company Infochimps worked with the consultancy arm of the Canadian media company Postmedia Network, we structured our approach around integrating these technologies to meet Postmedia’s particular needs. The company was looking to beef up their media analysis offerings to include ways of analyzing historical data, together with new information including that coming from real-time social media. The company found its initial work with a data scientist produced solutions that started with what they could do with a host of interesting technologies. Postmedia realized that this approach would not work.
Whether data scientists are using legacy statistical software tools or newer tools born out of innovation at Google, Yahoo, Facebook and the like, the curiosity of data science experts needs to be guided with the application in mind. For Postmedia, this meant producing metrics around advertiser activity over a historic period as well as what is trending in the last second. This requires a big data platform that is capable of combining the two approaches within the company’s application.
At Postmedia, their “conductor” was the general manager of platform and experience who led three movements, much like Igor Stravinsky premiered in the New York Philharmonic Orchestra in 1946. Postmedia’s executives realized that the work of a single data scientist essentially experimenting with several big data “instruments” was not going to help them achieve their business goals within a reasonable time.
We ended up working with Postmedia to develop a cloud-based system that was flexible enough to integrate the diverse technologies including batch, ad-hoc, and real time analytics that contemporary media analysis requires. Using these technologies, Postmedia has been able to offer its customers in the advertising industry services that have their objectives at the core—insightful media analysis across traditional and new media channels.
Social media analytics allows companies to analyze new media channels like Twitter, Facebook, and the blogosphere in real-time. Batch analysis of historical data provides a context of how the real-time behavior of customers compares to the past and in many cases help businesses like Postmedia and their customers predict the future.
Matching Technical Approaches to Business Challenges
Postmedia has thousands of prominent advertisers who ask questions about which media channel is performing better for any given campaign (such as comparing broadcast versus print) that can only be effectively addressed with a comprehensive big data analytics approach. How does real-time chatter affect my brand? What is the impact of media coverage of my brand? Did my brand campaign perform well? What risks or opportunities did this campaign present? Integrating multiple media analytic platforms of data into one to track, report on and analyze metrics of 360 degree campaigns allows the company to find answers to these kinds of questions.
It is good to keep in mind that the big data technologies of batch, ad-hoc, and real time analytics were developed to address specific technical challenges and not specific use cases. The real world questions that initiators of big data projects want to answer are almost always more complex than the technical cases for which the technologies are geared toward. Postmedia’s situation is representative of almost all big data use cases: data sources and other practical realities demand an examination and an arranging in concert of all available tools to achieve the project’s goals.
How can you deliver your symphony in three movements?
Jim Kaskade is the CEO of Infochimps, a unit of CSC that provides big data analytics services in the cloud. Prior to joining Infochimps, he was entrepreneur-in-residence at PARC, a Xerox company, where he led its big data program. Follow him on Twitter: @jimkaskade.