Is Big Data Really on a Bullet Train to the Cloud?

by   |   June 20, 2017 5:30 am   |   0 Comments

John Mertic, Director of Program Management for ODPi and Open Mainframe Project at The Linux Foundation

John Mertic, Director of Program Management for ODPi and Open Mainframe Project at The Linux Foundation

AtScale released its annual Big Data Maturity Survey report (formerly the “Hadoop Maturity Survey”) late last year, showing that Business Intelligence for Big Data is a top priority; it’s on the fast track to the cloud, and data governance is a growing concern.

The report and its results became a topic of discussion at a recent meeting with the ODPi User Advisory Board (UAB), which consists of representatives from large enterprises using Apache Hadoop and other Big Data technologies across the automotive, technology and entertainment industries, among others. The ODPi UAB strongly agreed with the growing data governance concerns presented in the report. Self-service access to Big Data and the governance of this self-service really struck a chord. We debated the concept of getting people to help drive policies in self-service access, reflecting on the growing number of data scientists that are part of Line-of-Business rather than IT. The group agreed that the current state of IT solely holding governance and security control is not cost effective and self-governance could be a tactic to help scale usage.

On the topic of Big Data in the cloud, the ODPi UAB sees their future in a hybrid cloud model. While they agree we are going to see cloud more in the next three years, they will be building on existing investments and not fully replacing what they have. Additionally, the group sees existing central processing to be complemented with local processing nodes to help with scaling needs and better comply with country-specific regulations. Currently, UAB members see moving full scale to cloud as cost prohibitive, but cloud becomes more interesting as IoT data home-running use cases start growing.

Conversations with the ODPi UAB then veered toward their own usage patterns. In this article, I’ll dive into these usage patterns, the insights presented by the ODPi UAB and their first-hand experience with the cloud’s role in Hadoop and Big Data.

Pre-production and Production Hadoop Defined

There are clear differences between pre-production and production Hadoop. Table 1 below outlines the core differences in operational Hadoop as enterprise usage shifts.

Table 1 - outlines the core differences in operational Hadoop as enterprise usage shifts

Table 1 – outlines the core differences in operational Hadoop as enterprise usage shifts


AtScale reported that 73 percent of respondents are in production usage, an eight percent YoY increase from 2015. Gartner reports much more pedestrian numbers in the 15 percent range from their business surveys. With this much variance in numbers, it’s clear that baseline definitions of stages in deployment are not clear in the industry. Working with the UAB and ODPi members, we’ve come up with this detailed deployment continuum that we believe accurately matches the market.


Enterprise Hadoop Deployment Continuum

Enterprise Hadoop Deployment Continuum


Plotting where you fit in this continuum gives a more objective view of what production usage looks like.

Analyzing Cloud for Big Data

While 72 percent of the survey respondents indicate plans to do big data in the cloud, deeper analysis would likely reveal some important nuances. As UAB member Nicholas Berg, Director of Enterprise Analytics at Seagate Technology, stated “To drill down, I would also ask: What proportion are your applications and IT on premise and in the cloud? What proportion of your big data is on premise and in the cloud? And, finally, how much do you actually have in production in the cloud? I’d bet the percentage is much smaller.”

Another ODPi UAB member commented, “I think a lot of companies are just doing PoCs in the cloud but still do production deployment on premises.” The UAB felt the primary driver here was the recognition of IT needing to minimize risk and cost.

That leads to the biggest unanswered question on cloud: cost. For years cloud vendors have been pushing the narrative of “reduce costs and complexity – move to the cloud,” “avoid the capital expense of server purchases – just leverage the cloud,” and so on. However, one factor the ODPi UAB called out loud and clear was that the cloud gets more expensive when you get to scale.

One ODPi UAB member spoke to their own experiences, noting “Recently, we’ve been doing some cost analysis of this and the numbers just don’t work – not when you’re talking about ingressing or egressing the volumes of data that we’re thinking of. That’s where the costs of cloud vendors really start to skyrocket, thus for a full shift to the cloud the numbers just don’t work. So we’re still considering it for experimental purposes, but I don’t see how we could ever move entirely to the cloud or even consider moving a large amount of our systems out there.” Another added, “I found that the cloud works for small subsets, maybe aggregation if you will, but once you’re trying to do long-term storage it gets prohibitively expensive very quickly.”

Does that mean that our ODPi UAB members feel Hadoop and Big Data in the cloud is doomed unless the cost economics come down? Definitely not. To them, the problem is in the narrative of cloud vs. on premises being constantly framed as an “either-or” argument versus a “both” discussion.

Cloud: A Use Case for IoT Data

One ODPi UAB member shared a use case with IoT data, “Cloud does start to make more sense when you are looking at solving certain problems with bringing data from different regions in for processings. This is especially true when we start talking IoT; do you really want to home run all of your IoT data?” Looking at the use case described, this member concluded, “You’re talking about home-running everything to your own data centers. That becomes a problem, both in terms of volume, but also in legal terms. Look at Germany. Look at China, etc. Unless you’re going to bring up data centers in those countries, you have a whole set of other problem on your hands.” Others agreed with this sentiment, saying “As we consider our cloud presence, it becomes a challenge. As a global company with offices all around the world, we don’t want to put a data center in every country, so cloud is a good option for that.”

From all of the above mentioned ODPi UAB feedback, it’s clear that cloud computing will continue to become more of a component of Hadoop and Big Data but will be in balance with the cost and efficiency needs of an organization. In other words, just like other technology shifts to cloud, the most realistic answer tends to lie with some form of a hybrid. Berg said it best, noting “I think with time, as Hadoop expands in the cloud and hopefully cost structures get more reasonable, things will start to shift. I think that’s going to happen, it’s just a matter of time. We’ll likely always have some sort of hybrid deployment, which will likely swing the higher usage of cloud as the economics and use case better align over time.”

The Next Big Thing

We live in an industry – and frankly a world – on the hunt for the next big thing. It fuels our desire to grow and evolve. Industry pioneers often refuse to accept the status quo, recognizing that perfection is never achieved yet always something to continually pursue. As Vince Lombardi once said, “Perfection is not attainable, but if we chase perfection we can catch excellence.”

That being said, one thing that has rarely been fruitful is forcing innovation along. Technology ebbs and flows, responding to the challenges and opportunities of modern life. If you’ve listened to professional athletes being interviewed, they always speak about how they “play what the game gives them,” showcasing how they’re able to take a step back, see the larger picture and respond with what is needed to win – the same formula doesn’t work every time. This same method happens with technology, but the scale is larger and the players aren’t always well defined.

Perhaps the takeaway here is the value of dissecting a report like this for context. One way to do this is within a community of fellow data practitioners, such as what the ODPi UAB and SIGs foster. Communities bring new ideas and fresh perspectives, which spur new technologies and solutions. And sometimes this brings us something game changing, like Hadoop, when we least suspect it…


John Mertic is Director of Program Management for ODPi and Open Mainframe Project at The Linux Foundation. Previously, Mertic was director of business development software alliances at Bitnami. Mertic comes from a PHP and Open Source background, being a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and frequent conference speaker around the world. As an avid writer, Mertic has published articles on IBM Developerworks, Apple Developer Connection, and PHP Architect, and authored the book The Definitive Guide to SugarCRM: Better Business Applications and the book Building on SugarCRM.


Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.


Tableau whitepaper - why business analytics in the cloud?

Tags: , , , , , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>