Why Big Data Projects Fail

by   |   March 5, 2013 3:46 pm   |   4 Comments

Stephen Brobst of Teradata

Stephen Brobst of Teradata

When approaching big data, the industry places a lot of focus on ‘the three Vs”: Volume, variety, and velocity. Yet, not nearly enough emphasis is being placed on the most important V – value. For this reason, too many big data projects are being undertaken without the kind of return on investment results that are possible in this emerging area of business opportunity.

There are four primary reasons that big data projects fail:

1. They focus on technology rather than business opportunities.

2. They are unable to provide data access to subject matter experts.

3. They fail to achieve enterprise adoption.

4. The enterprise lacks the sophistication to understand that the project’s total cost of ownership includes people as well as information technology systems.

Many of the big data projects and proof-of-concepts now underway are more about testing technology than uncovering business value. Downloading open source software from the Apache website and experimenting with Hadoop is an interesting exercise in programming, but this kind of endeavor is unlikely to yield business results.

Related Stories

Opinion: Change the definition of the data warehouse.

Read more»

The five elements of a data scientist’s job.

Read more»

Achieving organizational alignment for big data analytics.

Read more»

Developing a strategy for integrating big data analytics into the enterprise.

Read more»

For such efforts to yield value there needs to be a business person providing direction for the project. The notion that throwing data into a file system or database and then swimming around in it, fishing for insights using the latest and greatest technologies is bound to fail. Without business direction there will be no business results.

Project Champions, Business Analysts and Data Scientists
To be successful in extracting value from big data it must be possible for the business knowledge workers to effectively access and explore the data. In some of the more analytically-sophisticated organizations, a new “data scientist” function has emerged, with a different set of skills and job functions than a traditional business analyst.

While the business analyst uses data to answer business questions, a data scientist is not so much focused on answering business questions, but is instead focused on discovering new questions. Typically, a business analyst works with a BI (Business Intelligence) query tool using a point-and-click interface for specifying business questions and retrieving results.

A data scientist is more likely to work with data visualization and data mining tools to find patterns and relationships in the data that were not previously recognized. Once the patterns and relationships are identified, they can be translated to business questions to be answered within the domain of the business analysts.

A critical success factor for empowering a data scientist is to provide direct access to detailed data for exploratory purposes. In a big data environment, the diversity and non-relational formats of the data types is a challenge for traditional analytic toolsets. Tools that generate ANSI SQL are not sufficient for manipulating big data content which may be in the form of key-value pairs (such as weblog data), graphs (social networks, for example), text (as in social media), rich media (such as video and voice recordings), and so on.

New approaches for accessing data, such as the MapReduce programming framework invented at Google, have emerged to address these requirements. Yet, these so-called ‘No-SQL’ techniques have proven difficult for data scientists to leverage for advanced analytics; a computer scientist is usually required as an intermediary for accessing data.

Big Data Discovery
I once heard a data scientist at a large bank comment (in frustration) that “Hadoop is a great technology for storing large volumes of data at a low cost… but the problem is that the only people who can get the data out are the people who put the data in.”

To address this gap, big data discovery platforms have emerged as a key component of an ecosystem designed to optimize value creation from big data. A big data discovery platform is meant to provide direct access to big data content for a data scientist (without a computer scientist intermediary).

In order to accomplish this goal, there must be a data access interface that provides a higher level abstraction than flat file programming using MapReduce and Java or C++. Hybrid models which combine the power of MapReduce with the ease-of-use of SQL are required. There are a number of open source projects that have contributed languages such as Pig and Hive to close the skill set gap between data scientists and computer scientists. Yet these projects have so far failed to deliver the combination of efficiency and expressive power demanded by data scientists.

As a consequence, implementations in which SQL and No-SQL techniques can be combined using a ‘Not Only SQL’ interpretation of the No-SQL approach have been gaining popularity. Most commercial relational database vendors have already delivered or are actively working on such a capability.

A further common mistake in big data projects is falling victim to the “silver bullet” trap. New technologies are often perceived as silver bullets that will solve all problems. Open-source Hadoop is sometimes perceived to be the panacea for all analytic challenges. The reality, of course, is that no one technology solves all problems well. Success demands using the right tools to solve each part of the big data analytics challenge.

A Three-Platform Approach: Data Archive, Discovery, Production Analytics

Analytically sophisticated organizations including LinkedIn, eBay and Wells Fargo Bank have converged on an approach that involves three distinct platforms: a data archive platform, a discovery platform and a production analytics platform.

Open source Hadoop is often the favored technology for the data archive platform because of its scalability, high-performance loading and attractive economic characteristics. Data is stored and provisioned from the data archive platform. Data on this platform is usually stored in raw form using key-value pairs. However, adoption of Hadoop as an analytic platform is notoriously difficult to manipulate, and it has been isolated to small cadres of computer scientists who have mastered the technology.

For enterprise adoption, more capability is needed in areas such as usability, manageability and security. For this reason, discovery platforms have emerged to fill the gap between Hadoop and traditional relational database platforms used for production data warehousing. A robust discovery platform will add in the aforementioned missing features of Hadoop and will also allow data scientists to work with both SQL and No-SQL programming techniques (on top of both relational and non-relational data).

The discovery platform is optimized for a small number of very sophisticated data scientists who design and execute data experiments in search of new insights. Data is brought in from the Hadoop archive platform in a fairly unrefined format so as to avoid creating delays in putting the data into the hands of the data scientists.

If no value is found from the data experiments, the experimental data is simply discarded from the discovery platform and new data experiments are undertaken (note that the data discarded from the discovery platform still remains on the archive platform). If value is found in the data, then it is promoted into the enterprise data warehouse platform where it is certified, auditable and can be re-used for production analytics.

This dynamic leads to the creation of what I call a unified data architecture, which is all about using the right tool for solving the problem at hand. An enterprise can use Hadoop for the data archive platform. The discovery platform provides beyond-SQL analytic capabilities along with database functionality in the form of optimized performance, usability, and security appropriate for a data scientist. No-ETL (extract, transform and load) techniques, with late binding on the discovery platform, provide the flexibility for data scientists to apply structure to data at query-time rather than at load-time. Meanwhile, the production analytics platform consists of the enterprise data warehouse, primarily implemented using early binding of data at load-time with traditional ETL techniques.

Of course, not all enterprises will require all three of the platforms described above. For example, if the volume of big data has not yet accumulated to large scale then it may be prudent to defer deployment of the Hadoop platform until the economics make sense for doing so. Similarly, it may make sense to carry out discovery and production tasks on the same platform in circumstances where it will simplify the overall architecture to do so.

ROI and Total Cost of Ownership
In order for a big data initiative to be successful, it must deliver a positive return on investment. Yet, the investments necessary to be successful are widely misunderstood. Investments are not just in technology, but also in people with the right skill sets. For example, deployment of Hadoop is often perceived to be free because it is open source and has no software license cost.

The challenge is that organizations often do not make the required people investments to extract the value from their big data when using “free” software: installing software on a cluster of servers is not enough.

In this respect, Hadoop can be likened to a free puppy. The acquisition cost is free, but care and feeding for the environment is definitely not free. Organizations must invest in data scientist skill sets and the operational staff to keep the system up and running in order to get value from it.

Total cost of ownership is what matters—not just acquisition cost. Keeping this in mind will help organizations to make better choices regarding using the right technology for the problem to be solved. Optimizing total cost to value involves investing in the right technology and skill set combinations, understanding which technology is most efficient for which workloads and engineering an ecosystem which allows the selected technologies to work well together.

Organizations that approach big data from a value perspective with partnership between the business and IT are much more likely to be successful than those which adopt a pure technology approach. For this reason, making appropriate investments in both technology and organizational skill sets to ensure enterprise capability in extracting value from big data is essential. Aligning technologies, skill sets and costs is also fundamental to optimize total cost to value and make big data projects successful.

Stephen Brobst is the chief technology officer for Teradata Corporation. He is widely regarded as a leading expert in data warehousing.

Tags: , , , ,


  1. Tom Deutsch
    Posted March 6, 2013 at 1:41 pm | Permalink

    Yikes! Either Teradata is feeling threatened here or they haven’t been actually working with the technologies because much of what is said here is just flat out incorrect. This is basically a more eloquent take on the big data myths argument, as much of this is based on straw man (logical fallacy) arguments.

    Just for starters let’s look at their “four primary reasons that big data projects fail” – Teradata assertions in quotes.

    “1. They focus on technology rather than business opportunities.” NOPE. This is use case selection 101, and was the case maybe 2 years ago but I haven’t seen a major Enterprise do this in sometime now. Any vendor that suggest anything like this or doesn’t have a methodology which prevents it should be shown the door

    “2. They are unable to provide data access to subject matter experts.” WHY? This isn’t hard to do frankly, not as easy as traditional SQL technologies, but not hard to do. Between data provisioning, visual tools, hooks from SPSS, R and RevolutionR, and SAS and the BI tools it is unclear why anyone would say this frankly. And with the BigSQL and Impala deliverables it is about to get much-much easier.

    “3. They fail to achieve enterprise adoption.” BY DESIGN – this is a false standard and shows legacy EDW thinking. Hadoop isn’t a 1:1 map to a traditional EDW, and holding it to that standard is a violation of a Fit For Purpose approach (which is a big no-no). The role these environments play is different than an EDW, which is why they don’t replace the EDW. Boiling the ocean on a full Enterprise deployment is a worst practice frankly, and they should know that.

    “4. The enterprise lacks the sophistication to understand that the project’s total cost of ownership includes people as well as information technology systems.” MAYBE, but here again any partner or vendor with experience will prevent that from happening since their methodologies explicitly address this.

    I can go through the rest of the article if members here want…

    — Tom Deutsch, Program Director, Big Data Technologies and Advanced Analytics, IBM

  2. Stephen A. Brobst
    Posted March 14, 2013 at 8:26 pm | Permalink

    IBM is feeling a bit defensive, I see. This sounds like a sales/marketing comment if I ever heard one. Out of professional courtesy, I was not identifying the offending vendors by name, but since you self-identified I will point out that one of the worse practices is engaging with a vendor who re-brands their legacy products as “Big Insights” and pretends to add value in doing so. Cobbling together a grab bag of products combining legacy software, lab prototypes and over-priced services delivery is clearly not a formula for success. The point of the article, of course, is that the reasons that big data projects fail are preventable with good practices implementation and methodology. Teradata’s many successful customers, as well as superior positioning in the Gartner Magic Quadrant, speak quite clearly to our ability to leverage big data technologies into implementations that deliver business value with appropriate business-driven methodology.

  3. willcoxm
    Posted March 14, 2013 at 11:33 pm | Permalink

    Slightly surprised by the passion of your reaction to Stephen’s post, Tom – especially as it seems to me that on the substantive points you and Stephen are actually both in violent agreement. The main thrust of Stephen’s argument is surely that:

    (1) The new data management and analysis technologies are extending the established ones rather than replacing them, hence the requirement for a Unified Data Architecture;
    (2) We need to take care that the new technologies are not solutions looking for problems to solve, rather than the other way around.

    Since as you point out “the role [that] these environments play is different than an EDW” and “any vendor that… doesn’t have a methodology which prevents… [poor use-case selection] should be shown the door”, I actually see much less disagreement than you allow.

    I think that we have to acknowledge that there is something of a feeding frenzy sweeping the industry right now; you know that “Big Data” has jumped the shark when Dilbert’s Scott Adams has the evil pointy-haired boss intoning, “let us pay” to the All-Knowing God Who Lives In The Cloud. You may not have “seen a major Enterprise [deploy NoSQL technology without a good use-case] in sometime now”, but I still meet plenty of senior stakeholders in end-user organizations who are clearly and demonstrably thoroughly confused by the alphabet soup of new technologies. And at least some service-led vendors who are only too happy to position cheap-to-acquire-but-expensive-to-run technologies – regardless of whether they are appropriate to the task-at-hand or not – because it is man-days of consultancy that they are peddling, rather than licensed software. We are all of us, after all, selling something…

    Lastly, I am sure that Teradata can be accused of many things (even I have not drunk so much of the Kool-Aid that I claim we are perfect!), but I don’t think that we can really be reasonably suspected of “feeling threatened” by the new data management technologies – not just weeks after Gartner awarded us the leadership position in their 2013 Data Warehouse DBMS Magic Quadrant, specifically citing our Unified Data Architecture vision. If we are running anywhere, it is towards the sound and the fury – which is, of course, no less than our 1400+ customers demand and expect of us.

  4. Tom Deutsch
    Posted March 15, 2013 at 1:24 pm | Permalink

    OK guys – let’s get real here as your spreading more FUD isn’t really helping the discussion.

    Stephen – I at least had the courtesy to respond to what you wrote rather just making stuff up.

    Stephen when you write “worse practices is engaging with a vendor who re-brands their legacy products as “Big Insights” and pretends to add value in doing so” you are either either really out of touch or spreading more FUD.

    BigInsights was a brand new offering when released that natively uses Hadoop as an underlying engine (our distro or Cloudera). The NLP engine, Jaql language (opt-in since Pig and Hive and HBase and all the other Apache components are natively available), our scheduling algorithms, security management, Entity engine hooks and 20+ other capabilities are all BigInsights net-new capabilities.

    Lastly I’d just add that the architectures customers are looking for are information management, not EDW-cetric, and are moving to a Fit For Purpose approach.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>