When approaching big data, the industry places a lot of focus on ‘the three Vs”: Volume, variety, and velocity. Yet, not nearly enough emphasis is being placed on the most important V – value. For this reason, too many big data projects are being undertaken without the kind of return on investment results that are possible in this emerging area of business opportunity.
There are four primary reasons that big data projects fail:
1. They focus on technology rather than business opportunities.
2. They are unable to provide data access to subject matter experts.
3. They fail to achieve enterprise adoption.
4. The enterprise lacks the sophistication to understand that the project’s total cost of ownership includes people as well as information technology systems.
Many of the big data projects and proof-of-concepts now underway are more about testing technology than uncovering business value. Downloading open source software from the Apache website and experimenting with Hadoop is an interesting exercise in programming, but this kind of endeavor is unlikely to yield business results.
For such efforts to yield value there needs to be a business person providing direction for the project. The notion that throwing data into a file system or database and then swimming around in it, fishing for insights using the latest and greatest technologies is bound to fail. Without business direction there will be no business results.
Project Champions, Business Analysts and Data Scientists
To be successful in extracting value from big data it must be possible for the business knowledge workers to effectively access and explore the data. In some of the more analytically-sophisticated organizations, a new “data scientist” function has emerged, with a different set of skills and job functions than a traditional business analyst.
While the business analyst uses data to answer business questions, a data scientist is not so much focused on answering business questions, but is instead focused on discovering new questions. Typically, a business analyst works with a BI (Business Intelligence) query tool using a point-and-click interface for specifying business questions and retrieving results.
A data scientist is more likely to work with data visualization and data mining tools to find patterns and relationships in the data that were not previously recognized. Once the patterns and relationships are identified, they can be translated to business questions to be answered within the domain of the business analysts.
A critical success factor for empowering a data scientist is to provide direct access to detailed data for exploratory purposes. In a big data environment, the diversity and non-relational formats of the data types is a challenge for traditional analytic toolsets. Tools that generate ANSI SQL are not sufficient for manipulating big data content which may be in the form of key-value pairs (such as weblog data), graphs (social networks, for example), text (as in social media), rich media (such as video and voice recordings), and so on.
New approaches for accessing data, such as the MapReduce programming framework invented at Google, have emerged to address these requirements. Yet, these so-called ‘No-SQL’ techniques have proven difficult for data scientists to leverage for advanced analytics; a computer scientist is usually required as an intermediary for accessing data.
Big Data Discovery
I once heard a data scientist at a large bank comment (in frustration) that “Hadoop is a great technology for storing large volumes of data at a low cost… but the problem is that the only people who can get the data out are the people who put the data in.”
To address this gap, big data discovery platforms have emerged as a key component of an ecosystem designed to optimize value creation from big data. A big data discovery platform is meant to provide direct access to big data content for a data scientist (without a computer scientist intermediary).
In order to accomplish this goal, there must be a data access interface that provides a higher level abstraction than flat file programming using MapReduce and Java or C++. Hybrid models which combine the power of MapReduce with the ease-of-use of SQL are required. There are a number of open source projects that have contributed languages such as Pig and Hive to close the skill set gap between data scientists and computer scientists. Yet these projects have so far failed to deliver the combination of efficiency and expressive power demanded by data scientists.
As a consequence, implementations in which SQL and No-SQL techniques can be combined using a ‘Not Only SQL’ interpretation of the No-SQL approach have been gaining popularity. Most commercial relational database vendors have already delivered or are actively working on such a capability.
A further common mistake in big data projects is falling victim to the “silver bullet” trap. New technologies are often perceived as silver bullets that will solve all problems. Open-source Hadoop is sometimes perceived to be the panacea for all analytic challenges. The reality, of course, is that no one technology solves all problems well. Success demands using the right tools to solve each part of the big data analytics challenge.
A Three-Platform Approach: Data Archive, Discovery, Production Analytics
Analytically sophisticated organizations including LinkedIn, eBay and Wells Fargo Bank have converged on an approach that involves three distinct platforms: a data archive platform, a discovery platform and a production analytics platform.
Open source Hadoop is often the favored technology for the data archive platform because of its scalability, high-performance loading and attractive economic characteristics. Data is stored and provisioned from the data archive platform. Data on this platform is usually stored in raw form using key-value pairs. However, adoption of Hadoop as an analytic platform is notoriously difficult to manipulate, and it has been isolated to small cadres of computer scientists who have mastered the technology.
For enterprise adoption, more capability is needed in areas such as usability, manageability and security. For this reason, discovery platforms have emerged to fill the gap between Hadoop and traditional relational database platforms used for production data warehousing. A robust discovery platform will add in the aforementioned missing features of Hadoop and will also allow data scientists to work with both SQL and No-SQL programming techniques (on top of both relational and non-relational data).
The discovery platform is optimized for a small number of very sophisticated data scientists who design and execute data experiments in search of new insights. Data is brought in from the Hadoop archive platform in a fairly unrefined format so as to avoid creating delays in putting the data into the hands of the data scientists.
If no value is found from the data experiments, the experimental data is simply discarded from the discovery platform and new data experiments are undertaken (note that the data discarded from the discovery platform still remains on the archive platform). If value is found in the data, then it is promoted into the enterprise data warehouse platform where it is certified, auditable and can be re-used for production analytics.
This dynamic leads to the creation of what I call a unified data architecture, which is all about using the right tool for solving the problem at hand. An enterprise can use Hadoop for the data archive platform. The discovery platform provides beyond-SQL analytic capabilities along with database functionality in the form of optimized performance, usability, and security appropriate for a data scientist. No-ETL (extract, transform and load) techniques, with late binding on the discovery platform, provide the flexibility for data scientists to apply structure to data at query-time rather than at load-time. Meanwhile, the production analytics platform consists of the enterprise data warehouse, primarily implemented using early binding of data at load-time with traditional ETL techniques.
Of course, not all enterprises will require all three of the platforms described above. For example, if the volume of big data has not yet accumulated to large scale then it may be prudent to defer deployment of the Hadoop platform until the economics make sense for doing so. Similarly, it may make sense to carry out discovery and production tasks on the same platform in circumstances where it will simplify the overall architecture to do so.
ROI and Total Cost of Ownership
In order for a big data initiative to be successful, it must deliver a positive return on investment. Yet, the investments necessary to be successful are widely misunderstood. Investments are not just in technology, but also in people with the right skill sets. For example, deployment of Hadoop is often perceived to be free because it is open source and has no software license cost.
The challenge is that organizations often do not make the required people investments to extract the value from their big data when using “free” software: installing software on a cluster of servers is not enough.
In this respect, Hadoop can be likened to a free puppy. The acquisition cost is free, but care and feeding for the environment is definitely not free. Organizations must invest in data scientist skill sets and the operational staff to keep the system up and running in order to get value from it.
Total cost of ownership is what matters—not just acquisition cost. Keeping this in mind will help organizations to make better choices regarding using the right technology for the problem to be solved. Optimizing total cost to value involves investing in the right technology and skill set combinations, understanding which technology is most efficient for which workloads and engineering an ecosystem which allows the selected technologies to work well together.
Organizations that approach big data from a value perspective with partnership between the business and IT are much more likely to be successful than those which adopt a pure technology approach. For this reason, making appropriate investments in both technology and organizational skill sets to ensure enterprise capability in extracting value from big data is essential. Aligning technologies, skill sets and costs is also fundamental to optimize total cost to value and make big data projects successful.
Stephen Brobst is the chief technology officer for Teradata Corporation. He is widely regarded as a leading expert in data warehousing.