Editor’s note: This article is the second in a series examining issues related to evaluating and implementing big data analytics in business.
The first article in this series explored how the growing desire to consume massive volumes of structured and unstructured data intersected with the reduced costs and lowered barrier to entry for enabling scalable high performance analytics. The conclusion was that the convergence of these market conditions enhances the attraction to many different types of organizations of designing and implementing big data analytics. This is especially true for organizations with budgets too puny to accommodate the investment.
But even as the excitement around big data analytics reaches a fevered pitch, it remains a technology-driven activity. And just because big data is feasible does not necessarily mean that it is reasonable. Unless there are clear processes for determining the value proposition there is a risk that it will remain a fad until it hits the disappointment phase of the hype cycle. At that point, hopes may be dashed when it becomes clear that the basis for the investments in the technology were not grounded in expectations for clear business improvements.
That being said, a scan of existing content on the “value of big data” sheds interesting light on what is being promoted as the expected result of big data analytics and, more interestingly, how familiar those expectations sound. A good example is provided within an economic study on the value of big data (titled “Data Equity—Unlocking the Value of Big Data”), published in April by the Center for Economics and Business Research (CEBR) that speaks to the cumulative value of:
- Optimized consumer spending as a result of improved targeted customer marketing;
- Improvements to research and analytics within the manufacturing sectors to lead to new product development;
- Improvements in strategizing and business planning leading to innovation and new start-up companies;
- Predictive analytics for improving supply chain management to optimize stock management, replenishment, and forecasting;
- Improve the scope and accuracy of fraud detection.
It just so happens that these are exactly the same types of benefits promoted for years by business intelligence and data warehouse tools vendors and system integrators.
The Characteristics of the Big Data Environment
So what makes big data different? The answer must lie in the characteristics of the big data analytics application development environment, which largely consists of a methodology for elastically harnessing parallel computing resources and distributed storage along with data exchange via high-speed networks.
The result is improved performance and scalability, and we can examine another data point that sheds light on self-reported uses of big data techniques, namely the enumeration of projects listed at The Apache Software Foundation’s PoweredBy Hadoop website. A scan of the list shows most of those applications fall into these categories:
- Business intelligence, querying, reporting, searching, including many implementation of searching, filtering, indexing, speeding up aggregation for reporting and for report generation, trend analysis, search optimization, and general information retrieval. (Examples include: Alibaba, University of North Carolina Lineberger Comprehensive Cancer Center, University of Frieburg.)
- Improved performance for common data management operations, with the majority focusing on log storage, data storage and archiving, followed by sorting, running joins, Extraction/Transformation/Loading (ETL) processing, other types of data conversions, as well as duplicate analysis and elimination. (Examples: AOL, Brilig, Infochimps.)
- Non-Database Applications, such as image processing, text processing in preparation for publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and monitoring workflow processes. (Examples: Benipal Technologies, University of Maryland.)
- Data mining and analytical applications, including social network analysis, facial recognition, profile matching, other types of text analytics, web mining, machine learning, information extraction, personalization and recommendation analysis, ad optimization, and behavior analysis. (Examples: FOX Audience Network, LinkedIn, Telefonica Research.)
In other words, most of the applications reported by Hadoop users are not necessarily new applications. Rather, there are many familiar applications—and the availability of a low-cost high-performance computing framework either allows more users to develop these applications, run larger deployments, or speed up the execution time.
Reviewing these examples further, the different types of applications described by these implementations suggest that the big data approach now (and it is an approach that will continue to develop) is best suited to addressing business problems that are subject to one or more of the following criteria:
1) Data-restricted throttling. There is an existing solution whose performance is throttled as a result of data access latency, data availability, or size of inputs.
2) Computation-restricted throttling. There are existing algorithms, but they are heuristic and have not been implemented because the anticipated computational performance has not been met with conventional systems.
3) Large data volumes. The analytical application combines a multitude of existing large data sets and data streams with high rates of data creation and delivery.
4) Significant data variety. The data in the different sources varies in structure and content, and some (or much) of the data is unstructured.
5) Benefits from data parallelization. Because of the reduced data dependencies, the application’s runtime can be improved through task parallelization applied to independent data segments.
A Framework for Evaluating Big Data Analytics in Business
So how does this relate to business problems whose solutions are suited to big data analytics applications? The criteria in the list above can be used to assess the degree to which business problems are adaptable to big data technology.
Take ETL processing as a prime example. ETL processing is hampered by data throttling and computation throttling. It can involve large data volumes, may consume a variety of different types of data sets, and can benefit from data parallelization. This is the equivalent of a big data “home run” application. Find more examples in the table below.
|Examples of Applications Suited to Big Data Analytics|
|Application||Criteria items addressed||Sample data sources|
|Energy network monitoring and optimization||• Data throttling • Computation throttling • Large data volumes||Sensor data from smart meters and network components|
|Credit fraud detection||• Data throttling • Computation throttling • Large data volumes • Data parallelization • Data variety||Point-of-sale data, customer profiles, transaction histories, predictive models|
|Data profiling||• Large data volumes • Data parallelization||Sources selected for downstream repurposing|
|Clustering and customer segmentation||• Data throttling • Computation throttling • Large data volumes • Data parallelization • Data variety||Customer profiles, transaction histories, enhancement datasets|
|Recommendation engines||• Data throttling • Computation throttling • Large data volumes • Data parallelization • Data variety||Customer profiles, transaction histories, enhancement datasets, social network data|
|Price modeling||• Data throttling • Computation throttling • Large data volumes • Data parallelization||Point-of-sale data, customer profiles, transaction histories, predictive models|
These are examples we see now, and as with other trends in application development we are sure to see more as these big data technologies evolve. What the table suggests is that there is a straightforward approach to assessing the costs and benefits of pursuing a big data analytics project in your enterprise. Ask: to what extent is the investment in big data technologies reasonable to alleviate the constraints associated with the project? If it is not a “home run” that hits each of the criteria you have, does it accumulate enough benefits so that the benefits of doing the project outweigh the costs? If you can’t demonstrate the answer is yes, then you are just playing with technologies.
While we continue employing big data analytics for developing algorithms and solutions that are new implementations of old algorithms, we must anticipate that there are opportunities for new paradigms for using parallel execution and data distribution in innovative ways. Yet without proper organizational alignment and preparation, neither approach is likely to succeed. In future columns we will discuss different aspects of corporate readiness in preparation for designing, developing, and implementing big data applications.
David Loshin is the author of several books, including Practitioner’s Guide to Data Quality Improvement and the upcoming second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at email@example.com.