The fresh insights promised by big data analytics also present challenges including business case justifications, IT system design and organizational shifts required to take advantage of what the innovations have to offer.
By Martin LaMonica
May 14, 2013
For many people, “big data” may simply mean running analytics on a Hadoop cluster. But there’s a broad range of technologies and techniques that are enabling a new class of analytics applications. To take them on, companies need to reconsider how they build applications and their technology infrastructure.
The challenges aren’t limited to learning a different software stack, though. Analytics professionals need to be smart about system design and making a business case for collecting and analyzing the explosion of new data sources, whether it’s social media comments or pressure sensor data from an oil pipeline.
“A lot of the architectures and products that technology managers may have been accustomed to for traditional transactional activity don’t map well to a big-data world,” says Gordon Haff, corporate technology evangelist at Red Hat and the author of Computing Next, a book on cloud computing. “You very much need to think about an architecture in the context of big data.”
The high-end reliability features of traditional storage arrays, for instance, aren’t particularly useful for big data applications where software can be distributed across many machines or on public clouds, Haff says. Another area of disruption is in databases where it once seemed that relational databases would rule forever. Now, a number of NoSQL databases, in-memory databases, and other specialized data engines are challenging incumbent technologies. While Hadoop is well suited for batch-type analytical jobs, many cutting-edge products are geared for real-time analytics, says David Feinleib, the managing director of consultancy The Big Data Group.
Open source is now pervasive, too, as products, such as Hadoop and MongoDB, emerged to keep pace with emerging big-data problems, such as handling massive datasets and querying unstructured data. That means the pace of innovation is brisk and the product costs are low, but it also means businesses need to feel familiar with open-source projects and the technology providers that provide support and enterprise-grade features.
Meanwhile, cloud-delivered services for analytics are rapidly maturing. Recent examples are Amazon Web Services, which is testing a service called Redshift on a columnar database from ParAccel and enterprise computing giant SAP’s recent release of a cloud version of its HANA database. Cloud services allow companies to pay for computing and software applications based on usage, which can cut down up-front costs, but they introduce some management overhead and data security and privacy issues.
Altogether, it’s an extremely fertile time for data analytics and the software that underpins those applications. “What we’re seeing now is an absolute explosion in data management technology and it’s come about because of the complexity of data problems,” said Mike Olson, the CEO of Hadoop company Cloudera. “This Darwinian variability is good. We’ll have a richer tool set and that’s critical because we have very virgin data problems now.”
Olson predicts that more specialized database engines will emerge around the Hadoop framework to meet new classes of data problems.
How to Move Forward
Traditional business intelligence systems allowed analysts to monitor company performance by dipping into a well-defined and contained pool of information, such as transactional data. Now data analysts can face multiple streams of information, some from outside the corporation, presented in many formats. That means for many classes of applications, companies need to integrate and normalize a varied set of data.
When all your databases and analytics packages run in-house, it’s manageable because there are standardized interfaces, such as ODBC (open database connectivity) and SQL (structured query language). But as more corporate data and enterprise applications move into the cloud, the situation becomes more challenging, says Michael Benedict, vice president and business line manager at data-integration provider Progress DataDirect. “With more software being delivered as a service, it’s already increased the number of data sources an analytics or business intelligence person has to deal with by 10 or 20 times in the last five years. We expect this trend to continue,” Benedict says.
Often, the most compelling analytics applications collect data from multiple sources and then seek out correlations to help make decisions or predictions. For example, a utility can implement predictive analytics using data from usage meters, an example of the “Internet of Things.” Because it now has millions of two-way electricity meters installed, utility Southern California Edison now combines frequent meter readings with projected power generation supply, which increasingly includes intermittent wind and solar, to better predict the daily demand for energy and run more efficiently.
Another example is marketing company Runa, which analyzes millions of transactions, clickstream data, and online shopper history in real time to create highly customized product discount offers when people are shopping online.
Some innovative companies are using analytics to create new services from existing data sources. Startup Climate Corporation takes crop yield, weather, and agricultural monitoring data, such as temperature and precipitation, to price crop insurance for farmers. With the same datasets, the company can recommend to farmers in great detail when it’s best to irrigate, apply chemicals, or plant, says Hilary Jules, the company’s director of marketing. To take on these types of leading-edge applications requires technical sophistication, and companies need to investigate the latest technologies and techniques in analytics.
A key enabler to these innovative analytics applications is cheaper technology than traditional systems, which often relied on high-end servers and pricey software. “The big difference from an infrastructure perspective is that you can store a lot more data and process it an order of magnitude more than a few years ago,” says Feinleib. Much the way Linux enabled large-scale computing on commodity hardware, advanced analytics can be done using open source software on cheap hardware, he says.
Framing the Problem So Everyone Can Benefit
With more accessible tools, people can collect and analyze data to answer questions they simply couldn’t before in an economic way. And the issues that start in the data center end up influencing the way enterprises manage data formats, and how they communicate about roles and establish new business processes.
The American Wind Wildlife Institute is developing a system to help scientists and industry better understand how wind turbines affect bird and bat fatalities. It’s been well known that wind turbines kill avian wildlife for years, but the extent and factors that contribute to fatalities are not well understood. “We haven’t been able to conduct the analyses to demonstrate with scientific rigor what data is important to predict impacts,” says Taber Allison, the director of research and evaluation at the American Wind Wildlife Institute.
The project, which is in pilot phase, will initially collect bird fatality data from private companies and provide it to scientists to get a more accurate picture of the situation nationally. Over time, it will build its database with other sources including weather information, radar data that detects the presence of bats from their calls, and geospatial information that describes the topography of a wind farm site.
Analysis could show, for example, that factors such the time of year, humidity and temperature create a higher risk for bird fatalities. Instead of shutting down turbines between certain dates, the more detailed information could instruct wind farm operators to shut down only during certain weather patterns. The hope is to minimize the risk and maximize energy production.
The application has a sophisticated role-based security system that will allow different companies to contribute to the central PostgreSQL database over the Web without sharing proprietary data. But the organizational challenges have been significant as well, says Cherri Pancake, the director of the Northwest Alliance for Computational Science & Engineering at Oregon State University. In this case, the project owners had to devise a common data format and ensure multiple companies that reported data will remain private, an example of the cultural and social issues that to be dealt with to enable data sharing.
“The time you end up spending on [organizational] things has overshadowed the cost of the technical issues because that’s where the biggest challenges lay,” she says. Creating the scientific database would not have been possible without establishing the trust among multiple organizations, she says.
Surfacing more detailed information can present organizational challenges in a commercial environment, too. For instance, marketing departments may need to adjust to having far more granular information on specific customers or products. Viewing real-time analytics on product sales, rather than only historical data, could yield some insight into product sales. But to successfully integrate that sort of information into their workflow, they will need easy-to-use front-end tools and visualizations to deal with a higher volume of data.
“There’s a lot of evidence that there are insights to be gained here,” says Haff. “And certainly companies are using data more in making decisions than they were before.”
Martin LaMonica is a technology journalist in the Boston area. Follow him on Twitter @mlamonica.
Home page photo of gears by Les Chatfield via Wikipedia.