Hadoop has been gaining popularity as a business intelligence solution and the number of data analytics solutions offering Hadoop is growing exponentially. What does this mean for the traditional data warehouse?
Hadoop is an open-source, distributed file system developed by the Apache Software Foundation to solve the problems that arise when managing increasingly large data sets using a traditional data warehouse, such as the cost of data storage and the ability to efficiently glean insights from data of various formats.
Traditional data warehouses, also called relational databases or SQL databases, use a highly disciplined approach to establish relationships, structure, and govern data. They provide a canonical version of the entirety of a business’ data across departments and data sources. This data canon allows multiple users to access the same data and make comparisons, forecasts, and predictions. However, as useful as traditional databases are, adoption rates have remained low due to costs.
The costs of administering a traditional database at the enterprise level are significant (albeit decreasing). Storage, the traditional limiting cost, has greatly decreased in price over the last several decades. According to StatisticBrain, the average cost per gigabyte of storage was $437,500 in 1980, $11 in 2000, and just five cents in 2013. Of course, while the cost of storage has decreased, the labor involved in structuring the data is still expensive. According to Glassdoor, data scientists make a median income of $115,000 per year – a number that might not give large enterprises pause, but is probably out of the reach of many smaller businesses. Also, the exponentially increasing amount of data being captured will further complicate storage efforts.
An IDC study of the digital universe estimates that the digital world will grow to 40,000 exabytes, or 40 trillion gigabytes. This is the equivalent of more than 5,200 gigabytes per person in 2020.
Assuming that five cents per GB is still the cost of storage in 2020, that means it will cost roughly $800 trillion to store those 40,000EB of data – or more than 1,000 percent of the 2014 global GDP – and that number assumes that data won’t be replicated and stored in more than one place. It doesn’t take a data scientist to understand that those costs are unsustainable, hence the rise of Hadoop.
Hadoop’s biggest cost advantage comes from the fact that it is open source, which means it’s free and can be run on commodity hardware. Additionally, Hadoop helps defer the costs of data storage by allowing data to be accessed, modified, and processed in place, eliminating the need to duplicate and structure data in a traditional warehouse. While data warehouses require painstaking organization through the construction and modification of schemas, Hadoop requires very simple schemas – and sometimes no schema at all. On Hadoop, data can be stored and processed regardless of schema. According to Wayne Eckerson, founder of the consultancy Eckerson Group, 80 percent of all data is “multi-structured,” which can be difficult to normalize for storage in a traditional data warehouse. Because of Hadoop’s flexibility, organizations are using it to store data from disparate sources, including enterprise resource planning systems, customer relationship management software, Web server logs, sensor data, and various other systems. Organizations are using Hadoop clusters to support online analytical processing (OLAP), traditional Extract, Transform, Load (ETL) processes, development of custom business intelligence frameworks, and even file server operations.
There are certainly arguments to be made for switching to a Hadoop-only data architecture, but SQL-based data warehouses still have their place, particularly for storage and governance of sensitive information such as financial data, which requires transactional updates. However, the most powerful big data architectures are being built with hybrid solutions. For example, an eCommerce site might house transaction data in a traditional warehouse and link directly to website session data stored in a Hadoop cluster, better enabling the site to automate custom recommendations, analyze browsing paths, and ultimately increase conversions. According to a 2013 IDC study, nearly 70 percent of Hadoop deployments are used in conjunction with a traditional relational or transactional database.
As Hadoop technology matures, it may end up replacing some traditional database and data warehousing operations. According to an Allied Market Research report, the Hadoop market is projected to grow from $2 billion in 2013 to more than $50 billion by 2020. According to the AMR report, a majority of the marketplace is in Hadoop services, mainly integration and deployment efforts. The exponential data growth expected over the next several years should have no trouble supporting the continued adoption of Hadoop architectures – 40,000EB of data simply can’t be managed in a cost-effective manner using a traditional database. Ultimately, though, each organization will have to decide what data architecture is best suited to their needs. While Hadoop may end up managing a larger percentage of an organization’s data, the traditional data warehouse will continue to have its place. There is no one-size-fits-all big data solution.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.