Many Hadoop experts believe an integrated data warehouse (IDW) is simply a huge pile of data. However, data volume has nothing to do with what makes a data warehouse. An IDW is a design pattern, an architecture for an analytics environment. First defined by Barry Devlin in 1988, the architecture quickly was called into question as implementers built huge databases with simple designs as well as small databases with complex designs.
In 1992, Bill Inmon published “Building the Data Warehouse,” which described two competing implementations: data warehouses and data marts. Gartner echoed Inmon’s position in 2005 in its research “Of Data Warehouses, Operational Data Stores, Data Marts and Data Outhouses.” Both are oversimplified in the following table.
|Integrated Data Warehouses||Data Marts|
|Subject oriented||Subject oriented|
|Time variant||Time variance and currency|
“Subject oriented” means the IDW is a digital reflection of the business. Subject areas contain tabular data about customers, inventory, financials, sales, suppliers, accounts, etc. The IDW contains many subject areas, each of which is 250 to 5,000 relational tables. Having many subject areas enables cross-organizational analysis – often called the 360-degree view. The IDW can answer thousands of routine, ad hoc, and complex questions.
In contrast, a data mart deploys a small fraction of one or two subject areas (i.e., a few tables). With only a few tables, data marts answer far fewer questions and are poor at handling ad hoc requests from executives.
Integration in a data warehouse has many aspects. First is the standardization of data types. This means account balances contain only valid numbers, date fields have only valid dates, and so on. Integration also means rationalizing data from multiple operational applications. For example, say four corporate applications have Bill Franks, William Franks, W. J. Franks, and Frank Williams all at the same street address. Data-integration tools figure out which is the best data to put in the IDW. Data cleansing corrects messed-up data. For example, repairs are needed when “123 Oak St., Atlanta” is in the street address but the city field is blank. Data integration performs dozens of tasks to improve the quality and validity of the data. Coupled with subject areas, this is called “a single version of the truth.”
Does Hadoop Have What it Takes?
Hadoop was engineered to rely on the schema-on-read approach, in which data is parsed, reformatted, and cleansed at runtime in a manually written program. But Hadoop (and Hive) have limited to no ability to ensure valid dates and numeric account balances. In contrast, relational database management systems (RDBMS) ensure that input records conform to the database design – called the schema. According to Dr. Michael Stonebraker, “This is the best way to keep an application from adding ‘garbage’ to a data set.”
The current rage in the Hadoop community is SQL-on-Hadoop. Those who have committed to open-source Apache are playing catch-up to databases by adding SQL language features. SQL-on-Hadoop offers are a subset of the ANSI 1992 SQL language, meaning they lack features found in SQL 1999, 2003, 2006, 2008, and 2011 standards. Therefore, the business user’s ability to perform self-service reporting and analytics is throttled. This, in turn, throws a substantial labor cost back into IT to develop reports in Java.
Additionally, the lack of a database foundation also prevents SQL-on-Hadoop from achieving fast performance. Missing from Hadoop are robust indexing strategies, in-database operators, advanced memory management, concurrency, and dynamic workload management.
A consistent – sometimes angry – complaint from Hadoop experts is the poor performance in large table joins, which the SQL-on-Hadoop tools do not fix. Remember those subject areas above? Some subject areas have two to 10 tables in the 50-1,000 terabyte range. With a mature analytic database, it is a challenging problem to optimize queries that combine 50TB with 500TB, sort it, and do it fast. Fortunately, RDBMS vendors have been innovating the RDBMS and cost-based optimizers since the 1980s. A few Apache Hadoop committers are currently reinventing this wheel, intending to release a fledgling optimizer later in 2014. Again, self-service business user query and reporting suffers.
Hadoop, therefore, does not have what it takes to be a data warehouse. It is, however, nipping at the heels of data marts.
How Many Warehouses Has Hadoop Replaced?
As far as we know, Hadoop has never replaced a data warehouse, although I’ve witnessed a few failed attempts. Instead, Hadoop has been able to peel off a few workloads from an IDW. Migrating low-value data and workloads to Hadoop is not widespread, but neither is it rare.
One workload often offloaded is extract-transform-load (ETL). Technically, Hadoop is not an ETL solution. It’s a middleware infrastructure for parallelism. Hadoop requires hand coding of ETL transformations, which is expensive, especially when maintenance costs pile up in the years to come. Simple RDBMS tasks like referential integrity checks and match key lookup don’t exist in Hadoop or Hive. Hadoop does not provide typical ETL subsystem features out-of-the-box, such as:
- hundreds of built-in data-type conversions, transformers, look-up matching, and aggregations
- Robust metadata, data lineage, and data modeling capabilities
- Data quality and profiling subsystems
- Workflow management, i.e., a GUI for generating ETL scripts and handling errors
- Fine grained, role-based security
Because migrations often come with million-dollar price tags, there is not a stampede of ETL migrations to Hadoop. Many organizations keep the low-value ETL workload in the IDW because:
- The IDW works (it ain’t broke, don’t fix it)
- Years of business logic must be recoded, debugged, and vetted in Hadoop (risk)
- There are higher business value Hadoop projects to be implemented (ROI)
Nevertheless, some ETL workload migrations are justifiable. When they occur, the IDW resources freed up are quickly consumed by business users.
Similarly, Hadoop provides a parallel platform for analytics, but it does not provide the analytics. Hadoop downloads do not include report development tools, dashboards, OLAP cubes, hundreds of statistical functions, time series analysis, predictive analytics, optimization, and other analytics. These must be hand coded or acquired elsewhere and integrated into projects.
Hadoop Was Never Free
Where does this leave the cash-strapped CIO who is still under pressure? According to Phil Russom of The Data Warehousing Institute: “Hadoop is not free, as many people have mistakenly said about it. A number of Hadoop users speaking at recent TDWI conferences have explained that Hadoop incurs substantial payroll costs due to its intensive hand coding normally done by high-payroll personnel.”
This reflects the general agreement in the industry, which is that Hadoop is far from free. The $1,000/terabyte hardware costs are hype to begin with, and traditional vendors are closing in on Hadoop’s hardware price advantage anyway. Additionally, some SQL-on-Hadoop offerings are separately priced as open source vendors seek revenue. If you want Hadoop to be fast and functional, well, that part is moving away from free and toward becoming a proprietary, priced database.
Hadoop Jumps in the Lake
Mark Madsen, President of Third Nature, gives some direction on Hadoop benefits: “Some of the workloads, particularly when large data volumes are involved, require new storage layers in the data architecture and new processing engines. These are the problems Hadoop and alternate processing engines are equipped to solve.”
Hadoop defines a new market, called the data lake. Data lake workloads include the following:
- Many data centers have 50 million to 150 million files. Organizing this into a cohesive infrastructure, knowing where everything is, its age, its value, and its upstream/downstream uses is a formidable task. The data lake concept is uniquely situated to solve this.
- Hadoop can run parallel queries over flat files. This allows it do basic operational reporting on data in its original form.
- Hadoop excels as an archival subsystem. Using low-cost disk storage, Hadoop can compress and hold onto data in its raw form for decades. This avoids the problem of crumbling magnetic tapes and current software versions that can’t read the tape they produced eight years earlier. A close cousin to archival is backup-to-disk. Again, magnetic tape is the competitor.
- Hadoop is ideal for temporary data that will be used for a month or two then discarded. There are many urgent projects that need data for a short time then never again. Using Hadoop avoids the lengthy process of getting data through committees into the data warehouse.
- Hadoop, most notably YARN from Hortonworks, is providing the first cluster operating system. This is amazing stuff. YARN improves Hadoop cluster management but does not change Hadoop’s position vis-à-vis the data warehouse.
Apples and Oranges
Bob Page, the VP of Development at Hortonworks, weighed in on the Hadoop versus IDW debate: “We don’t see anybody today trying to build an IDW with Hadoop. This is a capability issue, not a cost issue. Hadoop is not an IDW. Hadoop is not a database. Comparing these two for an IDW workload is comparing apples to oranges. I don’t know anybody who would try to build an IDW in Hadoop. There are many elements of the IDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works – there are many different aspects of a modern IDW that you are not going to see in Hadoop today. I would not see these two as equivalent.”
Hadoop’s success won’t come as a low-priced imitation of a data warehouse. Instead, I continue to be bullish on Hadoop as we witness the birth of the data lake with predictable birthing pains. Over the next couple of years, the hype will quiet down and we can get to work exploiting the best Hadoop has to offer.
With over 30 years in IT, Dan Graham joined Teradata Corporation in 1989, where he was the senior product manager for the DBC/1012 parallel database computer. He then joined IBM, where he wrote product plans and launched the RS/6000 SP parallel server. He then became Strategy Executive for IBM’s Global Business Intelligence Solutions. As Enterprise Systems General Manager at Teradata, Dan was responsible for strategy, go-to-market success, and competitive differentiation for the Active Enterprise Data Warehouse platform. He currently leads Teradata’s technical marketing activities.