The worlds of BI and Big Data may seem like natural partners, but there’s a schism between them. Rather than reconciling the split, the two have instead embraced awkward workarounds and architectural bandages in order to co-exist. This stopgap integration doesn’t serve customers well and over the long term, is counterproductive for the industry.
The BI-Big Data mismatch is one of generational differences in technology as well as paradigms of data design, curation and management. In the next few pages, I’m going explore the generation gap, outline the resulting imperfect union and detail the costs it extracts and ways to improve upon it.
Business Intelligence and Its Founding Principals
BI technology largely came of age in the very late nineties and early aughts with a number of independent companies bringing BI products to market. By 2007, the technology megavendors made acquisitions of these BI pure-play firms, and the category became mature with an eye toward stability rather than continued, disruptive innovation.
BI adoption became fairly pervasive. A significant number of projects and implementations were executed, many of them in a centralized-IT oriented manner. With line-of-business stakeholders thus sidelined, a non-trivial number of projects failed, resulting in some industry headwinds.
In response to the setbacks, a second generation of BI tools, exemplified by the likes of Tableau and Qlik, made their impact on the market. Corporate BI stacks like SAP BusinessObjects, IBM Cognos and Oracle/Hyperion were associated with the centralized IT camp. The newer tools were more visual in approach and appealed instead to end users in the lines-of-business, enabling them to work autonomously and on a “self-service” basis.
It’s All About the Cube
As antithetical as these two BI categories may seem, products from each are in fact based on the same data architecture: a structure known as an OLAP (Online Analytical Processing) “cube.” So, while the second generation tools may seem vastly different in their use cases, all BI tools have common organizing principles for the data that they analyze.
Specifically, BI tools break down data into the numbers that need to be analyzed (called “measures”), and the categories by which those numbers are sliced and diced (“dimensions”). Dimensions may be hierarchical in structure (for example, a geography dimension may consist of country, province, city and postal code hierarchy levels), allowing for predefined drill paths for end-user analysis. Because OLAP can optimize query performance by understanding the natural hierarchies in the data, the whole exercise of working through and codifying a dimensional design is a big part of what made OLAP and BI work well.
If you know your measures and you’ve surmised your dimensions and their hierarchies, then you can take data from just about anywhere and restructure it to fit that schema. Extract, Transform and Load (ETL) tools are optimized for this work and for carrying it out in a continuous fashion during production.
A Misleading Calm
A decade ago, this “world order,” as it were, worked pretty well and provided rather comprehensive coverage for data from all kinds of sources. The arrangement worked because of an implicit contract that was in place: As long as you thought through your dimensional design, BI technology could enable all sorts of data loading and analyses around it.
Making the dimensional design explicit allowed BI engines to pre-aggregate at various upper levels in dimensional hierarchies. This made for better performance and let query tools provide generic user interfaces that could read in users’ dimensions, allowing them to do drill-down analysis on their data in relatively short order.
With this stable regime in place, work got done, information workers could acquire the competency to work with their data and get insights, and the world of business and technology established a sort of equilibrium.
Not Keeping Pace
This equilibrium was fragile though, due in part to facets of its very premise: Working out a dimensional design in advance of doing serious analysis pre-supposes that you already know the types of questions you’ll need to ask. This can work nicely for well-known business processes, but it puts many more ad hoc, exploratory analyses out of reach.
While ETL tools make it possible to integrate data from almost any database system into the dimensional repository, they really need the source data to be in a structured format of rows and columns. If source systems are operational, transactional databases, that works fine — ERP, CRM and line of business applications all fit in nicely. But contemporary analytics involves data from sources that don’t conform to this assumption.
BI tools are built for data volumes that fall short of the benchmark for Big Data. They are gigabyte- and terabyte-scale products rather than the petabyte-scale league Big Data is in. Self-service BI tools, because they are more visual in nature, suffer even more. With visualization rendered on a client computer or on a single server, only relatively small data sets can be accommodated. This forces an architecture whereby data has to be pre-analyzed and pre-processed before it can ever be visualized.
A Forced Solution?
This impedance mismatch aside, most BI tools have a Big Data story and can connect to the leading open source Big Data platforms: Hadoop and Spark. Yes, both of these cluster-based Big Data platforms tend to be used with volumes of data that BI tools can’t handle and data structures that BI tools don’t like, but there’s a workaround: Most BI tools connect to Big Data platforms using SQL-on-Hadoop technology like Apache Hive, Apache Impala or Spark SQL.
In these three systems, all data is structured as conventional database tables with rows and columns. Furthermore, if the SQL query is written correctly, then it can return a result set small enough for these tools to handle and visualize (even if the query itself executes over a much larger volume of data).
Situation Normal…All Foofed Up
So this is the status quo: BI client technology that is 15-20 years old is being used with SQL-on-Hadoop connectors that can make Big Data platforms appear as if they are conventional, relational databases. Sound good? Well, it’s not. In fact, this is a Frankenstein architecture that involves serious compromises:
– The requirement to model the data in advance of analyzing it is still there.
– The data must emanate from Hive/Impala tables that are structured as rows and columns, and the data volumes must be whittled down on the back end before visualization on the front end can be contemplated.
– The ability to query the data is relegated by the SQL paradigm, which, though broadly understood and sufficiently expressive for corporate BI, is not especially suited to the unique aspects of Big Data.
While this state of affairs is workable, it’s also far from ideal. To use a geopolitical analogy, this is more a state of détente than real peace. The BI and Big Data technologies are merely coexisting; they’re not achieving true harmony.
See It From the Vendors’ Side
As we’ve discussed, BI tools have a heritage around using OLAP/dimensional models built from tabular data sources. Because these principles are at the products’ very foundations, re-engineering those products is expensive, disruptive and risky. Given the crossroads at which Big Data adoption and project success/failure rates find themselves, the vendors may reasonably decide that the returns do not justify the investment.
That’s reasonable, but it doesn’t solve the problem. Classic BI platforms need to transform into modern BI platforms if they’re going to be truly useful for Big Data. They need to work with Big Data and related technologies natively and not just abstract them away to look like prior-generation technologies.
While muted Big Data adoption and success rates may appear to dilute the value of re-engineering BI tools, that’s just a self-fulfilling prophecy. Adoption and success will come when popular tools are designed to extract Big Data’s unique value; without that capability, adoption and success are in fact being sabotaged.
Big Data shouldn’t merely be about enabling the same old BI on greater volumes of data. It should be about getting more immediate, targeted insights from data that is more granular and more accessible, with less delay in collection and far less processing required prior to analysis.
Newer tools on the market are doing this. They’re built for Hadoop and Spark from the ground up, and they don’t try to fit those newer platforms’ square pegs into traditional BI’s round hole. These tools go with the grain, not against it, to perform advanced analysis on voluminous, semi-structured data, right off the wire.
Customers looking for Big Data success have two choices: Use the newer tools, or wait and see if the older ones ever catch up, despite vested interests in maintaining the status quo. Time is running out though, and the older vendors need to make a move.
Andrew Brust is Senior Director, Market Strategy and Intelligence at Datameer, liaising between the Marketing, Product and Product Management teams and the big data analytics community. Andrew writes a blog for ZDNet called “Big on Data;” is an advisor to NYTECH, the New York Technology Council; serves as Microsoft Regional Director and MVP; and writes the Redmond Review column for VisualStudioMagazine.com.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.