NASHVILLE, TENN. – Teradata’s Partners Conference and Expo attracted about 4,000 attendees who wanted to see how their peers are moving beyond theory and are putting big data into practice. Among the best-attended sessions was a CTO panel discussion titled Big Data Management for Effective Analytics.
The conversation ranged from the definition of a data lake to new vendors that are coming along to marry business intelligence and data integration tools, and several themes emerged. The first was building a business case. As with most technology initiatives started after the financial crash of 2008, there has to be a strong business reason for companies to invest a lot of resources into new technology – especially when the outcome may or may not have much business value.
“Sometimes tech people are looking for new uses for a new toy and it starts with the toy instead of the business. And that gets you into trouble, particularly related to data migration,” said Teradata CTO Stephen Brobst.
Jeff Pollack, Vice President of Product Management and Data Integration at Oracle, said successful business cases for a big data project include compliance initiatives, data discovery, and ETL off-loading. Jamie Engesser, Vice President of Solution Engineering for Hadoop-provider Hortonworks, said he sees a lot of companies looking to Hadoop for solutions when IT cannot provision resources fast enough.
“This is way too simplistic,” said Engesser, “but if you know the question you want to ask, Hadoop is probably not the right answer. If you don’t know the questions yet, Hadoop might be the right answer. There are a million caveats to that stuff, but the problem space has to be business driven.”
More than Storage
The technology side of the discussion focused on defining new concepts like data lakes and when to deploy them. A data lake is what it sounds like: a place to store lots of data until you decide what to do with it. In theory, this sounds relatively straightforward, but there are all manner of questions around issues like getting the data in and out of the lake and how processing is done on top of the data. Also, data lifecycle and linage must be in place so companies can conduct data governance, said Don Tirsell, Vice President of Technical Alliances at data-integration vendor Informatica.
“You should start with the structure and then move from there,” said Shyam Mudambi, Vice President of IBM’s InfoSphere Product Development. “If you don’t understand context, any analytics you do on top of is really not going to be very good.”
Another important aspect of working with a data lake is deciding what data to put into it.
“If you are chasing performance, a large-scale Hadoop cluster isn’t really the best use (of a data lake). But if you want provisioning and rapid time to value, there are some interesting things where a lot of customers are going after compliance-based initiatives. … But in getting started, the best bet is to select the right use-case and build business value incrementally,” said Oracle’s Pollack.
Say someone wants to move their business intelligence (BI) suite onto Hadoop. They start hand-coding how the data is going to be consumed from the data lake. But then things change, upgrades occur, applications get moved to the cloud, etc.
“It’s a disaster,” said Teradata’s Brobst. “Eventually, someone has to run the system day to day and do the maintenance on the code, and it’s a huge problem.”
So understanding the underlying purpose of the data lake is key to understanding what data it should contain. Otherwise, the data lake is nothing more than a way to save money on storage.
Users also need to look at using a common framework for establishing transformation rules, and then keep them independent from where those rules run, said Informatica’s Tirsell.
“You can’t do that writing (of) transformation rules in four or five different languages,” he said.
To get to this point, data integration (DI) and BI tools will have to merge, said Brobst. Because Hadoop data is in a structure that is defined at query time, BI tools will need to issue an access specification that includes transformation. But BI tools don’t know how to generate a schema that’s not already in a structured or relational form. They need the DI tools for this. “There’s going to be this interesting merging that takes place,” including startup companies, he said.
The panel also discussed “data wrangling,” which, according to Tirsell, is a new segment of the market that combines business analytics skills with a BI platform, where data has to be transformed before it can be combined, merged or de-duped on the fly.
“There are a number of startups that are looking at this in-between BI and DI layer and building products that address that middle ground,” he said.
Now a freelance writer, in a former, not-too-distant life, Allen Bernard was the managing editor of CIOUpdate.com and numerous other technology websites. Since 2000, Allen has written, assigned, and edited thousands of articles that focus on the intersection of technology and business. As well as content marketing and PR, he now writes for Data Informed, Ziff Davis B2B, CIO.com, the Economist Intelligence Unit, and other high-quality publications. Originally from the Boston area, Allen now calls Columbus, Ohio, home. He can be reached at 614-937-2316 or email@example.com. Please follow him on Twitter at @allen_bernard1, on Google+ or on LinkedIn.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.