Just because you’ve assembled a massive trove of information and made it searchable doesn’t necessarily mean that you have something that’s going to be useful to anybody. If you don’t know where that information is coming from and whether you can trust it, then it’s useless. The same idea applies to big data analytics. If you don’t know where the data is coming from, your data lake will quickly start to resemble a swamp instead of what it should resemble: a reservoir, something that guarantees access, quality, and provenance.
Here are four ways to ensure that your data lake successfully meets your future data analysis needs.
Understand that data is heterogeneous. No two data tables are perfectly alike. In fact, if you think about the three “Vs” of big data – volume, velocity, and variety, you’ll quickly realize that the first two really don’t present much of a problem. We have known how to scale things for a while, and many data systems are capable of doing so extremely well.
The problem that big data presents isn’t so much the size of the table, it’s the variety of data. A lot of data doesn’t feel “table-like” and needs to be accessed in ways that don’t fit relational algebra. Increasingly, it is by finding ways to contend with the variety within big data and, by extension, Hadoop, that we are able to see things clearly. Otherwise, we are just building larger and more powerful ways to compare apples with oranges.
There is no free lunch, only more choices. While it’s true that the tools for data analysis are stronger than ever before, we are still a long way from automating any of these processes. And while curated information is easier to sort through, data wildness is only one characteristic of data sets. Is it memory oriented or graph oriented? Is the information going to be more helpful if it is indexed or if it is easier to search? These are all questions that still are best answered with the help of a human being. Thinking that a particular tool can bail you out is naïve; there is no such thing as a free lunch in life. Why would we expect data analysis to be any different?
Understand the value of metadata. Just as no science experiment would be taken seriously if it didn’t use a control group, no data analysis is worth very much if it doesn’t make use of metadata. In order to call your data a true “reservoir” or “lake,” you need to be able to provide the business-level guarantees that one comes to expect from a data warehouse without concomitant ETL constraints. One of its most important characteristics often will be the presence of metadata that enables non-experts to know the location of and entitlements to the various forms of stored data within. Like actual bodies of water, data lakes occur naturally. Metadata is one of the only ways in which you’ll be able to make sense of information that comes in at unpredictable times and from all over the world.
Add in your own enterprise features. Every use case will have a unique solution. Geography may be an issue. (Analyzing data from two places at once is already a challenge, and it really doesn’t help if those two places are in different countries with different laws and regulatory challenges.) Your problem just as easily could be cleaning messy data, as computers are not always great at spotting duplicates. Data lakes, or data silos, are good for consolidating and incrementally refining information, but a “lowest common denominator” solution doesn’t help everyone. You need to build out unique enterprise features that go with the data you are trying to analyze. It’s important to understand that the data lake is part of a bigger ecosystem of enterprise data tools – there continues to be great value in having data warehouses, discovery platforms, and real-time processing systems that work in concert with your data lake.
The variety of big data and Hadoop’s continued customizability remain two of the central themes of data analysis. Be wary of any provider that claims to be a “one-stop shop.” On the contrary, it is only by keeping the limitations of data access patterns and the technological tools we have at our disposal that we are able to parse them in any meaningful way. You wouldn’t buy a bottle of water if you didn’t know where it came from. By increasing the specificity and clarity of language we use to talk about big data, we will increase our ability to produce insights you can actually trust.
Ron Bodkin is the founder and CEO of Think Big Analytics. Ron founded Think Big to help companies realize measurable value from big data. Think Big is the leading provider of independent consulting and integration services specifically focused on big data solutions. Our expertise spans all facets of data science and data engineering and helps our customers to drive maximum value from their big data initiatives.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.