In today’s evolving world of big data, many businesses have been inspired by the potential of the data lake approach. A data lake offers advantages to storing large volumes of heterogeneous data and, for the majority of organizations that need to analyze complex data (structured and unstructured), a data lake delays the need to integrate the data with a data warehouse.
But constructing a usable data lake with native formats presents a number of challenges that must be addressed for the data lake to fulfill its promise of making it easier and less costly to extract actionable information from data.
According to Gartner Research Director Nick Heudecker, “Data lakes typically begin as ungoverned data stores. Meeting the needs of wider audiences requires curated repositories with governance, semantic consistency, and access controls.”
It’s important not to overly constrain the data, but without sensible governance, users soon will find that accessing what they have stored is surprisingly challenging. Idle and overgrown, the data lake quickly will become a stagnant data swamp. But organizations can avoid data swamps by adding semantics to a data lake.
Semantics brings a powerful, yet highly flexible structure to unstructured and structured data, in a model that is sustainable over time. It allows users to see relationships between data without first forcing that data into a schema straightjacket and supports ad hoc and unanticipated analytic uses. It affordably breaks down data silos and, more simply, just makes sense. Semantics is intrinsically about logic and rules, and was developed to organize information in a comprehensible fashion. So it should come as no surprise that semantics provides us with a highly usable and consistent taxonomy model for data lakes.
For example, variances between words and their uses within big data sets can be highly problematic. Think of the last time you searched for a word with multiple meanings – perhaps you wanted information on Python programming, but ended up with returns on massive snakes. Think too about the way we use different spellings and abbreviations for the same words – (i.e., California, Calif., CA.). Context matters also: Are the data referring to the city of New York, or the state?
Born in response to issues like these was the semantic data model: A way to extract and define the meaning of the data in a logical way that makes sense both to people and machines. Still, it’s important to note that even though semantic models make sense to humans, they primarily are intended to allow software to extract and assign meaning to data independently.
Using a semantic data model, you represent the meaning of a data string as binary objects – typically in triplicates made up of two objects and an action. For example, to describe a dog that is playing with a ball, your objects are DOG and BALL, and their relationship is PLAY. In order for the data tool to understand what is happening between these three bits of information, the data model is organized in a linear fashion, with the active object first – in this case, DOG. If the data were structured as BALL, DOG, and PLAY, the assumption would be that the ball was playing with the dog. This simple structure can express very complex ideas and makes it easy to organize information in a data lake and then integrate additional large data stores.
Semantic Data Models in the Swamp
A workable semantic data model can be created by anyone with an understanding of logic and taxonomies. But when it comes to integrating disparate data sets, the fastest route to success is the use of a common language (nomenclature) across an entire audience and, often, industry.
Semantic data models, in combination with semantic graph databases, bring clarity, relationships, and structure to unstructured information and are designed explicitly to share data, discoveries, and answers. The data sources used for analytics can be – and often are – both internal and external, such as Linked Open Data, a graph of interlinked data sets. The standard for a Linked Open Data set is RDF, the Resource Description Framework, which is a model for describing things and their relationships. Tim Berners-Lee, inventor of the World Wide Web, is fond of describing linked data as “the semantic Web done right.”
Developing semantic data models for key industries is currently underway, with healthcare at the forefront, spearheaded by Montefiore Medical Center and its partners, which have created the first semantic data lake for healthcare. Semantic data models act as a translator, enabling variances in industry terminologies and words to be easily integrated with other internal and public data sets.
A semantic data lake is incredibly agile. The architecture quickly adapts to changing business needs, as well as to the frequent addition of new and continually changing data sets. No schemas, lengthy data preparation, or curating is required before analytics work can begin. Data is ingested once and is then usable by any and all analytic applications. Best of all, analysis isn’t impeded by the limitations of pre-selected data sets or pre-formulated questions, which frees users to follow the data trail wherever it may lead them.
Dr. Jans Aasman Ph.D., is the CEO of Franz Inc., the leading supplier of Graph database technology. Dr. Aasman’s previous experience and educational background include KPN Research, the research lab of the major Dutch telecommunication company; tenured professor in Industrial Design at the Technical University of Delft. Title of the chair: Informational Ergonomics of Telematics and Intelligent Products; Carnegie Mellon University visiting scientist at the Computer Science Department of Prof. Dr. Allan Newell; researcher at the Traffic Research Center of the University of Groningen (The Netherlands); and experimental and cognitive psychology at the University of Groningen, specialization: Psychophysiology, Cognitive Psychology.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.