Wikipedia—that modern-day oracle—defines “unstructured data” with simplicity and elegance: “information that either does not have a predefined data model and/or does not fit well into relational tables.” This, in a lucid moment, is exactly how I would define unstructured data. Implicit in this definition, incidentally, is an ode to the relational data model, calling it out separately, while bunching all other forms of data structure—from arrays and linked lists to spreadsheets to XML and object data stores—into one category.
On the other hand, I’ve often seen (or heard) big data equated to unstructured data. Now, that isn’t quite true, of course—big data can be just as structured as “small” data.
But a more common misconception about unstructured data is that it is, well, unstructured.
The typical image conjured by the term “big data” is nicely captured in this nugget from The New York Times: “Most of the Big Data surge is data in the wild—unruly stuff like words, images and video on the Web and those streams of sensor data. It is called unstructured data and is not typically grist for traditional databases.”
This daunting image of big data is further magnified by the staggering percentage of the world’s information that is perceived to be unstructured—70 percent? 80 percent? 90? Take your pick; I’ve seen them all, quoted at various times and from seemingly very respectable sources like Gartner and IBM. And the trouble is, they all seem very reasonable estimates.
Your Perspective on Unstructured Data
But when we begin to get a little closer to the numbers and put them into perspective—your perspective, your employer’s perspective, your customer’s perspective—there are two significant factors that come into play that are game-changers.
One, most of that does not apply to you—or your situation—at all. In a study last year, IDC estimated the size of the “digital universe” to be of the order of (hold your breath) 1.8 trillion gigabytes, which means we are now somewhere between an exabyte and a zettabyte. But think for just a moment: How much of it applies to you? Fortunately, no more than a tiny fraction (even if that amount is still enough to give us a migraine at work).
In other words, let’s not worry about how much unstructured data lies out there, and how fast it’s growing. It’s only the data pertaining to your corporate goals and strategy that matters.
Which brings us to the next comforting fact: Most of the “big data” that pertains to you is not totally unstructured.
Let’s look at an example.
Take the case of sentiment analysis being conducted by a toy manufacturer that sells toys through its own web site, also through intermediaries like Amazon.com and—directly and indirectly—eBay. (Notice the irony in the term “toy manufacturer” here—you and I both know that the real toy manufacturer is in China—our protagonist company is merely the holder of a bit of intellectual property and a lot of marketing muscle.)
The product management team is looking to trawl the Internet for the treasure trove of user comments on product acceptability, which is of course typically free form—a fairly classic case study for big data.
Analysis Techniques at Work
The information analysis techniques here are equally typical, such as data mining (for forecasting how the market will receive the new product being launched during holiday season, derived from what they have liked and not liked from existing, comparable products in the company portfolio) and active intelligence (which we will define as the ability to understand in relatively real-time, say, a day or two, what are the most recent causes for product failure or customer dissatisfaction) that will enable our call center operators to be informed and on the alert.
Well, what makes these user comments not quite unstructured? In a nutshell, the metadata surrounding the unstructured data.
For example, both Amazon and eBay have product metadata—such as the name of the product, product category, product attributes, SKU number—that can be mapped to the manufacturer’s product catalog. This solves one critical problem: putting context around the comments.
Then, both Amazon and eBay provide for a “star rating”, which very directly translated to degrees of “like”—very useful and convenient, indeed.
It now becomes a straight-forward exercise in text analytics to derive meaning from seemingly casual user comments, for example, “There are so many faults with the <product name> that I would not recommend it even it if was free! This is by far the worst!”
The task of semantic analysis becomes significantly more challenging when it comes to interpreting comments from, say, a customer blog which has inherently less structure. Making some sense of this requires web scraping/harvesting tools as well as the support of enterprise data management programs like master data management and metadata management.
This kind of analysis is inherently more applicable to the consumer market, though of course not limited to products sold on the likes of Amazon (think of financial products and other consumer services). Making sense of unstructured data has two main determinants: the meta-structure surrounding the unstructured data, and our ability to highly structure in-house data in order to better map and decipher that meta-structure.
The point is that unstructured data can be tamed. It can be a non-trivial proposition, requiring significant investment and an ongoing commitment from stakeholders across the company to support the effort from the ground up. Come to think of it, that’s where it all starts.
Rajan Chandras is a practitioner in enterprise data management and a freelance columnist for InformationWeek.com. He is employed in a senior capacity at a major healthcare insurance firm in the New York region. You can reach him at rchandras at gmail dot com.