“Big data” isn’t anything new.
You’d be excused for thinking otherwise, given the enormous resurgent interest in the mining of staggering amounts of data—which is how we could define big data at its most basic—catalyzed by an incoming tsunami of “big data” tools and technologies which are not particularly novel either.
The fact is, governments and large international organizations have been dealing with big data for decades: think census surveys and the research behind UNESCO and UNICEF programs to direct aid and alleviate poverty, for example. Also consider the ongoing revelations about the National Security Agency’s surveillance programs (more about that in a moment).
Until recently, those kinds of projects occurred in relative obscurity. Now there is no question that we are seeing tremendous momentum in reporting and understanding of big data use cases at the large-organization level, driven partly by our ability to process big data (bigger and faster servers and storage, the ubiquity of technologies like Hadoop and, of course, the sheer availability of big data) and partly by a rapidly growing awareness of the need to gain insight from this data. This in turn is throwing light on the opportunities leveraged, misused and just plain missed. It is clear that these use cases have both lessons and ramifications for all of us as business professionals and private citizens.
A very interesting case study was recently reported by The New York Times regarding the recently completed German census. According to this announcement by the German Statistisches Bundesamt, the federal Bureau of Statistics, the country had 1.5 million fewer people than expected.
The census findings were a “double whammy”: a lower (and aging) population means fewer able-bodied workers and less tax collection; plus, who would pay off future debts?
Now, one simple reason for the discrepancy was that Germans really weren’t sure what to expect: their last census was, astonishingly, a quarter of a century ago, before East and West Germany was reunified. Unlike, say, the United States, where the populace—at least, those of us fortunate enough to be here legally—happily submits to being counted every 10 years like clockwork, the Germans, it seems, view a census as an invasion of privacy, and are hence reluctant to submit to being counted. And to exacerbate things, German history does not provide a strong supporting case for state monitoring— during the 1930s and 1940s, the Nazis were reputedly using the census as a tool to identify Jews.
Subsequent investigations revealed an intriguing fact: most of the missing population consisted of migrants, and unusually healthy foreigners at that. “Demographers were trying to explain the healthy-migrant effect, why they were living to be 110 years old,” quotes the Times.
A Classic Data Management Issue
Turns out, it was a classic process data quality problem: foreigners were required to register on arrival in the country, but, of course, when they move out, they seldom bother to “deregister.” As a result, migrants continued to be counted long after they had left the country.
The “big data” lessons here are undeniable: overcoming the logistical challenges of gathering voluminous data across a large “catchment” area, scrubbing it, integrating, and making sense of it so that the resulting analytics do not mislead, with some master data management (MDM) aspects thrown in the mix. For example, I am curious about the use, if any, of MDM-like de-duping techniques used to identify migrants (or, for that matter, even citizens) that resided at multiple locations over their lifetimes—by no means a rare pattern. This project also underscores the importance of data profiling, the process of previewing the data collected in order to assess its quality and correctness, which would have led to early detection of the data outliers, such as the unusual aging profile of migrants.
Back home in the meanwhile, the recent furor over the National Security Agency (NSA) collecting phone records from Verizon and others on a mass scale spotlights the age-old conflict between the need of governments to gather enough information in order to govern effectively—a big part of which is maintaining national security—and avoid trampling the rights of citizens. And, unfortunately for us hapless citizens, it does not appear that the two are mutually exclusive.
This has a correspondence in the world of business, too. Replace “government” and citizens” with “companies” and “customers”, and we have the makings of a growing corporate dilemma: in our quest to know as much as possible about our customers and their lifestyles and preferences, at what point does the information we collect stop being benign and begin to get intrusive?
Take the example of a grocery store or a Web store that derives increasingly sophisticated insight into consumer preferences—past as well as future—in order to serve the customer better and improve on both sales and margin. The line between sophisticated analytics and intrusion of privacy is mighty fine indeed, as demonstrated by a report that appeared in The Times early last year. One startling and much-publicized case mentioned was how predictive analytics led to identifying a teenage girl who was pregnant and her family wasn’t yet aware of it. The original aim of the software was to identify women pregnant in their second trimester in order to influence their purchases and lock them in for years with targeted marketing. This was accomplished by means of a “pregnancy score” computed from the purchasing records for a basket of about 25 products. When the girl began to receive coupons appropriate to her condition, the secret was out.
This sort of predictive analytics (in this case in order to cultivate closer customer connections) is one of many emerging cases in which big data plays a role that we were not aware of.
It is worth considering the implications of such use cases as we continue to exploit data. Even as many of us bask in the sunlight of what we like to call “free society”, there are, of course, countries where the very concepts of privacy and individual freedom are viewed as subversive by the state and where citizens have come to accept (albeit with reluctance and resignation, perhaps) that it is indeed the right of the state to delve deep into, and wield control over, their lives.
There is an old adage, “knowledge is power when put into action.” As big data continues to propel increasingly deep forays into population analytics by countries and corporations alike, the question of “actionable insight versus actionable intrusion” (pun on “actionable” intended) will only grow in magnitude and complexity.
Rajan Chandras, a senior level practitioner in enterprise data management and a freelance technology columnist, is employed at a major healthcare insurance firm in the New York region. You can reach him at rchandras at gmail dot com.