The article below is excerpted from the book, Principles of Big Data, Preparing, Sharing, and Analyzing Complex Information (Morgan Kaufmann, 2013) with permission from the publisher. The book covers methods that permit data to be shared and integrated among different big data resources..
Imagine using a restaurant locater on your smartphone. With a few taps, it lists the Italian restaurants located within a 10-block radius of your current location. The database being queried is big and complex (a map database, a collection of all the restaurants in the world, their longitudes and latitudes, their street addresses, and a set of ratings provided by patrons, updated continuously), but the data that it yields is small (e.g., five restaurants, marked on a street map with pop-ups indicating their exact address, telephone number, and ratings). Your task comes down to selecting one restaurant from among the five and dining thereat.
In this example, your data selection was drawn from a large data set, but your ultimate analysis was confined to a small data set (i.e., five restaurants meeting your search criteria). The purpose of the big data resource was to proffer the small data set. No analytic work was performed on the big data resource—just search and retrieval. The real labor of the big data resource involved collecting and organizing complex data so that the resource would be ready for your query. Along the way, the data creators had many decisions to make (e.g., should bars be counted as restaurants? What about take-away only shops? What data should be collected? How should missing data be handled? How will data be kept current?).
Big data is seldom, if ever, analyzed in toto. There is almost always a drastic filtering process that reduces big data into smaller data. This rule applies to scientific analyses. The Australian Square Kilometre Array of radio telescopes, WorldWide Telescope, CERN’s Large Hadron Collider, and the Panoramic Survey Telescope and Rapid Response System array of telescopes produce petabytes of data every day. Researchers use these raw data sources to produce much smaller data sets for analysis.
Here is an example showing how workable subsets of data are prepared from big data resources. Blazars are rare super-massive black holes that release jets of energy moving at near-light speeds. Cosmologists want to know as much as they can about these strange objects. A first step to studying blazars is to locate as many of these objects as possible. Afterward, various measurements on all of the collected blazars can be compared and their general characteristics can be determined. Blazars seem to have a gamma ray signature not present in other celestial objects. The Wide-field Infrared Survey Explorer (WISE) collected infrared data on the entire observable universe. Researchers extracted from the WISE data every celestial body associated with an infrared signature in the gamma ray range that was suggestive of blazars—about 300 objects. Further research on these 300 objects led researchers to believe that about half were blazars (about 150).
This is how big data research typically works—by constructing small data sets that can be productively analyzed. The table below identifies key differences between small and big data.
|Small Data||Big Data|
|Goals||Answer a specific question or serve a particular goal.||There is a vague goal, but there really is no way to completely specify what the big data resource will contain and how the various types of data held in the resource will be organized, connected to other data resources, or usefully analyzed.|
|Location||Typically, small data is contained within one institution, often on one computer, sometimes in one file.||Typically spread throughout electronic space, typically parceled onto multiple Internet servers, located anywhere on earth.|
|Data Structure and Content||Ordinarily contains highly structured data. The data domain is restricted to a single discipline or subdiscipline. The data often comes in the form of uniform records in an ordered spreadsheet.||Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, big data resources.|
|Data Preparation||In many cases, the data user prepares her own data, for her own purposes.||The data comes from many diverse sources, and it is prepared by many people. People who use the data are seldom the people who have prepared the data.|
|Longevity||When the data project ends, the data is kept for a limited time (seldom longer than 7 years, the traditional academic life span for research data) and then discarded.||Big data projects typically contain data that must be stored in perpetuity. Ideally, data stored in a big data resource will be absorbed into another resource when the original resource terminates. Many big data projects extend into the future and the past (e.g., legacy data), accruing data prospectively and retrospectively.|
|Measurement||Typically, the data is measured using one experimental protocol, and the data can be represented using one set of standard.||Many different types of data are delivered in many different electronic formats. Measurements, when present, may be obtained by many different protocols. Verifying the quality of big data is one of the most difficult tasks for data managers.|
|Reproducibility||Projects are typically repeatable. If there is some question about the quality of the data, reproducibility of the data, or validity of the conclusions drawn from the data, the entire project can be repeated, yielding a new data set.||Replication of a big data project is seldom feasible. In most instances, all that anyone can hope for is that bad data in a big data resource will be found and flagged as such.|
|Stakes||Project costs are limited. Laboratories and institutions can usually recover from the occasional small data failure.||Big data projects can be obscenely expensive. A failed big data effort can lead to bankruptcy, institutional collapse, mass firings, and the sudden disintegration of all the data held in the resource. Though the costs of failure can be high in terms of money, time, and labor, big data failures may have some redeeming value. Each failed effort lives on as intellectual remnants consumed by the next big data effort.|
|Introspection||Individual data points are identified by their row and column location within a spreadsheet or database table. If you know the row and column headers, you can find and specify all of the data points contained within.||Unless the big data resource is exceptionally well designed, the contents and organization of the resource can be inscrutable, even to the data managers. Complete access to data, information about the data values, and information about the organization of the data is achieved through a technique herein referred to as introspection.|
|Analysis||In most instances, all of the data contained in the data project can be analyzed together, and all at once.||With few exceptions, such as those conducted on supercomputers or in parallel on multiple computers, big data is ordinarily analyzed in incremental steps. The data are extracted, reviewed, reduced, normalized, transformed, visualized, interpreted, and reanalyzed with different methods.|
Table. General differences that can help distinguish big data and small data.
Jules Berman, Ph.D., M.D., is a free-lance author, writing extensively in his three areas of expertise: informatics, computer programming, and cancer biology.