GAITHERSBURG, Md.—Scientific research already uses very big data sets, but the field needs to develop a framework and standards to take major inquiries to new heights of collaboration and discovery, according to researchers convening at the National Institute of Standards and Technology (NIST).
Indeed, the reason NIST hosted the June 13 workshop was to gather information on what’s needed to create interoperability standards for big data, said Ashit Talukder, chief of NIST’s Information Access Division. Scientists at the event discussed several challenges researchers must overcome, including: providing for the clear provenance of data, to ensure researchers’ source materials are reliable; standards for exchanging data from multiple domains, to enable their combination and analysis; and security measures to allow for research while protecting sensitive data.
Although individual research organizations are using their own data sources, “the real breakthroughs will come when we are able to use someone else’s data,” said Howard Wactlar of the National Science Foundation. Wactlar is division director of information and intelligent systems in NSF’s Computer Information Science and Engineering Directorate.
One of the biggest problems facing standards developers is how to provide data provenance. Researchers need proof that the data is reliable. “Do we know where the data comes from and can we document the history of the data?” asked Neal Ziring, technical advisor to the Information Assurance Directorate of the National Security Agency. Another challenge is how to combine data from different sources and multiple domains so that it can be analyzed together, he noted.
If a standardized framework is developed, it would usher in “process automation for science,” said Ian Foster, professor of computer science at the University of Chicago and director of the Computation Institute, jointly run by Argonne National Laboratory and the university. Third-party organizations could offer “research IT as a service,” providing access to cloud-based stores of data and computation, allowing researchers to outsource the computing and data access functions and concentrate on the research itself.
Groups have already begun emerging to meet the specific needs of big data for science, which are different from commercial industry and aren’t given priority by commercial cloud providers. For example, moving huge data files from one location to another typically takes a long time, which slows the research process. Globus Online is an initiative by the Computation Institute that specializes in fast transfer and storage of such data. The free service was launched 18 months ago and now has 5,000 customers, including Blue Waters, an NSF-funded project on sustained petascale computing at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign.
Robert Grossman, a biology professor and director of informatics at the Institute for Genomics and Systems Biology and the University of Chicago, described a non-profit group called the Open Science Data Cloud, essentially cloud services for researchers operated by the Open Cloud Consortium. The project provides “a way to begin to think about what kind of science we can do if we have” distributed cloud services for science, he said.
A similar concept is the virtual observatory, which was the topic of a presentation by Peter Fox, a professor of earth and environmental science and computer science at Rensselaer Polytechnic Institute. Fox is the primary investigator for the Virtual Solar-Terrestrial Observatory, a computing environment which allows researchers to explore an environment analyzing data from diverse data archives in the fields of solar, solar-terrestrial, and space physics.
But the idea is being expanded beyond astronomy. “The premise is that anyone should be able to access a global, distributed knowledge base of scientific data” just as if it were integrated and stored locally, said Fox. Standards are needed to define how to exchange data and how to mediate data from different sources so it can be analyzed, said Fox.
If and when these problems can be solved, the result could be academic research on steroids.
Interoperability would accelerate discovery and innovation, said Foster. Having “a big process for big science” in place, giving researchers worldwide access to huge stores of data, could dramatically reduce both the time and the costs of conducting research.
The academic world is excited by the prospects. Said Fox: “A new culture [is emerging] in how we do science in the age of big data.”
Tam Harbert is a freelance writer based in Washington, D.C. She can be contacted through her website.