GAITHERSBURG, MD.—The likes of Microsoft, Google and Amazon are just beginning to pay attention to the big data needs of scientific researchers. That was the message delivered by computer scientists from several commercial heavyweights at a workshop on big data held June 13 and 14 at the National Institute of Standards and Technology (NIST).
The tech giants are providing cloud computing resources to researchers, or exploring new projects to support their work. For example, Microsoft is exploring what role it could play in helping academic researchers catalog, share and analyze the reams of data with which they work. The company hosted a March meeting in Bellevue, Wash., with 13 university CIOs to discuss the topic, said Dr. Dennis Gannon, director of cloud research engagements in the technology policy group at Microsoft.
Gannon noted that while some areas of science, such as high-energy physics, are relatively well prepared for the explosion in data, others are not. He talked about Microsoft’s vision of an ecosystem that supports the research community and echoed the term “research IT as a service” used by another workshop participant, Ian Foster, who is director of the Computation Institute at the Argonne National Laboratory. Gannon said he sees Microsoft providing something akin to a “data market,” in which academicians subscribe to particular data collections and use specialized tools to use that data.
“We’re trying to find a way to test this hypothesis,” he said, adding that Microsoft has had ongoing discussions, like the one in March, about potential pilot project with universities.
Google is getting “intensely involved with scientific activities” as part of its effort to understand big data in the research market, according to Joseph L. Hellerstein, who manages the Computational Discovery Department at the company. Acknowledging that commercial cloud vendors have been slow to support this area, he said Google is now “learning what’s required to do science in the cloud.”
One vehicle Google is using is a research grant program announced in April. Google Exacycle will award at least 100 million core-hours to each of up to 10 qualified researchers. Awards are expected to be announced starting this month. The program focuses on large-scale batch computations such as genomic search and sky survey image analysis.
Hellerstein said computational discovery needs a standard framework that makes it easy to scale computing resources and data so that scientists can focus on the science and not the programming of distributed systems. Among the other architectural challenges in developing such a framework are the need for advanced data management—including provenance of data, and the ability to interactively explore, analyze and critique the data at scale. Such a framework needs to support crowd-sourced data curation and the computing power of multiple clouds. Hellerstein described a new capability—called introspective batch processing—that such a framework should enable. It’s batch processing on a large scale and yet it incorporates interactive feeds so that researchers can observe the processing of batches and actually change things dynamically.
Amazon’s cloud computing arm is involved, too. Mark Ryland, chief solutions architect for Amazon Web Services, touted the company’s participation in the National Institute of Health’s 1000 Genomes Project, part of the Obama Administration’s big data initiative announced in March. Amazon Web Services is providing NIH researchers with free data access as part of the project. “It’s now at 250 terabytes and growing fast,” he said.
Ryland said researchers using the service no longer need massive on-premise storage and compute resources, and said the cloud will enable new types of research collaboration. One example is “executable papers,” that include links to not only data but the computational framework of the research. A reader of an academic research paper could “boot the computational resource that [the author] built,” he said. “It’s an incredibly powerful collaboration technique.”
Tam Harbert is a freelance writer based in Washington, D.C. She can be contacted through her website.