BOSTON – Researchers in cutting-edge medical fields like neuroscience and genomics are bumping into big data problems that most businesses won’t ever encounter. The human genome, for example, is so complex that it takes large datasets to describe it, yet scientists are not close to understanding its makeup well enough to analyze all the data they collect. At the same time, the field continues to advance at such a rapid pace that results of experiments—and the tools to run them—risk obsolescence by the time they are ready.
Compared to the human genome, there are very few groundbreaking discoveries in the supply chain, or marketing.
But behind the complex human biology, the heads of research centers do encounter issues more familiar to enterprise IT and business executives: changing the culture to accept data sharing, using visualizations as tool to aid communication and further discovery, and planning for future cloud adoption.
At a panel discussion hosted by the Massachusetts Technology Leadership Council on Feb. 15, three medical research and bioinformatics experts at Boston-area institutions discussed the challenges they face in collecting, storing, and analyzing medical data.
You think you have problems finding insights in the wealth of your data? These people in medical research have big data problems that would make your hair curl, then fall out.
Michele Clamp, Harvard’s interim director of research computing and director of informatics and scientific applications, can’t tune her system to run the cutting-edge algorithms for analyzing thousands of human genomes.
Her infrastructure budget can’t keep up with amount of data collected; Harvard is already storing 15 petabytes of genomics data.
“We’re hitting the limits,” Clamp said. “We’re getting to the point where we can’t get the data off the disk fast enough to keep the CPUs spinning.”
Matthew Trunnell, CIO at the Broad Institute, a joint research center between Harvard and MIT, has the same problem. Another problem he shares with Clamp: A new algorithm to crunch genomics comes out every six months, and many of them are so complex that no efficiency is gained by running them in a parallel environment. Hadoop hasn’t proven to be a successful way to crunch the data because the datasets are so complex it makes it difficult to program a long string of MapReduce jobs fast enough to make them useful; it’s only a matter of time before a newer discovery comes along, making the work obsolete.
Peter Bergethon, the head of neuroinformatics and computational neurology at Pfizer Inc.’s research unit in Cambridge, Mass., has his own set of problems. The biologists researching the brain create models so complex it drives the engineers crazy; the engineers simplify the models so much in order to compute the problems that it introduces errors into the system. In the brain drug business, that won’t work.
This is Not Your Parents’ Moon Shot
The real issue, Bergethon said, is that these fields of advanced research are exploring the depths of scientists’ understanding about how humans are made up, and how the brain works. These aren’t problems solved by simply throwing more computational power at them; the researchers are still making discoveries as to what the data collected actually means before they can analyze it and put it to use.
“People say we should just treat it like the Moon Shot, and throw more money at it,” Bergethon said. “But we knew how to go to the moon. Newton figured it out 300 years before we did it. We still don’t understand how the brain works.”
New ways to capture data are still being invented, so new data types are constantly being created, Bergethon added. There is no standard format for clinical data, or genome data; there is a massive variety in datasets that will break a comprehensive data model months after it’s created.
Cultural Challenges, and the Potential of the Cloud
However, there are some more common problems that members of the panel have had to overcome: the foremost is convincing some of the biologists working for them to make their data available to colleagues, in a format that is helpful.
Trunnell said the biologists have a history of being imprecise because the science they are doing can be somewhat messy. And biology is a discipline where people are “scared to hell of numbers, and that includes zero and one.”
“There is a cultural component to this problem,” he said. “There just isn’t a perceived culture of value in sharing your data, or explaining your data to somebody else.”
Bergethon said he runs into the problem at Pfizer where his biologists aren’t comfortable with the quantitative math to do some of the computer modeling, and the computer engineers aren’t comfortable with the biology required for the richest models. Efforts to cross train current staff, or find someone fully trained in both disciplines, are “expensive,” he said.
Clamp said she’s considered using the cloud to store and compute her data; it’s too costly for her around-the-clock operation at Harvard now, but in five years it could be possible, she said. Trunnell would like to start replicating data and putting it into the cloud on a regular basis, so at some point it would be available to use for experiments without taxing their current computing system. That’s an effort that’s down the road, too.
Bergethon said he currently has access to cloud resources, but had to be careful with privacy concerns for much of his data
Home page image of test tubes by Wikipedia user Anna Loskutova.