The international 1000 Genomes Project, a government-backed initiative that aims to sequence the entire genome of 2,600 people from around the world, is on track to reach this goal by the end of the year, continuing to grow the world’s largest dataset on human genetic variation.
Since March, when the National Institute of Health (NIH) made the DNA sequencing of 1,700 people publicly available to researchers on the Amazon Web Services (AWS) cloud, another roughly 50 individual samples have gone through initial sequencing, according to scientists with the National Center for Biotechnology Information (NCBI), a component of NIH.
The NIH and AWS jointly announced in a March 29 press release that Phase 1 of the project data was available for cloud computing. In the months since, the 1000 Genomes Project has demonstrated its potential by, for example, enabling researchers to validate a scientific discovery related to Alzheimer’s disease.
“Right now we’re in the middle of collecting and analyzing Phase 2 data, which will bring us up to 1,800 individuals,” said Stephen Sherry, chief of the NCBI reference collections section. Phase 3 includes the final set of populations that will be collected and sequenced over the next several months, Sherry said. The target for Phase 3 completion is the end of 2012.
Once complete, 1000 Genomes Project data will contain data from 2,600 people from various population groups around the world—roughly 500 individuals from each of five continental regions, with five population subsets in each region. The goal is to produce a catalog of genetic variants present in 1 percent or greater of the populations studied, and to make the data publicly available on the cloud, where it can be retrieved quickly and inexpensively, to accelerate the rate of medical discoveries.
Researchers accessing the 1000 Genomes Project data on the cloud benefit by not having to download the data, as well as by being able to run their analyses on multiple servers simultaneously. Practical uses have already been realized, including the ability to quickly confirm independent research studies, as well as the development of an application designed to focus on a specific type of genetic variant.
For example, Sherry said that by using the 1000 Genomes Project data on the AWS cloud he was able to independently confirm the recent discovery of a rare genetic mutation that may offer protection against Alzheimer’s disease. The discovery, announced July 11 in the journal Nature, reached its conclusion by studying the coding variants data from the genomes of 1,795 Icelanders.
Sherry said that he was able to confirm the study’s results overnight at a fraction of the cost that someone would incur by running the same experiment without cloud access. His purpose for doing so was to confirm the accuracy of 1000 Genomes Project data by cross-referencing the Icelandic population.
The mutation, Sherry said, wasn’t reported in the 1000 Genomes Project data even though that data includes DNA sequencing from Icelanders. To confirm whether the 1000 Genomes Project data simply missed the variant, Sherry said that his team “had to go back through all this alignment data at that one position on the chromosome and ask how many people had normal A and how many people had mutant A. We didn’t miss it, we don’t have a single observation of that mutant A in all 1,700 people (of 1000 Genomes Project samples). It’s truly a rare variant.”
Without cloud access, Sherry’s team would have needed several months to get a copy of the data used in the study, excluding the time spent running the analysis.
“On a Thursday afternoon I talked to one of my programmers. They went back in that night and scanned the data overnight and Friday morning I had my answer,” Sherry said. “Anyone on the cloud could do the same thing. You can scan the data for any position of interest overnight.
“It’s state-of-the-art in human population genetics as a community resource,” Sherry added.
Don Preuss, head of the NCBI systems group, said that any researcher wishing to replicate Sherry’s experiment could do so by spending roughly $100 for the compute time of running the analysis.
A Cloud Collaboration Channel for Genetics Researchers
In addition to this type of cloud use, collaborative analysis among researchers is also made possible through pipeline applications that simplify complex computer problems. One cloud-supported resource is the Variant Annotation Tool (VAT), which was designed to analyze a specific subset of the 1000 Genomes Project data. VAT’s development was announced in June by Gerstein Labs. “It’s something we developed within the context of 1000 Genomes to functionally annotate the most deleterious variants, those that essentially knock out a protein,” said Mark Gerstein of Yale University. Gerstein is the Albert Williams Professor of biomedical informatics, molecular biophysics and biochemistry and computer science at Yale.
The reasons for a variant, Gerstein explained, are sometimes not always obvious. What VAT is intended to do is categorize and summarize the reasons for a particular variant.
After starting to develop the VAT for use on a localized server, Gerstein said that his team quickly realized the benefits of making the resource available on the cloud and began to develop a cloud-enabled prototype.
“The vision is that you’d have all this 1000 Genomes data and it would sit in the cloud and never come out. And then you’d have all these pipelines or chained group of analyses put together that would run on this data and they would produce subsidiary files that would then reside on the cloud and you would run them. That’s kind of what’s VAT is designed for,” Gerstein said. “It’s obviously not designed to do all the potential analyses you could do with 1000 Genomes data, but it’s made to really play nicely within these workflows or pipelines. In practice we’d have some use of it in the cloud. A lot of the computing still is people just downloading VAT and running it on a big cluster.”
Gerstein said that he anticipates more use of VAT on the cloud as data from 1000 Genomes project continue to be added.
Sherry, of the National Center for Biotechnology Information, said VAT is an important technological tool that serves as an example of how cloud computing has the power to accelerate biomedical advancements.
“The VAT is important because this is the thing most geneticists are interested in,” he said. “What the pipeline does if it’s good, that’s an important service because diagnostic labs and research labs or non-specialists can use this as a way to take their own sequencing data and analyze it with the same tools that were used on 1000 Genomes. It’s been developed and verified on 1000 Genomes data that’s been put on the cloud. You can run it on 1000 Genomes data and you can also run it on your own data. That’s the real power.”
As with Sherry’s experiment running the data from the Icelander study in a cloud environment, pipelines like VAT help accelerate the pace of discovery because researchers and scientists can independently verify the results of a published paper with easily accessed public data before applying the pipeline to their own data.
This gives a researcher “confidence that you’re running the software in a properly configured way that will make the message more stable and easier to communicate in papers explaining how things were done,” Sherry said. “There’s a whole data flow analysis environment packaged and ready to go that other people can use. And that’s very appealing to scientists.”
Ken Murphy is a freelance writer based in the Boston area. He can be reached at firstname.lastname@example.org.