First the Data, then the Analytics: Pioneers at deCODE Invented Analytical Tools for Genomic Research

by   |   January 3, 2013 12:14 pm   |   0 Comments

Dr. Hákon Guðbjartsson, deCODE’s director of informatics

When Reykjavik-based deCODE genetics was launched in 1996, its claim to fame was its unique access to a powerful combination of genetic, genealogical and clinical data about Iceland’s largely homogenous population of 300,000 citizens. What it didn’t have were viable commercial tools to analyze that goldmine of data riches.

“Very soon after we started we found out that we were not able to solve our needs by using off-the-shelf software,” says Dr. Hákon Guðbjartsson, who has served as deCODE’s director of informatics from the beginning. Available analytics software was geared toward academia for small projects or specific tasks. There was no systems to handle the scale of deCODE’s ambitions.

“We wanted to harvest all of that data into one database and reuse the information for research on many diseases,” Guðbjartsson says. So as a pioneer in the field of bioinformatics, deCODE took a do-it-yourself approach.

DeCODE’s history is one of analytical innovation—the company has developed some of the most well-known tools in genetics research, and it is the force behind important scientific findings such as a discovery about a genetic component of Alzheimer’s published in 2012. But such innovation does not guarantee commercial success; after emerging from a 2009 bankruptcy, deCODE reached agreement in December to become a research subsidiary of biotechnology giant Amgen.

In the early days, deCODE was dealing with short DNA sequences called microsatellites that are repeating and inherited by people from their parents. But in 2006, the company began working with array-based chips to measure single-nucleotide polymorphisms (SNPs). These larger DNA sequences represent the most common types of genetic variations that can serve as biological markers to identify genes associated with specific diseases.

Related Stories

Up and running, 1000 Genomes Project shows its potential.

Read more»

Big Pharma tracks competition with semantic web tools.

Read more»

Relay Technology combines analytics, search and visualizations to bolster drug discovery.

Read more»

“As we moved to SNPs the sheer number of genetic markers became the issue.  We captured close to million genome-wide SNPs as compared to only 2,000 microsatellites,” says Guðbjartsson, “We’ve come a long way in terms of the tools we are using.”

As the cost to decode an individual’s entire genome began to fall in 2009, and the volume of data to be mined further expanded. “That’s hundreds of gigs of sequence read data per individual,” Guðbjartsson says. “We’re working at a scale that’s 40 times larger than when we started with array-based SNP chips, working with over 40 million of sequence variants per individual.”

Two Petabytes of Genetic Data
What began with homegrown systems for sample management, a genealogy database, automated encryption processes and advanced set definition language for patient cohort definitions has evolved into a focus on high-throughput tools for whole genome analysis.

In their early days, deCODE benefited from open source linkage analysis algorithms like the original GENEHUNTER program which deCODE’s statisticians improved upon with the Allegro analysis package—30 times faster than its predecessor—developed in 2000. Today, the informatics team at deCODE continues to refine its database architecture systems and analytics to handle the volume of whole genome datasets.

In 2008, deCODE developed an analytics method to extrapolate the results from its results from whole genome sequencing data and apply them to other datasets. The method, called long-range phasing, enables scientists and researchers to take the sequence variants found in its database of 3,000 Icelanders with whole genome sequence data and impute those variations to thousands of others who have been measured only by SNPs.

“We impute 40 million sequence variations into the entire genotyped population, very accurately and relatively quickly, allowing us to perform genome-wide association analysis on a very large number of phenotyped individuals,” Guðbjartsson says. The company’s statisticians have taken the process further, using genealogy data to impute those variations into ancestors of individuals in the database for whom there is no genetic data at all.

Now Guðbjartsson’s informatics team is developing a process for harmonizing and standardizing whole genome sequence data. “There is a myriad of different data formats and specific tools to deal with sequence data so you end up building a lot of software glue for converting data from one format to the other,” says Guðbjartsson. “We’re trying to build a standard data architecture that allows you to use the same tools for most of the analysis that we do.”

The system is compatible with the BAM format which emerged from the 1000 Genome Project as the de facto standard for storing sequence data. “Conventional commercial relational databases do not scale well enough for this [work] and their query tools do not provide all the necessary functions needed for scientists to work with this data,” says Guðbjartsson. DeCODE is building what it calls its Genomic Ordered Relations (GOR) architecture specifically geared to this data set along with declarative query tools that enable researchers to say what they want to do, not how—similar to SQL language queries.

While deCODE did provide its SNP analysis tools to customers as part of its genotyping service, the company is just beginning to independently commercialize its whole genome sequencing analytics tools.  “The data issues with SNP analysis have largely been solved by academic and other commercial software,” says Guðbjartsson. “But now we’re seeing those same data challenges in even greater magnitude. They’ve solved the issue of how to sequence entire genomes, but they’re playing catch up with informatics tools that can mine those datasets efficiently.” DeCODE also sells a variety of reference laboratory DNA tests for assessing inherited risk factors for diseases such as breast and prostate cancer, glaucoma, and heart disease.

The Business of Genetics Research
DeCODE does have some competition, mostly in the form of start-ups offering cloud-based solutions for the visualization, analysis and storage of genomics data, like DNA Nexus and Ingenuity. But “so far we have not seen the scalability and flexibility of our solution matched. We’ve got a lot more experience working with data at this scale,” says Guðbjartsson, pointing to the two petabytes deCODE has in storage.

Dr. Richard Durbin, joint head of Human Genetics at the Wellcome Trust Sanger Institute and leader of the Genome Informatics group, says there is little visibility into the analytics platform deCODE has developed, but the impression in the genomics field is that the tools are well-engineered with an intelligent underlying structure. “The thing that I envy deCODE for is its focus on its primary task, which is representation of the genetic diversity of the Icelandic population, and correlation of that to phenotypic diversity,” says Durbin. “It has built a strong data processing structure around that.”

Of course, deCODE’s primary business is not data processing or software development but genetic research. (The company gave up its original drug discovery strategy after emerging from bankruptcy in 2010, and said it will maintain its research focus when Amgen announced it would purchase deCODE for $415 million.)

It’s made some big headlines in scientific journals, most notably publishing its findings of a rare variant in a known Alzheimer’s gene that protects the individual against the disease. “Alzheimer’s is just one example. We have a database we routinely mine for 1,500 differently clinical traits against sequence variants in our population,” says Guðbjartsson. “It’s hard to pinpoint specific research and connect to a particular tool we’ve developed. It’s all part of a collective effort.”

If Guðbjartsson is reluctant to take some credit, it may be because he has his hands full. “Streamlining our analysis to take the best statistical hits out of our pipeline is a challenge in itself. We’re dealing with 40 million variants times 1,500 clinical traits times eight statistical models on each,” he says. “Our input data set is huge, but the output is big as well. We have to make sure it’s readily available to our scientists, continuing to build tools to make the data easy to filter, compare and visualize.”

Stephanie Overby is a Boston-based freelance writer. Follow her on Twitter: @stephanieoverby.

Tags: , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>