In the past ten years researchers have made unprecedented progress in the field of genomics. In 2008, the 1000 Genome Project aimed to document a representative global sample of whole genome sequences from one thousand participants. The project took seven years to complete, providing deep sequencing for the largest public collection of genomic data at that time. Fast forward to today when the National Heart, Lung, Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) project has already sequenced 90,000 genomes in its first few years and will soon reach 150,000 genomes. The nation-wide All of Us project has the once-unimaginable goal of collecting and analyzing genetic and clinical data from one million participants. Jonathon LeFaive, Senior App Programmer/Analyst in the Department of Biostatistics at the University of Michigan (UM), explains that “we’re starting to hit data sizes we’ve never seen before... and cloud computing makes that possible.”
Google Cloud helps the University of Michigan’s TOPMed team process and share genomics data faster
By using Google Compute Engine to catalog and store 150,000 whole genome sequences alongside clinical data about patient health, environments, and lifestyles, a team at the University of Michigan can enable insights into the diagnosis and treatment of disease.
"The NIH Data Commons and STRIDES are an effort to open up the black box of data so we get to the science faster. Instead of sending data to the scientists we’re bringing the scientists to the data."
Jonathon LeFaive, Senior App Programmer/Analyst, Department of Biostatistics at the University of Michigan
Using containers allows other researchers to reproduce results
UM serves as Informatics Research Core (IRC) for collating sequence reads, variant call sets, and analytic methods for TOPMed, a precision medicine initiative supporting research into heart, lungs, blood, and sleep disorders. TOPMed IRC at UM has already generated three petabytes of highly compressed data which would take thousands of core years to analyze with traditional high performance computers (HPC). UM needed an alternative that would be fast, scalable, and cost-effective, but as part of a consortium of around forty TOPMed studies they also needed one that would facilitate collaboration and data access.
So they turned to Google Cloud for a solution that was both highly customizable and modular for easy reproducibility. Google Cloud’s container organization was key: “scientists love containers,” LeFaive reports, “not just because it’s easy to deploy complicated pipelines but because it makes it easier for peers to reproduce results and reproducibility is critical in science.” To minimize costs, the team also used Google Compute Engine’s Pre-emptible VMs and divided their jobs into smaller batches, running them efficiently through fine-tuned controls for each job. It worked. “With seemingly limitless resources on Google Compute Engine we can compress years of data prep into a few months,” LeFaive says. “That means researchers can start analyzing the data and making discoveries much sooner.”
Google Cloud broadens access to urgently needed data
Many common medical conditions have complex causes: high cholesterol, for example, is affected both by inherited genetic traits and individual lifestyle choices as well as demographic factors. By combining genomic research, clinical data like personal histories, and environmental factors, the field of precision medicine can help customize treatments and predict outcomes for individual patients. But while commercial companies offer DNA microarrays to interrogate selections of a genome and its common variations, these simplified sequence analyses don’t offer the same range of detail needed for some medical research, especially in the case of rare disorders. By collecting and analyzing “deep sequences” of whole genomes against a baseline reference, researchers can catalog and track variants—and provide the level of detail necessary to validate results. Mapping all the extra information, or “depth,” on a genome ensures that what researchers find isn’t the result of random errors. Identifying rare mutations is especially important: collating rare mutations against each other helps to explain their causes and can help target specific treatment options. LeFaive says, “TOPMed’s whole genome sequencing has already identified 600 million variants, providing 600 million potential avenues of discovery—of which 40% are extremely rare and have likely never been seen before. Only with this level of information can researchers study how rare mutations are associated with disease. By looking at a patient’s DNA it may be possible for doctors to determine whether a treatment would be effective for their patient.”
STRIDES, Google Cloud’s new partnership with the National Institute of Health to host public datasets on Google Cloud, will soon facilitate even more collaboration and shared resources to advance breakthroughs. Google Cloud storage is already compatible with the NIH Data Commons platform, which can customize access to each dataset to meet donor terms of consent as well as federal privacy regulations. According to LeFaive, cloud computing and collaborations will accelerate research and help even the playing field for researchers everywhere: “Working with Google Cloud makes it easier for all the TOPMed contributors, even small labs, to get immediate access to data to do the analyses much more quickly. The NIH Data Commons and STRIDES are an effort to open up the black box of data so we get to the science faster. Instead of sending data to the scientists we’re bringing the scientists to the data.”
"With the seemingly limitless resources on Google Compute Engine we can compress years of data prep into a few months. That means researchers can start analyzing the data and making discoveries much sooner."
Jonathon LeFaive, Senior App Programmer/Analyst, Department of Biostatistics at the University of Michigan