Jump to content

Toronto researchers build machine learning tool with Google Cloud to track COVID-19 genomic data

A team from Vector Institute in Canada built a web-based visualization tool to share and analyze COVID-19 genomic sequences. By pooling data and resources, they hope to better understand the disease’s evolution and progress.

When the COVID-19 pandemic hit early in 2020, Vector Institute in Toronto had to close its labs and send its students and faculty to work at home. Dr. Bo Wang, Lead Artificial Intelligence Scientist at the University Health Network and Faculty Member at Vector, redirected his team to prioritize urgent COVID-19 research. Ph.D. student Hassaan Maan, who works in Dr. Wang’s lab on machine learning for healthcare, wanted to help in the global efforts to combat the pandemic’s impact, too. He had an idea: a web-based visualization tool to process public COVID-19 viral genome data.

With Dr. Samira Mubareka of the Sunnybrook Health Sciences Center and Dr. Andrew McArthur, associate professor in the department of biochemistry and biomedical sciences and director of the biomedical discovery & commercialization program at McMaster University, Maan developed the COVID-19 Genotyping Tool (CGT). The application provides insights into transmission pathways, outbreak epicenters, and key viral mutations. It allows users to upload viral genome data from patients anywhere in the world and analyze it in real-time. “In doing so,” says Wang, “they can determine the context of local events with respect to the global picture, and help shape local health policy and alert the community to any key changes in viral evolution.”

With the total number of publicly posted SARS-CoV-2 genomes rapidly approaching 100,000, the COVID-19 Genotyping Tool (CGT) is proving to be an invaluable resource for rapidly visualizing and tracking viral genomes worldwide.

Dr. Terrance Snutch, professor at the Michael Smith Laboratories, at the University of British Columbia and chair of the Canadian COVID-19 Genomics Network

Intuitive deployment within a week

Maan started developing the app in the R-Shiny framework because he was already familiar with it. Still, he needed a place to deploy all the data, which would have to scale as the number of users and uploads grew—and genome sequences require massive data processing. His solution: Compute Engine on Google Cloud. “Google Cloud offers elastic deployment and is optimized for containers in Docker,” he says. “I had never developed a tool like this, but Google Cloud had intuitive guides and docs. I deployed the app within a week.” Now the team is working on implementing larger batch uploads.

Maan sees two main benefits for researchers using CGT: “First, it helps track the evolution of the virus’ mutations. Most mutations may be harmless or synonymous, but some variations in the genome could change how the disease is treated and transmitted. That kind of data surveillance is very important for predicting new outbreaks. Second, it’s also important to track the virus’ transmission backwards. By tracing a cluster of cases’s origins, we know more about how it spreads. This tool lets us ask new questions in new ways.”

Making COVID-19 data accessible through machine learning

The Vector team made CGT publicly available on a website in June 2020, and the tool has averaged about 10,000 new genomes uploaded every week since. “With the total number of publicly posted SARS-CoV-2 genomes rapidly approaching 100,000, CGT is proving to be an invaluable resource for rapidly visualizing and tracking viral genomes worldwide,” says Dr. Terrance Snutch, professor at the Michael Smith Laboratories at the University of British Columbia and chair of the Canadian COVID-19 Genomics Network. “As the pandemic progresses into autumn and schools begin to reopen, the tool can be a critical component of genotyping efforts, carried out in smaller communities dealing with localized outbreaks.”

For Maan, the project has exciting implications for responding to the global pandemic: “The app allows researchers to sift through genetic information and find potential patterns of transmission on a broad scale. For example, getting travel histories from every COVID-19 patient has been uneven. CGT can help guide public health policy and inform travel restrictions. When genomic sequencing of the virus picks up, it will be even more useful.”

Both Maan and McArthur received Google Cloud research credits through Dr. Wang’s lab for this COVID-19 related project. If you’re interested in accessing complementary credits to drive your own research, Google is funding projects all the way from modeling the COVID-19 outbreak to predicting sepsis and discovering new planets. Click here to learn more.

What researchers are saying:

I had never developed a tool like this, but Google Cloud had intuitive guides and docs. I deployed the app within a week.

Hassaan Maan, Ph.D. candidate, University Health Network, Toronto

Sign up here for updates, insights, resources, and more.