Expanding access to biomedical research by sharing datasets on Google Cloud
The Knowledge Platform consists of the public-facing Knowledge Portal for user access and resources, searchable datasets in Google Cloud Storage and BigQuery, and Terra workspaces with their own built-in datasets and tools to help researchers get started quickly on data analysis. Debra Babcock, M.D., Ph.D., Program Director for Systems and Cognitive Neuroscience at the NINDS and co-chair of AMP PD’s Steering Committee, reports that “the development process has been remarkably smooth and has proceeded rapidly.” Eline Appelmans, M.D., MPH, Project Manager of Neuroscience Research Partnerships at the FNIH, adds that the Platform has been immediately popular, with close to 2,000 users in its first two months.
AMP PD’s Knowledge Platform provides researchers with easy access to large datasets from four participating cohorts: the Michael J. Fox Foundation (MFJJ) and NINDS BioFIND, Brigham and Women's Hospital and Massachusetts General Hospital Harvard Biomarkers Study, NINDS Parkinson’s Disease Biomarkers Program, MJFF’s Parkinson’s Progression Markers Initiative. According to Dr. Appelmans, the Platform already contains the clinical records of 4,298 PD and control participants, including 8,356 RNA samples and 3,941 whole genome sequencing (WGS) samples. Storing these huge datasets on Google Cloud allows AMP PD to offer researchers convenient access through their own Google accounts. They can query the data in BigQuery and run Compute Engine virtual machine (VM) instances on demand, while only paying for the storage and compute they actually use.
Dr. Babcock says there are several benefits to their approach: “A lot of PD studies are done in small samples. But some of the genes involved are rare variants and you really need large samples for subtle differences to be obvious. Increasing sample sizes is key. Also, many studies are not longitudinal. The datasets we bring together are collected longitudinally so we can see how the disease progresses, which is an important problem for Parkinson’s disease. Our datasets also include multiple data types from the same patients, which is very unusual. Many studies look at only genomics or clinical information and we provide both.”