Stanford’s School of Medicine reimagines its clinical data warehouse with Google Cloud

Stanford unlocks insights from its petabyte-scale hospital data and offers seamless access to its electronic health records. With Google Cloud, Stanford expands its compute capacity and easily enables services like consultation, training, and support.

The December 2019 launch of STARR-OMOP, Stanford School of Medicine’s (SoM) next generation analytical clinical data warehouse, was the culmination of a three year journey to push the frontiers of artificial intelligence in medicine. STARR-OMOP stores about seven terabytes of Stanford Electronic Health Record data from its two hospitals in a Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). By standardizing this huge archive and making it secure, private, and easier to access, the data warehouse enables faster and better data science. In October 2020 a COVID-19 network study published in Nature Communications became the first peer-reviewed article based on STARR-OMOP.

STARR-OMOP can be accessed by Stanford researchers through Nero, a secure, HIPAA-compliant internal data platform for scientific research that integrates with Google Cloud’s enterprise data warehouse, BigQuery. Using BigQuery, customers are able to analyze petabyte scale data immediately and empower their data analysts to run queries on the data with zero operational overhead. “Cohort queries run ten to one hundred times faster on BigQuery when compared to an on-premise database–and it’s cost efficient. Even better is a managed service that scales with our users, automagically,” says Somalee Datta, Director of Research IT at SoM. Over 120 data scientists now have access to STARR-OMOP through Nero.

"Cohort queries run ten to one hundred times faster on BigQuery when compared to an on-premise database–and it’s cost efficient."

Somalee Datta, Director of Research IT, Stanford School of Medicine

Building on Google Cloud’s flexible infrastructure to search 30 billion medical concepts with BigQuery

Migrating some of their data storage and analysis to Google Cloud unlocks efficiencies and promotes collaboration. Research IT developers can cost-effectively burst compute capacity during weekly production runs to boot up hundreds of compute cores as opposed to dozens on-premise. They can work together on secure cloud infrastructure that protects the privacy of electronic health records (EHR) and deploy their own technology stacks via shared Google technologies like Kubernetes, Dataproc, and Dataflow. For example, Research IT collaborators, Odysseus, use Dataproc for ETL prototyping, so they can work faster in a familiar environment. Dataflow manages Research IT’s de-identified text pipeline to set up data analysis in BigQuery. “BigQuery is a truly universal resource,” says Garrick Olson, Infrastructure and Platform team lead in Stanford’s Research IT. “I can share a dataset in my project with anyone, anywhere, and they can join my dataset with their own (or anyone else’s) right within BigQuery.”

The Research IT team also benefited from Google Cloud’s flexibility for developing custom tools. For example, they worked with Natural Language Processing (NLP) expert and data scientist Jose Posada to incorporate a text mining algorithm to identify standardized medical concepts, such as diabetes. The algorithm searches 100 million de-identified clinical notes in the dataset, then annotates the record according to whether the concept relates to the patient (e.g. patient has diabetes), the patient’s family history (e.g. patient’s father has diabetes), and the patient’s status at the time (e.g. whether the patient has reported symptoms of diabetic retinopathy at the visit). This feature makes 30 billion standardized medical concepts easily searchable with BigQuery under maximally secure conditions through a Nero account. "Being able to do team science with the world has been a privilege” says Posada. “And my analyses are running in near real time."

Expanding access through outreach

Last year R&D lead Priya Desai rolled out a new data science training program to teach the basic building blocks of doing data science and build up to more complex skills using STARR clinical text. With COVID-19 restrictions, the hands-on curriculum is being converted to a series of short videos on Stanford’s STARR YouTube channel.

Datta anticipates even closer collaborations as they reach out across the biomedical community and integrate more clinical data. “We have already integrated STARR-OMOP with radiology and bedside monitoring data and continue to work on other data types such as pathology,” she says. In 2019 the Research IT team became part of Technology and Digital Solutions, a single IT organization for Stanford Healthcare and Stanford School of Medicine. “This merger allows Research IT to build a better STARR platform in partnership with our clinical teams,” Datta reports. “The real testament of any infrastructure is when it brings experimentation velocity and lets the team focus on innovation. Google Cloud has been foundational to Research IT being able to re-imagine our nextgen CDW and help our data science community push the frontiers of data-driven research.” To find out more about the journey to develop STARR-OMOP, read more here.