Safety, Reproducibility, Performance: Accelerating cancer drug discovery with cloud, ML, and HPC technologies


4th Computational Approaches for Cancer Workshop at SC18 | Dallas, TX

Date: November 11, 2018

Speaker: Amanda Minnich, PhD

Title: Safety, Reproducibility, Performance: Accelerating cancer drug discovery with cloud, ML, and HPC technologies

Abstract: The drug discovery process is currently costly, slow, and failure-prone. It takes an average of 5.5 years to get to the clinical testing stage, and in this time millions of molecules are tested, thousands are made, and most fail.

The ATOM consortium is working to transform the drug discovery process by utilizing machine learning to pretest many molecules in silico for both safety and efficacy, reducing the costly iterative experimental cycles that are traditionally needed. The consortium comprises of Lawrence Livermore National Laboratory, GlaxoSmithKline, Frederick National Laboratory for Cancer Research, and UCSF. Through ATOM’s unique combination of partners, machine learning experts are able to use HPC supercomputers to develop models based on proprietary and public pharma data for over 2 million compounds. The goal of the consortium is to create a new paradigm of drug discovery that would drastically reduce the time from identified drug target to clinical candidate, and we intend to use oncology as the first exemplar of the platform.

To this end, we have created a computational framework to build ML models that generate all key safety and pharmacokinetics parameters needed as input for Quantitative System Pharmacology and Toxicology models. Our end-to-end pipeline first ingests raw datasets, curates them, and stores the result in our data lake. Next it extracts features from these data and trains and saves the model to our model zoo. Our pipeline generates a variety of molecular features and both shallow and deep ML models. The HPC-specific module we have developed conducts efficient parallelized search of the model hyperparameter space and reports the best-performing hyperparameters for each of these feature/model combinations.

To ensure complete traceability of results, we save the training, validation, and testing dataset version IDs, the Git hash of the code used to generate the model, and the OS- and library-related version information. We have set up a Docker/Kubernetes infrastructure, so when a promising model has been identified, we can encapsulate the pipeline that created it, supporting both reproducibility and portability. Our system is designed to handle protected data and support incorporating proprietary models, which allows the framework to be run on real drug design tasks.

Our models are currently being integrated into an active learning pipeline to aid in de novo compound generation, as well as being sent back to consortium members to incorporate into their drug discovery efforts . Our models and code will also be released to the public at the end of our 3-year proof-of-concept phase. To make these models usable externally, we have built a module that can load in a model from our model zoo and generate predictions for a list of compounds on-the-fly. If ground truth is known, a variety of performance metrics are generated and stored in our model performance tracker database, allowing for easy querying and comparison of model performance.

We are confident that this work will help to transform cancer drug discovery from a time-consuming, sequential, and high-risk process into an approach that is rapid, integrated, and with better patient outcomes.