ATOM is combining forces with NCI to release data and models for the public good

The Predictive Oncology Model and Data Clearinghouse (MoDaC) was developed within the National Cancer Institute (NCI), as part of a collaborative effort with the Joint Design of Advance Computing Solutions for Cancer and the ATOM Consortium (ATOM), as a resource-sharing platform. MoDaC's goal is to transition predictive oncology datasets and mathematical models (such as machine learning and deep learning models) to the broader research community through four objectives:

MoDaC’s Objectives

Serve as a public repository for predictive oncology datasets and computational models.
Establish connectivity with other repositories and analysis platforms to promote a shared ecosystem for model management.
Enable users to perform prediction and evaluation of deployed models.
Enable comparison and standardization of models.

To support MoDaC’s objectives, ATOM has released its best model and datasets into MoDaC’s repository for public users to use. The idea is for the user to re-use the machine learning models with ATOM's open-source ATOM Modeling Pipeline (AMPL) software to evaluate small-molecule compounds against selected targets for their drug discovery or chemical toxicology research.

An opportunity exists for integrating MoDaC with AMPL to facilitate programmatic storage and retrieval of models in MoDaC from AMPL, as well as for incorporating the ability to execute these models from MoDaCin an AMPL environment.

MoDaC’s Capabilities

Annotated datasets and models (known as assets in the system) stored in the repository are publicly searchable by metadata. The generic data hierarchy and metadata structure enable consistent organization of data irrespective of the domain. The search page with browsing and filtering capabilities allows users to locate the assets quickly. A user can download assets to, or upload assets from a Globus endpoint, an AWS S3 bucket, Google Drive, or the user's computer.

A representational state transfer (REST) application programming interface enables integration with modeling and biomedical analysis platforms. It allows users to store, discover, and retrieve data programmatically.

The owner of an asset can specify a document object identifier (DOI) for that asset. Also, the platform provides a unique publicly accessible hyperlink to each asset for citations or references. The Department Of Energy Data Explorer has leveraged this capability by publishing the links for their datasets that have a DOI attached to them.

In addition to the above capabilities, MoDaC was recently enhanced to generate predictions and evaluate models deployed in an on-premises execution environment. Users can provide their own dataset, direct MoDaC to retrieve the dataset from the National Cancer Institute Genomic Data Commons (GDC) by supplying a GDC manifest file or use the reference datasets stored in MoDaC for that purpose. MoDaC then runs the specified model with that dataset and returns the predictions and evaluation results back to the user.

Planned MoDaC enhancements include

Ability to evaluate models deployed in cloud environments.
Ability to compare models using metrics like score, recall, precision, and so on.
Expanded REST API suite to enable model evaluation and comparison.