CAPRICHO¶

The ChEMBL data curator that flags issues instead of silently dropping them.

CAPRICHO (ChEMBL Aggregation Package with Robust Inspection and Curation Handling Options) is a Python package that streamlines fetching, curating, and aggregating ChEMBL data into a machine learning-ready format for drug discovery in a flexible and reproducible manner.

Goals¶

The development of CAPRICHO is guided by two core principles:

Transparency Above All: Data curation should never be a black box. Removed data points should be saved to be scrutinized by the user and the original data should be always preserved to ensure data integrity.
Flexibility by Design: Every modeling project is unique. The tool must support flexible data collection and aggregation, allowing the incorporation of any ChEMBL metadata column to be incorporated into same-compound bioactivity values.

Features¶

Data retrieval by any ChEMBL identifier (molecule IDs, target IDs, assay IDs, or document IDs)
Automated pChEMBL (pXC50) value calculation for bioactivities if not provided through ChEMBL
ADMET data support with unit conversion and non-pChEMBL aggregation
Customizable filtering options
Configurable data aggregation options
Save a fetching and processing recipe for reproducibility
Command-line interface for easy use

Contents