Quick Start Guide¶

This guide will help you get started with CAPRICHO in just a few minutes.

Tab Completion¶

CAPRICHO supports tab completion for commands and options. To enable it, run:

capricho --install-completion

This will make it easier to discover available commands and options as you type.

Basic Workflow¶

CAPRICHO follows the workflow:

Download ChEMBL database (one-time setup);
Explore the data to understand what’s available;
Get curated bioactivity data for your analysis;
Prepare the data by cleaning concerning quality flags according to the projects’ needs and optionally output an activity matrix for multitask ML;
Binarize (optional) convert continuous activity values to binary labels for classification.

Step 1: Download ChEMBL Database¶

First, download the ChEMBL database to your local machine:

capricho download

This downloads the latest ChEMBL version to ~/.data/chembl/ by default. Defining a custom subdirectory is supported with the --prefix option, but the path will be saved under the default ~/.data/ location, as defined by pystow. For example:

capricho download --version 33 --prefix /path/to/custom/location

Step 2: Explore the Database¶

Before fetching data, you might want to explore what’s available:

# List all tables in the database
capricho explore --list-tables

# Explore a specific table
capricho explore --table activities

# Search for tables containing a specific column
capricho explore --search-column pchembl

# Run a custom SQL query
capricho explore --query "SELECT COUNT(*) FROM activities WHERE confidence_score = 9"

Step 3: Get Bioactivity Data¶

Now you’re ready to fetch and curate bioactivity data. Here are some common use cases:

Basic Target Query¶

Get all bioactivity data for EGFR (CHEMBL203):

capricho get --target-ids CHEMBL203 --output-path egfr_data.csv

Multiple Targets with High Confidence¶

Get data for multiple targets, but only high-confidence measurements:

capricho get --target-ids CHEMBL203,CHEMBL204,CHEMBL279 --confidence-scores 8,9 --output-path high_confidence.csv

Specific Bioactivity Types¶

Focus on specific bioactivity measurements:

capricho get --target-ids CHEMBL203 --bioactivity-type IC50,Ki --output-path ic50_ki_data.csv

With Custom Aggregation¶

Aggregate mutant data and include additional metadata:

capricho get --target-ids CHEMBL203 --aggregate-mutants --metadata-columns organism,tissue --output-path aggregated_data.csv

Step 4: Prepare Data for ML¶

The prepare command filters data based on quality flags, and outputs the cleaned data and an activity matrix suitable for multitask machine learning.

Quality and processing flags are introduced to all data fetched with the get command, including the max curation standards introduced by Landrum & Riniker (2024) and marking any modifications made by CAPRICHO during curation, respectively.

Basic Preparation¶

Transform aggregated data into a multitask activity matrix:

capricho prepare -i egfr_data.csv -o egfr_matrix.csv

Filtering Quality Flags¶

Remove entries with specific quality concerns:

capricho prepare -i egfr_data.csv -o egfr_matrix.csv \
    --drop-potential-duplicate \
    --drop-data-validity

The prepare command outputs:

Activity matrix (e.g., egfr_matrix.csv): Rows are compounds, columns are targets
Prepared data (e.g., egfr_prepared.csv): Filtered data before pivoting

See CLI Reference for all available quality flags (--drop-undefined-stereo, --drop-unit-error, etc.).

Step 5: Binarize Data (Optional)¶

If you need binary labels (active/inactive) for classification tasks, you can binarize the prepared data:

Basic Binarization¶

Convert continuous pChEMBL values to binary labels using the default threshold of 6.0 (1 µM):

capricho binarize -i egfr_prepared.csv -o egfr_binary.csv

Custom Threshold¶

Use a more stringent threshold of 7.0 (100 nM):

capricho binarize -i egfr_prepared.csv -o egfr_binary.csv -t 7.0

Using Median Values¶

Use median instead of mean for binarization:

capricho binarize -i egfr_prepared.csv -o egfr_binary.csv -vcol pchembl_value_median

The binarization process:

Handles censored measurements (< and >) intelligently
Flags conflicting measurements for the same compound-target pair
Outputs a new column activity_binary with values 0 (inactive) or 1 (active)

Understanding the Output¶

CAPRICHO generates several files:

Main data file (e.g., egfr_data.csv): The curated bioactivity data
Recipe file (e.g., egfr_data_recipe.json): Complete record of parameters used
Not aggregated file (e.g., egfr_data_not_aggregated.csv): Pre-aggregation data
Removed subset file (e.g., egfr_data_removed_subset.csv): Data filtered out during curation

Key Output Columns¶

The main data file contains these important columns:

connectivity: Molecular connectivity identifier
smiles: Standardized SMILES representation
target_chembl_id: ChEMBL ID for the target
pchembl_value_mean: Mean pChEMBL value (aggregated)
pchembl_value_median: Median pChEMBL value (aggregated)
pchembl_value_std: Standard deviation
standard_relation: Relation type (=, <, <<, >, >>, ~)
data_dropping_comment: Flags for filtered data
data_processing_comment: Processing notes

Next Steps¶

Read the CLI Reference for complete parameter documentation
Learn about Key Concepts like compound equality and backends
See the ADMET Data Guide for working with ADMET endpoints