Quick Start Guide¶
This guide will help you get started with CAPRICHO in just a few minutes.
Tab Completion¶
CAPRICHO supports tab completion for commands and options. To enable it, run:
capricho --install-completion
This will make it easier to discover available commands and options as you type.
Basic Workflow¶
CAPRICHO follows the workflow:
Download ChEMBL database (one-time setup);
Explore the data to understand what’s available;
Get curated bioactivity data for your analysis;
Prepare the data by cleaning concerning quality flags according to the projects’ needs and optionally output an activity matrix for multitask ML;
Binarize (optional) convert continuous activity values to binary labels for classification.
Step 1: Download ChEMBL Database¶
First, download the ChEMBL database to your local machine:
capricho download
This downloads the latest ChEMBL version to ~/.data/chembl/ by default. Defining a custom subdirectory is supported with the --prefix option, but the path will be saved under the default ~/.data/ location, as defined by pystow. For example:
capricho download --version 33 --prefix /path/to/custom/location
Step 2: Explore the Database¶
Before fetching data, you might want to explore what’s available:
# List all tables in the database
capricho explore --list-tables
# Explore a specific table
capricho explore --table activities
# Search for tables containing a specific column
capricho explore --search-column pchembl
# Run a custom SQL query
capricho explore --query "SELECT COUNT(*) FROM activities WHERE confidence_score = 9"
Step 3: Get Bioactivity Data¶
Now you’re ready to fetch and curate bioactivity data. Here are some common use cases:
Basic Target Query¶
Get all bioactivity data for EGFR (CHEMBL203):
capricho get --target-ids CHEMBL203 --output-path egfr_data.csv
Multiple Targets with High Confidence¶
Get data for multiple targets, but only high-confidence measurements:
capricho get --target-ids CHEMBL203,CHEMBL204,CHEMBL279 --confidence-scores 8,9 --output-path high_confidence.csv
Specific Bioactivity Types¶
Focus on specific bioactivity measurements:
capricho get --target-ids CHEMBL203 --bioactivity-type IC50,Ki --output-path ic50_ki_data.csv
With Custom Aggregation¶
Aggregate mutant data and include additional metadata:
capricho get --target-ids CHEMBL203 --aggregate-mutants --metadata-columns organism,tissue --output-path aggregated_data.csv
Step 4: Prepare Data for ML¶
The prepare command filters data based on quality flags, and outputs the cleaned data and an activity matrix suitable for multitask machine learning.
Quality and processing flags are introduced to all data fetched with the get command, including the max curation standards introduced by Landrum & Riniker (2024) and marking any modifications made by CAPRICHO during curation, respectively.
Basic Preparation¶
Transform aggregated data into a multitask activity matrix:
capricho prepare -i egfr_data.csv -o egfr_matrix.csv
Filtering Quality Flags¶
Remove entries with specific quality concerns:
capricho prepare -i egfr_data.csv -o egfr_matrix.csv \
--drop-potential-duplicate \
--drop-data-validity
The prepare command outputs:
Activity matrix (e.g.,
egfr_matrix.csv): Rows are compounds, columns are targetsPrepared data (e.g.,
egfr_prepared.csv): Filtered data before pivoting
See CLI Reference for all available quality flags (--drop-undefined-stereo, --drop-unit-error, etc.).
Step 5: Binarize Data (Optional)¶
If you need binary labels (active/inactive) for classification tasks, you can binarize the prepared data:
Basic Binarization¶
Convert continuous pChEMBL values to binary labels using the default threshold of 6.0 (1 µM):
capricho binarize -i egfr_prepared.csv -o egfr_binary.csv
Custom Threshold¶
Use a more stringent threshold of 7.0 (100 nM):
capricho binarize -i egfr_prepared.csv -o egfr_binary.csv -t 7.0
Using Median Values¶
Use median instead of mean for binarization:
capricho binarize -i egfr_prepared.csv -o egfr_binary.csv -vcol pchembl_value_median
The binarization process:
Handles censored measurements (< and >) intelligently
Flags conflicting measurements for the same compound-target pair
Outputs a new column
activity_binarywith values 0 (inactive) or 1 (active)
Understanding the Output¶
CAPRICHO generates several files:
Main data file (e.g.,
egfr_data.csv): The curated bioactivity dataRecipe file (e.g.,
egfr_data_recipe.json): Complete record of parameters usedNot aggregated file (e.g.,
egfr_data_not_aggregated.csv): Pre-aggregation dataRemoved subset file (e.g.,
egfr_data_removed_subset.csv): Data filtered out during curation
Key Output Columns¶
The main data file contains these important columns:
connectivity: Molecular connectivity identifiersmiles: Standardized SMILES representationtarget_chembl_id: ChEMBL ID for the targetpchembl_value_mean: Mean pChEMBL value (aggregated)pchembl_value_median: Median pChEMBL value (aggregated)pchembl_value_std: Standard deviationstandard_relation: Relation type (=, <, <<, >, >>, ~)data_dropping_comment: Flags for filtered datadata_processing_comment: Processing notes
Next Steps¶
Read the CLI Reference for complete parameter documentation
Learn about Key Concepts like compound equality and backends
See the ADMET Data Guide for working with ADMET endpoints