CLI Reference¶
CAPRICHO provides five main commands: download, explore, get, prepare, and binarize. This section provides comprehensive documentation for all command-line options.
capricho download¶
Downloads the ChEMBL SQL database using chembl_downloader.
capricho download [OPTIONS]
Options¶
Option |
Description |
Default |
|---|---|---|
|
ChEMBL version to download |
latest |
|
Custom pystow storage path |
|
Examples¶
# Download latest ChEMBL version
capricho download
# Download specific version
capricho download --version 33
# Use custom storage location (this will install version 25 it on ~/.data/old-chembl)
capricho download --version 25 --prefix old-chembl/
capricho explore¶
Explore the downloaded ChEMBL SQL database.
capricho explore [OPTIONS]
Options¶
Option |
Description |
|---|---|
|
ChEMBL version to use (defaults to latest) |
|
List all tables within the SQL database and exit |
|
Explore a specific table |
|
Search for tables containing a column name pattern |
|
Run a custom SQL query |
|
Console output format for tables ( |
|
Save primary result DataFrame to file (format inferred from extension) |
|
ANSI color cycling on console table rows |
Examples¶
# List all available tables on the relational database
capricho explore --list-tables
# Examine the activities table
capricho explore --table activities
# Find tables with 'pchembl' columns
capricho explore --search-column pchembl
# Check available standard_relation types in ChEMBL
capricho explore --query "SELECT standard_relation, COUNT(*) as count FROM activities WHERE standard_relation IS NOT NULL GROUP BY standard_relation ORDER BY count DESC"
Useful Queries for Data Exploration¶
These queries can help you understand the data before running capricho get:
Count activities by bioactivity type:
capricho explore --query "SELECT standard_type, COUNT(*) as count FROM activities WHERE standard_type IS NOT NULL GROUP BY standard_type ORDER BY count DESC LIMIT 20"
Count activities by assay type:
capricho explore --query "SELECT assay_type, COUNT(*) as count FROM assays GROUP BY assay_type ORDER BY count DESC"
Check confidence score distribution:
capricho explore --query "SELECT confidence_score, COUNT(*) as count FROM assays GROUP BY confidence_score ORDER BY confidence_score DESC"
Check all standard units:
capricho explore --query "SELECT standard_units, COUNT(*) as count FROM activities WHERE standard_units IS NOT NULL GROUP BY standard_units ORDER BY count DESC LIMIT 30"
capricho get¶
Filter, download, and process bioactivity data from ChEMBL. This is the main command of CAPRICHO.
capricho get [OPTIONS]
Input ID Options¶
Specify which ChEMBL entities to retrieve data for:
Option |
Description |
Default |
|---|---|---|
|
ChEMBL molecule IDs, comma-separated |
|
|
ChEMBL target IDs, comma-separated |
|
|
ChEMBL assay IDs, comma-separated |
|
|
ChEMBL document IDs, comma-separated |
|
Filtering Options¶
Control which bioactivity data to include:
Option |
Description |
Default |
|---|---|---|
|
Confidence scores to filter, comma-separated |
|
|
Bioactivity types to filter, comma-separated |
|
|
Filter by standard relation ( |
|
|
Filter by standard units, comma-separated. Useful for ADMET data with specific units like |
|
|
Assay types (B, F, A, T, P), comma-separated |
|
|
Only fetch data reported up to a certain ChEMBL release |
|
|
Filter out bioactivities without a document date |
|
|
Maximum number of compounds in an assay |
|
|
Minimum number of compounds in an assay |
|
|
Minimum overlapping compounds between assays |
|
Confidence Scores¶
ChEMBL assigns confidence scores from 0-9 based on target assignment certainty:
9: Direct single protein target assigned
8: Homologous single protein target assigned
7: Direct protein complex subunits assigned
6: Homologous protein complex subunits assigned
5: Multiple direct protein targets may be assigned (e.g., PROTEIN FAMILY)
4: Multiple homologous protein targets may be assigned (e.g., PROTEIN FAMILY)
3: Target assigned is molecular non-protein target
1: Target assigned is non-molecular
0: Default value - Target assignment has yet to be curated
Bioactivity Types¶
Common bioactivity measurements:
IC50: Half maximal inhibitory concentration
EC50: Half maximal effective concentration
Ki: Inhibition constant
Kd: Dissociation constant
AC50: Half maximal activity concentration
Potency: General potency measurement (see
assay_descriptionfor more detail)
Assay Types¶
B: Binding assay
F: Functional assay
A: ADMET assay
T: Toxicity assay
P: Physicochemical assay
U: Unclassified
Further information on assay types and confidence scores can be found in the ChEMBL documentation.
Processing & Aggregation Options¶
Control how data is processed and aggregated:
Option |
Description |
Default |
|---|---|---|
|
Calculate pChEMBL values if not reported. Required when using censored data ( |
|
|
Column to aggregate statistics on. Use |
|
|
Convert units to standard formats before aggregation. See Unit Conversion. |
|
|
Consider chirality during fingerprint calculation |
|
|
Drop entries with unassigned chiral centers |
|
|
Apply curation for pChEMBL annotation errors |
|
|
Aggregate data on targets regardless of mutation |
|
|
Flag assays with mutant-related keywords for removal |
|
|
Method for compound equality determination |
|
|
Extra metadata columns to keep, comma-separated |
|
|
Extra ID columns for aggregation, comma-separated |
|
Aggregation Column Options¶
pchembl_value: (Default) Aggregate on pChEMBL values (-log10 molar potency). Uses geometric mean.
standard_value: Aggregate on raw standard_value column. Uses arithmetic mean. Useful for ADMET data with non-molar units (%, permeability, etc.).
Compound Equality Methods¶
connectivity: (Default) Based on molecular connectivity (InChI key first block), ignoring stereochemistry
mixed_fp: Uses ECFP4 and RDKit fingerprints (each with 2048 bits) for similarity determination
smiles: Uses standardized SMILES strings directly for exact string matching
Useful Metadata Columns¶
organism: Source organismtissue: Tissue typecell_type: Cell line informationassay_description: Detailed assay descriptiontarget_type: Type of target (e.g., SINGLE PROTEIN, PROTEIN FAMILY)
Output & Backend Options¶
Control output format and data source:
Option |
Description |
Default |
|---|---|---|
|
Path to save the output files |
|
|
Skip saving pre-aggregation data |
|
|
Skip saving the JSON recipe file |
|
|
Backend to use for ChEMBL interaction |
|
|
ChEMBL version used by |
|
Backend Options¶
downloader: (Default) Uses local SQL database, faster for large queries
webresource: Uses ChEMBL web API, no local download required
Complete Examples¶
Basic Target Analysis¶
capricho get --target-ids CHEMBL203 --output-path egfr_analysis.csv
High-Quality Multi-Target Dataset¶
capricho get \
--target-ids CHEMBL203,CHEMBL204,CHEMBL279 \
--confidence-scores 8,9 \
--bioactivity-type IC50,Ki \
--standard-relation "=" \
--aggregate-mutants \
--metadata-columns organism,tissue \
--output-path high_quality_dataset.csv
ADMET Data with Unit Conversion¶
Retrieve Caco-2 permeability data with unit conversion and aggregation on standard_value:
capricho get \
--assay-ids CHEMBL1112933,CHEMBL3529279,CHEMBL3529278 \
--assay-types A \
--confidence-scores 0,1,2,3,4,5,6,7,8,9 \
--aggregate-on standard_value \
--convert-units \
--id-columns standard_units,assay_cell_type \
--drop-unassigned-chiral \
--output-path caco2_permeability.csv
This command:
Fetches data from specific Caco-2 permeability assays
Uses ADMET assay type (
-at A)Aggregates on
standard_valueinstead of pChEMBL (permeability isn’t a potency measurement)Converts permeability units to a common format (
10^-6 cm/s)Groups by
standard_unitsandassay_cell_typeduring aggregation
capricho prepare¶
Clean aggregated bioactivity data by filtering entries based on quality and processing flags introduced during capricho get. Optionally, transform the cleaned data into a multitask activity matrix where rows are compounds and columns are tasks (e.g., targets).
capricho prepare [OPTIONS]
Required Options¶
Option |
Description |
|---|---|
|
Path to aggregated data file (CSV, TSV, or Parquet) |
|
Path to save the output file |
Quality Flag Filtering Options¶
These flags remove entries with specific quality concerns. Each flag corresponds to a comment added during capricho get:
Option |
Description |
Default |
|---|---|---|
|
Drop entries with undefined stereochemistry |
|
|
Drop entries flagged as potential duplicates across documents |
|
|
Drop entries with data validity comments from ChEMBL |
|
|
Drop entries with unit annotation errors (3.0 or 6.0 log unit differences) |
|
|
Drop entries containing mixtures in SMILES |
|
|
Drop entries outside assay size bounds (both too small and too large) |
|
|
Drop entries from assays with insufficient compound overlap |
|
|
Custom quality flags to remove, comma-separated. Rows with these flags in |
|
Data Cleaning Options¶
Option |
Description |
Default |
|---|---|---|
|
Remove duplicate pChEMBL values within aggregated rows and recalculate statistics |
|
|
Resolve unit annotation errors by keeping measurement from earliest document. Use |
|
Activity Matrix Options¶
These options control the optional multitask activity matrix output:
Option |
Description |
Default |
|---|---|---|
|
Column to use as task identifier |
|
|
Column for compound identity ( |
|
|
Column containing SMILES strings |
|
|
Column that was aggregated on during |
|
|
Extra columns to combine with |
|
Output Options¶
Option |
Description |
Default |
|---|---|---|
|
Path to save comparability plots (e.g., |
|
Understanding Quality Flags¶
Quality flags are added to the data_dropping_comment column during capricho get. Common flags include:
Undefined stereochemistry: Compounds with unassigned chiral centers
Potential duplicate: Quality flag introduced by the ChEMBL team, indicting that compound-target pair reported in multiple documents with identical values
Data validity comment: ChEMBL’s own data quality annotations
Unit annotation error: Measurements differing by exactly 3.0 or 6.0 log units (suggesting unit conversion errors)
Mixture in SMILES: SMILES containing multiple components (
.separator)Assay size too small/large: Assays outside the specified size bounds
Insufficient assay overlap: Assays without enough shared compounds for reliable comparison
Output Files¶
The prepare command generates two files:
Prepared data (
*_prepared.csv): The cleaned data after filtering quality flagsActivity matrix (specified by
-o): Rows are compounds (indexed bycompound_col), columns are tasks, plus asmilescolumn. Suitable for multitask ML models.
Examples¶
Basic Preparation¶
# Clean data and output activity matrix
capricho prepare -i egfr_data.csv -o egfr_matrix.csv
Filtering Quality Flags¶
# Remove potential duplicates and data validity issues
capricho prepare -i egfr_data.csv -o egfr_clean.csv \
--drop-potential-duplicate \
--drop-data-validity
Strict Quality Filtering¶
# Apply multiple quality filters for high-confidence data
capricho prepare -i kinase_data.csv -o kinase_clean.csv \
--drop-potential-duplicate \
--drop-data-validity \
--drop-unit-error \
--drop-undefined-stereo
With Deduplication¶
# Remove duplicate values and recalculate statistics
capricho prepare -i data.csv -o clean_data.csv \
--deduplicate \
--drop-potential-duplicate
Resolve Annotation Errors¶
# Keep earliest measurement when annotation errors are detected
capricho prepare -i data.csv -o clean_data.csv \
--resolve-annotation-error first \
--drop-unit-error
Generate Comparability Plots¶
# Output plots showing data comparability across assays
capricho prepare -i data.csv -o clean_data.csv \
--drop-potential-duplicate \
--plot comparability.png
This generates two plots:
comparability_cleaned.png: Comparability of data after filteringcomparability_flags.png: Multi-panel view showing remaining flags
Composite Task Identifiers¶
# Use id-columns if data was aggregated with --id-columns
capricho prepare -i data.csv -o matrix.csv \
--id-columns assay_cell_type,standard_units
Custom Flag Removal¶
# Remove entries with custom flags
capricho prepare -i data.csv -o clean_data.csv \
--remove-flags "Censored activity comment,Mutant assay"
capricho binarize¶
Convert aggregated bioactivity data to binary labels (active/inactive) based on a pChEMBL threshold. This command handles censored measurements (< and >) and validates agreement between different measurement types.
capricho binarize [OPTIONS]
Required Options¶
Option |
Description |
|---|---|
|
Path to aggregated data file (CSV, TSV, or Parquet) |
|
Path to save the binarized output file |
Binarization Options¶
Option |
Description |
Default |
|---|---|---|
|
Activity threshold for binarization (pChEMBL scale) |
|
|
Column to use for binarization |
|
|
Column name for compound identifiers |
|
|
Column name for target identifiers |
|
|
Column name for standard_relation values |
|
|
Name for the output binary column |
|
|
Compare measurements across mutants for conflicts |
|
|
Path to save detailed conflict report as JSON |
|
|
Strategy for resolving conflicts: |
|
Understanding pChEMBL Thresholds¶
The pChEMBL scale is -log10(Molar), where higher values indicate higher activity:
pChEMBL 6.0 = 1 µM (common threshold)
pChEMBL 6.5 = 316 nM
pChEMBL 7.0 = 100 nM (stringent threshold)
pChEMBL 5.0 = 10 µM (permissive threshold)
Handling Standard Relations¶
The binarization process handles different measurement types:
=(discrete): Direct comparison to threshold~(approximate): Uses lower bound for conservative classification (±0.5 log units)<,<<(censored active): Compound is more active than reported value>,>>(censored inactive): Compound is less active than reported value
Conflict Detection and Resolution¶
The command flags measurements that disagree for the same compound-target pair:
Mixed discrete/censored conflicts: When discrete measurements (
=,~) disagree with censored measurements (<,>)Binary label conflicts: When measurements result in different activity classifications (active vs inactive)
Mutation handling: Use
--compare-across-mutantsto control whether different mutants are compared
Conflict Resolution Strategies¶
By default, conflicts are only flagged in data_dropping_comment. Use --conflict-resolution to automatically resolve them:
Strategy |
Behavior |
Fallback |
|---|---|---|
|
Remove all rows for conflicting pairs |
– |
|
Keep exact ( |
Drop all if no |
|
Keep row with highest |
Drop all on tie |
|
Classify each individual measurement against the threshold; majority label wins |
Drop all on tie |
The majority strategy splits the pipe-separated raw values (e.g., pchembl_value = "6.0|6.5|7.0") and classifies each individual measurement against the threshold. Each measurement gets one vote. This means a row whose mean is above the threshold but contains individual values below it will contribute some inactive votes. When the raw value column is not present, falls back to count-weighted or row-based voting.
# Resolve conflicts by keeping exact measurements
capricho binarize -i data.csv -o binary.csv -t 7.0 -cr relation
# Resolve by measurement-weighted majority vote, save report
capricho binarize -i data.csv -o binary.csv -t 7.0 -cr majority -rp conflicts.json
Conflict Report¶
Use -rp / --conflict-report-path to save a JSON report with:
Summary: Total conflicts, conflict patterns (exact vs censored), active/inactive counts, MCC, resolution summary
Per-conflict details: Measurements, vote summary, severity (low/medium/high based on measurement spread), recommendation, resolution outcome
Compound Identifiers for Conflict Detection¶
By default, conflicts are detected using the connectivity column (InChI key connectivity layer), which groups compounds by their molecular graph ignoring stereochemistry. You can use a different identifier:
connectivity(default): Groups by connectivity layer, ignoring stereochemistrysmiles: Groups by standardized SMILES, which may be more or less permissive depending on your aggregation settings
To use SMILES for conflict detection, specify -cid smiles:
capricho binarize -i data.csv -o output.csv -cid smiles
This allows you to check for inconsistencies at different levels of molecular identity (e.g., detecting conflicts between stereoisomers when using SMILES, or treating stereoisomers as the same compound when using connectivity).
Examples¶
Basic Binarization¶
# Default threshold of 6.0 (1 µM)
capricho binarize -i aggregated_data.csv -o binarized_data.csv
Custom Threshold¶
# Stringent threshold of 7.0 (100 nM)
capricho binarize -i egfr_data.csv -o egfr_binary.csv -t 7.0
Using Median Values¶
# Use median instead of mean for binarization
capricho binarize \
-i aggregated_data.csv \
-o binary_median.csv \
-vcol pchembl_value_median \
-t 6.5
Compare Across Mutants¶
# Flag conflicts even when measurements are on different mutants
capricho binarize \
-i kinase_data.csv \
-o kinase_binary.csv \
-t 6.5 \
--compare-across-mutants
Resolve Conflicts and Generate Report¶
# Keep exact measurements, drop censored when they conflict
capricho binarize \
-i egfr_data.csv \
-o egfr_binary.csv \
-t 7.0 \
-cr relation \
-rp conflict_report.json
# Measurement-weighted majority vote
capricho binarize \
-i egfr_data.csv \
-o egfr_binary.csv \
-t 7.0 \
-cr majority \
-rp conflict_report.json
Understanding pchembl_relation¶
The output file includes a pchembl_relation column that adjusts the standard_relation signs for the -log scale used in pChEMBL values. This makes it easier to interpret activity thresholds:
Example: For threshold = 6.0 (1 µM)
IC50withpchembl_value= 6.0 andpchembl_relation=>-log10[IC50 concentration] > 6.0
IC50 concentration < 1 µM
active (1)
The relation inversion happens because pChEMBL values are negative logarithms:
standard_relation<(low concentration) →pchembl_relation>(high pChEMBL, active)standard_relation>(high concentration) →pchembl_relation<(low pChEMBL, inactive)standard_relation=→pchembl_relation=(unchanged)
This column is automatically generated during binarization and helps interpret the relationship between measurements and the activity threshold.
Output Format¶
The output file contains all original columns plus:
Binary activity column (default:
activity_binary): 0 (inactive), 1 (active), or null (missing)pchembl_relation column: Standard relation adjusted for -log scale (see above)
Conflict flags: Rows with disagreeing measurements are flagged in the
data_dropping_commentcolumn
Conflicting measurements are logged with detailed information about the disagreement.
Post-Resolution Deduplication¶
When a conflict resolution strategy is active (-cr), compound-target pairs are deduplicated to one row per pair. During deduplication:
Individual measurements that disagree with the resolved binary label are filtered out from all pipe-separated columns
Rows for the same compound-target pair are merged, concatenating their source values
The
standard_relationcolumn becomes pipe-separated to match per-measurement relations (e.g.,"=|=|<")Statistics (
*_mean,*_std,*_median,*_counts) are recalculated from the kept measurements only
The resulting pchembl_value and standard_relation columns serve as a register of source values that compose the binarized label.