CLI Reference

CAPRICHO provides five main commands: download, explore, get, prepare, and binarize. This section provides comprehensive documentation for all command-line options.

capricho download

Downloads the ChEMBL SQL database using chembl_downloader.

capricho download [OPTIONS]

Options

Option

Description

Default

--version, -v

ChEMBL version to download

latest

--prefix, -p

Custom pystow storage path

~/.data/chembl/

Examples

# Download latest ChEMBL version
capricho download

# Download specific version
capricho download --version 33

# Use custom storage location (this will install version 25 it on ~/.data/old-chembl)
capricho download --version 25 --prefix old-chembl/

capricho explore

Explore the downloaded ChEMBL SQL database.

capricho explore [OPTIONS]

Options

Option

Description

--version, -v

ChEMBL version to use (defaults to latest)

--list-tables, -list

List all tables within the SQL database and exit

--table, -t

Explore a specific table

--search-column, -search

Search for tables containing a column name pattern

--query, -q

Run a custom SQL query

--format, -f

Console output format for tables (markdown or csv)

--output, -o

Save primary result DataFrame to file (format inferred from extension)

--colorize / --no-colorize

ANSI color cycling on console table rows

Examples

# List all available tables on the relational database
capricho explore --list-tables

# Examine the activities table
capricho explore --table activities

# Find tables with 'pchembl' columns
capricho explore --search-column pchembl

# Check available standard_relation types in ChEMBL
capricho explore --query "SELECT standard_relation, COUNT(*) as count FROM activities WHERE standard_relation IS NOT NULL GROUP BY standard_relation ORDER BY count DESC"

Useful Queries for Data Exploration

These queries can help you understand the data before running capricho get:

Count activities by bioactivity type:

capricho explore --query "SELECT standard_type, COUNT(*) as count FROM activities WHERE standard_type IS NOT NULL GROUP BY standard_type ORDER BY count DESC LIMIT 20"

Count activities by assay type:

capricho explore --query "SELECT assay_type, COUNT(*) as count FROM assays GROUP BY assay_type ORDER BY count DESC"

Check confidence score distribution:

capricho explore --query "SELECT confidence_score, COUNT(*) as count FROM assays GROUP BY confidence_score ORDER BY confidence_score DESC"

Check all standard units:

capricho explore --query "SELECT standard_units, COUNT(*) as count FROM activities WHERE standard_units IS NOT NULL GROUP BY standard_units ORDER BY count DESC LIMIT 30"

capricho get

Filter, download, and process bioactivity data from ChEMBL. This is the main command of CAPRICHO.

capricho get [OPTIONS]

Input ID Options

Specify which ChEMBL entities to retrieve data for:

Option

Description

Default

-mids, --molecule-ids

ChEMBL molecule IDs, comma-separated

[]

-tids, --target-ids

ChEMBL target IDs, comma-separated

[]

-asids, --assay-ids

ChEMBL assay IDs, comma-separated

[]

-dids, --document-ids

ChEMBL document IDs, comma-separated

[]

Filtering Options

Control which bioactivity data to include:

Option

Description

Default

-c, --confidence-scores

Confidence scores to filter, comma-separated

[7, 8, 9]

-biotype, --bioactivity-type

Bioactivity types to filter, comma-separated

['Potency', 'Kd', 'Ki', 'IC50', 'AC50', 'EC50']

-rel, --standard-relation

Filter by standard relation (=, <, >, ~), comma-separated. Note: Including < or > requires --calculate-pchembl. See Standard Relations.

['=']

-units, --standard-units

Filter by standard units, comma-separated. Useful for ADMET data with specific units like % (percent inhibition).

None

-at, --assay-types

Assay types (B, F, A, T, P), comma-separated

['B', 'F']

-cr, --chembl-release

Only fetch data reported up to a certain ChEMBL release

None

-reqdoc, --require-doc-date

Filter out bioactivities without a document date

False

-maxas, --max-assay-size

Maximum number of compounds in an assay

None

-minas, --min-assay-size

Minimum number of compounds in an assay

None

-maso, --min-assay-overlap

Minimum overlapping compounds between assays

0

Confidence Scores

ChEMBL assigns confidence scores from 0-9 based on target assignment certainty:

  • 9: Direct single protein target assigned

  • 8: Homologous single protein target assigned

  • 7: Direct protein complex subunits assigned

  • 6: Homologous protein complex subunits assigned

  • 5: Multiple direct protein targets may be assigned (e.g., PROTEIN FAMILY)

  • 4: Multiple homologous protein targets may be assigned (e.g., PROTEIN FAMILY)

  • 3: Target assigned is molecular non-protein target

  • 1: Target assigned is non-molecular

  • 0: Default value - Target assignment has yet to be curated

Bioactivity Types

Common bioactivity measurements:

  • IC50: Half maximal inhibitory concentration

  • EC50: Half maximal effective concentration

  • Ki: Inhibition constant

  • Kd: Dissociation constant

  • AC50: Half maximal activity concentration

  • Potency: General potency measurement (see assay_description for more detail)

Assay Types

  • B: Binding assay

  • F: Functional assay

  • A: ADMET assay

  • T: Toxicity assay

  • P: Physicochemical assay

  • U: Unclassified

Further information on assay types and confidence scores can be found in the ChEMBL documentation.

Processing & Aggregation Options

Control how data is processed and aggregated:

Option

Description

Default

-calc, --calculate-pchembl

Calculate pChEMBL values if not reported. Required when using censored data (--standard-relation includes < or >). See Standard Relations.

False

-agg-on, --aggregate-on

Column to aggregate statistics on. Use standard_value for non-pChEMBL data (e.g., ADMET assays with % inhibition). See Non-pChEMBL Aggregation.

pchembl_value

-conu, --convert-units

Convert units to standard formats before aggregation. See Unit Conversion.

False

-chiral, --chirality

Consider chirality during fingerprint calculation

False

-duchi, --drop-unassigned-chiral

Drop entries with unassigned chiral centers

False

-cure, --curate-annotation-errors

Apply curation for pChEMBL annotation errors

False

-mutagg, --aggregate-mutants

Aggregate data on targets regardless of mutation

False

-smr, --strict-mutant-removal

Flag assays with mutant-related keywords for removal

False

-cpd-eq, --compound-equality

Method for compound equality determination

connectivity

-mcols, --metadata-columns

Extra metadata columns to keep, comma-separated

[]

-idcols, --id-columns

Extra ID columns for aggregation, comma-separated

[]

Aggregation Column Options

  • pchembl_value: (Default) Aggregate on pChEMBL values (-log10 molar potency). Uses geometric mean.

  • standard_value: Aggregate on raw standard_value column. Uses arithmetic mean. Useful for ADMET data with non-molar units (%, permeability, etc.).

Compound Equality Methods

  • connectivity: (Default) Based on molecular connectivity (InChI key first block), ignoring stereochemistry

  • mixed_fp: Uses ECFP4 and RDKit fingerprints (each with 2048 bits) for similarity determination

  • smiles: Uses standardized SMILES strings directly for exact string matching

Useful Metadata Columns

  • organism: Source organism

  • tissue: Tissue type

  • cell_type: Cell line information

  • assay_description: Detailed assay description

  • target_type: Type of target (e.g., SINGLE PROTEIN, PROTEIN FAMILY)

Output & Backend Options

Control output format and data source:

Option

Description

Default

-o, --output-path

Path to save the output files

chembl_data.csv

-skip-agg, --skip-not-aggregated

Skip saving pre-aggregation data

False

-rec, --skip-recipe

Skip saving the JSON recipe file

False

-back, --chembl-backend

Backend to use for ChEMBL interaction

downloader

-v, --chembl-version

ChEMBL version used by chembl_downloader

None

Backend Options

  • downloader: (Default) Uses local SQL database, faster for large queries

  • webresource: Uses ChEMBL web API, no local download required

Complete Examples

Basic Target Analysis

capricho get --target-ids CHEMBL203 --output-path egfr_analysis.csv

High-Quality Multi-Target Dataset

capricho get \
  --target-ids CHEMBL203,CHEMBL204,CHEMBL279 \
  --confidence-scores 8,9 \
  --bioactivity-type IC50,Ki \
  --standard-relation "=" \
  --aggregate-mutants \
  --metadata-columns organism,tissue \
  --output-path high_quality_dataset.csv

ADMET Data with Unit Conversion

Retrieve Caco-2 permeability data with unit conversion and aggregation on standard_value:

capricho get \
  --assay-ids CHEMBL1112933,CHEMBL3529279,CHEMBL3529278 \
  --assay-types A \
  --confidence-scores 0,1,2,3,4,5,6,7,8,9 \
  --aggregate-on standard_value \
  --convert-units \
  --id-columns standard_units,assay_cell_type \
  --drop-unassigned-chiral \
  --output-path caco2_permeability.csv

This command:

  • Fetches data from specific Caco-2 permeability assays

  • Uses ADMET assay type (-at A)

  • Aggregates on standard_value instead of pChEMBL (permeability isn’t a potency measurement)

  • Converts permeability units to a common format (10^-6 cm/s)

  • Groups by standard_units and assay_cell_type during aggregation

capricho prepare

Clean aggregated bioactivity data by filtering entries based on quality and processing flags introduced during capricho get. Optionally, transform the cleaned data into a multitask activity matrix where rows are compounds and columns are tasks (e.g., targets).

capricho prepare [OPTIONS]

Required Options

Option

Description

-i, --input-path

Path to aggregated data file (CSV, TSV, or Parquet)

-o, --output-path

Path to save the output file

Quality Flag Filtering Options

These flags remove entries with specific quality concerns. Each flag corresponds to a comment added during capricho get:

Option

Description

Default

--drop-undefined-stereo

Drop entries with undefined stereochemistry

False

--drop-potential-duplicate

Drop entries flagged as potential duplicates across documents

False

--drop-data-validity

Drop entries with data validity comments from ChEMBL

False

--drop-unit-error

Drop entries with unit annotation errors (3.0 or 6.0 log unit differences)

False

--drop-mixture

Drop entries containing mixtures in SMILES

False

--drop-assay-size

Drop entries outside assay size bounds (both too small and too large)

False

--drop-insufficient-overlap

Drop entries from assays with insufficient compound overlap

False

--remove-flags

Custom quality flags to remove, comma-separated. Rows with these flags in data_dropping_comment will be filtered out.

None

Data Cleaning Options

Option

Description

Default

--deduplicate

Remove duplicate pChEMBL values within aggregated rows and recalculate statistics

False

--resolve-annotation-error

Resolve unit annotation errors by keeping measurement from earliest document. Use first to enable.

None

Activity Matrix Options

These options control the optional multitask activity matrix output:

Option

Description

Default

--task-col

Column to use as task identifier

target_chembl_id

--compound-col

Column for compound identity (connectivity or smiles)

connectivity

--smiles-col

Column containing SMILES strings

smiles

-agg-on, --aggregate-on

Column that was aggregated on during capricho get. Derives the value column as {aggregate_on}_mean.

pchembl_value

--id-columns

Extra columns to combine with task_col for composite task identifiers. Use the same columns passed to capricho get --id-columns during aggregation.

None

Output Options

Option

Description

Default

--plot

Path to save comparability plots (e.g., comparability.png). If not provided, no plot is generated.

None

Understanding Quality Flags

Quality flags are added to the data_dropping_comment column during capricho get. Common flags include:

  • Undefined stereochemistry: Compounds with unassigned chiral centers

  • Potential duplicate: Quality flag introduced by the ChEMBL team, indicting that compound-target pair reported in multiple documents with identical values

  • Data validity comment: ChEMBL’s own data quality annotations

  • Unit annotation error: Measurements differing by exactly 3.0 or 6.0 log units (suggesting unit conversion errors)

  • Mixture in SMILES: SMILES containing multiple components (. separator)

  • Assay size too small/large: Assays outside the specified size bounds

  • Insufficient assay overlap: Assays without enough shared compounds for reliable comparison

Output Files

The prepare command generates two files:

  • Prepared data (*_prepared.csv): The cleaned data after filtering quality flags

  • Activity matrix (specified by -o): Rows are compounds (indexed by compound_col), columns are tasks, plus a smiles column. Suitable for multitask ML models.

Examples

Basic Preparation

# Clean data and output activity matrix
capricho prepare -i egfr_data.csv -o egfr_matrix.csv

Filtering Quality Flags

# Remove potential duplicates and data validity issues
capricho prepare -i egfr_data.csv -o egfr_clean.csv \
    --drop-potential-duplicate \
    --drop-data-validity

Strict Quality Filtering

# Apply multiple quality filters for high-confidence data
capricho prepare -i kinase_data.csv -o kinase_clean.csv \
    --drop-potential-duplicate \
    --drop-data-validity \
    --drop-unit-error \
    --drop-undefined-stereo

With Deduplication

# Remove duplicate values and recalculate statistics
capricho prepare -i data.csv -o clean_data.csv \
    --deduplicate \
    --drop-potential-duplicate

Resolve Annotation Errors

# Keep earliest measurement when annotation errors are detected
capricho prepare -i data.csv -o clean_data.csv \
    --resolve-annotation-error first \
    --drop-unit-error

Generate Comparability Plots

# Output plots showing data comparability across assays
capricho prepare -i data.csv -o clean_data.csv \
    --drop-potential-duplicate \
    --plot comparability.png

This generates two plots:

  • comparability_cleaned.png: Comparability of data after filtering

  • comparability_flags.png: Multi-panel view showing remaining flags

Composite Task Identifiers

# Use id-columns if data was aggregated with --id-columns
capricho prepare -i data.csv -o matrix.csv \
    --id-columns assay_cell_type,standard_units

Custom Flag Removal

# Remove entries with custom flags
capricho prepare -i data.csv -o clean_data.csv \
    --remove-flags "Censored activity comment,Mutant assay"

capricho binarize

Convert aggregated bioactivity data to binary labels (active/inactive) based on a pChEMBL threshold. This command handles censored measurements (< and >) and validates agreement between different measurement types.

capricho binarize [OPTIONS]

Required Options

Option

Description

-i, --input-path

Path to aggregated data file (CSV, TSV, or Parquet)

-o, --output-path

Path to save the binarized output file

Binarization Options

Option

Description

Default

-t, --threshold

Activity threshold for binarization (pChEMBL scale)

6.0 (1 µM)

-vcol, --value-column

Column to use for binarization

pchembl_value_mean

-cid, --compound-id-col

Column name for compound identifiers

connectivity

-tid, --target-id-col

Column name for target identifiers

target_chembl_id

-rel, --relation-col

Column name for standard_relation values

standard_relation

-bcol, --binary-col

Name for the output binary column

activity_binary

-cmp-mut, --compare-across-mutants

Compare measurements across mutants for conflicts

False

-rp, --conflict-report-path

Path to save detailed conflict report as JSON

None

-cr, --conflict-resolution

Strategy for resolving conflicts: drop, relation, confidence, majority

None

Understanding pChEMBL Thresholds

The pChEMBL scale is -log10(Molar), where higher values indicate higher activity:

  • pChEMBL 6.0 = 1 µM (common threshold)

  • pChEMBL 6.5 = 316 nM

  • pChEMBL 7.0 = 100 nM (stringent threshold)

  • pChEMBL 5.0 = 10 µM (permissive threshold)

Handling Standard Relations

The binarization process handles different measurement types:

  • = (discrete): Direct comparison to threshold

  • ~ (approximate): Uses lower bound for conservative classification (±0.5 log units)

  • <, << (censored active): Compound is more active than reported value

  • >, >> (censored inactive): Compound is less active than reported value

Conflict Detection and Resolution

The command flags measurements that disagree for the same compound-target pair:

  • Mixed discrete/censored conflicts: When discrete measurements (=, ~) disagree with censored measurements (<, >)

  • Binary label conflicts: When measurements result in different activity classifications (active vs inactive)

  • Mutation handling: Use --compare-across-mutants to control whether different mutants are compared

Conflict Resolution Strategies

By default, conflicts are only flagged in data_dropping_comment. Use --conflict-resolution to automatically resolve them:

Strategy

Behavior

Fallback

drop

Remove all rows for conflicting pairs

relation

Keep exact (=) rows, drop censored

Drop all if no = exists

confidence

Keep row with highest confidence_score

Drop all on tie

majority

Classify each individual measurement against the threshold; majority label wins

Drop all on tie

The majority strategy splits the pipe-separated raw values (e.g., pchembl_value = "6.0|6.5|7.0") and classifies each individual measurement against the threshold. Each measurement gets one vote. This means a row whose mean is above the threshold but contains individual values below it will contribute some inactive votes. When the raw value column is not present, falls back to count-weighted or row-based voting.

# Resolve conflicts by keeping exact measurements
capricho binarize -i data.csv -o binary.csv -t 7.0 -cr relation

# Resolve by measurement-weighted majority vote, save report
capricho binarize -i data.csv -o binary.csv -t 7.0 -cr majority -rp conflicts.json

Conflict Report

Use -rp / --conflict-report-path to save a JSON report with:

  • Summary: Total conflicts, conflict patterns (exact vs censored), active/inactive counts, MCC, resolution summary

  • Per-conflict details: Measurements, vote summary, severity (low/medium/high based on measurement spread), recommendation, resolution outcome

Compound Identifiers for Conflict Detection

By default, conflicts are detected using the connectivity column (InChI key connectivity layer), which groups compounds by their molecular graph ignoring stereochemistry. You can use a different identifier:

  • connectivity (default): Groups by connectivity layer, ignoring stereochemistry

  • smiles: Groups by standardized SMILES, which may be more or less permissive depending on your aggregation settings

To use SMILES for conflict detection, specify -cid smiles:

capricho binarize -i data.csv -o output.csv -cid smiles

This allows you to check for inconsistencies at different levels of molecular identity (e.g., detecting conflicts between stereoisomers when using SMILES, or treating stereoisomers as the same compound when using connectivity).

Examples

Basic Binarization

# Default threshold of 6.0 (1 µM)
capricho binarize -i aggregated_data.csv -o binarized_data.csv

Custom Threshold

# Stringent threshold of 7.0 (100 nM)
capricho binarize -i egfr_data.csv -o egfr_binary.csv -t 7.0

Using Median Values

# Use median instead of mean for binarization
capricho binarize \
  -i aggregated_data.csv \
  -o binary_median.csv \
  -vcol pchembl_value_median \
  -t 6.5

Compare Across Mutants

# Flag conflicts even when measurements are on different mutants
capricho binarize \
  -i kinase_data.csv \
  -o kinase_binary.csv \
  -t 6.5 \
  --compare-across-mutants

Resolve Conflicts and Generate Report

# Keep exact measurements, drop censored when they conflict
capricho binarize \
  -i egfr_data.csv \
  -o egfr_binary.csv \
  -t 7.0 \
  -cr relation \
  -rp conflict_report.json

# Measurement-weighted majority vote
capricho binarize \
  -i egfr_data.csv \
  -o egfr_binary.csv \
  -t 7.0 \
  -cr majority \
  -rp conflict_report.json

Understanding pchembl_relation

The output file includes a pchembl_relation column that adjusts the standard_relation signs for the -log scale used in pChEMBL values. This makes it easier to interpret activity thresholds:

Example: For threshold = 6.0 (1 µM)

  • IC50 with pchembl_value = 6.0 and pchembl_relation = >

    • -log10[IC50 concentration] > 6.0

    • IC50 concentration < 1 µM

    • active (1)

The relation inversion happens because pChEMBL values are negative logarithms:

  • standard_relation < (low concentration) → pchembl_relation > (high pChEMBL, active)

  • standard_relation > (high concentration) → pchembl_relation < (low pChEMBL, inactive)

  • standard_relation =pchembl_relation = (unchanged)

This column is automatically generated during binarization and helps interpret the relationship between measurements and the activity threshold.

Output Format

The output file contains all original columns plus:

  • Binary activity column (default: activity_binary): 0 (inactive), 1 (active), or null (missing)

  • pchembl_relation column: Standard relation adjusted for -log scale (see above)

  • Conflict flags: Rows with disagreeing measurements are flagged in the data_dropping_comment column

Conflicting measurements are logged with detailed information about the disagreement.

Post-Resolution Deduplication

When a conflict resolution strategy is active (-cr), compound-target pairs are deduplicated to one row per pair. During deduplication:

  • Individual measurements that disagree with the resolved binary label are filtered out from all pipe-separated columns

  • Rows for the same compound-target pair are merged, concatenating their source values

  • The standard_relation column becomes pipe-separated to match per-measurement relations (e.g., "=|=|<")

  • Statistics (*_mean, *_std, *_median, *_counts) are recalculated from the kept measurements only

The resulting pchembl_value and standard_relation columns serve as a register of source values that compose the binarized label.