Key Concepts

Understanding these core concepts will help you use CAPRICHO effectively and make informed decisions about data curation.

Compound Equality

One of the most important decisions in bioactivity data analysis is determining when two compound entries represent the same molecule.

Connectivity-Based (Default)

The connectivity method identifies compounds by their molecular graph. It’s based on the first 14 characters of the InChIKey, which encode atom connectivity but ignore stereochemistry and tautomerism:

capricho get --target-ids CHEMBL203 --compound-equality connectivity

Because the connectivity layer does not encode stereochemistry, stereoisomers (e.g., R/S enantiomers) are merged during aggregation. To reflect this, CAPRICHO strips stereochemistry from the output SMILES when using connectivity mode, preventing the output from misleadingly retaining an arbitrary enantiomer’s representation.

Advantages:

  • Robust to tautomers: InChI normalizes mobile-hydrogen tautomers (e.g., amide/imidic acid shifts between N, O, S atoms), so tautomeric forms are correctly identified as the same compound

  • Deterministic and computationally lightweight

Limitations:

  • Merges stereoisomers that may have different biological activity

  • Does not normalize all tautomers

Use When:

  • Stereochemistry information is unreliable or inconsistent across sources

  • You want to combine data from different stereoisomers and tautomers into a single entry

Fingerprint-Based

The mixed_fp method uses a concatenation of ECFP4 (Morgan, radius 2) and RDKit path-based fingerprints to determine compound identity:

capricho get --target-ids CHEMBL203 --compound-equality mixed_fp

Using two fingerprint types covers some failure modes of each individual method — ECFP4 captures circular substructure environments while RDKit fingerprints capture path-based features. From those, only ECFP4 enables stereochemical distinctions and is used depending on the presence of the --chirality flag.

Advantages:

  • Maintains stereochemical distinctions (with --chirality)

Limitations:

  • Sensitive to tautomers: different tautomeric forms produce different fingerprint bit vectors, potentially treating the same compound as two different entries

  • Susceptible to bit collisions in molecules with repetitive structural patterns (e.g., peptides, long lipid chains), where distinct substructures may hash to the same bits

  • Computationally heavier than connectivity

Use When:

  • Stereochemistry is important for your analysis

  • Building focused datasets for SAR studies

  • Working with data where tautomer variation is minimal

SMILES-Based

The smiles method uses standardized SMILES strings directly for exact matching:

capricho get --target-ids CHEMBL203 --compound-equality smiles

Advantages:

  • Simple and transparent matching logic

  • No additional computation (InChI or fingerprints)

Limitations:

  • Sensitive to tautomers: different tautomeric forms produce different SMILES strings

  • Relies entirely on the standardization performed by the ChEMBL structure pipeline

ChEMBL Backends

CAPRICHO supports two different ways to access ChEMBL data, each with distinct advantages.

Local Database Backend (Default)

The downloader backend uses a local SQLite database:

capricho download  # One-time setup
capricho get --target-ids CHEMBL203 --chembl-backend downloader

Advantages:

  • Much faster for large queries

  • Works offline after initial download

  • Consistent performance

  • Full SQL query capabilities

Requirements:

  • ~25GB disk space for full ChEMBL database

  • Initial download time

Best For:

  • Large-scale data mining

  • Repeated queries

  • Complex filtering requirements

  • Offline analysis

Web API Backend

The webresource backend queries the live ChEMBL web API:

capricho get --target-ids CHEMBL203 --chembl-backend webresource

Advantages:

  • No local storage required

  • Always uses latest data

  • No setup time

  • Good for small queries

Limitations:

  • Slower for large queries

  • Requires internet connection

  • Subject to API rate limits

Best For:

  • Small, targeted queries

  • One-off analyses

  • When disk space is limited

Data Flagging

A core principle of CAPRICHO is never silently dropping data. Instead of removing problematic entries during the get command, CAPRICHO flags them and lets users decide what to filter during the prepare step.

Two Types of Flags

CAPRICHO maintains two separate flag columns:

data_dropping_comment - Quality flags indicating data concerns:

  • Potential duplicates across documents

  • Data validity comments from ChEMBL

  • Unit annotation errors (3.0 or 6.0 log unit differences)

  • Patent sources

  • Undefined stereochemistry

  • Assay size issues

  • Insufficient assay overlap

These include max curation standards from Landrum & Riniker (2024).

data_processing_comment - Processing flags documenting transformations:

  • Salt/solvent removal from SMILES

  • SMILES standardization

  • pChEMBL calculation

  • Unit conversions

The Two-Phase Workflow

  1. capricho get: Fetches and curates data, adding flags to all entries

  2. capricho prepare: Filters data based on quality flags according to your project’s needs

This separation ensures transparency - you can inspect flagged data before deciding what to remove:

# Step 1: Get data with all flags
capricho get --target-ids CHEMBL203 -o egfr_data.csv

# Step 2: Filter based on your quality requirements
capricho prepare -i egfr_data.csv -o egfr_clean.csv \
    --drop-potential-duplicate \
    --drop-data-validity

Data Aggregation

CAPRICHO provides several options for handling duplicate measurements and aggregating data.

Target Mutations

By default, CAPRICHO treats different target mutations as separate entities. Use --aggregate-mutants to combine them:

capricho get --target-ids CHEMBL203 --aggregate-mutants

This is useful when you want to study the target in general rather than specific mutations.

Metadata Columns

Include additional metadata in your analysis:

capricho get --target-ids CHEMBL203 --metadata-columns organism,tissue,cell_type

These columns are preserved during aggregation and can help you understand data heterogeneity.

Non-pChEMBL Aggregation

By default, CAPRICHO aggregates bioactivity data using the pchembl_value column, which represents -log10(molar) potency values. However, many ChEMBL assays (especially ADMET assays) report measurements that aren’t suitable for pChEMBL conversion, such as:

  • Permeability (e.g., Caco-2 apparent permeability in cm/s)

  • Percent inhibition (e.g., % inhibition at a fixed concentration)

  • Clearance (e.g., mL/min/kg)

  • Half-life (e.g., hours)

  • Solubility (e.g., µg/mL)

For these measurements, use --aggregate-on standard_value:

capricho get \
  --assay-ids CHEMBL1112933,CHEMBL3529279 \
  --assay-types A \
  --aggregate-on standard_value \
  --output-path permeability_data.csv

Key Differences from pChEMBL Aggregation

Aspect

pchembl_value (default)

standard_value

Mean type

Geometric mean

Arithmetic mean

Units

Always molar (-log10)

Original units preserved

Use case

Potency measurements (IC50, Ki, etc.)

ADMET, physicochemical properties

Important: Preventing Unit Mixing

When aggregating on standard_value, ensure you don’t inadvertently combine measurements with different units. Use --id-columns to group by unit type:

capricho get \
  --assay-ids CHEMBL1112933,CHEMBL3529279 \
  --aggregate-on standard_value \
  --id-columns standard_units \
  --output-path permeability_data.csv

This creates separate aggregations for measurements with different standard_units values.

Unit Conversion

ChEMBL contains bioactivity data reported in many different units, even for the same measurement type. For example, permeability might be reported as cm/s, nm/s, or 10^-6 cm/s. This heterogeneity makes cross-study aggregation challenging.

The --convert-units flag enables automatic unit conversion before aggregation:

capricho get \
  --assay-ids CHEMBL1112933,CHEMBL3529279 \
  --aggregate-on standard_value \
  --convert-units \
  --output-path permeability_data.csv

Supported Unit Families

Family

Target Unit

Source Units

Permeability

10^-6 cm/s

cm/s, nm/s, ucm/s, 10'-6 cm/s, etc.

Molar concentration

nM

uM, µM, mM, pM, M

Mass concentration

ug/mL

ng/ml, mg/ml, mg/L, pg/ml

Dose

mg/kg

ug/kg, ug.kg-1, mg.kg-1

Time

hr

min, s, ms, day

Transparency

All unit conversions are logged and tracked in the data_processing_comment column. This ensures you can always trace which measurements were converted and by what factor.

Example: Caco-2 Permeability Dataset

capricho get \
  --assay-ids CHEMBL1112933,CHEMBL3529279,CHEMBL3529278 \
  --assay-types A \
  --confidence-scores 0,1,2,3,4,5,6,7,8,9 \
  --aggregate-on standard_value \
  --convert-units \
  --id-columns standard_units,assay_cell_type \
  --drop-unassigned-chiral \
  --output-path caco2_permeability.csv

This command:

  1. Fetches permeability data from specific Caco-2 assays

  2. Converts all permeability units to 10^-6 cm/s

  3. Aggregates using arithmetic mean on standard_value

  4. Groups by cell type to maintain biological context

Confidence Scoring

ChEMBL assigns confidence scores (0-9) based on target assignment certainty. See the ChEMBL documentation for full details.

Score

Description

9

Direct single protein target assigned

8

Homologous single protein target assigned

7

Direct protein complex subunits assigned

6

Homologous protein complex subunits assigned

5

Multiple direct protein targets may be assigned (e.g., PROTEIN FAMILY)

4

Multiple homologous protein targets may be assigned (e.g., PROTEIN FAMILY)

3

Target assigned is molecular non-protein target

1

Target assigned is non-molecular

0

Default value - Target assignment has yet to be curated

Standard Relations and Censored Data

ChEMBL bioactivity measurements include a standard_relation field that indicates the relationship between the measured value and the reported concentration.

Relation Types

  • =: The measured value equals the reported concentration

  • <: The compound is active at concentrations below the reported value (active at the reported concentration)

  • <<: Stronger indication that the compound is active at concentrations well below the reported value (active at the reported concentration)

  • >: The compound is active at concentrations above the reported value (inactive at the reported concentration)

  • >>: Stronger indication that the compound is active at concentrations well above the reported value (inactive at the reported concentration)

  • ~: Approximate measurement (CAPRICHO handles these as ±0.5 log units)

Working with Censored Data

Important: ChEMBL only pre-calculates pchembl_value for exact measurements (standard_relation='='). To include censored data (<, >), you must use the --calculate-pchembl flag:

# Default: Only includes exact measurements (=)
capricho get --target-ids CHEMBL203 --output-path egfr_exact.csv

# Include censored measurements: MUST use --calculate-pchembl
capricho get --target-ids CHEMBL203 \
  --standard-relation "=,<,>" \
  --calculate-pchembl \
  --output-path egfr_all.csv

Without --calculate-pchembl, you’ll get an error if you request censored data, but the data will still be fetched:

ERROR: pchembl_values are only calculated for standard_relation='='.
If you want to use censored data, please set calculate_pchembl to True.

Aggregation with Censored Data

When aggregating data with censored measurements, CAPRICHO only combines measurements that have:

  1. Identical standard_relation values

  2. In case of identical (=) standard_relation, statistics will be calculated for compound-target pairs with multiple exact measurements.

  3. Censored measurements (<, >, <<, >>) are only combined with exact measurement matches (e.g.: < 6.0 will not be combined with < 5.0).

This conservative approach prevents mixing incompatible measurement types (e.g., averaging an exact value with a lower bound).

pchembl_relation: Inverted Relations for -log Scale

When working with pChEMBL values ($-log_{10}(Molar)$), the direction of comparison operators is inverted compared to the original concentration values. CAPRICHO automatically creates a pchembl_relation column during binarization to make this relationship explicit:

Relation Inversion Logic:

  • standard_relation < (low concentration, active) → pchembl_relation > (high pChEMBL, active)

  • standard_relation > (high concentration, inactive) → pchembl_relation < (low pChEMBL, inactive)

  • standard_relation =pchembl_relation = (unchanged)

  • standard_relation ~pchembl_relation ~ (unchanged)

Example Interpretation: For a measurement with IC50 = 1 µM (pChEMBL = 6.0) and standard_relation = <:

  • Original: IC50 < 1 µM (active at concentrations below 1 µM)

  • pChEMBL: pchembl_value > 6.0 (higher pChEMBL = more active)

  • With threshold = 6.0: classified as active (1)

This inverted relation column is automatically added when you run the binarize command and helps interpret how measurements relate to activity thresholds on the -log scale.

Activity Data Analysis

For binary classification (active/inactive), use the binarize command which properly handles censored measurements. Following the example above, we have:

capricho binarize -i egfr_all.csv -o egfr_binary.csv -t 6.0

See the CLI reference for detailed binarization options.

Binarization Conflicts

When a compound-target pair has measurements from both exact (=) and censored (<, >) assays, the resulting binary labels may disagree. For example, an exact IC50 of 100 nM (pChEMBL 7.0) says “active”, while a censored measurement > 10 µM (pChEMBL < 5.0) says “inactive”.

Why Conflicts Happen

The most common pattern (~97% of EGFR conflicts) is exact vs censored: an exact measurement classifies a compound differently from a censored bound. This often occurs when:

  • A high-throughput screen reports a censored inactive result (> 10 µM)

  • A focused follow-up study measures an exact IC50 well below the threshold

  • Different assay formats have different dynamic ranges

Conflict Resolution Strategies

By default, CAPRICHO flags conflicts but keeps all rows. Use --conflict-resolution to resolve them automatically:

relation — Trust exact measurements over censored bounds. This is the most scientifically grounded strategy: an exact IC50 measurement is inherently more informative than a “> 10 µM” bound.

majority — Measurement-level majority vote. Each row’s pipe-separated raw values (e.g., pchembl_value = "6.0|6.5|7.0") are split and each individual measurement is classified against the threshold. Every measurement gets one vote, so a row with 3 values contributes 3 votes. This is more granular than row-level counting: if a row’s mean is active but one individual value is below the threshold, that value votes inactive. Falls back to dropping all rows on a tie.

confidence — Keep the row with the highest ChEMBL confidence score. Useful when assays vary in target assignment certainty.

drop — Remove all conflicting rows entirely. The most conservative option; guarantees no ambiguous labels remain.

Conflict Severity

The conflict report classifies each conflict by severity based on measurement spread:

  • Low (spread < 1.0 log units): Measurements are close to each other and the threshold

  • Medium (spread 1.0–2.0 log units): Moderate disagreement

  • High (spread > 2.0 log units): Large disagreement, likely different assay conditions or systematic error

Choosing a Strategy

For most use cases, relation is a safe default — it keeps the most informative measurements. Use majority when you have many aggregated measurements and want each individual value to vote on the label. Use drop when label accuracy is critical and you’d rather lose data than risk mislabeling.

Generate a conflict report (-rp conflicts.json) to inspect the conflicts before deciding:

# Inspect conflicts first (no resolution)
capricho binarize -i data.csv -o binary.csv -t 7.0 -rp conflicts.json

# Then resolve with your chosen strategy
capricho binarize -i data.csv -o binary.csv -t 7.0 -cr relation -rp conflicts.json

Post-Resolution Deduplication

When any conflict resolution strategy is active, CAPRICHO deduplicates the output to one row per compound-target pair. This produces ML-ready data where each compound-target combination has a single binary label.

During deduplication:

  1. Individual measurement filtering — Within each row, the pipe-separated raw values (e.g., pchembl_value = "5.5|6.5|7.0") are split and each measurement is classified against the threshold. Measurements that disagree with the row’s resolved binary label are removed. All other pipe-separated columns (assay_chembl_id, molecule_chembl_id, etc.) are filtered at the same positions to stay aligned.

  2. Row merging — Multiple rows for the same compound-target pair (e.g., one with standard_relation = "=" and another with standard_relation = "<") are merged into a single row. The standard_relation column becomes pipe-separated to reflect per-measurement relations (e.g., "=|=|<").

  3. Stats recalculationpchembl_value_mean, _std, _median, and _counts are recalculated from the kept measurements only.

The resulting pchembl_value and standard_relation columns serve as a register tracing which source values compose the binarized label.

Quality Control Filters

CAPRICHO provides multiple layers of quality control:

Bioactivity Type Filtering

--bioactivity-type IC50,Ki,EC50

Focus on specific measurement types relevant to your analysis.

Relation Filtering

# Only exact measurements (default behavior, pchembl pre-calculated by ChEMBL)
--standard-relation "="

# Include censored data (requires --calculate-pchembl)
--standard-relation "=,<,>" --calculate-pchembl

Choose measurement precision level based on your analysis needs.

Date Requirements

--require-doc-date

Ensure all data has associated publication dates.

Assay Size Constraints

--min-assay-size 10 --max-assay-size 1000

Filter assays by number of tested compounds.

Reproducibility

CAPRICHO ensures full reproducibility through several mechanisms:

Recipe Files

Every run generates a JSON recipe file containing the full command and all parameters used:

{
  "command": "capricho get --target-ids CHEMBL203 --output-path egfr_data.csv",
  "capricho version": "0.1.0",
  "molecule_ids": [],
  "target_ids": ["CHEMBL203"],
  "assay_ids": [],
  "document_ids": [],
  "calculate_pchembl": false,
  "output_path": "egfr_data.csv",
  "confidence_scores": [7, 8, 9],
  "bioactivity_type": ["Potency", "Kd", "Ki", "IC50", "AC50", "EC50"],
  "standard_relation": ["="],
  "assay_types": ["B", "F"],
  "chembl_version": "36",
  "compound_equality": "connectivity",
  "aggregate_on": "pchembl_value"
}

This allows exact reproduction of your data curation workflow.

Version Control

Specify exact ChEMBL versions:

capricho get --target-ids CHEMBL203 --chembl-version 33

Transparent Processing

All filtering steps are logged and flagged data is preserved for inspection.

Output Structure

Understanding CAPRICHO’s output helps you make the most of your curated data. Each capricho get run produces multiple files:

Main Data File (*_data.csv)

  • Aggregated bioactivity measurements

  • Standardized column names

  • Quality and processing flags in dedicated columns

Recipe File (*_recipe.json)

  • Complete record of all parameters used

  • Enables exact reproduction of the workflow

Pre-aggregation Data (*_not_aggregated.csv)

  • Individual measurements before aggregation

  • Useful for understanding how data was combined

  • Can be skipped with --skip-not-aggregated

Removed Subset (*_removed_subset.csv)

  • Entries that were filtered out during curation

  • Allows inspection of what was excluded and why

This multi-file approach ensures transparency while providing clean, analysis-ready data.