# Key Concepts Understanding these core concepts will help you use CAPRICHO effectively and make informed decisions about data curation. ## Compound Equality One of the most important decisions in bioactivity data analysis is determining when two compound entries represent the same molecule. ### Connectivity-Based (Default) The `connectivity` method identifies compounds by their molecular graph. It's based on the first 14 characters of the InChIKey, which encode atom connectivity but ignore stereochemistry and tautomerism: ```bash capricho get --target-ids CHEMBL203 --compound-equality connectivity ``` Because the connectivity layer does not encode stereochemistry, stereoisomers (e.g., R/S enantiomers) are merged during aggregation. To reflect this, CAPRICHO strips stereochemistry from the output SMILES when using connectivity mode, preventing the output from misleadingly retaining an arbitrary enantiomer's representation. **Advantages:** - Robust to tautomers: InChI normalizes mobile-hydrogen tautomers (e.g., amide/imidic acid shifts between N, O, S atoms), so tautomeric forms are correctly identified as the same compound - Deterministic and computationally lightweight **Limitations:** - Merges stereoisomers that may have different biological activity - Does not normalize all tautomers **Use When:** - Stereochemistry information is unreliable or inconsistent across sources - You want to combine data from different stereoisomers and tautomers into a single entry ### Fingerprint-Based The `mixed_fp` method uses a concatenation of ECFP4 (Morgan, radius 2) and RDKit path-based fingerprints to determine compound identity: ```bash capricho get --target-ids CHEMBL203 --compound-equality mixed_fp ``` Using two fingerprint types covers some failure modes of each individual method — ECFP4 captures circular substructure environments while RDKit fingerprints capture path-based features. From those, only ECFP4 enables stereochemical distinctions and is used depending on the presence of the `--chirality` flag. **Advantages:** - Maintains stereochemical distinctions (with `--chirality`) **Limitations:** - Sensitive to tautomers: different tautomeric forms produce different fingerprint bit vectors, potentially treating the same compound as two different entries - Susceptible to bit collisions in molecules with repetitive structural patterns (e.g., peptides, long lipid chains), where distinct substructures may hash to the same bits - Computationally heavier than connectivity **Use When:** - Stereochemistry is important for your analysis - Building focused datasets for SAR studies - Working with data where tautomer variation is minimal ### SMILES-Based The `smiles` method uses standardized SMILES strings directly for exact matching: ```bash capricho get --target-ids CHEMBL203 --compound-equality smiles ``` **Advantages:** - Simple and transparent matching logic - No additional computation (InChI or fingerprints) **Limitations:** - Sensitive to tautomers: different tautomeric forms produce different SMILES strings - Relies entirely on the standardization performed by the ChEMBL structure pipeline ## ChEMBL Backends CAPRICHO supports two different ways to access ChEMBL data, each with distinct advantages. ### Local Database Backend (Default) The `downloader` backend uses a local SQLite database: ```bash capricho download # One-time setup capricho get --target-ids CHEMBL203 --chembl-backend downloader ``` **Advantages:** - Much faster for large queries - Works offline after initial download - Consistent performance - Full SQL query capabilities **Requirements:** - ~25GB disk space for full ChEMBL database - Initial download time **Best For:** - Large-scale data mining - Repeated queries - Complex filtering requirements - Offline analysis ### Web API Backend The `webresource` backend queries the live ChEMBL web API: ```bash capricho get --target-ids CHEMBL203 --chembl-backend webresource ``` **Advantages:** - No local storage required - Always uses latest data - No setup time - Good for small queries **Limitations:** - Slower for large queries - Requires internet connection - Subject to API rate limits **Best For:** - Small, targeted queries - One-off analyses - When disk space is limited ## Data Flagging A core principle of CAPRICHO is **never silently dropping data**. Instead of removing problematic entries during the `get` command, CAPRICHO flags them and lets users decide what to filter during the `prepare` step. ### Two Types of Flags CAPRICHO maintains two separate flag columns: **`data_dropping_comment`** - Quality flags indicating data concerns: - Potential duplicates across documents - Data validity comments from ChEMBL - Unit annotation errors (3.0 or 6.0 log unit differences) - Patent sources - Undefined stereochemistry - Assay size issues - Insufficient assay overlap These include max curation standards from [Landrum & Riniker (2024)](https://doi.org/10.1021/acs.jcim.4c00049). **`data_processing_comment`** - Processing flags documenting transformations: - Salt/solvent removal from SMILES - SMILES standardization - pChEMBL calculation - Unit conversions ### The Two-Phase Workflow 1. **`capricho get`**: Fetches and curates data, adding flags to all entries 2. **`capricho prepare`**: Filters data based on quality flags according to your project's needs This separation ensures transparency - you can inspect flagged data before deciding what to remove: ```bash # Step 1: Get data with all flags capricho get --target-ids CHEMBL203 -o egfr_data.csv # Step 2: Filter based on your quality requirements capricho prepare -i egfr_data.csv -o egfr_clean.csv \ --drop-potential-duplicate \ --drop-data-validity ``` ## Data Aggregation CAPRICHO provides several options for handling duplicate measurements and aggregating data. ### Target Mutations By default, CAPRICHO treats different target mutations as separate entities. Use `--aggregate-mutants` to combine them: ```bash capricho get --target-ids CHEMBL203 --aggregate-mutants ``` This is useful when you want to study the target in general rather than specific mutations. ### Metadata Columns Include additional metadata in your analysis: ```bash capricho get --target-ids CHEMBL203 --metadata-columns organism,tissue,cell_type ``` These columns are preserved during aggregation and can help you understand data heterogeneity. (non-pchembl-aggregation)= ## Non-pChEMBL Aggregation By default, CAPRICHO aggregates bioactivity data using the `pchembl_value` column, which represents -log10(molar) potency values. However, many ChEMBL assays (especially ADMET assays) report measurements that aren't suitable for pChEMBL conversion, such as: - **Permeability** (e.g., Caco-2 apparent permeability in cm/s) - **Percent inhibition** (e.g., % inhibition at a fixed concentration) - **Clearance** (e.g., mL/min/kg) - **Half-life** (e.g., hours) - **Solubility** (e.g., µg/mL) For these measurements, use `--aggregate-on standard_value`: ```bash capricho get \ --assay-ids CHEMBL1112933,CHEMBL3529279 \ --assay-types A \ --aggregate-on standard_value \ --output-path permeability_data.csv ``` ### Key Differences from pChEMBL Aggregation | Aspect | pchembl_value (default) | standard_value | |--------|------------------------|----------------| | **Mean type** | Geometric mean | Arithmetic mean | | **Units** | Always molar (-log10) | Original units preserved | | **Use case** | Potency measurements (IC50, Ki, etc.) | ADMET, physicochemical properties | ### Important: Preventing Unit Mixing When aggregating on `standard_value`, ensure you don't inadvertently combine measurements with different units. Use `--id-columns` to group by unit type: ```bash capricho get \ --assay-ids CHEMBL1112933,CHEMBL3529279 \ --aggregate-on standard_value \ --id-columns standard_units \ --output-path permeability_data.csv ``` This creates separate aggregations for measurements with different `standard_units` values. (unit-conversion)= ## Unit Conversion ChEMBL contains bioactivity data reported in many different units, even for the same measurement type. For example, permeability might be reported as `cm/s`, `nm/s`, or `10^-6 cm/s`. This heterogeneity makes cross-study aggregation challenging. The `--convert-units` flag enables automatic unit conversion before aggregation: ```bash capricho get \ --assay-ids CHEMBL1112933,CHEMBL3529279 \ --aggregate-on standard_value \ --convert-units \ --output-path permeability_data.csv ``` ### Supported Unit Families | Family | Target Unit | Source Units | |--------|------------|--------------| | **Permeability** | `10^-6 cm/s` | `cm/s`, `nm/s`, `ucm/s`, `10'-6 cm/s`, etc. | | **Molar concentration** | `nM` | `uM`, `µM`, `mM`, `pM`, `M` | | **Mass concentration** | `ug/mL` | `ng/ml`, `mg/ml`, `mg/L`, `pg/ml` | | **Dose** | `mg/kg` | `ug/kg`, `ug.kg-1`, `mg.kg-1` | | **Time** | `hr` | `min`, `s`, `ms`, `day` | ### Transparency All unit conversions are logged and tracked in the `data_processing_comment` column. This ensures you can always trace which measurements were converted and by what factor. ### Example: Caco-2 Permeability Dataset ```bash capricho get \ --assay-ids CHEMBL1112933,CHEMBL3529279,CHEMBL3529278 \ --assay-types A \ --confidence-scores 0,1,2,3,4,5,6,7,8,9 \ --aggregate-on standard_value \ --convert-units \ --id-columns standard_units,assay_cell_type \ --drop-unassigned-chiral \ --output-path caco2_permeability.csv ``` This command: 1. Fetches permeability data from specific Caco-2 assays 2. Converts all permeability units to `10^-6 cm/s` 3. Aggregates using arithmetic mean on `standard_value` 4. Groups by cell type to maintain biological context ## Confidence Scoring ChEMBL assigns confidence scores (0-9) based on target assignment certainty. See the [ChEMBL documentation](https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions#what-is-the-confidence-score) for full details. | Score | Description | |-------|-------------| | 9 | Direct single protein target assigned | | 8 | Homologous single protein target assigned | | 7 | Direct protein complex subunits assigned | | 6 | Homologous protein complex subunits assigned | | 5 | Multiple direct protein targets may be assigned (e.g., PROTEIN FAMILY) | | 4 | Multiple homologous protein targets may be assigned (e.g., PROTEIN FAMILY) | | 3 | Target assigned is molecular non-protein target | | 1 | Target assigned is non-molecular | | 0 | Default value - Target assignment has yet to be curated | ### Recommended Usage - **High confidence (8-9)**: Best for focused target analysis and SAR studies - **Medium confidence (6-7)**: Good balance of data quantity and quality - **Lower confidence (0-5)**: Useful for large-scale analyses, but may include less specific target assignments **Note on ADMET data**: Confidence scores reflect target assignment certainty, not measurement reliability. ADMET assays (permeability, clearance, solubility, etc.) typically have low confidence scores because they measure whole-cell or physicochemical properties rather than specific protein targets. A low confidence score for an ADMET assay does not indicate unreliable data - use `--confidence-scores 0,1,2,3,4,5,6,7,8,9` when retrieving ADMET data. ## Standard Relations and Censored Data ChEMBL bioactivity measurements include a `standard_relation` field that indicates the relationship between the measured value and the reported concentration. ### Relation Types - **`=`**: The measured value equals the reported concentration - **`<`**: The compound is active at concentrations _below_ the reported value (_active_ at the reported concentration) - **`<<`**: Stronger indication that the compound is active at concentrations _well below_ the reported value (_active_ at the reported concentration) - **`>`**: The compound is active at concentrations _above_ the reported value (_inactive_ at the reported concentration) - **`>>`**: Stronger indication that the compound is active at concentrations _well above_ the reported value (_inactive_ at the reported concentration) - **`~`**: Approximate measurement (CAPRICHO handles these as ±0.5 log units) ### Working with Censored Data **Important**: ChEMBL only pre-calculates pchembl_value for exact measurements (`standard_relation='='`). To include censored data (`<`, `>`), you *must* use the `--calculate-pchembl` flag: ```bash # Default: Only includes exact measurements (=) capricho get --target-ids CHEMBL203 --output-path egfr_exact.csv # Include censored measurements: MUST use --calculate-pchembl capricho get --target-ids CHEMBL203 \ --standard-relation "=,<,>" \ --calculate-pchembl \ --output-path egfr_all.csv ``` Without `--calculate-pchembl`, you'll get an error if you request censored data, but the data will still be fetched: ``` ERROR: pchembl_values are only calculated for standard_relation='='. If you want to use censored data, please set calculate_pchembl to True. ``` ### Aggregation with Censored Data When aggregating data with censored measurements, CAPRICHO only combines measurements that have: 1. Identical `standard_relation` values 2. In case of identical (`=`) standard_relation, statistics will be calculated for compound-target pairs with multiple exact measurements. 3. Censored measurements (`<`, `>`, `<<`, `>>`) are only combined with exact measurement matches (e.g.: < 6.0 will not be combined with < 5.0). This conservative approach prevents mixing incompatible measurement types (e.g., averaging an exact value with a lower bound). ### pchembl_relation: Inverted Relations for -log Scale When working with pChEMBL values ($-log_{10}(Molar)$), the direction of comparison operators is inverted compared to the original concentration values. CAPRICHO automatically creates a `pchembl_relation` column during binarization to make this relationship explicit: **Relation Inversion Logic:** - `standard_relation` `<` (low concentration, active) → `pchembl_relation` `>` (high pChEMBL, active) - `standard_relation` `>` (high concentration, inactive) → `pchembl_relation` `<` (low pChEMBL, inactive) - `standard_relation` `=` → `pchembl_relation` `=` (unchanged) - `standard_relation` `~` → `pchembl_relation` `~` (unchanged) **Example Interpretation:** For a measurement with `IC50` = 1 µM (pChEMBL = 6.0) and `standard_relation` = `<`: - Original: IC50 < 1 µM (active at concentrations below 1 µM) - pChEMBL: pchembl_value > 6.0 (higher pChEMBL = more active) - With threshold = 6.0: classified as **active (1)** This inverted relation column is automatically added when you run the `binarize` command and helps interpret how measurements relate to activity thresholds on the -log scale. ### Activity Data Analysis **For binary classification** (active/inactive), use the `binarize` command which properly handles censored measurements. Following the example above, we have: ```bash capricho binarize -i egfr_all.csv -o egfr_binary.csv -t 6.0 ``` See the CLI reference for detailed binarization options. ## Binarization Conflicts When a compound-target pair has measurements from both exact (`=`) and censored (`<`, `>`) assays, the resulting binary labels may disagree. For example, an exact IC50 of 100 nM (pChEMBL 7.0) says "active", while a censored measurement `> 10 µM` (pChEMBL < 5.0) says "inactive". ### Why Conflicts Happen The most common pattern (~97% of EGFR conflicts) is **exact vs censored**: an exact measurement classifies a compound differently from a censored bound. This often occurs when: - A high-throughput screen reports a censored inactive result (`> 10 µM`) - A focused follow-up study measures an exact IC50 well below the threshold - Different assay formats have different dynamic ranges ### Conflict Resolution Strategies By default, CAPRICHO flags conflicts but keeps all rows. Use `--conflict-resolution` to resolve them automatically: **`relation`** — Trust exact measurements over censored bounds. This is the most scientifically grounded strategy: an exact IC50 measurement is inherently more informative than a "> 10 µM" bound. **`majority`** — Measurement-level majority vote. Each row's pipe-separated raw values (e.g., `pchembl_value = "6.0|6.5|7.0"`) are split and each individual measurement is classified against the threshold. Every measurement gets one vote, so a row with 3 values contributes 3 votes. This is more granular than row-level counting: if a row's mean is active but one individual value is below the threshold, that value votes inactive. Falls back to dropping all rows on a tie. **`confidence`** — Keep the row with the highest ChEMBL confidence score. Useful when assays vary in target assignment certainty. **`drop`** — Remove all conflicting rows entirely. The most conservative option; guarantees no ambiguous labels remain. ### Conflict Severity The conflict report classifies each conflict by severity based on measurement spread: - **Low** (spread < 1.0 log units): Measurements are close to each other and the threshold - **Medium** (spread 1.0–2.0 log units): Moderate disagreement - **High** (spread > 2.0 log units): Large disagreement, likely different assay conditions or systematic error ### Choosing a Strategy For most use cases, `relation` is a safe default — it keeps the most informative measurements. Use `majority` when you have many aggregated measurements and want each individual value to vote on the label. Use `drop` when label accuracy is critical and you'd rather lose data than risk mislabeling. Generate a conflict report (`-rp conflicts.json`) to inspect the conflicts before deciding: ```bash # Inspect conflicts first (no resolution) capricho binarize -i data.csv -o binary.csv -t 7.0 -rp conflicts.json # Then resolve with your chosen strategy capricho binarize -i data.csv -o binary.csv -t 7.0 -cr relation -rp conflicts.json ``` ### Post-Resolution Deduplication When any conflict resolution strategy is active, CAPRICHO deduplicates the output to **one row per compound-target pair**. This produces ML-ready data where each compound-target combination has a single binary label. During deduplication: 1. **Individual measurement filtering** — Within each row, the pipe-separated raw values (e.g., `pchembl_value = "5.5|6.5|7.0"`) are split and each measurement is classified against the threshold. Measurements that disagree with the row's resolved binary label are removed. All other pipe-separated columns (`assay_chembl_id`, `molecule_chembl_id`, etc.) are filtered at the same positions to stay aligned. 2. **Row merging** — Multiple rows for the same compound-target pair (e.g., one with `standard_relation = "="` and another with `standard_relation = "<"`) are merged into a single row. The `standard_relation` column becomes pipe-separated to reflect per-measurement relations (e.g., `"=|=|<"`). 3. **Stats recalculation** — `pchembl_value_mean`, `_std`, `_median`, and `_counts` are recalculated from the kept measurements only. The resulting `pchembl_value` and `standard_relation` columns serve as a register tracing which source values compose the binarized label. ## Quality Control Filters CAPRICHO provides multiple layers of quality control: ### Bioactivity Type Filtering ```bash --bioactivity-type IC50,Ki,EC50 ``` Focus on specific measurement types relevant to your analysis. ### Relation Filtering ```bash # Only exact measurements (default behavior, pchembl pre-calculated by ChEMBL) --standard-relation "=" # Include censored data (requires --calculate-pchembl) --standard-relation "=,<,>" --calculate-pchembl ``` Choose measurement precision level based on your analysis needs. ### Date Requirements ```bash --require-doc-date ``` Ensure all data has associated publication dates. ### Assay Size Constraints ```bash --min-assay-size 10 --max-assay-size 1000 ``` Filter assays by number of tested compounds. ## Reproducibility CAPRICHO ensures full reproducibility through several mechanisms: ### Recipe Files Every run generates a JSON recipe file containing the full command and all parameters used: ```json { "command": "capricho get --target-ids CHEMBL203 --output-path egfr_data.csv", "capricho version": "0.1.0", "molecule_ids": [], "target_ids": ["CHEMBL203"], "assay_ids": [], "document_ids": [], "calculate_pchembl": false, "output_path": "egfr_data.csv", "confidence_scores": [7, 8, 9], "bioactivity_type": ["Potency", "Kd", "Ki", "IC50", "AC50", "EC50"], "standard_relation": ["="], "assay_types": ["B", "F"], "chembl_version": "36", "compound_equality": "connectivity", "aggregate_on": "pchembl_value" } ``` This allows exact reproduction of your data curation workflow. ### Version Control Specify exact ChEMBL versions: ```bash capricho get --target-ids CHEMBL203 --chembl-version 33 ``` ### Transparent Processing All filtering steps are logged and flagged data is preserved for inspection. ## Output Structure Understanding CAPRICHO's output helps you make the most of your curated data. Each `capricho get` run produces multiple files: ### Main Data File (`*_data.csv`) - Aggregated bioactivity measurements - Standardized column names - Quality and processing flags in dedicated columns ### Recipe File (`*_recipe.json`) - Complete record of all parameters used - Enables exact reproduction of the workflow ### Pre-aggregation Data (`*_not_aggregated.csv`) - Individual measurements before aggregation - Useful for understanding how data was combined - Can be skipped with `--skip-not-aggregated` ### Removed Subset (`*_removed_subset.csv`) - Entries that were filtered out during curation - Allows inspection of what was excluded and why This multi-file approach ensures transparency while providing clean, analysis-ready data.