# Quick Start Guide This guide will help you get started with CAPRICHO in just a few minutes. ## Tab Completion CAPRICHO supports tab completion for commands and options. To enable it, run: ```bash capricho --install-completion ``` This will make it easier to discover available commands and options as you type. ## Basic Workflow CAPRICHO follows the workflow: 1. **Download** ChEMBL database (one-time setup); 2. **Explore** the data to understand what's available; 3. **Get** curated bioactivity data for your analysis; 4. **Prepare** the data by cleaning concerning quality flags according to the projects' needs and optionally output an activity matrix for multitask ML; 5. **Binarize** (optional) convert continuous activity values to binary labels for classification. ## Step 1: Download ChEMBL Database First, download the ChEMBL database to your local machine: ```bash capricho download ``` This downloads the latest ChEMBL version to `~/.data/chembl/` by default. Defining a custom subdirectory is supported with the `--prefix` option, but the path will be saved under the default `~/.data/` location, as defined by [pystow](https://pystow.readthedocs.io/en/latest/). For example: ```bash capricho download --version 33 --prefix /path/to/custom/location ``` ## Step 2: Explore the Database Before fetching data, you might want to explore what's available: ```bash # List all tables in the database capricho explore --list-tables # Explore a specific table capricho explore --table activities # Search for tables containing a specific column capricho explore --search-column pchembl # Run a custom SQL query capricho explore --query "SELECT COUNT(*) FROM activities WHERE confidence_score = 9" ``` ## Step 3: Get Bioactivity Data Now you're ready to fetch and curate bioactivity data. Here are some common use cases: ### Basic Target Query Get all bioactivity data for EGFR (CHEMBL203): ```bash capricho get --target-ids CHEMBL203 --output-path egfr_data.csv ``` ### Multiple Targets with High Confidence Get data for multiple targets, but only high-confidence measurements: ```bash capricho get --target-ids CHEMBL203,CHEMBL204,CHEMBL279 --confidence-scores 8,9 --output-path high_confidence.csv ``` ### Specific Bioactivity Types Focus on specific bioactivity measurements: ```bash capricho get --target-ids CHEMBL203 --bioactivity-type IC50,Ki --output-path ic50_ki_data.csv ``` ### With Custom Aggregation Aggregate mutant data and include additional metadata: ```bash capricho get --target-ids CHEMBL203 --aggregate-mutants --metadata-columns organism,tissue --output-path aggregated_data.csv ``` ## Step 4: Prepare Data for ML The `prepare` command filters data based on quality flags, and outputs the cleaned data and an activity matrix suitable for multitask machine learning. Quality and processing flags are introduced to all data fetched with the `get` command, including the max curation standards introduced by [Landrum & Riniker (2024)](https://doi.org/10.1021/acs.jcim.4c00049) and marking any modifications made by CAPRICHO during curation, respectively. ### Basic Preparation Transform aggregated data into a multitask activity matrix: ```bash capricho prepare -i egfr_data.csv -o egfr_matrix.csv ``` ### Filtering Quality Flags Remove entries with specific quality concerns: ```bash capricho prepare -i egfr_data.csv -o egfr_matrix.csv \ --drop-potential-duplicate \ --drop-data-validity ``` The prepare command outputs: - **Activity matrix** (e.g., `egfr_matrix.csv`): Rows are compounds, columns are targets - **Prepared data** (e.g., `egfr_prepared.csv`): Filtered data before pivoting See [CLI Reference](cli-reference.md) for all available quality flags (`--drop-undefined-stereo`, `--drop-unit-error`, etc.). ## Step 5: Binarize Data (Optional) If you need binary labels (active/inactive) for classification tasks, you can binarize the prepared data: ### Basic Binarization Convert continuous pChEMBL values to binary labels using the default threshold of 6.0 (1 µM): ```bash capricho binarize -i egfr_prepared.csv -o egfr_binary.csv ``` ### Custom Threshold Use a more stringent threshold of 7.0 (100 nM): ```bash capricho binarize -i egfr_prepared.csv -o egfr_binary.csv -t 7.0 ``` ### Using Median Values Use median instead of mean for binarization: ```bash capricho binarize -i egfr_prepared.csv -o egfr_binary.csv -vcol pchembl_value_median ``` The binarization process: - Handles censored measurements (< and >) intelligently - Flags conflicting measurements for the same compound-target pair - Outputs a new column `activity_binary` with values 0 (inactive) or 1 (active) ## Understanding the Output CAPRICHO generates several files: - **Main data file** (e.g., `egfr_data.csv`): The curated bioactivity data - **Recipe file** (e.g., `egfr_data_recipe.json`): Complete record of parameters used - **Not aggregated file** (e.g., `egfr_data_not_aggregated.csv`): Pre-aggregation data - **Removed subset file** (e.g., `egfr_data_removed_subset.csv`): Data filtered out during curation ### Key Output Columns The main data file contains these important columns: - `connectivity`: Molecular connectivity identifier - `smiles`: Standardized SMILES representation - `target_chembl_id`: ChEMBL ID for the target - `pchembl_value_mean`: Mean pChEMBL value (aggregated) - `pchembl_value_median`: Median pChEMBL value (aggregated) - `pchembl_value_std`: Standard deviation - `standard_relation`: Relation type (=, <, <<, >, >>, ~) - `data_dropping_comment`: Flags for filtered data - `data_processing_comment`: Processing notes ## Next Steps - Read the [CLI Reference](cli-reference.md) for complete parameter documentation - Learn about [Key Concepts](concepts.md) like compound equality and backends - See the [ADMET Data Guide](guides/admet-data.md) for working with ADMET endpoints