API Reference¶
CAPRICHO is primarily designed as a command-line tool, but its core functionality is also available programmatically through Python APIs.
CLI Module¶
The main entry point for CAPRICHO commands.
Module containing the command line interface to get data from ChEMBL
- class Capricho.cli.main.AggregationColumn(value)[source]¶
-
- pchembl_value = 'pchembl_value'¶
- standard_value = 'standard_value'¶
- class Capricho.cli.main.ChemblBackend(value)[source]¶
-
- downloader = 'downloader'¶
- webresource = 'webresource'¶
- class Capricho.cli.main.CompoundEquality(value)[source]¶
-
- connectivity = 'connectivity'¶
- mixed_fp = 'mixed_fp'¶
- smiles = 'smiles'¶
- class Capricho.cli.main.CompoundIdColumn(value)[source]¶
-
- connectivity = 'connectivity'¶
- smiles = 'smiles'¶
- class Capricho.cli.main.ConflictResolution(value)[source]¶
-
- confidence = 'confidence'¶
- drop = 'drop'¶
- majority = 'majority'¶
- relation = 'relation'¶
- class Capricho.cli.main.LogLevel(value)[source]¶
-
- critical = 'critical'¶
- debug = 'debug'¶
- error = 'error'¶
- info = 'info'¶
- trace = 'trace'¶
- warning = 'warning'¶
- Capricho.cli.main.binarize_data(ctx, input_path, output_path, threshold=6.0, value_column='pchembl_value_mean', compound_id_col=CompoundIdColumn.connectivity, target_id_col='target_chembl_id', relation_col='standard_relation', output_binary_col='activity_binary', compare_across_mutants=False, conflict_report_path=None, conflict_resolution=None)[source]¶
Binarize aggregated bioactivity data based on activity threshold.
This command converts continuous pchembl values to binary labels (0=inactive, 1=active) while properly handling censored measurements (< and >) and validating agreement between discrete (=) and censored measurements for the same compound-target pair.
The output file also contains a new relation column with signs adjusted for -log scale. E.g.: for threshold of 6.0,
IC50withpchembl_value6.0 andpchembl_relation>-log[IC50 concentration] > 6.0;
IC50 concentration < 1 µM;
active (1).
Example
capricho binarize -i aggregated_data.csv -o binarized_data.csv -t 6.5
- Capricho.cli.main.csv_intergers(value)[source]¶
Parses a comma-separated string into a list of integers.
- Capricho.cli.main.csv_string(value)[source]¶
Parses a comma-separated string into a list of strings.
- Capricho.cli.main.download(ctx, version=None, prefix=None)[source]¶
Download ChEMBL SQL database using chembl_downloader.
- Capricho.cli.main.explore(ctx, version=None, list_tables=False, table=None, search_column=None, query=None, fmt=ExploreFormat.markdown, output_path=None, colorize=False)[source]¶
Explore the downloaded ChEMBL SQL database.
For a visual inspection of the latest ChEMBL schema, see: https://www.ebi.ac.uk/chembl/db_schema
- Capricho.cli.main.get_data(ctx, molecule_ids=[], target_ids=[], assay_ids=[], document_ids=[], output_path='chembl_data.csv', confidence_scores=[7, 8, 9], bioactivity_type=None, standard_relation=['='], standard_units=None, assay_types=['B', 'F'], chembl_release=None, chembl_version=None, chembl_backend='downloader', compound_equality='connectivity', aggregate_on='pchembl_value', metadata_columns=[], id_columns=[], max_assay_size=None, min_assay_size=None, min_assay_overlap=0, calculate_pchembl=False, chirality=False, drop_unassigned_chiral=False, curate_annotation_errors=True, skip_not_aggregated=False, aggregate_mutants=False, skip_recipe=False, require_doc_date=False, strict_mutant_removal=False, convert_units=False)[source]¶
Filter, download, and process bioactivity data from ChEMBL.
- Capricho.cli.main.prepare_data(ctx, input_path, output_path, task_col='target_chembl_id', aggregate_on=AggregationColumn.pchembl_value, compound_col=CompoundIdColumn.connectivity, smiles_col='smiles', remove_flags=None, id_columns=None, drop_undefined_stereo=False, drop_potential_duplicate=False, drop_data_validity=False, drop_unit_error=False, drop_mixture=False, drop_assay_size=False, drop_insufficient_overlap=False, deduplicate=False, resolve_annotation_error=None, plot_path=None)[source]¶
Transform aggregated bioactivity data into multitask format (activity matrix).
This command pivots aggregated data to create an activity matrix where rows are compounds and columns are tasks (e.g., targets). This format is suitable for multitask machine learning models.
The command supports tab-completable flags for common data quality filters, as well as a –deduplicate option to remove duplicate pChEMBL values and recalculate statistics.
Example
capricho prepare -i aggregated_data.csv -o activity_matrix.csv capricho prepare -i data.csv -o matrix.csv –drop-undefined-stereo –drop-unit-error capricho prepare -i data.csv -o matrix.csv –deduplicate –plot comparability.png
Data Pipeline¶
Core data processing workflow that orchestrates fetch → standardize → clean → aggregate.
- Capricho.cli.chembl_data_pipeline.aggregate_data(df, chirality, metadata_cols=[], extra_id_cols=[], extra_multival_cols=[], aggregate_mutants=False, output_path=None, compound_equality='connectivity', value_col='pchembl_value')[source]¶
Aggregate the data obtained from ChEMBL by: 1) Calculate fingerprints and use those to identify same-structure compounds; 2) Identify identical arrays from fingerprints and aggregate the data;
Aggregated data will contain the original data separated by a semicolon and calculate the mean, median, standard deviation, median absolute deviation, and value counts for the pchembl values.
- Parameters:
df – dataframe output from CompoundMapper.cli.workflow.fetch_standardize_and_clean_workflow
chirality (
bool) – toggle chiral-sensitive fingerprints for identifying same moleculesextra_id_cols (
list[str]) – additional columns to use as identifiers for the aggregation. Passing [“assay_chembl_id”] to this argument, for example, will only aggregate the data if the compound is the same and the assay is the same.extra_multival_cols (
list[str]) – list of extra columns that you’d like to keep as aggregated values in the final dataframe. Caveat: these columns will be displayes as (str) separated by ; in the final dataframe. Defaults to [].aggregate_mutants (
bool) – if true, will aggregate data solely based on the target_chembl_id, regardless of the mutation flag in ChEMBL. Defaults to False.output_path (
Union[str,Path,None]) – path to save the aggregated datacompound_equality (
Literal['mixed_fp','connectivity','smiles']) – How to identify same compounds in the dataset. If “mixed_fp”, uses mixed fingerprints (ECFP4 + RDKitFP) to identify same compounds. If “connectivity”, uses the first part of the InChI key (connectivity) to identify same compounds. If “smiles”, uses standardized SMILES strings directly. Defaults to “connectivity”.value_col (
str) – Column name containing the values to aggregate statistics on. Defaults to “pchembl_value”. Use “standard_value” for non-pChEMBL data (e.g., % inhibition).
- Returns:
the aggregated data
- Return type:
pd.DataFrame
- Capricho.cli.chembl_data_pipeline.get_standardize_and_clean_workflow(molecule_ids=None, target_ids=None, assay_ids=None, document_ids=None, chirality=True, calculate_pchembl=False, output_path=None, confidence_scores=[7, 8, 9], bioactivity_type=None, standard_relation=['='], standard_units=None, assay_types=['B', 'F'], chembl_release=None, save_not_aggregated=True, drop_unassigned_chiral=False, version=None, backend='downloader', curate_annotation_errors=True, require_doc_date=False, min_assay_size=None, max_assay_size=None, min_assay_overlap=0, strict_mutant_removal=False, value_col='pchembl_value', enable_unit_conversion=False)[source]¶
Fetched the filtered data from ChEMBL based on the provided IDs, assay confidence, and bioactivity types. The fetched smiles are then standardized and chemical mixtures are removed from the dataset. Duplicate data is also removed and the remaining data is saved to a csv file.
- Parameters:
molecule_ids (
Optional[list[str]]) – list of ChEMBL molecule IDs to filter data fromtarget_ids (
Optional[list[str]]) – list of ChEMBL target IDs to filter data fromassay_ids (
Optional[list[str]]) – list of ChEMBL assay IDs to filter data fromdocument_ids (
Optional[list[str]]) – list of ChEMBL document IDs to filter data fromcalculate_pchembl (
bool) – whether to calculate pchembl values when not found for assay results reported in nanomolar/micromolar unitschirality (
bool) – setting this to False will remove stereochemistry information from the SMILES on top of the standardization. Defaults to True.output_path (
Union[str,Path,None]) – path to save the resulting csv fileconfidence_scores (
list[str]) – list of confidence scores (assay-related) to filter data frombioactivity_type (
Optional[list[str]]) – list of bioactivity types (assay-related) to filter data fromstandard_relation (
list[str]) – standard relation to filter data from. Currently only supports “=” Defaults to “=”.chembl_release (
Optional[int]) – latest ChEMBL release to retrieve data fromsave_not_aggregated (
bool) – whether to save the resulting data to the csv (output_path) beforedrop_unassigned_chiral (
bool) – whether to drop data points with undefined stereocenters. Defaults to False.version (
Union[int,str,None]) – backend==”downloader” only! version of the ChEMBL database to be downloaded by chembl_downloader. If left as None, the latest version will be downloaded. Defaults to None.backend (
Literal['downloader','webresource']) – the backend to be used for fetching the data. If downloader, the ChEMBL sql database is downloaded and extracted first. Defaults to “downloader”.curate_annotation_errors (
bool) – Whether to apply activity curation based on pChEMBL values diverging in exactly 3.0 (indicate possible annotation errors). Defaults to True.require_doc_date (
bool) – Whether to filter out activities without a document year.max_assay_size (
Optional[int]) – Minimum number of compounds in an assay. Assays smaller than this size will have their activities flagged for removal. Defaults to None (no filtering).max_assay_size – Maximum number of compounds in an assay. Assays exceeding this size will have their activities flagged for removal. Defaults to None (no filtering).
min_assay_overlap (
int) – Minimum number of overlapping compounds between two assays for the same target for their activities to be considered. Defaults to 0 (no filtering).strict_mutant_removal (
bool) – If True, assays with ‘mutant’, ‘mutation’, or ‘variant’ in their description will be flagged for removal. Defaults to False.
- Returns:
the filtered, standardized, and cleaned data
- Return type:
pd.DataFrame
- Capricho.cli.chembl_data_pipeline.re_aggregate_data(df, chirality, extra_id_cols=[], extra_multival_cols=[], aggregate_mutants=False, output_path=None, compound_equality='connectivity')[source]¶
Re-aggregate the data obtained from the aggregate_data method after dataset explosion. Useful for exploring the effect of different extra_id_cols and other parameters.
- Parameters:
df (
DataFrame) – dataframe output from aggregate_datachirality (
bool) – toggle chiral-sensitive fingerprints for identifying same moleculesextra_id_cols (
list[str]) – additional columns to use as identifiers for the aggregation. Passing [“assay_chembl_id”] to this argument, for example, will only aggregate the data if the compound is the same and the assay is the same.extra_multival_cols (
list[str]) – list of extra columns that you’d like to keep as aggregated values in the final dataframe. Caveat: these columns will be displayes as (str) separated by | (pipe) in the final dataframe. Defaults to [].aggregate_mutants (
bool) – if true, will aggregate data solely based on the target_chembl_id, regardless of the mutation flag in ChEMBL. Defaults to False.output_path (
Union[str,Path,None]) – path to save the aggregated datacompound_equality (
Literal['mixed_fp','connectivity','smiles']) – How to identify same compounds in the dataset. If “mixed_fp”, uses mixed fingerprints (ECFP4 + RDKitFP) to identify same compounds. If “connectivity”, uses the first part of the InChI key (connectivity) to identify same compounds. If “smiles”, uses standardized SMILES strings directly. Defaults to “connectivity”.
- Returns:
the re-aggregated data
- Return type:
pd.DataFrame
Data Preparation¶
Filter data based on quality flags and transform to activity matrices for ML.
Clean aggregated bioactivity data by removing quality flags and duplicates, and pivot into activity matrices for multitask modeling.
- Capricho.cli.prepare.prepare_multitask_data(df, task_col, value_col, compound_col, smiles_col, id_columns=None)[source]¶
Transform aggregated data to multitask format (activity matrix).
This function pivots aggregated bioactivity data to create an activity matrix where: - Rows represent unique compounds (identified by compound_col) - Columns represent tasks (e.g., different targets) - Values are the bioactivity measurements (e.g., pchembl_value_mean)
Use
clean_data()before calling this function to filter quality flags and deduplicate values.- Parameters:
df (
DataFrame) – Aggregated DataFrame from aggregate_data() with bioactivity statistics.task_col (
str) – Column to use as task identifier (e.g., “target_chembl_id”).value_col (
str) – Column containing values to pivot (e.g., “pchembl_value_mean”).compound_col (
str) – Column for compound identity (e.g., “connectivity” or “smiles”).smiles_col (
str) – Column containing SMILES strings.id_columns (
Optional[List[str]]) – List of additional columns to combine with task_col for creating composite task identifiers. Use this when data was aggregated with –id-columns (e.g., [“assay_tissue”]) to prevent losing information.
- Return type:
- Returns:
DataFrame with compounds as rows (indexed by compound_col), tasks as columns, and a smiles column. Missing values are represented as NaN.
ChEMBL Processing¶
Bioactivity data processing and pChEMBL value calculation.
Module holding functionalities for the ChEMBL API.
- Capricho.chembl.processing.convert_to_log10(df)[source]¶
Function to be applied to the whole DataFrame. Will convert the standard_value column to pchembl_value column, if the standard_units are in nM, µM or uM.
Activities with incompatible units are flagged and preserved with pchembl_value=NaN for transparency.
- Parameters:
df (
DataFrame) – a bioactivity DataFrame. e.g.: output from get_activity_table.- Returns:
- the DataFrame with the pchembl_value column added. Activities with
incompatible units are flagged and have pchembl_value=NaN.
- Return type:
pd.DataFrame
- Capricho.chembl.processing.curate_activity_pairs(df, mol_id_col='molecule_chembl_id', assay_id_col='assay_chembl_id', activity_value_col='pchembl_value')[source]¶
Curate activity pairs for the same molecule across different assays. Removes or flags pairs of measurements if their activity values (e.g., pChEMBL values) differ by approximately 3.0 or 6.0
Filter inspired on Landrum & Riniker, 2024, where the authors state:
> Given the very low probability of two separate experiments producing exactly the same results, the exact matches are most likely cases where values from a previous paper are copied into a new one; this was discussed in the earlier work by Kramer et al. (10) and spot-checked with a number of assay pairs here.
- Parameters:
df (
DataFrame) – DataFrame with bioactivity data.mol_id_col (
str) – column name for molecule IDs. Defaults to “molecule_chembl_id” for ChEMBL data.assay_id_col (
str) – column name for assay IDs. Defaults to “assay_chembl_id” for ChEMBL data.activity_value_col (
str) – column name for activity values. Defaults to “pchembl_value” for ChEMBL data.
- Returns:
The curated DataFrame.
- Return type:
pd.DataFrame
- Capricho.chembl.processing.get_bioactivities_workflow(molecule_chembl_ids=None, target_chembl_ids=None, assay_chembl_ids=None, document_chembl_ids=None, standard_relation=None, standard_type=None, standard_units=None, confidence_scores=(9, 8), assay_types=('B', 'F'), chembl_release=None, additional_fields=None, prefix=None, version=None, calculate_pchembl=False, curate_annotation_errors=True, require_document_date=False, backend='downloader', value_col='pchembl_value')[source]¶
Perform the first step of the bioactivity data retrieval workflow. These are:
1. Get ChEMBL data using any of the input identifiers: molecule_chembl_ids, target_chembl_ids, assay_chembl_ids, or document_chembl_ids. Additional filters are supported by other input parameters.
2. Once the data is retrieved, the bioactivities are processed according to the calculate_pchembl parameter. ChEMBL calculates pChEMBL values for activity data with the following criteria:
“standard_type” in “[IC50”, “XC50”, “EC50”, “AC50”, “Ki”, “Kd”, “Potency”, “ED50”];
“standard_relation” == “=” & “standard_units” == “nM”;
“standard_value” > 0 & “data_validity_comment”].isnull()) | “data_validity_comment” == “Manually validated”
By passing this parameter to True, pchembl values will be calculated for bioactivities reported in nM, µM or uM standard_unit, or -Log|Log standard_type.
3. The first quality filter is applied. Data containing “data_validity_description” or “potential_duplicate” flags are immediatelly removed from the DataFrame.
- Parameters:
molecule_chembl_ids (
Union[list,str,None]) – list of ChEMBL molecule IDs to fetch data for. Defaults to None.target_chembl_ids (
Union[list,str,None]) – list of ChEMBL target IDs to fetch data for. Defaults to None.assay_chembl_ids (
Union[list,str,None]) – list of ChEMBL assay IDs to fetch data for. Defaults to None.document_chembl_ids (
Union[list,str,None]) – list of ChEMBL document IDs to fetch data for. Defaults to None.standard_relation (
Optional[List[str]]) – Optional filter for standard relation types (e.g., [“=”, “<”, “>”])standard_type (
Optional[List[str]]) – Optional filter for activity types (e.g., [“IC50”, “Ki”, “EC50”])confidence_scores (
Union[list,Tuple]) – list of confidence scores to filter the fetched assay data. Defaults to (9, 8).assay_types (
Union[list,Tuple]) – list of assay types to be fetched from ChEMBL. Defaults to binding (B) and functional (F) data.chembl_release (
Optional[int]) – Not to confuse for version. This is the ChEMBL release number used to filter the data. Defaults to None.additional_fields (
Optional[List[str]]) – backend==”downloader” only! “Optional list of additional fields to include in the sql query. E.g.: [“vs.sequence”], to retrieve the sequence of the variant, if available. Defaults to None.prefix (
Optional[Sequence[str]]) – backend==”downloader” only! prefix to be used by pystow for storing the data on a custom directory. Defaults to None.version (
Union[int,str,None]) – backend==”downloader” only! version of the ChEMBL database to be downloaded by chembl_downloader. If left as None, the latest version will be downloaded. Defaults to None.curate_annotation_errors (
bool) – Whether to apply activity curation based on pChEMBL values diverging in exactly 3.0 (indicate possible annotation errors). Defaults to True.calculate_pchembl (
bool) – calculate pChEMBL values for bioactivities reported in nM, µM or uM standard_unit or -Log|Log standard_type. Defaults to Falsebackend (
Literal['downloader','webresource']) – the backend to be used for fetching the data. If downloader, the ChEMBL sql database is downloaded and extracted first. Defaults to “downloader”.value_col (
str) – Column name for values to be used during aggregation. Defaults to “pchembl_value”. When set to “standard_value”, pchembl filtering is skipped and pchembl values are only calculated opportunistically for compatible units. Defaults to “pchembl_value”.Raises – BioactivitiesNotFoundError: If the retrieved bioactivity dataframe is empty.
- Capricho.chembl.processing.process_bioactivities(bioactivities_df, calculate_pchembl=True, curate_annotation_errors=True, require_document_date=False, value_col='pchembl_value')[source]¶
Processes the bioactivities DataFrame. Will convert the standard_value column to pchembl_value column if the standard_units are in mM, µM, uM, or nM. If the standard_units are in log, the original value in the pchembl_value is preserved.
- Parameters:
bioactivities_df (
DataFrame) – bioactivity dataframe, e.g.: output from get_activity_table.calculate_pchembl (
bool) – Whether to calculate pChEMBL values. When aggregating on standard_value (value_col != “pchembl_value”), this enables calculation of pchembl_value for compatible units (e.g., nM, µM, mM). Though those are available for high quality data, censored data does not have a pChEMBL value readily available and they’ll need to be calculated.curate_annotation_errors (
bool) – Whether to apply activity curation based on pChEMBL values diverging in exactly 3.0 (indicate possible annotation errors). Defaults to True.require_document_date (
bool) – Whether to filter out activities without a document year.value_col (
str) – Column name for aggregation values. Defaults to “pchembl_value”. When set to “standard_value”, pchembl_value filtering is skipped.
- Returns:
the processed bioactivities DataFrame.
- Return type:
pd.DataFrame
Data Quality Flags¶
Functions that flag (rather than remove) problematic data entries.
Collection of functions to flag compounds based on specific criteria.
Not all the functions are annotating compounds to be removed from the dataset. Some are used to annotate processing steps that occurred during the data processing pipeline, like:
Salt/solvent removal (applied to the canonical SMILES since they’re kept as-is by CompoundMapper)
Calculated pChEMBL value (used when the this measure is absent in ChEMBL and calculated from nM … etc readouts)
Potential duplicates (when this one is found, CompoundMapper will keep only one of those in the dataset. If the user wants to investigate, please pass the keep_duplicates flag to True)
- Capricho.chembl.data_flag_functions.flag_calculated_pchembl(df)[source]¶
Marks rows where ‘calculated_pchembl’ is True.
- Return type:
- Capricho.chembl.data_flag_functions.flag_censored_activity_comment(df)[source]¶
Mark activities with activity_comment indicating censored/inactive data but standard_relation=’=’.
ChEMBL contains many activities where the activity_comment field indicates the compound was inactive, inconclusive, or not tested, but the standard_relation is incorrectly set to ‘=’.
The standard_value represents a concentration (e.g., IC50 in nM), and pChEMBL = -log10(standard_value_in_M). When a compound is marked as “inactive” at a given concentration, it means: - The true IC50 (standard_value) is GREATER THAN the tested concentration - The true pChEMBL value is LESS THAN the reported pChEMBL
Therefore, the standard_relation should be ‘<’ for pChEMBL values (or ‘>’ for standard_value). This function corrects the relation to ‘<’ since we work with pChEMBL values.
- Return type:
- Capricho.chembl.data_flag_functions.flag_incompatible_units(df)[source]¶
Mark activities with units that cannot be converted to pChEMBL in the dropping comment.
This function flags activities with standard_units that are incompatible with pChEMBL calculation (i.e., not nM, µM, uM, or mM). These activities will have pchembl_value=NaN and are flagged for transparency.
- Parameters:
df (
DataFrame) – DataFrame to be processed.- Returns:
DataFrame with incompatible units flagged in dropping comment.
- Return type:
pd.DataFrame
- Capricho.chembl.data_flag_functions.flag_insufficient_assay_overlap(df, min_overlap=0, molecule_col='molecule_chembl_id', assay_col='assay_chembl_id', target_col='target_chembl_id', comment_col='data_dropping_comment')[source]¶
Mark activities from assay pairs (for the same target) that don’t meet the minimum compound overlap criterium, useful for analysis assessing the comparability of assays reported in ChEMBL. Depending on the target, this filter can remove a significant amount of activities from the dataset, but it is useful to assess the comparability of the assays reported in the database.
This function calculates overlap across ALL assays regardless of size flags, following CAPRICHO’s principle of transparency. Overlap is only counted when: 1. Compounds have DIFFERENT pChEMBL values across assays (same values indicate annotation errors) 2. The difference is not exactly 3.0 or 6.0 log units (likely censored/inactive measurements) 3. Assays are from DIFFERENT documents (same-document overlaps are excluded)
- Parameters:
df (
DataFrame) – DataFrame to be processed.min_overlap (
int) – Minimum number of overlapping compounds required. Defaults to 0molecule_col (
str) – Name of the molecule identifier column. Defaults to molecule_chembl_id.assay_col (
str) – Name of the assay identifier column. Defaults to assay_chembl_id.target_col (
str) – Name of the target identifier column. Defaults to target_chembl_id.comment_col (
str) – Name of the column to store dropping comments. Defaults to data_dropping_comment.
- Returns:
DataFrame with activities from low-overlap assay pairs flagged.
- Return type:
pd.DataFrame
- Capricho.chembl.data_flag_functions.flag_inter_document_duplication(df, key_subset=['molecule_chembl_id', 'standard_smiles', 'canonical_smiles', 'pchembl_value', 'standard_relation', 'target_chembl_id', 'mutation', 'target_organism'], diff_subset=['document_chembl_id'])[source]¶
Marks rows with a potential duplication after SMILES standardization & salt removal.
This function only flags duplicates for discrete measurements (standard_relation=’=’). Censored measurements (e.g., ‘<’, ‘>’) are not flagged as duplicates since the same bound can be independently reached in different studies without indicating true duplication.
- Parameters:
df (
DataFrame) – DataFrame to be processed.key_subset (
list[str]) – metadata columns used for identifying duplicates. Defaults to a list of columns typically used to identify a compound readout.diff_subset (
Optional[list[str]]) – optional metadata columns used to identify duplicates across different documents. If None, it identifies only based on key_subset. Defaults to a list containing ‘document_chembl_id’, which is used to identify duplicates across different documents.
- Returns:
- DataFrame with duplicates marked in data_processing_comment
(or data_dropping_comment if comment_type=’d’ was used).
- Return type:
pd.DataFrame
- Capricho.chembl.data_flag_functions.flag_max_assay_size(df, max_assay_size=None)[source]¶
Mark assays for removal based on size greater than the specified maximum assay size.
- Return type:
- Capricho.chembl.data_flag_functions.flag_min_assay_size(df, min_assay_size=0)[source]¶
Mark assays for removal based on size lower than the specified minimum assay size.
- Return type:
- Capricho.chembl.data_flag_functions.flag_missing_canonical_smiles(df)[source]¶
Marks rows where ‘canonical_smiles’ is missing.
- Return type:
- Capricho.chembl.data_flag_functions.flag_missing_document_date(df)[source]¶
Mark activities that lack a document date (year) in the processing comment.
This function always flags missing document dates for transparency, regardless of whether they will be filtered out. Activities without document dates are flagged in the data_processing_comment column so users can see which data points lack temporal information.
- Parameters:
df (
DataFrame) – DataFrame to be processed.- Returns:
DataFrame with activities lacking document dates flagged in processing comment.
- Return type:
pd.DataFrame
- Capricho.chembl.data_flag_functions.flag_missing_standard_smiles(df)[source]¶
Marks rows where ‘standard_smiles’ is missing.
- Return type:
- Capricho.chembl.data_flag_functions.flag_potential_duplicate(df)[source]¶
Marks rows where ‘potential_duplicate’ is 0.
- Return type:
- Capricho.chembl.data_flag_functions.flag_salt_or_solvent_removal(df)[source]¶
Marks rows with a salt/mixture on the canonical SMILES (not modified by CompoundMapper)
- Return type:
- Capricho.chembl.data_flag_functions.flag_strict_mutant_assays(df, strict_mutant_removal=False)[source]¶
Mark assays for removal if their description contains mutant-related keywords and strict_mutant_removal is True.
- Return type:
- Capricho.chembl.data_flag_functions.flag_to_remove_mixture_compounds(df)[source]¶
Marks rows where ‘mixture_compounds’ is True.
- Return type:
- Capricho.chembl.data_flag_functions.flag_undefined_stereochemistry(df)[source]¶
Mark compounds with undefined stereochemistry based on a predefined boolean mask.
- Return type:
- Capricho.chembl.data_flag_functions.flag_unit_conversion(df)[source]¶
Mark rows where unit conversion was applied to standardize measurements.
This function flags activities that had their standard_value and standard_units converted to a common unit by unit conversion functions (e.g., convert_permeability_units, convert_molar_concentration_units, etc.). The conversion_factor column (added during conversion) is used to identify which rows were converted.
If an ‘original_unit’ column exists (added by newer conversion functions), it will be used to create a more informative comment showing the original -> target unit transformation. Both conversion_factor and original_unit columns are removed after flagging.
This is a processing flag (comment_type=’p’) to document data transformations for transparency.
- Parameters:
df (
DataFrame) – DataFrame to be processed. Must contain ‘conversion_factor’ column if unit conversion was applied. May optionally contain ‘original_unit’ for more detailed flagging.- Returns:
- DataFrame with converted rows flagged in data_processing_comment.
The conversion_factor and original_unit columns are removed after flagging.
- Return type:
pd.DataFrame
- Capricho.chembl.data_flag_functions.flag_with_data_validity_comment(df)[source]¶
Marks rows where ‘data_validity_comment’ is present (not NA).
- Return type:
- Capricho.chembl.data_flag_functions.flag_zero_values(df, column='standard_value')[source]¶
Mark rows where the measurement value is exactly zero.
Zero values in bioactivity measurements are typically data quality issues - they may represent values below the limit of detection, data entry errors, or rounding artifacts. This flag helps identify such problematic data points.
Analysis Tools¶
Tools for data quality analysis and comparability studies.
Analysis utilities for CAPRICHO bioactivity data comparability studies
- Capricho.analysis.explode_assay_comparability(subset, sep_str='|', extra_multival_cols=None, value_column='pchembl_value')[source]¶
Explode dataset to create pairwise comparisons between assays for the same compound.
Takes a subset of data where compounds have measurements across multiple assays (indicated by separator in columns) and creates all pairwise combinations for comparability analysis.
- Parameters:
subset (
DataFrame) – DataFrame with multi-valued columns separated by sep_str.sep_str (
str) – Separator string used to delimit multiple values in columns.extra_multival_cols (
Optional[list[str]]) – Additional columns to treat as multi-valued.value_column (
str) – The column containing activity values to compare. Defaults to “pchembl_value” but can be set to “standard_value” for non-pChEMBL data (e.g., Caco-2 permeability, percent inhibition).
- Return type:
- Returns:
DataFrame with exploded pairwise comparisons, with _x and _y suffixes for each pair.
- Capricho.analysis.plot_multi_panel_comparability(exploded_subset, comments, title='Comparability Across Flagged Data', figsize=(20, 8), ncols=5, value_column='pchembl_value', log_transform=False, log_scale_factor=1.0, axis_label=None, axis_limits=None, reference_lines=True, units=None, alpha=0.3)[source]¶
Create multi-panel plot showing comparability for different data quality flags.
- Parameters:
exploded_subset (
DataFrame) – DataFrame from explode_assay_comparability().title (
str) – Overall figure title.figsize (
Tuple[float,float]) – Figure size as (width, height) tuple.ncols (
int) – Number of columns in subplot grid.value_column (
str) – Base name of the value column (without _x/_y suffix). Defaults to “pchembl_value”.log_transform (
bool) – If True, apply -log10 transformation to values before plotting. Use for concentration values (nM, µM, etc.) but NOT for percentages or pChEMBL.log_scale_factor (
float) – Scale factor for the unit of measurement (e.g., 1e-6 for values in 10^-6 cm/s units). The transformation becomes -log10(value * factor).axis_label (
Optional[str]) – Custom axis label. If None, uses format_axis_label() to generate a LaTeX-formatted label based on value_column, log_transform, and units.axis_limits (
Optional[Tuple[float,float]]) – Tuple of (min, max) for both axes. If None, defaults to (3, 12) for pchembl_value, or auto-detects from data.reference_lines (
bool) – If True, draw identity and ±1/±0.3 reference lines.units (
Optional[str]) – Unit string for axis labels (e.g., “10^-6 cm/s”). Converted to LaTeX format automatically. Only used when axis_label is None.
- Return type:
- Returns:
Tuple of (figure, axes array).
- Capricho.analysis.plot_subset(subset, title='', color='slategray', alpha=0.3, figsize=(5, 5), value_column='pchembl_value', log_transform=False, log_scale_factor=1.0, axis_label=None, axis_limits=None, reference_lines=True, units=None)[source]¶
Create scatter plot comparing values across assays with correlation metrics.
- Parameters:
subset (
DataFrame) – DataFrame with {value_column}_x and {value_column}_y columns.title (
str) – Plot title.color (
str) – Color for scatter points.alpha (
float) – Transparency for scatter points.figsize (
Tuple[float,float]) – Figure size as (width, height) tuple.value_column (
str) – Base name of the value column (without _x/_y suffix). Defaults to “pchembl_value”.log_transform (
bool) – If True, apply -log10 transformation to values before plotting. Use for concentration values (nM, µM, etc.) but NOT for percentages or pChEMBL.log_scale_factor (
float) – Scale factor for the unit of measurement (e.g., 1e-6 for values in 10^-6 cm/s units). The transformation becomes -log10(value * factor). For example, a value of 5 in 10^-6 cm/s units with factor=1e-6 gives -log10(5e-6) ≈ 5.3, producing pChEMBL-like positive values.axis_label (
Optional[str]) – Custom axis label. If None, uses format_axis_label() to generate a LaTeX-formatted label based on value_column, log_transform, and units.axis_limits (
Optional[Tuple[float,float]]) – Tuple of (min, max) for both axes. If None, defaults to (3, 12) for pchembl_value, (0, 100) for percentage data, or auto-detects from data.reference_lines (
bool) – If True, draw identity and ±1/±0.3 reference lines. These are most meaningful for pChEMBL-scale data.units (
Optional[str]) – Unit string for axis labels (e.g., “10^-6 cm/s”). Converted to LaTeX format automatically. Only used when axis_label is None.
- Return type:
Tuple[Figure,Axes]- Returns:
Tuple of (figure, axes) objects.
Core Utilities¶
Statistical Aggregation¶
Module containing helper functions for processing repeated elements in a DataFrame
- Capricho.core.stats_make.process_repeat_mols(df, repeat_element_idxs, solve_strat='keep', multiple_value_cols=('standard_smiles', 'canonical_smiles', 'pchembl_value', 'standard_value', 'standard_units', 'assay_chembl_id', 'assay_description', 'activity_id', 'assay_type', 'standard_type', 'confidence_score', 'standard_relation', 'target_organism', 'molecule_chembl_id', 'document_chembl_id', 'assay_tissue', 'assay_cell_type', 'relationship_type', 'max_phase', 'oral', 'prodrug', 'withdrawn_flag'), extra_id_cols=[], extra_multival_cols=[], chirality=False, aggregate_mutants=False, value_col='pchembl_value')[source]¶
Process the dataframe according to repeated elements identified with the function find_repeated_arr_from_series. The standard criteria here will be that molecules with the same Fingerprint representation will be treated as a single entity, and will have their values aggregated. Upon aggregation, if the min & max values differ 1 or more log units, then those samples will be remioved from the dataset. Otherwise, values will be aggregated and a new column will be assigned, called might_rancemic. This column will be a boolean, indicating whether the molecule might be rancemic or not.
- Parameters:
df (
DataFrame) – dataframe with the bioactivity datarepeat_element_idxs (
List[List[int]]) – list of indices of repeated elements in the dataframe.solve_strat (
str) – strategy to solve the repeated elements. If ‘drop’, then both the points within >= 1 log unit difference will be dropped. If ‘keep’, then no values will be dropped.extra_id_cols (
List[str]) – list of extra identification columns you might have for your own compounds that you’d like to use to avoid mixing data & to keep in the final dataframe. Defaults to [].extra_multival_cols (
List[str]) – list of extra columns that you’d like to keep as aggregated values in the final dataframe. Caveat: these columns will be displayes as (str) separated by | (pipe) in the final dataframe. Defaults to [].chirality (
bool) – boolean flag to indicate whether the fingerprints used to check for identical compounds is chirality-sensitive or not. Defaults to False
- Returns:
dataframe with the repeated elements processed.
- Return type:
df
- Capricho.core.stats_make.repeated_indices_from_IDs_df(df, columns)[source]¶
Find repeated indices for given columns in a DataFrame.
- Parameters:
df (pd.DataFrame) – The DataFrame to search for repeats.
columns (list) – List of column names with external (non-chembl) IDs to identify repeats across. E.g.: [“JUMP_ID”, “target_chembl_id”]
- Returns:
A list of lists containing indices of repeated rows based on specified columns.
- Return type:
DataFrame Helpers¶
Module containing helper functions for manipulating pandas DataFrames
- Capricho.core.pandas_helper.add_comment(df, comment, criteria_func=None, target_column=None, comment_type='d')[source]¶
- Marks rows in a DataFrame based on a given criteria or the entire DataFrame, adding a comment to:
data_dropping_comment if comment_type == d (drop)
data_processing_comment if comment_type == p (process).
- Parameters:
df (pd.DataFrame) – The input DataFrame.
comment (str) – The comment to add to the comment column for marked rows.
criteria_func (callable, optional) – A function that takes a pandas Series and returns a boolean Series. E.g.: pd.isna, lambda x: x == ‘invalid’, lambda x: x < 0. It’s required if target_column is specified. If target_column is None, this argument is ignored and the comment is applied to all rows.
target_column (str, optional) – The name of the column to apply the criteria_func to. If None, the comment is applied to all rows of the DataFrame.
comment_type (Literal["p", "d"]) – The type of comment to add. ‘p’ for data processing comment, ‘d’ for data dropping comment. Defaults to ‘d’.
- Returns:
The DataFrame with the specified comment column added/updated.
- Return type:
pd.DataFrame
- Capricho.core.pandas_helper.save_dataframe(df, path, compression='infer')[source]¶
Saves a DataFrame to a file with optional compression.
This function determines the file format from the file extension and uses the appropriate pandas function to save the DataFrame.
- Parameters:
df (
DataFrame) – The DataFrame to be saved.path (
Union[Path,str]) – The file path where the DataFrame will be saved. The file extension determines the format (.csv, .tsv, .parquet).compression (
Optional[str]) – The compression format to use. For CSV/TSV, the default is ‘infer’, which deduces the compression from the file extension (e.g., ‘.gz’, ‘.zip’). For Parquet, if ‘infer’ is passed, it defaults to ‘snappy’. Use None for no compression.
- Return type:
Binarization¶
Contain functions for binarizing bioactivity data; handling censored data and validating agreement between discrete and censored measurements
- Capricho.core.binarization.binarize_aggregated_data(df, threshold=6.0, value_column='pchembl_value_mean', compound_id_col='connectivity', target_id_col='target_chembl_id', relation_col='standard_relation', output_binary_col='activity_binary', compare_across_mutants=False, conflict_report_path=None, conflict_resolution=None)[source]¶
Binarize aggregated bioactivity data based on activity threshold and standard_relation.
This function converts continuous pchembl values to binary labels (0=inactive, 1=active) while properly handling censored measurements and approximate values, and validating agreement between discrete and censored measurements for the same compound-target pair.
Key logic: - standard_relation “=”: compare value to threshold directly - standard_relation “~”: approximate (±0.5 log units); uses lower bound for conservative classification - standard_relation “<”, “<<” (low concentration): if pchembl >= threshold → active (1) - standard_relation “>”, “>>” (high concentration): if pchembl <= threshold → inactive (0) - Mixed relations: validate agreement and flag conflicts
- Parameters:
df (
DataFrame) – Aggregated DataFrame from aggregate_data() with pchembl statisticsthreshold (
float) – Activity threshold for binarization (default 6.0 = 1 µM)value_column (
str) – Which aggregated column to use (default: “pchembl_value_mean”)compound_id_col (
str) – Column identifying compounds (default: “connectivity”)target_id_col (
str) – Column identifying targets (default: “target_chembl_id”)relation_col (
str) – Column with standard_relation values (default: “standard_relation”)output_binary_col (
str) – Name for output binary column (default: “activity_binary”)compare_across_mutants (
bool) – If False (default), different mutations are treated as separate compound-target pairs for conflict detection. If True, measurements on different mutants are compared and flagged if they disagree.conflict_report_path (
Union[str,Path,None]) – Optional path to save detailed conflict report as JSONconflict_resolution (
Optional[str]) – Strategy for resolving conflicts. One of: - None (default): flag only, keep all rows - “drop”: remove all rows for conflicting pairs - “relation”: keep ‘=’ rows, drop censored; fall back to drop if no ‘=’ - “confidence”: keep row with highest confidence_score; drop all on tie - “majority”: keep rows matching majority binary label; drop all on tie
- Return type:
- Returns:
DataFrame with binary activity labels, pchembl_relation column, and conflict flags
- Capricho.core.binarization.invert_relation_for_pchembl(relation)[source]¶
Inverts comparison relation for pchembl values.
Since pchembl = -log10(Molar), higher pchembl = more active (lower concentration). Therefore, standard_relation directions must be inverted: - standard_relation “<” (low concentration, active) → pchembl “>” (high value, active) - standard_relation “>” (high concentration, inactive) → pchembl “<” (low value, inactive)
- Capricho.core.binarization.save_conflict_report(conflict_details, output_path, threshold, total_rows=0, active_count=0, inactive_count=0, mcc=0.0, resolution_details=None, conflict_resolution=None)[source]¶
Save conflict report to JSON file.
- Parameters:
conflict_details (
list[dict]) – List of conflict detail dictionariesthreshold (
float) – Binarization threshold usedtotal_rows (
int) – Total number of rows in the DataFrameactive_count (
int) – Number of active rowsinactive_count (
int) – Number of inactive rowsmcc (
float) – Matthews Correlation Coefficientresolution_details (
Optional[list[dict]]) – Resolution details from _resolve_conflictsconflict_resolution (
Optional[str]) – Strategy name used for resolution
- Return type:
Backends¶
Local SQL Backend¶
Module holding functionalities for the ChEMBL API using chembl downloader as the backend.
- Capricho.chembl.api.downloader.check_and_download_chembl_db(prefix=None, version=None)[source]¶
Check if the ChEMBL database is present. Download and extract it if not. After extraction, remove the tarball to free up space. This method is also used to assert the correct downloaded ChEMBL version is used across different query functions.
- Parameters:
prefix (
Optional[Sequence[str]]) – Optional prefix for an alternative data directory with path components passed as a list of strings. If passed, will create a new configuration file under ~/.data/chembl_downloader_config_{version}.json pointing to the new data directory. Defaults to None.version (
Union[int,str,None]) – Optional ChEMBL version to download. If not provided, will download the latest available version. Defaults to None.
- Return type:
- Returns:
Path to the ChEMBL SQLite database
- Capricho.chembl.api.downloader.get_activity_table_sql(molecule_chembl_ids=None, target_chembl_ids=None, assay_chembl_ids=None, document_chembl_ids=None, prefix=None, version=None, **kwargs)[source]¶
Get bioactivity data from ChEMBL using SQL backend.
- Parameters:
molecule_chembl_ids (
Optional[List[str]]) – list of molecule ChEMBL IDs. Defaults to None.target_chembl_ids (
Optional[List[str]]) – list of target ChEMBL IDs. Defaults to None.assay_chembl_ids (
Optional[List[str]]) – list of assay ChEMBL IDs. Defaults to None.document_chembl_ids (
Optional[List[str]]) – list of document ChEMBL IDs. Defaults to None.prefix (
Optional[Sequence[str]]) – Optional prefix for an alternative data directory.version (
Union[int,str,None]) – Optional ChEMBL version to use.**kwargs – Additional filtering parameters. e.g.: standard_relation=[“=”]
- Returns:
a DataFrame with bioactivity data and the parameters used.
- Return type:
Tuple[pd.DataFrame, dict]
- Capricho.chembl.api.downloader.get_assay_size_sql(assay_chembl_ids, prefix=None, version=None)[source]¶
Get the number of distinct molecules for a list of ChEMBL assay IDs.
- Parameters:
- Returns:
a DataFrame with assay_chembl_id and assay_size.
- Return type:
pd.DataFrame
- Capricho.chembl.api.downloader.get_assay_table_sql(assay_chembl_ids=None, confidence_scores=None, assay_types=None, prefix=None, version=None, **kwargs)[source]¶
Get assay information from ChEMBL using SQL backend.
- Parameters:
assay_chembl_ids (
Optional[List[str]]) – list of assay ChEMBL IDs. If None, all assays are fetched. Defaults to None.confidence_scores (
Optional[List[int]]) – list of confidence scores to filter the assays. Defaults to None.assay_types (
Optional[List[str]]) – list of assay types to filter. Defaults to None.prefix (
Optional[Sequence[str]]) – Optional prefix for an alternative data directory.version (
Union[int,str,None]) – Optional ChEMBL version to use.**kwargs – Additional filtering parameters not used in SQL implementation.
- Returns:
a DataFrame with assay information.
- Return type:
pd.DataFrame
- Capricho.chembl.api.downloader.get_compound_table_sql(molecule_chembl_ids=None, prefix=None, version=None, **kwargs)[source]¶
Get information on molecules from ChEMBL using SQL backend.
- Parameters:
- Returns:
a DataFrame with molecule information.
- Return type:
pd.DataFrame
- Capricho.chembl.api.downloader.get_document_table_sql(document_chembl_ids=None, prefix=None, version=None, **kwargs)[source]¶
Get publication details for a list of ChEMBL document IDs using SQL backend.
- Parameters:
- Returns:
a DataFrame with the publication details.
- Return type:
pd.DataFrame
- Capricho.chembl.api.downloader.get_full_activity_data_sql(molecule_chembl_ids=None, target_chembl_ids=None, assay_chembl_ids=None, document_chembl_ids=None, standard_relation=None, standard_type=None, standard_units=None, confidence_scores=(9, 8), assay_types=('B', 'F'), chembl_release=None, additional_fields=None, prefix=None, version=None)[source]¶
Retrieve ChEMBL bioactivity data from any combination of molecule, target, assay, or document IDs. Data is retrieved using the ChEMBL downloader. Merges are performed on the SQL query level and a DataFrame is returned with the bioactivity data.
- Parameters:
molecule_chembl_ids (
Union[list,str,None]) – list of ChEMBL molecule IDs to fetch data for. Defaults to None.target_chembl_ids (
Union[list,str,None]) – list of ChEMBL target IDs to fetch data for. Defaults to None.assay_chembl_ids (
Union[list,str,None]) – list of ChEMBL assay IDs to fetch data for. Defaults to None.document_chembl_ids (
Union[list,str,None]) – list of ChEMBL document IDs to fetch data for. Defaults to None.standard_relation (
Optional[List[str]]) – Optional filter for standard relation types (e.g., [“=”, “<”, “>”])standard_type (
Optional[List[str]]) – Optional filter for activity types (e.g., [“IC50”, “Ki”, “EC50”])confidence_scores (
Union[list,Tuple]) – list of confidence scores to filter the fetched assay data. Defaults to (9, 8).assay_types (
Union[list,Tuple]) – list of assay types to be fetched from ChEMBL. Defaults to binding (B) and functional (F) data.chembl_release (
Optional[int]) – Not to confuse for version. This is the ChEMBL release number used to filter the data. Defaults to None.additional_fields (
Optional[List[str]]) – Optional list of additional fields to include in the sql query. E.g.: [“vs.sequence”], to retrieve the sequence of the variant, if available. Defaults to None.prefix (
Optional[Sequence[str]]) – Optional prefix for an alternative data directory. If passed, will create a new configuration file under ~/.data/chembl_downloader_config_{version}.json pointing to the new data directory. Defaults to None.version (
Union[int,str,None]) – ChEMBL database to be downloaded and used by ChEMBL downloader. If not provided, will download the latest available version. Defaults to None.
- Returns:
a DataFrame with the bioactivity data.
- Return type:
pd.DataFrame
Web API Backend¶
Module holding functionalities for the ChEMBL API using the webresource client as the backend.
- Capricho.chembl.api.webresource.get_activity_table(molecule_chembl_ids=None, target_chembl_ids=None, assay_chembl_ids=None, document_chembl_ids=None, **kwargs)[source]¶
Take a list of molecule chembl ids and get their respective bioactivities in ChEMBL. :param molecule_chembl_id: list of molecule ChEMBL IDs to fecth bioactivities. Defaults to None. :type target_chembl_ids:
Optional[list] :param target_chembl_ids: list of target ChEMBL IDs to fetch bioactivities. Defaults to None. :type assay_chembl_ids:Optional[list] :param assay_chembl_ids: list of assay ChEMBL IDs to fetch bioactivities. Defaults to None. :type document_chembl_ids:Optional[list] :param document_chembl_ids: list of document ChEMBL IDs to fetch bioactivities. Defaults to None. :type kwargs: :param kwargs: example -> standard_relation=[“=”], assay_type__in=[“B”, “F”].- Returns:
a DataFrame with the bioactivities and the parameters used to fetch them.
- Return type:
Tuple[pd.DataFrame, dict]
- Capricho.chembl.api.webresource.get_assay_table(assay_chembl_ids, confidence_scores=None, **kwargs)[source]¶
Take a list of assay chembl ids and get their respective assays in ChEMBL. :type assay_chembl_ids:
list:param assay_chembl_ids: a list of assay ChEMBL IDs. :type kwargs: :param kwargs: keywords arguments to filter the assays.- Returns:
a DataFrame with the assays.
- Return type:
pd.DataFrame
- Capricho.chembl.api.webresource.get_compound_table(molecule_chembl_ids)[source]¶
Get information on a molecule from ChEMBL. :type molecule_chembl_ids:
list:param molecule_chembl_ids: a list of molecule ChEMBL IDs.- Returns:
a DataFrame with the molecule information.
- Return type:
pd.DataFrame
- Capricho.chembl.api.webresource.get_document_table(document_chembl_ids)[source]¶
From a list of ChEMBL assay IDs, get the publication details. :param assay_chembl_ids: list of ChEMBL assay IDs.
- Returns:
a dictionary with assay IDs as keys and publication details as values.
- Return type:
- Capricho.chembl.api.webresource.get_full_activity_data(molecule_chembl_ids=None, target_chembl_ids=None, assay_chembl_ids=None, document_chembl_ids=None, confidence_scores=(9, 8), assay_types=('B', 'F'), chembl_release=None, add_document_info=True)[source]¶
Retrieve ChEMBL bioactivity data from any combination of molecule, target, assay, or document IDs. Data is retrieved using the ChEMBL webresource client, merges and returns a DataFrame with the bioactivity data.
- Fetch bioactivities for the given target or molecule IDs, considering designated
confidence scores and assay types (binding or functional) using new_client.activity from them chembl_webresource_client package.
- Extract unique assay IDs from the bioactivities DataFrame & add this information to the
final DataFrame.
- Parameters:
molecule_chembl_ids (
Optional[list]) – list of ChEMBL molecule IDs to fetch data for. Defaults to None.target_chembl_ids (
Optional[list]) – list of ChEMBL target IDs to fetch data for. Defaults to None.assay_chembl_ids (
Optional[list]) – list of ChEMBL assay IDs to fetch data for. Defaults to None.document_chembl_ids (
Optional[list]) – list of ChEMBL document IDs to fetch data for. Defaults to None.confidence_scores (
Union[list,Tuple]) – list of confidence scores to filter the fetched assay data. Defaults to (9, 8).assay_types (
Union[list,Tuple]) – list of assay types to be fetched from ChEMBL. Defaults to binding (B) and functional (F) data.chembl_release (
Optional[int]) – specify latest ChEMBL release to extract data from (e.g., 28). Defaults to None.add_document_info (
bool) – whether to add publication-related fields to the final DataFrame. Setting to True, will require one less query to be made to ChEMBL, but fields like year will be lacking. Defaults to True.
- Returns:
Merged DataFrame with molecule, bioactivity, and assay information.
- Return type:
pd.DataFrame
- Capricho.chembl.api.webresource.get_similarity_compound_table(smi, similarity)[source]¶
Fetch similar compounds from ChEMBL using the similarity API.
- Parameters:
smiles – single smiles string to find similar molecules to.
similarity (
float) – similarity threshold to use for the search. Value should be between 40 and 100.
- Raises:
ValueError – If the similarity is not between 40 and 100, or if no similar molecules are found.
- Returns:
a DataFrame with the similar molecules.
- Return type:
pd.DataFrame
Note: API documentation is automatically generated from docstrings in the source code. For the most comprehensive and up-to-date information, refer to the CLI Reference which exposes all functionality.